54 datasets found

Meta data and supporting documentation
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
datadryad.org
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Harvard Medical School
Massachusetts General Hospital
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
Dataset: Screening Causal Assessment of Brook Trout Occurrence and Road...
catalog.data.gov
s.cnmilf.com
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). Dataset: Screening Causal Assessment of Brook Trout Occurrence and Road Runoff 20250218 [Dataset]. https://catalog.data.gov/dataset/dataset-screening-causal-assessment-of-brook-trout-occurrence-and-road-runoff-20250218
Explore at:
Dataset updated
Apr 25, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Pedigree of all data and processing included in the manuscript. Open zip file then access pedigree folder for file describing all other folders, links, and data dictionary Items: NOTES: Description of work and other worksheets. Pedigree: Summary source files used to create figures and tables. DataFiles: Data files used in the R code for creating the figures and tables. DataDictionary: Data file titles in all data files Data: Data file uploaded to Science Hub Output: Files generated from R scripts Plot: Plots generated from R scripts and other software R_Scripts: Clean R scripts used to analyze the data, generate figures and tables Result: Tables generated from R scripts
Bitter Creek Analysis Pedigree Data
catalog.data.gov
s.cnmilf.com
Updated Sep 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). Bitter Creek Analysis Pedigree Data [Dataset]. https://catalog.data.gov/dataset/bitter-creek-analysis-pedigree-data
Explore at:
Dataset updated
Sep 25, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These data sets contain raw and processed data used in for analyses, figures, and tables in the Region 8 Memo: Characterization of chloride and conductivity levels in the Bitter Creek Watershed, WY. However, these data may be used for other analyses alone or in combination with other or new data. These data were used to assess whether chloride levels are naturally high in streams in the Bitter Creek, WY watershed and how chloride concentrations expected to protect 95 percent of aquatic genera in these streams compare to Wyoming’s chloride criteria applicable to the Bitter Creek watershed. Owing to the arid conditions, background conductivity and chloride levels were characterized for surface flow and ground water flow conditions. Natural chloride levels were found to be less than current water quality criteria for Wyoming. Although the report was prepared for USEPA Region 8 and OST, Office of Water, the report will be of interest to the WDEQ, Sweetwater County Conservation District, and the regulated community. No formal metadata standard was used. Pedigree.xlsx contains: 1. NOTES: Description of work and other worksheets. 2. Pedigree_Summary: Source files used to create figures and tables. 3. DataFiles: Data files used in the R code for creating the figures and tables 4. R_Script: Summary of the R scripts. 5. DataDictionary: Data file titles in all data files Folders: _Datasets Data file uploaded to Environmental Dataset Gateway "A list of subfolders: _R: Clean R scripts used to generate document figures and tables _Tables_Figures: Files generated from R script and used in the Region 6 memo R Code and Data: All additional files used for this project, including original files, intermediate files, extra output files, and extra functions the ""_R"" folder stores R scripts for input and output files and an R project file.. Users can open the R project and run R scripts directly from the ""_R"" folder or the XC95 folder by installing R, RStudio, and associated R packages."
l
LScDC Word-Category RIG Matrix
figshare.le.ac.uk
pdf
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC Word-Category RIG Matrix [Dataset]. http://doi.org/10.25392/leicester.data.12133431.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.12133431.v2
Dataset updated
Apr 28, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Simulation Data Set
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Data set: St. Louis River Watershed, MN Conductivity Assessment March 2022
catalog.data.gov
datasets.ai
Updated Jul 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). Data set: St. Louis River Watershed, MN Conductivity Assessment March 2022 [Dataset]. https://catalog.data.gov/dataset/data-set-st-louis-river-watershed-mn-conductivity-assessment-march-2022
Explore at:
Dataset updated
Jul 18, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
Saint Louis River, Minnesota
Description
Data used to evaluate potential downstream impacts of the NorthMet Mine, by USEPA Office of Research and Development is providing, for USEPA Region 5’s use, including a characterization of stream specific conductivity (SC) levels, least disturbed background SC, and SC levels that may exceed the Fond du Lac Band’s WQ standards and adversely affect aquatic life, including brook trout (Salvelinus fontinalis), lake sturgeon (Acipenser fulvescens), and benthic macroinvertebrates. Keywords: Conductivity, St. Louis River, benthic invertebrates; mining The attached Excel Pedigree includes: _Datasets: Data file uploaded to EPA Science Hub and/or Environmental Data Set Gateway _R : Clean R scripts used to generate document figures and tables _Tables_Figures: Files generated from R script and used in the Region 5 memo 20220325 R Code and Data: All additional files used for this project, including original files, intermediate files, extra output files, and extra functions. The "_R" folder contains four subfolders. Each subfolder has several R scripts, input and output files, and an R project file. Users can run R scripts directly from each subfolder by installing R, RStudio, and associated R packages. Data Dictionary: See tab DataDictionary in Excel file Datasets: Simplified language is used in the text to identify parent data sets. Source and File names are retained in this pedigree in original form to enable R-scripts to retain functionality. • Thingvold et al. (1975-1977) • Griffith (1998-2009) • Predicted background (2000-2015) • Water Quality Portal (1996-2021) • Water Quality Portal Less Disturbed (1996-2021) • Minnesota Pollution Control Agency (MPCA) (1996-2013) • Mid-Atlantic Highlands (1990 to 2014). This dataset is associated with the following publication: Cormier, S., and Y. Wang. Appendix C: ORD Specific Conductance Memo, from Susan Cormier to Tera Fong. March 15, 2022. Assessment of effects of increased ion concentrations in the St. Louis River Watershed with special attention to potential mining influence and the jurisdiction of the Fond du Lac Band of Lake Superior Chippewa. U.S. Environmental Protection Agency, Washington, DC, USA, 2022.
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
m
Input files required for R code to analyse data for the PAFA project
figshare.manchester.ac.uk
csv
Updated Nov 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaufa Shareef (2025). Input files required for R code to analyse data for the PAFA project [Dataset]. http://doi.org/10.48420/29340302.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.48420/29340302.v1
Dataset updated
Nov 22, 2025
Dataset provided by
University of Manchester
Authors
Shaufa Shareef
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the input files required for the R code used to analyse data for the Patterns and prevalence of food allergy in adulthood in the UK (PAFA) project. This includes:pafa_data_dictionary_anonymised.csv: The data dictionary describing each column in the anonymised PAFA dataset. "snomed_field_name" lists all column names in the dataset; "field_name_extended" lists the original column name in the REDCap data download, which was then recoded to include SNOMED and FoodEx2 codes for future analyses; "variable_field_name" denotes the corresponding coded field name in the REDCap form; "field_type" denotes the type of REDCap field; "field_label" describes the field name in plain language; "choices_calculations_or_slider_labels" describes the choices provided to the participant for that question.foodex2_codes_with_other.csv: A CSV file with key-value pairs for identifying foods coded in the dataset.
H
Name Dictionaries for "wru" R Package
dataverse.harvard.edu
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evan Rosenman; Santiago Olivella; Kosuke Imai (2022). Name Dictionaries for "wru" R Package [Dataset]. http://doi.org/10.7910/DVN/7TRYAC
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/7TRYAC
Dataset updated
Aug 1, 2022
Dataset provided by
Harvard Dataverse
Authors
Evan Rosenman; Santiago Olivella; Kosuke Imai
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We provide four dictionaries that provide the racial distributions associated with names in the United States. These dictionaries are used by the latest iteration of the "WRU" package (Khanna et al., 2022) to make probabilistic predictions about the race of individuals, given their names and geolocations. The probabilities cover five racial categories: White, Black, Hispanic, Asian, and Other. We provide two surname dictionaries. The first provides entries P(race | surname) for about 160K names, derived from the 2010 Census surname list, aggregated with the Census Spanish surname list. The second provides analogous probabilities for 1.48MM surnames. This dictionary is created by starting with the Census-based dictionary and supplementing it with race distributions estimated from the voter files of six Southern states -- Alabama, Florida, Georgia, Louisiana, North Carolina, and South Carolina -- that collect race data. We also provide dictionaries estimating P(race | first name) and P(race | middle name). These dictionaries -- which contain 1.04MM and 1.16MM names respectively -- are sourced exclusively from the voter files of the six Southern states. References Kabir Khanna, Brandon Bertelsen, Santiago Olivella, Evan Rosenman and Kosuke Imai (2022). wru: Who are You? Bayesian Prediction of Racial Category Using Surname, First Name, Middle Name, and Geolocation. R package version 1.0.0. https://CRAN.R-project.org/package=wru
[Superseded] Intellectual Property Government Open Data 2019
researchdata.edu.au
data.gov.au
Updated Jun 6, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IP Australia (2019). [Superseded] Intellectual Property Government Open Data 2019 [Dataset]. https://researchdata.edu.au/superseded-intellectual-property-data-2019/2994670
Explore at:
Dataset updated
Jun 6, 2019
Dataset provided by
Data.govhttps://data.gov/
Authors
IP Australia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
What is IPGOD?\r

The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.\r \r \r

How do I use IPGOD?\r

IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.\r \r \r

IP Data Platform\r

IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform\r \r

References\r

\r The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.\r \r * Patents\r * Trade Marks\r * Designs\r * Plant Breeder’s Rights\r \r \r

Updates\r

\r

Tables and columns\r

\r Due to the changes in our systems, some tables have been affected.\r \r * We have added IPGOD 225 and IPGOD 325 to the dataset!\r * The IPGOD 206 table is not available this year.\r * Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use.\r \r

Data quality improvements\r

\r Data quality has been improved across all tables.\r \r * Null values are simply empty rather than '31/12/9999'.\r * All date columns are now in ISO format 'yyyy-mm-dd'.\r * All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0.\r * All tables are encoded in UTF-8.\r * All tables use the backslash \ as the escape character.\r * The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.
FTICR-MS Data from Multi-continent River Water and Sediment and from Coastal...
osti.gov
knb.ecoinformatics.org
+2more
Updated Dec 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. DOE > Office of Science > Early Career Research Program (2021). FTICR-MS Data from Multi-continent River Water and Sediment and from Coastal River Fresh and Saline Sediment Associated with: “Dissolved Organic Matter Functional Trait Relationships are Conserved Across Rivers” [Dataset]. http://doi.org/10.15485/1824222
Explore at:
Unique identifier
https://doi.org/10.15485/1824222
Dataset updated
Dec 31, 2021
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
Department of Energy Biological and Environmental Research Program
Environmental System Science Data Infrastructure for a Virtual Ecosystem
Early Career Research Program: Watershed Perturbation-Response Traits Derived Through Ecological Theory
Description
This data package is associated with the publication “Dissolved Organic Matter Functional Trait Relationships are Conserved Across Rivers” submitted to PNAS (Stegen et al., 2023). The study aims to understand large-scale spatial structure of the dissolved organic matter (DOM) thermodynamic traits and inter-trait relationships by investigating (1) river water and sediments collected along 97 rivers spanning 3 continents and (2) coastal sediment collected from fresh and saline locations in Pacific and Gulf/Atlantic rivers. Sediment extracts and water samples were analyzed using ultrahigh resolution Fourier transform ion cyclotron resonance mass spectrometry (FTICR-MS). This dataset is comprised of three folders (1) Coastal, (2) WHONDR_S19S, and (3) Data_Dictionaries. Coastal contains (1) a subfolder with processed FTICR-MS data as csv files and sample collection metadata, (2) a subfolder with R scripts used to process the data and create associated figures, (3) a subfolder with the raw, unprocessed FTICR-MS data as .xml files, and (4) a readme file with more information about the dataset and instructions for using Formularity (https://omics.pnl.gov/software/formularity). WHONDRS_S19S contains (1) a csv file with processed FTICR data, (2) a csv with sample collection metadata, (3) a csv with sample geospatial data, (4) a csv with simulated lambda model outputs, (5) a subfolder with R scripts used to process the data and create associated figures, and (6) a readme file with more information regarding WHONDRS raw FTICR data and processing scripts. Data_Dictionaries contains data dictionaries for each csv file in the data package. The 97 global river corridors were part of a WHONDRS (https://whondrs.pnnl.gov) study. The raw, unprocessed FTICR-MS data with additional data can be found at doi:10.15485/1729719 for sediments and doi:10.15485/1603775 for water. This data package contains the processed data used in the associated manuscript. The coastal data has not been previously published, and this data package contains both the raw and processed data. Version 3 of this data package published February 2023 includes updates to the title of the manuscript, additional data and data dictionary and updated scripts linked to new analysis.
Data from: Bike Sharing Dataset
kaggle.com
zip
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ram Vishnu R (2024). Bike Sharing Dataset [Dataset]. https://www.kaggle.com/datasets/ramvishnur/bike-sharing-dataset/data
Explore at:
zip(22674 bytes)Available download formats
Dataset updated
Sep 10, 2024
Authors
Ram Vishnu R
Description
Problem Statement:

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.

How well those variables describe the bike demands

Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

Business Goal:

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

Data Preparation:

You can observe in the dataset that some of the variables like 'weathersit' and 'season' have values as 1, 2, 3, 4 which have specific labels associated with them (as can be seen in the data dictionary). These numeric values associated with the labels may indicate that there is some order to them - which is actually not the case (Check the data dictionary and think why). So, it is advisable to convert such feature values into categorical string values before proceeding with model building. Please refer the data dictionary to get a better understanding of all the independent variables.

You might notice the column 'yr' with two values 0 and 1 indicating the years 2018 and 2019 respectively. At the first instinct, you might think it is a good idea to drop this column as it only has two values so it might not be a value-add to the model. But in reality, since these bike-sharing systems are slowly gaining popularity, the demand for these bikes is increasing every year proving that the column 'yr' might be a good variable for prediction. So think twice before dropping it.

Model Building:

In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.

Model Evaluation:

When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set. python from sklearn.metrics import r2_score r2_score(y_test, y_pred) - where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set. - Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
Dataset for: Research data management in academic institutions: a scoping...
zenodo.org
data.niaid.nih.gov
+1more
csv, pdf
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
L Perrier; L Perrier; E Blondal; E Blondal; A. P Ayala; A. P Ayala; D Dearborn; D Dearborn; T Kenny; T Kenny; D Lightfoot; D Lightfoot; R Reka; M Thuna; M Thuna; L Trimble; H MacDonald; H MacDonald; R Reka; L Trimble (2024). Dataset for: Research data management in academic institutions: a scoping review [Dataset]. http://doi.org/10.5281/zenodo.557043
Explore at:
csv, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.557043
Dataset updated
Aug 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
L Perrier; L Perrier; E Blondal; E Blondal; A. P Ayala; A. P Ayala; D Dearborn; D Dearborn; T Kenny; T Kenny; D Lightfoot; D Lightfoot; R Reka; M Thuna; M Thuna; L Trimble; H MacDonald; H MacDonald; R Reka; L Trimble
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This dataset contains the raw data for the manuscript:
Perrier L, Blondal E, Ayala AP, Dearborn D, Kenny T, Lightfoot D, Reka R, Thuna M, Trimble L, MacDonald H. Research data management in academic institutions: A scoping review. PLOS One. 2017 May 23;12(5):e0178261. doi: 10.1371/journal.pone.0178261.

Full-text available at: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0178261

Data and Documentation Files

Five files make up the dataset:

Data Dictionary: RDMScopingReview_DataDictionary.pdf

Data Abstraction Sheet: RDMScopingReview_StudyCharacteristics.csv

Data Abstraction Sheet: RDMScopingReview_Setting.csv

Data Abstraction Sheet: RDMScopingReview_DataCollectionTools.csv

Data Abstraction Sheet: RDMScopingReview_Outcomes.csv

Contact: Laure Perrier: orcid.org/0000-0001-9941-7129
Z
Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats...
data.niaid.nih.gov
zenodo.org
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mast, Austin R.; Paul, Deborah L.; Rios, Nelson; Bruhn, Robert; Dalton, Trevor; Krimmel, Erica R.; Pearson, Katelin D.; Sherman, Aja; Shorthouse, David P.; Simmons, Nancy B.; Soltis, Pam; Upham, Nathan; Abibou, Djihbrihou (2024). Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats and Relatives, a Known Reservoir for Coronaviruses [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3974999
Explore at:
Dataset updated
Jul 18, 2024
Dataset provided by
Florida State University
American Museum of Natural History
Agriculture and Agri-Food Canada
University of Florida
Yale University Peabody Museum of Natural History
Arizona State University
Authors
Mast, Austin R.; Paul, Deborah L.; Rios, Nelson; Bruhn, Robert; Dalton, Trevor; Krimmel, Erica R.; Pearson, Katelin D.; Sherman, Aja; Shorthouse, David P.; Simmons, Nancy B.; Soltis, Pam; Upham, Nathan; Abibou, Djihbrihou
License
https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
Area covered
World
Description
This repository is associated with NSF DBI 2033973, RAPID Grant: Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats and Relatives, a Known Reservoir for Coronaviruses (https://www.nsf.gov/awardsearch/showAward?AWD_ID=2033973). Specifically, this repository contains (1) raw data from iDigBio (http://portal.idigbio.org) and GBIF (https://www.gbif.org), (2) R code for reproducible data wrangling and improvement, (3) protocols associated with data enhancements, and (4) enhanced versions of the dataset published at various project milestones. Additional code associated with this grant can be found in the BIOSPEX repository (https://github.com/iDigBio/Biospex). Long-term data management of the enhanced specimen data created by this project is expected to be accomplished by the natural history collections curating the physical specimens, a list of which can be found in this Zenodo resource.

Grant abstract: "The award to Florida State University will support research contributing to the development of georeferenced, vetted, and versioned data products of the world's specimens of horseshoe bats and their relatives for use by researchers studying the origins and spread of SARS-like coronaviruses, including the causative agent of COVID-19. Horseshoe bats and other closely related species are reported to be reservoirs of several SARS-like coronaviruses. Species of these bats are primarily distributed in regions where these viruses have been introduced to populations of humans. Currently, data associated with specimens of these bats are housed in natural history collections that are widely distributed both nationally and globally. Additionally, information tying these specimens to localities are mostly vague, or in many instances missing. This decreases the utility of the specimens for understanding the source, emergence, and distribution of SARS-COV-2 and similar viruses. This project will provide quality georeferenced data products through the consolidation of ancillary information linked to each bat specimen, using the extended specimen model. The resulting product will serve as a model of how data in biodiversity collections might be used to address emerging diseases of zoonotic origin. Results from the project will be disseminated widely in opensource journals, at scientific meetings, and via websites associated with the participating organizations and institutions. Support of this project provides a quality resource optimized to inform research relevant to improving our understanding of the biology and spread of SARS-CoV-2. The overall objectives are to deliver versioned data products, in formats used by the wider research and biodiversity collections communities, through an open-access repository; project protocols and code via GitHub and described in a peer-reviewed paper, and; sustained engagement with biodiversity collections throughout the project for reintegration of improved data into their local specimen data management systems improving long-term curation.

This RAPID award will produce and deliver a georeferenced, vetted and consolidated data product for horseshoe bats and related species to facilitate understanding of the sources, distribution, and spread of SARS-CoV-2 and related viruses, a timely response to the ongoing global pandemic caused by SARS-CoV-2 and an important contribution to the global effort to consolidate and provide quality data that are relevant to understanding emergent and other properties the current pandemic. This RAPID award is made by the Division of Biological Infrastructure (DBI) using funds from the Coronavirus Aid, Relief, and Economic Security (CARES) Act.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria."

Files included in this resource

9d4b9069-48c4-4212-90d8-4dd6f4b7f2a5.zip: Raw data from iDigBio, DwC-A format

0067804-200613084148143.zip: Raw data from GBIF, DwC-A format

0067806-200613084148143.zip: Raw data from GBIF, DwC-A format

1623690110.zip: Full export of this project's data (enhanced and raw) from BIOSPEX, CSV format

bionomia-datasets-attributions.zip: Directory containing 103 Frictionless Data packages for datasets that have attributions made containing Rhinolophids or Hipposiderids, each package also containing a CSV file for mismatches in person date of birth/death and specimen eventDate. File bionomia-datasets-attributions-key_2021-02-25.csv included in this directory provides a key between dataset identifier (how the Frictionless Data package files are named) and dataset name.

bionomia-problem-dates-all-datasets_2021-02-25.csv: List of 21 Hipposiderid or Rhinolophid records whose eventDate or dateIdentified mismatches a wikidata recipient’s date of birth or death across all datasets.

flagEventDate.txt: file containing term definition to reference in DwC-A

flagExclude.txt: file containing term definition to reference in DwC-A

flagGeoreference.txt: file containing term definition to reference in DwC-A

flagTaxonomy.txt: file containing term definition to reference in DwC-A

georeferencedByID.txt: file containing term definition to reference in DwC-A

identifiedByNames.txt: file containing term definition to reference in DwC-A

instructions-to-get-people-data-from-bionomia-via-datasetKey: instructions given to data providers

RAPID-code_collection-date.R: code associated with enhancing collection dates

RAPID-code_compile-deduplicate.R: code associated with compiling and deduplicating raw data

RAPID-code_external-linkages-bold.R: code associated with enhancing external linkages

RAPID-code_external-linkages-genbank.R: code associated with enhancing external linkages

RAPID-code_external-linkages-standardize.R: code associated with enhancing external linkages

RAPID-code_people.R: code associated with enhancing data about people

RAPID-code_standardize-country.R: code associated with standardizing country data

RAPID-data-dictionary.pdf: metadata about terms included in this project’s data, in PDF format

RAPID-data-dictionary.xlsx: metadata about terms included in this project’s data, in spreadsheet format

rapid-data-providers_2021-05-03.csv: list of data providers and number of records provided to rapid-joined-records_country-cleanup_2020-09-23.csv

rapid-final-data-product_2021-06-29.zip: Enhanced data from BIOSPEX, DwC-A format

rapid-final-gazetteer.zip: Gazetteer providing georeference data and metadata for 10,341 localities assessed as part of this project

rapid-joined-records_country-cleanup_2020-09-23.csv: data product initial version where raw data has been compiled and deduplicated, and country data has been standardized

RAPID-protocol_collection-date.pdf: protocol associated with enhancing collection dates

RAPID-protocol_compile-deduplicate.pdf: protocol associated with compiling and deduplicating raw data

RAPID-protocol_external-linkages.pdf: protocol associated with enhancing external linkages

RAPID-protocol_georeference.pdf: protocol associated with georeferencing

RAPID-protocol_people.pdf: protocol associated with enhancing data about people

RAPID-protocol_standardize-country.pdf: protocol associated with standardizing country data

RAPID-protocol_taxonomic-names.pdf: protocol associated with enhancing taxonomic name data

RAPIDAgentStrings1_archivedCopy_30March2021.ods: resource used in conjunction with RAPID people protocol

recordedByNames.txt: file containing term definition to reference in DwC-A

Rhinolophid-HipposideridAgentStrings_and_People2_archivedCopy_30March2021.ods: resource used in conjunction with RAPID people protocol

wikidata-notes-for-bat-collectors_leachman_2020: please see https://zenodo.org/record/4724139 for this resource
e
Aesthetic value of the Great Barrier Reef 2017 - data collection (NESP TWQ...
catalogue.eatlas.org.au
researchdata.edu.au
Updated Mar 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Griffith Institute for Tourism, Griffith University (2019). Aesthetic value of the Great Barrier Reef 2017 - data collection (NESP TWQ 3.2.3, Griffith Institute for Tourism Research) [Dataset]. https://catalogue.eatlas.org.au/geonetwork/srv/api/records/b6d7dd10-da27-43da-a923-353d47782854
Explore at:
www:link-1.0-http--related, www:link-1.0-http--downloaddataAvailable download formats
Dataset updated
Mar 6, 2019
Dataset provided by
Griffith Institute for Tourism, Griffith University
Time period covered
Jan 1, 1987 - Nov 30, 2013
Area covered
Great Barrier Reef
Description
This dataset resulted from two inter-linked research streams. The first stream was related to the application of eye-tracking technology and an online survey in studying natural beauty. The second stream is related to the development of an Artificial Intelligence (AI)-based system recognising and assessing the beauty of natural scenes. Due to differences in data collection and data analysis, details of research methods used for each research stream are described in three separated data records.

This record that describes the common elements and goals of the three parts to the research.

Within these research streams, three datasets were developed. These include: Eye tracking - containing the outcome documents of eye-tracking experiment conducted within the project framework Online survey - includes a survey format document and three subfolders showing how each section of the survey was designed and outcome of each section (i.e. conjoint analysis, picture rating and open question) Algorithm data reflecting how a computer-based system for automated assessment of image attractiveness is developed

Format and methods:

The project dataset has multiple parts containing data of different formats and methods. Details of each dataset are discussed in the corresponding data records.

Data Dictionary:

See data dictionaries in the following data report forms: Becken, S., Connolly R., Stantic B., Scott N., Mandal R., Le D., (2018), Eye-tracking data report form, Griffith Institute for Tourism Research Report No 15. Becken, S., Connolly R., Stantic B., Scott N., Mandal R., Le D., (2018), Online survey data report form, Griffith Institute for Tourism Research Report No 15. Becken, S., Connolly R., Stantic B., Scott N., Mandal R., Le D., (2018), Algorithm data report form, Griffith Institute for Tourism Research Report No 15.

References: Further information can be found in the following publication:

Becken, S., Connolly R., Stantic B., Scott N., Mandal R., Le D., (2018), Monitoring aesthetic value of the Great Barrier Reef by using innovative technologies and artificial intelligence, Griffith Institute for Tourism Research Report No 15. Becken, S., Connolly R., Stantic B., Scott N., Mandal R., Le D., (2018), Eye-tracking data report form, Griffith Institute for Tourism Research Report No 15. Becken, S., Connolly R., Stantic B., Scott N., Mandal R., Le D., (2018), Online survey data report form, Griffith Institute for Tourism Research Report No 15. Becken, S., Connolly R., Stantic B., Scott N., Mandal R., Le D., (2018), Algorithm data report form, Griffith Institute for Tourism Research Report No 15.

Data Location:

This dataset is filed in the eAtlas enduring data repository at: data esp3\3.2.3_Aesthetic-Values-GBR
TM-Link
data.gov.au
csv, zip
Updated Aug 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IP Australia (2020). TM-Link [Dataset]. https://data.gov.au/data/dataset/activity/tm-link
Explore at:
csv(24978728), zip(488855910), csv(1142331453), csv(1235408729), csv(390258394), csv(301798419), csv(394684372), csv(1534407171), csv(801899603)Available download formats
Dataset updated
Aug 12, 2020
Dataset provided by
IP Australiahttp://ipaustralia.gov.au/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Dictionary available here

2020 August UPDATE

Fixed some quality issues with dates in the application table.

Fixed some quality issues with names in the owner table.

2020 March UPDATE

The TM-Link dataset has been updated to include more recent data and also has expanded to include more information. As such, the structure of the dataset has also changed.

TM-Link is an international dataset IP Australia and Swinburne University have developed in collaboration. The dataset provides information from various jurisdictions, modelled under a common schema for greater accessibility to researchers and analysts. TM-Link also links together similar trade marks from different countries based on common information, such as similar trade mark phrases and applicant names. These links identify families of international trade marks, which provide a new and unique insight into international branding trends and export behaviours. IP Australia and Swinburne University are looking to continually develop TM-Link to become a core part of the global IP data landscape. If you have any suggestions or requests to model any additional data points, or improve the current accuracy of the data please let us know via email to ipdataplatform@ipaustralia.gov.au.

For more information on the linking algorithm, please see:

Petrie S, Kollmann T, Codoreanu A, Thomson R & Webster E (2019); International Trademarking and Regional Export Performance. Available at SSRN: https://ssrn.com/abstract=3445244

For more information on TM-Link data collection and descriptive analyses, please see:

Petrie S, Adams M, Mitra‐Kahn B, Johnson M, Thomson R, Jensen PH, Palangkaraya A, & Webster EM (2019); TM-Link: An Internationally Linked Trade Mark Database. Australian Economic Review, Forthcoming. Available at SSRN: https://ssrn.com/abstract=3511526
d
Data from: Decision-Support Framework for Linking Regional-Scale Management...
catalog.data.gov
data.usgs.gov
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Decision-Support Framework for Linking Regional-Scale Management Actions to Continental-Scale Conservation of Wide-Ranging Species [Dataset]. https://catalog.data.gov/dataset/decision-support-framework-for-linking-regional-scale-management-actions-to-continental-sc
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data release presents the data, JAGS models, and R code used to manipulate data and to produce results and figures presented in the USGS Open File Report, "Decision-Support Framework for Linking Regional-Scale Management Actions to Continental-Scale Conservation of Wide-Ranging Species, (https://doi.org/10.5066/P93YTR3X). The zip folder is provided so that other can reproduce results from the integrated population model, inspect model structure and posterior simulations, conduct analyses not present in the report, and use and modify the code. Raw source data can be sourced from the USGS Bird Banding Laboratory, USFWS Surveys and Monitoring Branch, National Oceanic and Atmospheric administration, and Ducks Unlimited Canada. The zip file contains the following objects when extracted: Readme.txt: A plain text file describing each file in this directory. Figures-Pintail-IPM.r: R code that generates report figures in png, pdf, and eps format. Generates Figures 2-11 and calls source code for figures 12 and 13 found in other files. * get pintail IPM data.r: R source code that must be run to format data for the IPM code file. * getbandrecovs.r: R code that takes Bird Banding Lab data for pintail band releases and recoveries and formats for analysis. This file is called by 'get pintail IPM data.r'. File was originally written by Scott Boomer (USFWS) and modified by Erik Osnas for use for the IPM. * Model_1_post.txt: Text representation of the posterior simulations from Model 1. This file can be read by the R function dget() to produce an R list object that contain posterior draws from Model 1. The list is the BUGSoutput$sims.list object from a call to rjags::jags. * Model_2_post.txt: As above but for Model 2. * Model_S1_post.txt: As above but for Model S1. * Pintail IPM.r: This is the main file that defines the IPM models in JAGS, structures the data for JAGS, defines initial values, and calls runs the models. Outputs are text files that contains JAGS model files, R work spaces that contains all data models, and results, include the output from the jags() function. From this the BUGSoutput$sims.list object was written to text for each model. * MSY_metrics.txt: Summary of results produced from running code in source_figure_12.R. This table is a text representation of a summary of the maximum sustained yield analysis at various mean rainfall levels, used for Table 1 of report and can be reproduced by running the code in source_figure_12.R. To understand the structure of this file, you must consult the code file and understand the structure of the R objects created from that code. Otherwise, consult Figure 12 and Table 1 in report. * source_figure_12.R: R code to produce Figure 12. Code is written to work with Rworkspace output from Model 1, but can be modified to use the Model_1_post.txt file without re-running the model. This would allow use of the same posterior realizations as used in the report. * source_figure_13.R: This is the code used to product the results for Figure 13. Required here is the posterior from Model 1 and data for the Prairie Parkland Model based on Jim Devries/Ducks Unlimited data. These are described in the report text. * Data: A directory that contains the raw data used for this report. * Data/2015_LCC_Networks_shapefile: A directory that contain ESRI shapefiles for used in Figure 1 and to define the boundaries of the Landscape Conservation Cooperatives. Found at (https://www.sciencebase.gov/catalog/item/55b943ade4b09a3b01b65d78) * Data/bndg_1430_yr1960up_DBISC_03042014.csv: A comma delimited file for banded pintail from 1960 to 2014. Obtained from the USGS Bird Banding Lab. This file is used by 'getbandrecovs.r' to produce and 'm-array' used in the Integrated Population Model (IPM). A data dictionary describing the codes for each field can be found here, https://www.pwrc.usgs.gov/BBL/manual/summary.cfm * Data/cponds.csv: A comma delimited file of estimated Canadian ponds based on counts from the North American Breeding Waterfowl and Habitat Survey, 1955-2014. Given is the year, point estimate, and estimated standard error. * Data/enc_1430_yr1960up_DBISC_03042014.csv: A comma delimited file for encounters of banded pintail. Obtained from the USGS Bird Banding Lab. This file is use by 'getbandrecovs.r' to produce and 'm-array' used in the Integrated Population Model (IPM). A data dictionary describing the codes for each field can be found here, (https://www.pwrc.usgs.gov/BBL/manual/enc.cfm) * Data/nopiBPOP19552014.csv: A comma delimited file of estimated northern pintail based on counts from the North American Breeding Waterfowl and Habitat Survey, 1955-2014. Given is the year, pintail point estimate (bpop), and pintail estimated standard error (bpopSE), mean latitude of the pintail population (lat), latitude variance of the pintail population (latVAR), mean longitude of the pintail population (lon), and the variance in longitude of the pintail population (lonVAR). * Data/Summary Climate Data California CV 2.csv: Rainfall data for the California central valley downloaded from National Climate Data Center (www.ncdc.noaa.gov/cdo-web/) as described in report text (https://doi.org/10.5066/P93YTR3X) and publication found at https://doi.org/10.1002/jwmg.21124 . Used in 'get pintail IPM data.r' for IPM. * Data/Summary data MAV.csv: Rainfall data for the Mississippi Aluvial valley downloaded from National Climate Data Center (www.ncdc.noaa.gov/cdo-web/) as described in report text (https://doi.org/10.5066/P93YTR3X) and publication found at https://doi.org/10.1002/jwmg.21124 . Used in 'get pintail IPM data.r' for IPM. * Data/Wing data 1961 2011 NOPI.txt: Comma delimited text file for pintail wing age data for 1961 to 2011 from the Parts Collection Survey. Each row is an individual wing with sex cohorts 4 = male, 5 = female and age cohorts 1 = After Hatch Year and 2 = Hatch Year. Wt is a weighting factor that determines how many harvested pintails this wing represent. See USFWS documentation for the Part Collection survey for descriptions. Summing Wt for each age, sex, and year gives an estimate of the number of pintail harvested. Used in 'get pintail IPM data.r' for IPM. * Data/Wing data 2012 2013 NOPI.csv: Same as 'Wing data 1961 2011 NOPI.txt' but for years 2012 and 2013.

r/UofT Student Community Dataset (2011-2025)

kaggle.com

zip

Updated Jul 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

BENpen1110 (2025). r/UofT Student Community Dataset (2011-2025) [Dataset]. https://www.kaggle.com/datasets/benpen1110/ruoft-student-community-dataset-2011-2025

Explore at:

zip(11848083 bytes)Available download formats

Dataset updated

Jul 19, 2025

Authors

BENpen1110

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

A comprehensive, 14-year longitudinal dataset of posts and comments from the r/UofT subreddit, professionally cleaned and enriched for NLP, social science research, and AI applications.

📌 Overview

This dataset provides a deep and structured view into the academic, social, and cultural life at the University of Toronto, as captured through its largest online community, the r/UofT subreddit. Spanning over a decade, it contains more than 6,000 posts and their corresponding 95,000+ comments, offering an unparalleled resource for understanding student experiences, concerns, and trends over time.

The data has been meticulously collected, cleaned, and enriched with calculated quality metrics, making it a powerful, ready-to-use resource for a wide range of data analysis and machine learning tasks.

✨ Key Features & Highlights

Longitudinal Scope: With data from 2011 to 2025, this dataset is perfect for analyzing long-term trends, seasonal patterns (like exam stress), and the impact of major events (e.g., the COVID-19 pandemic).
Rich, Structured Comment Data: Unlike many Reddit datasets, this one includes a complete, structured JSON array of all comment threads for each post. This preserves the nested reply structure, scores, and author information for deep conversational analysis.
Professionally Cleaned Text: All primary text fields (title, content, and comments_json.body) have undergone a rigorous cleaning pipeline, including HTML decoding, placeholder removal, and whitespace normalization, making them immediately suitable for NLP tasks.
Enriched with Quality & Engagement Metrics: Each post is augmented with calculated scores for title, content, and engagement quality. This allows for easy filtering and curation of the dataset, such as isolating only "high-quality" posts or posts with significant discussion.
Dual Text Versions for Comments: The comments_json for each comment provides both a body (fully cleaned text) and a body_original (unmodified raw text), offering maximum flexibility for different research needs.

🧠 Potential Use Cases

This dataset is a goldmine for various projects:

🧠 Natural Language Processing (NLP)
- Sentiment Analysis: Track student sentiment towards courses, policies, or campus life over time.
- Topic Modeling (LDA, BERTopic): Discover and analyze recurring themes such as "mental health," "housing," "CS POSt," and "career advice."
- Text Summarization: Build models to summarize long discussion threads.
- Conversational AI: Analyze reply chains and social interaction patterns.
🎓 Educational & Social Research
- Student Mental Health: Identify keywords and trends related to stress, anxiety, and burnout.
- Course & Professor Analysis: Aggregate student opinions and reviews on specific courses and faculty.
- Social Dynamics: Study how students seek help, form communities, and discuss social issues.
🤖 AI & Chatbot Development
- RAG / FAQ Chatbot: Create a knowledgeable bot that can answer prospective and current students' questions based on years of real discussions.
- Course Recommender System: Build a system that suggests courses based on student discussions and interests.

🗂️ File Structure

Based on your upload screenshot, the dataset is provided in two files:

uoft_reddit_dataset_[...].csv: The primary data file containing all 6,000+ posts and their metadata. Each row represents a single post.
uoft_reddit_dataset_[...]_summary.json: A summary file, likely containing aggregated statistics, key topics, or other high-level insights about the dataset.

📋 Data Dictionary (Schema Summary)

This is a summary of the key columns. Please refer to the full, detailed Data Dictionary provided separately for a complete definition of all fields, formulas, and cleaning methodologies.

Column	Type	Description
`post_id`	String	Unique Reddit post identifier.
`title`	String	Fully cleaned post title.
`content`	String	Fully cleaned post body text.
`author`	String	Reddit username of the post's author.
`created_date`	String	Human-readable creation date (UTC).
`score`	Integer	Net score (upvotes - downvotes) from Reddit.
`num_comments`	Integer	Total comment count reported by Reddit.
`comments_json`	String (JSON)	A complete, structured JSON array of the comment tree.
`quality_overall`	Float	A composite quality score (0-10) based on title, content, and engagement.
`quality_recommendation`	String	A quality category (`high_quality`, `medium_quality`, etc.).
`engagement_score`	Integer	A calculated metric for post engagement.
`url`	String	The direct, full URL to the Reddit post.

Methodol...

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation

Meta data and supporting documentation

Explore at:

Dataset updated

Nov 12, 2020

Dataset provided by

United States Environmental Protection Agencyhttp://www.epa.gov/

Description

We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

Clear search

Close search

Google apps

Main menu

Meta data and supporting documentation

Data from: Generalizable EHR-R-REDCap pipeline for a national...

LScD (Leicester Scientific Dictionary)

Dataset: Screening Causal Assessment of Brook Trout Occurrence and Road...

Bitter Creek Analysis Pedigree Data

LScDC Word-Category RIG Matrix

Simulation Data Set

Data set: St. Louis River Watershed, MN Conductivity Assessment March 2022

Cleaned NHANES 1988-2018

Input files required for R code to analyse data for the PAFA project

Name Dictionaries for "wru" R Package

[Superseded] Intellectual Property Government Open Data 2019

What is IPGOD?\r

How do I use IPGOD?\r

IP Data Platform\r

References\r

Updates\r

Tables and columns\r

Data quality improvements\r

FTICR-MS Data from Multi-continent River Water and Sediment and from Coastal...

Data from: Bike Sharing Dataset

Problem Statement:

Business Goal:

Data Preparation:

Model Building:

Model Evaluation:

Dataset for: Research data management in academic institutions: a scoping...

Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats...

Aesthetic value of the Great Barrier Reef 2017 - data collection (NESP TWQ...

TM-Link

Data from: Decision-Support Framework for Linking Regional-Scale Management...

r/UofT Student Community Dataset (2011-2025)

A comprehensive, 14-year longitudinal dataset of posts and comments from the r/UofT subreddit, professionally cleaned and enriched for NLP, social science research, and AI applications.

📌 Overview

✨ Key Features & Highlights

🧠 Potential Use Cases

🗂️ File Structure

📋 Data Dictionary (Schema Summary)

Methodol...

Meta data and supporting documentation