87 datasets found

B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
f
Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene...
frontiersin.figshare.com
docx
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s004
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2024.1379910.s004
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
f
Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene...
frontiersin.figshare.com
docx
Updated Mar 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_3_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s003
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2024.1379910.s003
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
q
Writing Clean Code in R Workshop
qubeshub.org
Updated Oct 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Joseph; Leah Wasser (2019). Writing Clean Code in R Workshop [Dataset]. https://qubeshub.org/publications/1442
Explore at:
Dataset updated
Oct 15, 2019
Dataset provided by
QUBES
Authors
Max Joseph; Leah Wasser
Description
When working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.
SARS-CoV-2 Surface Cleaning Dataset
catalog.data.gov
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
+1more
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2024). SARS-CoV-2 Surface Cleaning Dataset [Dataset]. https://catalog.data.gov/dataset/sars-cov-2-surface-cleaning-dataset
Explore at:
Dataset updated
Aug 31, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Cleaning efficacy study, for surfaces contaminated with SARS-Co-2. This dataset is associated with the following publication: Nelson, S., R. Hardison, R. Limmer, J. Marx, B.M. Taylor, R. James, M. Stewart, S. Lee, W. Calfee, S. Ryan, and M. Howard. Efficacy of Detergent-Based Cleaning and Wiping against SARS-CoV-2 on High Touch Surfaces. Letters in Applied Microbiology. Blackwell Publishing, Malden, MA, USA, 76(3): ovad033, (2023).
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Z
Data cleaning and analysis for the Master's thesis: DIFFERENCES IN CONSUMER...
data.niaid.nih.gov
zenodo.org
+1more
Updated Aug 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Burnard, Michael (2020). Data cleaning and analysis for the Master's thesis: DIFFERENCES IN CONSUMER PREFERENCES FOR UNWEATHERED AND WEATHERED WOOD [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3981176
Explore at:
Dataset updated
Aug 13, 2020
Dataset provided by
Burnard, Michael
Remesova, Hana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data and analytical support the Master's thesis submitted by Hana Remesova at the University of Primorska Faculty of Mathematics, Natural Sciences, and Information Technologies. The .csv files are data files, the .Rmd file is an R markdown which can be run. The product of knitting the .Rmd file is the .html.
f
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
figshare
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
f
Dataset for a globally synthesised and flagged bee occurrence dataset and...
open.flinders.edu.au
datasetcatalog.nlm.nih.gov
+1more
txt
Updated Jun 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Dorey; Erica E. Fischer; Paige R. Chesshire; Angela Nava-Bolaños; Robert L O'Reilly; Silas Bossert; Shannon M. Collins; Elinor M. Lichtenberg; Tucker, Erika M.; Allan Smith-Pardo; Armando Falcón-Brindis; Diego A. Guevara; Bruno Ribeiro; Diego de Pedro; Keng-Lou James Hung; Katherine A. Parys; Lindsie M. McCabe; Matthew S. Rogan; Robert L. Minckley; Santiago José Elías Velazco; Terry Griswold; Tracy A. Zarrillo; Walter Jetz; Yanina V. Sica; Michael Christopher Orr.; Laura Melissa Guzman; John S. Ascher; Alice Hughes; Neil S. Cobb (2024). Dataset for a globally synthesised and flagged bee occurrence dataset and cleaning workflow [Dataset]. http://doi.org/10.25451/flinders.21709757.v7
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25451/flinders.21709757.v7
Dataset updated
Jun 17, 2024
Dataset provided by
Flinders University
Authors
James Dorey; Erica E. Fischer; Paige R. Chesshire; Angela Nava-Bolaños; Robert L O'Reilly; Silas Bossert; Shannon M. Collins; Elinor M. Lichtenberg; Tucker, Erika M.; Allan Smith-Pardo; Armando Falcón-Brindis; Diego A. Guevara; Bruno Ribeiro; Diego de Pedro; Keng-Lou James Hung; Katherine A. Parys; Lindsie M. McCabe; Matthew S. Rogan; Robert L. Minckley; Santiago José Elías Velazco; Terry Griswold; Tracy A. Zarrillo; Walter Jetz; Yanina V. Sica; Michael Christopher Orr.; Laura Melissa Guzman; John S. Ascher; Alice Hughes; Neil S. Cobb
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Species occurrence data are foundational for research, conservation, and science communication, but the limited availability and accessibility of reliable data represents a major obstacle, particularly for insects, which face mounting pressures. We present BeeBDC, a new R package, and a global bee occurrence dataset to address this issue. We combined >18.3 million bee occurrence records from multiple public repositories (GBIF, SCAN, iDigBio, USGS, ALA) and smaller datasets, then standardised, flagged, deduplicated, and cleaned the data using the reproducible BeeBDCR-workflow. Specifically, we harmonised species names (following established global taxonomy), country names, and collection dates and we added record-level flags for a series of potential quality issues. These data are provided in two formats, “cleaned” and “flagged-but-uncleaned”. The BeeBDC package with online documentation provides end users the ability to modify filtering parameters to address their research questions. By publishing reproducible R workflows and globally cleaned datasets, we can increase the accessibility and reliability of downstream analyses. This workflow can be implemented for other taxa to support research and conservation.
d
Data from: Designing data science workshops for data-intensive environmental...
datadryad.org
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7wm37pvp7
Dataset updated
Dec 8, 2020
Dataset provided by
Dryad
Authors
Allison Theobold; Stacey Hancock; Sara Mannheimer
Time period covered
Nov 14, 2020
Description
Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw. The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey. The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, r...
H
Replication Data for: A more efficient approach to converting ASCII files...
dataverse.harvard.edu
Updated Dec 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Harris (2021). Replication Data for: A more efficient approach to converting ASCII files and cleaning data in R with the speedycode package [Dataset]. http://doi.org/10.7910/DVN/X7UKRL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/X7UKRL
Dataset updated
Dec 1, 2021
Dataset provided by
Harvard Dataverse
Authors
Jacob Harris
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Replication data for working paper: A more efficient approach to converting ASCII files and cleaning data in R with the speedycode package
m
Data from: Datasets for lot sizing and scheduling problems in the...
data.mendeley.com
narcis.nl
Updated Jan 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Piñeros (2021). Datasets for lot sizing and scheduling problems in the fruit-based beverage production process [Dataset]. http://doi.org/10.17632/j2x3gbskfw.1
Explore at:
Unique identifier
https://doi.org/10.17632/j2x3gbskfw.1
Dataset updated
Jan 19, 2021
Authors
Juan Piñeros
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets presented here were partially used in “Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings” (Toscano, A., Ferreira, D. , Morabito, R. , Computers & Chemical Engineering) [1], in “A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning” (Toscano, A., Ferreira, D. , Morabito, R. , Flexible Services and Manufacturing Journal) [2], and in “A heuristic approach to optimize the production scheduling of fruit-based beverages” (Toscano et al., Gestão & Produção, 2020) [3]. In fruit-based production processes, there are two production stages: preparation tanks and production lines. This production process has some process-specific characteristics, such as temporal cleanings and synchrony between the two production stages, which make optimized production planning and scheduling even more difficult. In this sense, some papers in the literature have proposed different methods to solve this problem. To the best of our knowledge, there are no standard datasets used by researchers in the literature in order to verify the accuracy and performance of proposed methods or to be a benchmark for other researchers considering this problem. The authors have been using small data sets that do not satisfactorily represent different scenarios of production. Since the demand in the beverage sector is seasonal, a wide range of scenarios enables us to evaluate the effectiveness of the proposed methods in the scientific literature in solving real scenarios of the problem. The datasets presented here include data based on real data collected from five beverage companies. We presented four datasets that are specifically constructed assuming a scenario of restricted capacity and balanced costs. These dataset is supplementary data for the submitted paper to Data in Brief [4]. [1] Toscano, A., Ferreira, D., Morabito, R., Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings, Computers & Chemical Engineering. 142 (2020) 107038. Doi: 10.1016/j.compchemeng.2020.107038. [2] Toscano, A., Ferreira, D., Morabito, R., A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning, Flexible Services and Manufacturing Journal. 31 (2019) 142-173. Doi: 10.1007/s10696-017-9303-9. [3] Toscano, A., Ferreira, D., Morabito, R., Trassi, M. V. C., A heuristic approach to optimize the production scheduling of fruit-based beverages. Gestão & Produção, 27(4), e4869, 2020. https://doi.org/10.1590/0104-530X4869-20. [4] Piñeros, J., Toscano, A., Ferreira, D., Morabito, R., Datasets for lot sizing and scheduling problems in the fruit-based beverage production process. Data in Brief (2021).
d
Data from: Who shares? Who doesn’t? Factors associated with openly archiving...
datadryad.org
data.niaid.nih.gov
zip
Updated May 26, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heather A. Piwowar (2011). Who shares? Who doesn’t? Factors associated with openly archiving raw research data [Dataset]. http://doi.org/10.5061/dryad.mf1sd
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.mf1sd
Dataset updated
May 26, 2011
Dataset provided by
Dryad
Authors
Heather A. Piwowar
Time period covered
May 26, 2011
Description
Microarray publications and publication attributes157 columns of attributes for 11603 publications identified as creating gene expression microarray data. Tab delimited. Key: PubMed identifier (pmid). See stats.R for data cleaning steps and more details on variables. Data collected in January 2010 using code available at http://github.com/hpiwowar/pypubrawdata.txtJournal policy details for microarray dataData sharing policy details for journals that publish a lot of gene expression microarray data. Policy links, excerpts, and classifications (24 columns) for 156 journals. Some of these classifications are included as columns in rawdata.txt as journal policy attributes.journal_policies_microarray_data.csvStatistical analysis R scriptR script for data cleaning, statistical analysis, and graphics as presented in the paper. Takes rawdata.txt as input and loads helper_functions.R source.stats.RHelper R script functionsHelper functions loaded by stats.R for analysis and graphical output...
r
exampleSoilDataCleaning
resodate.org
Updated Jul 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jens Nieschulze (2020). exampleSoilDataCleaning [Dataset]. http://doi.org/10.25625/LISGLR
Explore at:
Unique identifier
https://doi.org/10.25625/LISGLR
Dataset updated
Jul 20, 2020
Dataset provided by
Georg-August-Universität Göttingen
R sniblets
GRO.data
Authors
Jens Nieschulze
Description
an introductory R scrips show casing the use of regular expressions to cope with common data cleaning of variables containing characters
H
Replication Data for: Race, gender, and the politics of incivility
dataverse.harvard.edu
Updated Jun 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sam Gubitz (2020). Replication Data for: Race, gender, and the politics of incivility [Dataset]. http://doi.org/10.7910/DVN/ODPNI8
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ODPNI8
Dataset updated
Jun 10, 2020
Dataset provided by
Harvard Dataverse
Authors
Sam Gubitz
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Use the project file first, then open the cleaning R file to clean the raw data. Then use the R file called OLS analysis to analyze the cleaned data, which was outputted as a .rds file.
l
LScDC (Leicester Scientific Dictionary-Core)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9896579.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
r
Street cleaning in City of Yarra
researchdata.edu.au
Updated Oct 2, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Yarra (2019). Street cleaning in City of Yarra [Dataset]. https://researchdata.edu.au/street-cleaning-city-yarra/2980765
Explore at:
Dataset updated
Oct 2, 2019
Dataset provided by
data.gov.au
Authors
City of Yarra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
Our aim is to make Yarra a clean and pleasant place for our residents to live. This data asset has information about sweeping and loose litter removal across residential roads, kerbs and public open spaces within the Yarra municipality. The street cleansing details include cleaning date and time, suburb where the cleaning was done, category of cleaning, volume of litter removed and cleaning duration.\r \r While all due care has been taken to ensure the data asset is accurate and current, Yarra City Council does not warrant that this data is definitive nor free of error and does not accept responsibility for any loss, damage, claim, expense, cost or liability whatsoever arising from reliance upon information provided herein.\r \r Feedback on the data asset - including compliments, complaints and requests for more detail - is welcome.
Initial data analysis checklist for data screening in longitudinal studies.
plos.figshare.com
xls
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Initial data analysis checklist for data screening in longitudinal studies. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295726.t001
Dataset updated
May 29, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Initial data analysis checklist for data screening in longitudinal studies.
H
Mo(Wa)²TER Data Science Workshop Material
dataverse.harvard.edu
Updated Sep 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amanda S. Hering; Kathryn B. Newhart; Derek Weix (2024). Mo(Wa)²TER Data Science Workshop Material [Dataset]. http://doi.org/10.7910/DVN/PKLIOC
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/PKLIOC
Dataset updated
Sep 9, 2024
Dataset provided by
Harvard Dataverse
Authors
Amanda S. Hering; Kathryn B. Newhart; Derek Weix
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset funded by
National Science Foundation
Description
These are the materials developed for the Mo(Wa)²TER Data Science workshop, which is designed for upper level and graduate students in environmental engineering or industry professionals in the water and wastewater treatment (W/WWT) fields. Working through this material will improve a learner’s data analysis and programming skills with the free R language and will focus exclusively on problems arising in W/WWT. Training in basic R coding, data cleaning, visualization, data analysis, statistical modeling, and machine learning are provided. Real W/WWT examples and exercises are given with each topic to strengthen and deepen comprehension. These materials aim to equip students with the skills to handle data science challenges in their future careers. Materials were developed over three offerings of this workshop in 2021, 2022, and 2023. At the time of publication, all code runs, but we provide no guarantees on future versions of R or packages used in this workshop.
Data and Code for "Urban socioeconomic variation influences the ecology and...
zenodo.org
bin, csv, txt
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ella Martin; Samer El-Galmady; Marc T.J. Johnson; Ella Martin; Samer El-Galmady; Marc T.J. Johnson (2024). Data and Code for "Urban socioeconomic variation influences the ecology and evolution of trophic interactions" [Dataset]. http://doi.org/10.5281/zenodo.10640975
Explore at:
csv, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10640975
Dataset updated
Feb 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ella Martin; Samer El-Galmady; Marc T.J. Johnson; Ella Martin; Samer El-Galmady; Marc T.J. Johnson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and code required for all analyses in Urban socioeconomic variation influences the ecology and evolution of trophic interactions.

Gall_Data_2022_new.csv contains data for gall predation and diameter measurements. Goldrod_Gall_Density.csv contains goldenrod and gall density measurements for each site. GallSitesFinal.csv contains location data for all study sites. DisseminationAreaCodes.csv contains codes for each site location needed to obtain census data.

The script DataCleaning.R assembles the above four datasets with environmental and census data to produce the final dataset: MartinElGalmady&Johnson2023_cleandataset.csv and the supplemental dataset with galls with early larval death removed: NoELD_dataset.csv

Analysis.R provides the code for conducting analyses and producing figures using the MartinElGalmady&Johnson2023_cleandataset.csv dataset (or the NoELD_dataset.csv for supplemental analyses with early larval death removed).

Detailed descriptions of each dataset are included in metadata.xlsx (NoELD_dataset has the same rows and columns as the full dataset).

All code was run in R verison 4.2.2