100+ datasets found

B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
f
Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene...
frontiersin.figshare.com
docx
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2024.1379910.s001
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
o
Reddit: /r/worldnews (Submissions & Comments)
opendatabay.com
.undefined
Updated Jun 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Reddit: /r/worldnews (Submissions & Comments) [Dataset]. https://www.opendatabay.com/data/ai-ml/4f3e6b7d-569e-48b5-b3e8-6818eb389988
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 29, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset offers insight into the ways that public opinion shapes the world news cycle. Gathering posts from various topics such as politics, current affairs, socio-economic issues, sports, entertainment and more from the subredditworldnews subreddit, this dataset provides engagement data from each post in order to analyze public sentiment. With columns including title, score, url, comms_num, created timestamp and body text for each post in the collection it is easy to assess discussion thread topics or dig deep into individual posts to ascertain what perspectives on world news have the most traction. From questions of foreign policy to environmental action and social movements it is possible with this tool analyse how these stories shape our global outlook

More Datasets For more datasets, click here.

Featured Notebooks 🚨 Your notebook can be here! 🚨! Research Ideas Looking at correlations between post engagement and the topics of the posts to better understand the most popular topics on world news. Analyzing the differences in post engagement according to geographic regions to better understand what is trending in certain areas of the world. Tracking changes in engagement over time as a way to assess public opinion about specific news cycles or events

License

CC0

Original Data Source: Reddit: /r/worldnews (Submissions & Comments)
l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v2
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)

April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identiﬁed by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
o
Data and code for "Plastic bag bans and fees reduce harmful bag litter on...
openicpsr.org
delimited
Updated Apr 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Papp; Kimberly Oremus (2024). Data and code for "Plastic bag bans and fees reduce harmful bag litter on shorelines" [Dataset]. http://doi.org/10.3886/E200661V3
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E200661V3
Dataset updated
Apr 14, 2024
Dataset provided by
University of Delaware
Columbia University
Authors
Anna Papp; Kimberly Oremus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code and data for "Plastic bag bans and fees reduce harmful bag litter on shorelines " by Anna Papp and Kimberly Oremus.Please see included README file for details: This folder includes code and data to fully replicate Figures 1-5. In addition, the folder also includes instructions to rerun data cleaning steps. Last modified: March 6, 2025For any questions, please reach out to ap3907@columbia.edu._Code (replication/code):To replicate main figures, run each file for each main figure: - 1_figure1.R- 1_figure2.R- 1_figure3.R - 1_figure4.R- 1_figure5.R Update the home directory to match where the directory is saved ("replication" folder) in this file before running it. The code will require you to install packages (see note on versions below).To replicate entire data cleaning pipeline:- First download all required data (explained in Data section below). - Run code in code/0_setup folder (refer to separate README file)._ R-Version and Package VersionsThe project was developed and executed using:- R version: 4.0.0 (2024-04-24)- Platform: macOS 13.5 Code was developed and main figures were created using the following versions: - data.table: 1.14.2- dplyr: 1.1.4- readr: 2.1.2- tidyr: 1.2.0- broom: 0.7.12- stringr: 1.5.1- lubridate: 1.7.9- raster: 3.5.15- sf: 1.0.7- readxl: 1.4.0- cobalt: 4.4.1.9002- spdep: 1.2.3- ggplot2: 3.4.4- PNWColors: 0.1.0- grid: 4.0.0- gridExtra: 2.3- ggpubr: 0.4.0- knitr: 1.48- zoo: 1.8.12 - fixest: 0.11.2- lfe: 2.8.7.1 - did: 2.1.2- didimputation: 0.3.0 - DIDmultiplegt: 0.1.0- DIDmultiplegtDYN: 1.0.15- scales: 1.2.1 - usmap: 0.6.1 - tigris: 2.0.1 - dotwhisker: 0.7.4_Data Processed data files are provided to replicate main figures. To replicate from raw data, follow the instructions below.Policies (needs to be recreated or email for version): Compiled from bagtheban.com/in-your-state/, rila.org/retail-compliance-center/consumer-bag-legislation, baglaws.com, nicholasinstitute.duke.edu/plastics-policy-inventory, and wikipedia.org/wiki/Plastic_bag_bans_in_the_United_States; and massgreen.org/plastic-bag-legislation.html and cawrecycles.org/list-of-local-bag-bans to confirm legislation in Massachusetts and California.TIDES (needs to be downloaded for full replication): Download cleanup data for the United States from Ocean Conservancy (coastalcleanupdata.org/reports). Download files for 2000-2009, 2010-2014, and then each separate year from 2015 until 2023. Save files in the data/tides directory, as year.csv (and 2000-2009.csv, 2010-2014.csv) Also download entanglement data for each year (2016-2023) separately in a file called data/tides/entanglement (each file should be called 'entangled-animals-united-states_YEAR.csv').Shapefiles (needs to be downloaded for full replication): Download shapefiles for processing cleanups and policies. Download county shapefiles from the US Census Bureau; save files in the data/shapefiles directory, county shapefile should be in folder called county (files called cb_2018_us_county_500k.shp). Download TIGER Zip Code tabulation areas from the US Census Bureau (through data.gov); save files in the data/shapefiles directory, zip codes shapefile folder and files should be called tl_2019_us_zcta510.Other: Helper files with US county and state fips codes, lists of US counties and zip codes in data/other directory, provided in the directory except as follows. Download zip code list and 2020 IRS population data from United States zip codes and save as uszipcodes.csv in data/other directory. Download demographic characteristics of zip codes from Social Explorer and save as raw_zip_characteristics.csv in data/other directory.Refer to the .txt files in each data folder to ensure all necessary files are downloaded.
o
Reddit World News Post Analytics
opendatabay.com
.undefined
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Reddit World News Post Analytics [Dataset]. https://www.opendatabay.com/data/web-social/4f3e6b7d-569e-48b5-b3e8-6818eb389988
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 8, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics, World
Description
This dataset provides insight into how public opinion shapes the world news cycle, offering public opinion engagement data from posts on the r/worldnews subreddit. It gathers posts on various topics such as politics, current affairs, socio-economic issues, sports, and entertainment. The dataset includes engagement metrics for each post, allowing for analysis of public sentiment. It is a valuable tool for assessing discussion threads, delving into individual posts to understand prevalent perspectives on world news, and analysing how stories on foreign policy, environmental action, and social movements influence our global outlook.

Columns

The worldnews.csv dataset includes the following columns: * title: The title of the post. (String) * score: The number of upvotes the post has received. (Integer) * id: A unique identifier for the post. (String) * url: The URL of the post. (String) * comms_num: The number of comments the post has received. (Integer) * created: The date and time the post was created. (Datetime) * body: The main text content of the post. (String) * timestamp: The date and time the post was last updated. (Datetime)

Distribution

The dataset is provided in CSV format. It contains 1,871 unique post IDs. While a total row count for the entire dataset is not explicitly stated, data is available in various ranges for scores, comments, and timestamps, indicating a substantial collection of records. For instance, timestamps span from 8th December 2022 to 15th December 2022.

Usage

This dataset is ideal for: * Understanding the most popular topics on world news by correlating post engagement with their subject matter. * Analysing differences in post engagement across various geographic regions to identify trending global issues. * Tracking changes in public opinion by monitoring engagement over time, particularly concerning specific news cycles or events. * Conducting deep dives into individual posts to ascertain which perspectives on world news gain the most traction. * Analysing how global stories, from foreign policy to environmental action and social movements, shape collective global outlook.

Coverage

The dataset offers global coverage of public opinion, as it is sourced from the r/worldnews subreddit. The time range for the included posts spans from 8th December 2022 to 15th December 2022. The scope primarily focuses on posts related to general world news, politics, current affairs, and socio-economic issues.

License

CC0

Who Can Use It

This dataset is well-suited for data science and analytics professionals, researchers, and anyone interested in: * Analysing public sentiment related to world events. * Studying the dynamics of online news consumption and engagement. * Exploring the relationship between social media discussions and global outlook. * Developing Natural Language Processing (NLP) models for text analysis and sentiment detection.

Dataset Name Suggestions

Reddit World News Engagement Data

Global Public Opinion on News

r/worldnews Submission & Comment Data

World News Social Sentiment

Reddit World News Post Analytics

Attributes

Original Data Source: Reddit: /r/worldnews (Submissions & Comments)
Data cleaning and analysis for the Master's thesis: DIFFERENCES IN CONSUMER...
zenodo.org
data.niaid.nih.gov
bin, csv, html
Updated Aug 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hana Remesova; Michael Burnard; Michael Burnard; Hana Remesova (2020). Data cleaning and analysis for the Master's thesis: DIFFERENCES IN CONSUMER PREFERENCES FOR UNWEATHERED AND WEATHERED WOOD [Dataset]. http://doi.org/10.5281/zenodo.3981177
Explore at:
html, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3981177
Dataset updated
Aug 13, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hana Remesova; Michael Burnard; Michael Burnard; Hana Remesova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data and analytical support the Master's thesis submitted by Hana Remesova at the University of Primorska
Faculty of Mathematics, Natural Sciences, and Information Technologies. The .csv files are data files, the .Rmd file is an R markdown which can be run. The product of knitting the .Rmd file is the .html.
Cleaning against MHV dataset
catalog.data.gov
Updated Mar 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). Cleaning against MHV dataset [Dataset]. https://catalog.data.gov/dataset/cleaning-against-mhv-dataset
Explore at:
Dataset updated
Mar 4, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
efficacy data against MHV for cleaning surfaces. This dataset is associated with the following publication: Hardison, R., S. Nelson, D. Barriga, J. Ghere, G. Fenton, R. James, M. Stewart, S. Lee, M.W. Calfee, S. Ryan, and M. Howard. Efficacy of Detergent-Based Cleaning Methods Against Coronavirus MHV-A59 on Porous and Non-Porous Surfaces. JOURNAL OF OCCUPATIONAL AND ENVIRONMENTAL HYGIENE. Taylor & Francis, Inc., Philadelphia, PA, USA, 19(2): 91-101, (2022).
q
Writing Clean Code in R Workshop
qubeshub.org
Updated Oct 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Joseph; Leah Wasser (2019). Writing Clean Code in R Workshop [Dataset]. https://qubeshub.org/publications/1442
Explore at:
Dataset updated
Oct 15, 2019
Dataset provided by
QUBES
Authors
Max Joseph; Leah Wasser
Description
When working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.
f
Dataset for a globally synthesised and flagged bee occurrence dataset and...
open.flinders.edu.au
researchdata.edu.au
txt
Updated Jun 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Dorey; Erica E. Fischer; Paige R. Chesshire; Angela Nava-Bolaños; Robert L O'Reilly; Silas Bossert; Shannon M. Collins; Elinor M. Lichtenberg; Tucker, Erika M.; Allan Smith-Pardo; Armando Falcón-Brindis; Diego A. Guevara; Bruno Ribeiro; Diego de Pedro; Keng-Lou James Hung; Katherine A. Parys; Lindsie M. McCabe; Matthew S. Rogan; Robert L. Minckley; Santiago José Elías Velazco; Terry Griswold; Tracy A. Zarrillo; Walter Jetz; Yanina V. Sica; Michael Christopher Orr.; Laura Melissa Guzman; John S. Ascher; Alice Hughes; Neil S. Cobb (2024). Dataset for a globally synthesised and flagged bee occurrence dataset and cleaning workflow [Dataset]. http://doi.org/10.25451/flinders.21709757.v7
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25451/flinders.21709757.v7
Dataset updated
Jun 17, 2024
Dataset provided by
Flinders University
Authors
James Dorey; Erica E. Fischer; Paige R. Chesshire; Angela Nava-Bolaños; Robert L O'Reilly; Silas Bossert; Shannon M. Collins; Elinor M. Lichtenberg; Tucker, Erika M.; Allan Smith-Pardo; Armando Falcón-Brindis; Diego A. Guevara; Bruno Ribeiro; Diego de Pedro; Keng-Lou James Hung; Katherine A. Parys; Lindsie M. McCabe; Matthew S. Rogan; Robert L. Minckley; Santiago José Elías Velazco; Terry Griswold; Tracy A. Zarrillo; Walter Jetz; Yanina V. Sica; Michael Christopher Orr.; Laura Melissa Guzman; John S. Ascher; Alice Hughes; Neil S. Cobb
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Species occurrence data are foundational for research, conservation, and science communication, but the limited availability and accessibility of reliable data represents a major obstacle, particularly for insects, which face mounting pressures. We present BeeBDC, a new R package, and a global bee occurrence dataset to address this issue. We combined >18.3 million bee occurrence records from multiple public repositories (GBIF, SCAN, iDigBio, USGS, ALA) and smaller datasets, then standardised, flagged, deduplicated, and cleaned the data using the reproducible BeeBDCR-workflow. Specifically, we harmonised species names (following established global taxonomy), country names, and collection dates and we added record-level flags for a series of potential quality issues. These data are provided in two formats, “cleaned” and “flagged-but-uncleaned”. The BeeBDC package with online documentation provides end users the ability to modify filtering parameters to address their research questions. By publishing reproducible R workflows and globally cleaned datasets, we can increase the accessibility and reliability of downstream analyses. This workflow can be implemented for other taxa to support research and conservation.
m
Data from: Datasets for lot sizing and scheduling problems in the...
data.mendeley.com
narcis.nl
Updated Jan 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Piñeros (2021). Datasets for lot sizing and scheduling problems in the fruit-based beverage production process [Dataset]. http://doi.org/10.17632/j2x3gbskfw.1
Explore at:
Unique identifier
https://doi.org/10.17632/j2x3gbskfw.1
Dataset updated
Jan 19, 2021
Authors
Juan Piñeros
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets presented here were partially used in “Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings” (Toscano, A., Ferreira, D. , Morabito, R. , Computers & Chemical Engineering) [1], in “A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning” (Toscano, A., Ferreira, D. , Morabito, R. , Flexible Services and Manufacturing Journal) [2], and in “A heuristic approach to optimize the production scheduling of fruit-based beverages” (Toscano et al., Gestão & Produção, 2020) [3]. In fruit-based production processes, there are two production stages: preparation tanks and production lines. This production process has some process-specific characteristics, such as temporal cleanings and synchrony between the two production stages, which make optimized production planning and scheduling even more difficult. In this sense, some papers in the literature have proposed different methods to solve this problem. To the best of our knowledge, there are no standard datasets used by researchers in the literature in order to verify the accuracy and performance of proposed methods or to be a benchmark for other researchers considering this problem. The authors have been using small data sets that do not satisfactorily represent different scenarios of production. Since the demand in the beverage sector is seasonal, a wide range of scenarios enables us to evaluate the effectiveness of the proposed methods in the scientific literature in solving real scenarios of the problem. The datasets presented here include data based on real data collected from five beverage companies. We presented four datasets that are specifically constructed assuming a scenario of restricted capacity and balanced costs. These dataset is supplementary data for the submitted paper to Data in Brief [4]. [1] Toscano, A., Ferreira, D., Morabito, R., Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings, Computers & Chemical Engineering. 142 (2020) 107038. Doi: 10.1016/j.compchemeng.2020.107038. [2] Toscano, A., Ferreira, D., Morabito, R., A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning, Flexible Services and Manufacturing Journal. 31 (2019) 142-173. Doi: 10.1007/s10696-017-9303-9. [3] Toscano, A., Ferreira, D., Morabito, R., Trassi, M. V. C., A heuristic approach to optimize the production scheduling of fruit-based beverages. Gestão & Produção, 27(4), e4869, 2020. https://doi.org/10.1590/0104-530X4869-20. [4] Piñeros, J., Toscano, A., Ferreira, D., Morabito, R., Datasets for lot sizing and scheduling problems in the fruit-based beverage production process. Data in Brief (2021).
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
d
Data from: Who shares? Who doesn’t? Factors associated with openly archiving...
datadryad.org
data.niaid.nih.gov
zip
Updated May 26, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heather A. Piwowar (2011). Who shares? Who doesn’t? Factors associated with openly archiving raw research data [Dataset]. http://doi.org/10.5061/dryad.mf1sd
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.mf1sd
Dataset updated
May 26, 2011
Dataset provided by
Dryad
Authors
Heather A. Piwowar
Time period covered
2011
Description
Microarray publications and publication attributes157 columns of attributes for 11603 publications identified as creating gene expression microarray data. Tab delimited. Key: PubMed identifier (pmid). See stats.R for data cleaning steps and more details on variables. Data collected in January 2010 using code available at http://github.com/hpiwowar/pypubrawdata.txtJournal policy details for microarray dataData sharing policy details for journals that publish a lot of gene expression microarray data. Policy links, excerpts, and classifications (24 columns) for 156 journals. Some of these classifications are included as columns in rawdata.txt as journal policy attributes.journal_policies_microarray_data.csvStatistical analysis R scriptR script for data cleaning, statistical analysis, and graphics as presented in the paper. Takes rawdata.txt as input and loads helper_functions.R source.stats.RHelper R script functionsHelper functions loaded by stats.R for analysis and graphical output...
l
LScDC (Leicester Scientific Dictionary-Core)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9896579.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
m
Data from: BDcleaner: a workflow for cleaning taxonomic and geographic...
data.mendeley.com
Updated Oct 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Jin (2019). BDcleaner: a workflow for cleaning taxonomic and geographic errors in occurrence data archived in biodiversity databases [Dataset]. http://doi.org/10.17632/pghkfm5sm9.1
Explore at:
Unique identifier
https://doi.org/10.17632/pghkfm5sm9.1
Dataset updated
Oct 4, 2019
Authors
Jing Jin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These R scripts are source codes of the research "BDcleaner: a workflow for cleaning taxonomic and geographic errors in occurrence data archived in biodiversity databases"
m
Step-downs analysis: aggregated data and analytical code
bridges.monash.edu
researchdata.edu.au
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tyler Lane; Luke Sheehan; Shannon Gray; Dianne Beck; Alex Collie (2023). Step-downs analysis: aggregated data and analytical code [Dataset]. http://doi.org/10.26180/5dba1e5b4277a
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.26180/5dba1e5b4277a
Dataset updated
May 31, 2023
Dataset provided by
Monash University
Authors
Tyler Lane; Luke Sheehan; Shannon Gray; Dianne Beck; Alex Collie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This file contains analytical code and aggregate data for evaluating the impact of step-downs (reductions in workers' comp payments after several months in the system) on scheme exit, plus the R project file. I have included the cleaning file but no case-level data.To use:Download all files into a single folder and open the project file. You should be able to run most RMarkdown files from there. The meta-analysis file depends on outputs from all other analytical files.Note on v5:This update reflect analytical changes to reflect peer review and the correction of an existing error. Analytical changes include an additional sensitivity analysis investigating effects on claims unaffected by step-downs, which would indicate confounding. The existing error was exclusion criteria had not been applied correctly in some cases. Claims affected by step-downs must consider both maximum/minimum caps and compensation rates. Otherwise, claims may have minimal effects of step-downs. E.g., if the cap was $2000 per week with an initial rate of 95%, injured workers could only be included if they made $2000 / 95% or less.
H
Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It
dataverse.harvard.edu
Updated Nov 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grant Allard (2018). SBIR - STTR Data and Code for Collecting Wrangling and Using It [Dataset]. http://doi.org/10.7910/DVN/CKTAZX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/CKTAZX
Dataset updated
Nov 5, 2018
Dataset provided by
Harvard Dataverse
Authors
Grant Allard
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.
t
Examplesoildatacleaning - Vdataset - LDM
service.tib.eu
Updated May 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Examplesoildatacleaning - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/goe-doi-10-25625-lisglr
Explore at:
Dataset updated
May 16, 2025
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
an introductory R scrips show casing the use of regular expressions to cope with common data cleaning of variables containing characters
o
Population Pyramid Data and R Script for the US, States, and Counties 1970 -...
openicpsr.org
delimited
Updated Jan 23, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathanael Rosenheim (2020). Population Pyramid Data and R Script for the US, States, and Counties 1970 - 2017 [Dataset]. http://doi.org/10.3886/E117081V2
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E117081V2
Dataset updated
Jan 23, 2020
Dataset provided by
Texas A&M University
Authors
Nathanael Rosenheim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States, Counties, States
Description
Population pyramids provide a way to visualize the age and sex composition of a geographic region, such as a nation, state, or county. A standard population pyramid divides sex into two bar charts or histograms, one for the male population and one for the female population. The two charts mirror each other and are divide age into 5-year cohorts. The shape of a population pyramid provides insights into a region’s fertility, mortality, and migration patterns. When a region has high fertility and mortality, but low migration the visualization will look like a pyramid, with the youngest age cohort (0-4 years) representing the largest percent of the population and each older cohort representing a progressively smaller percent of the population.

In many regions fertility and mortality have decreased significantly since 1970, as people live longer and women have fewer children. With lower fertility and mortality, population pyramids are shaped more like a pillar.

While population pyramids can be made for any geographic region, when interpreting population pyramids for smaller areas (like counties) the most important force that shapes the pyramid is often in- and out-migration (Wang and vom Hofe, 2006, p. 65). For smaller regions, population pyramids can have unique shapes.

This data archive provides the resources needed to generate population pyramids for the United States, individual states, and any county within the United States. Population pyramids usually require significant data cleaning and graph making skills to generate one pyramid. With this data archive the data cleaning has been completed and the R script provides reusable code to quickly generate graphs. The final output is an image file with six graphs on one page. The final layout makes it easy to compare changes in population age and sex composition for any state and any county in the US for 1970, 1980, 1990, 2000, 2010, and 2017.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:

154 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.5683/SP3/ZCN177

Dataset updated

Jul 13, 2023

Dataset provided by

Borealis

Authors

Rong Luo

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Clear search

Close search

Google apps

Main menu

Data Cleaning Sample

Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene...

Reddit r/AskScience Flair Dataset

Reddit: /r/worldnews (Submissions & Comments)

License

LSC (Leicester Scientific Corpus)

Data and code for "Plastic bag bans and fees reduce harmful bag litter on...

Reddit World News Post Analytics

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Data cleaning and analysis for the Master's thesis: DIFFERENCES IN CONSUMER...

Cleaning against MHV dataset

Writing Clean Code in R Workshop

Dataset for a globally synthesised and flagged bee occurrence dataset and...

Data from: Datasets for lot sizing and scheduling problems in the...

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Data from: Who shares? Who doesn’t? Factors associated with openly archiving...

LScDC (Leicester Scientific Dictionary-Core)

Data from: BDcleaner: a workflow for cleaning taxonomic and geographic...

Step-downs analysis: aggregated data and analytical code

Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It

Examplesoildatacleaning - Vdataset - LDM

Population Pyramid Data and R Script for the US, States, and Counties 1970 -...

Data Cleaning Sample