100+ datasets found

B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
q
Writing Clean Code in R Workshop
qubeshub.org
Updated Oct 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Joseph; Leah Wasser (2019). Writing Clean Code in R Workshop [Dataset]. https://qubeshub.org/publications/1442
Explore at:
Dataset updated
Oct 15, 2019
Dataset provided by
QUBES
Authors
Max Joseph; Leah Wasser
Description
When working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
TidY_PracticE_DatasetS
kaggle.com
zip
Updated Jun 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DEBALINA MITRA (2023). TidY_PracticE_DatasetS [Dataset]. https://www.kaggle.com/datasets/debalinamitra/tidy-practice-datasets
Explore at:
zip(139335 bytes)Available download formats
Dataset updated
Jun 24, 2023
Authors
DEBALINA MITRA
Description
Original dataset that is shared on Github can be found here. These are hands on practice datasets that were linked through the Coursera Guided Project Certificate Course for Handling Missing Values in R, a part of Coursera Project Network. The datasets links were shared by the original author and instructor of the course Arimoro Olayinka Imisioluwa.

Things you could do with this dataset: As a beginner in R, these datasets helped me to get a hang over making data clean and tidy and handling missing values(only numeric) using R. Good for anyone looking for a beginner to intermediate level understanding in these subjects.

Here are my notebooks as kernels using these datasets and using a few more preloaded datasets in R, as suggested by the instructor. TidY DatA Practice MissinG DatA HandlinG - NumeriC
s
R/r custom clean llc USA Import & Buyer Data
seair.co.in
Updated Jan 11, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2018). R/r custom clean llc USA Import & Buyer Data [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Jan 11, 2018
Dataset provided by
Seair Info Solutions PVT LTD
Authors
Seair Exim
Area covered
United States
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Data cleaning EVI2
figshare.com
txt
Updated May 13, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geraldine Klarenberg (2019). Data cleaning EVI2 [Dataset]. http://doi.org/10.6084/m9.figshare.5327527.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5327527.v1
Dataset updated
May 13, 2019
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Geraldine Klarenberg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Scripts to clean EVI2 data obtained from the VIP lab (University of Arizona) website (https://vip.arizona.edu/about.php and https://vip.arizona.edu/viplab_data_explorer.php). Data obtained in 2012.- outlier detection and removal/replacement- alignment of 2 periodsThe manuscript detailing the methods and resulting data sets has been accepted for publication in Nature Scientific Data (05/11/2019).Instructions: use the R Markdown html file for instructions!Code last manipulated and tested in R 3.4.3 ("Kite-Eating Tree")
d
The fractured lab notebook: undergraduate and ecological data management...
search.dataone.org
Updated Nov 14, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Center for Ecological Analysis and Synthesis; Carly Strasser (2013). The fractured lab notebook: undergraduate and ecological data management training in the United States [Dataset]. https://search.dataone.org/view/knb.300.9
Explore at:
Dataset updated
Nov 14, 2013
Dataset provided by
Knowledge Network for Biocomplexity
Authors
National Center for Ecological Analysis and Synthesis; Carly Strasser
Time period covered
Mar 29, 2011 - May 25, 2011
Area covered

Variables measured
Answer, Coding, EndDate, Question, R script, StartDate, First Name, Param name, Description, RespondentID, and 157 more
Description
Data presented here are those collected from a survey of Ecology professors at 48 undergraduate institutions to assess the current state of data management education. The following files have been uploaded:

Scripts(2): 1. DataCleaning_20120105.R is an R script for cleaning up data prior to analysis. This script removes spaces, substitutes text for codes, removed duplicate schools, and converts questions and answers from the survey into more simple parameter names, without any numbers, spaces, or symbols. This script is heavily annotated to assist the user of the file in understanding what is being done to the data files. The script produces the file cleandata_[date].Rdata, which is called in the file DataTrimming_20120105.R 2. DataTrimming_20120105.R is an R script for trimming extraneous variables not used in final analyses. Some variables are combined as needed and NAs (no answers) are removed. The file is heavily annotated. It produces trimdata_[date].Rdata, which was imported into Excel for summary statistics.

Data files (3) 3. AdvancedSpreadsheet_20110526.csv is the output file from the SurveyMonkey online survey tool used for this project. It is a .csv sheet with the complete set of survey data, although some data (e.g., open-ended responses, institution names) are removed to prevent schools and/or instructors from being identifiable. This file is read into DataCleaning_20120105.R for cleaning and editing. 4. VariableRenaming_20110711.csv is called into the DataCleaning_20120105.R script to convert the questions and answers from the survey into simple parameter names, without any numbers, spaces, or symbols. 5. ParamTable.csv is a list of the parameter names used for analysis and the value codes. It can be used to understand outputs from the scripts above (cleandata_[date].Rdata and trimdata_[date].Rdata).
H
Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It
dataverse.harvard.edu
Updated Nov 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grant Allard (2018). SBIR - STTR Data and Code for Collecting Wrangling and Using It [Dataset]. http://doi.org/10.7910/DVN/CKTAZX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/CKTAZX
Dataset updated
Nov 5, 2018
Dataset provided by
Harvard Dataverse
Authors
Grant Allard
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.
e
Shanghai Hanyang Clean Technology R Export Import Data | Eximpedia
eximpedia.app
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Shanghai Hanyang Clean Technology R Export Import Data | Eximpedia [Dataset]. https://www.eximpedia.app/companies/shanghai-hanyang-clean-technology-r/57881709
Explore at:
Dataset updated
Oct 8, 2025
Area covered
Shanghai
Description
Shanghai Hanyang Clean Technology R Export Import Data. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
H
Replication Data for: Race, gender, and the politics of incivility
dataverse.harvard.edu
search.dataone.org
Updated Jun 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sam Gubitz (2020). Replication Data for: Race, gender, and the politics of incivility [Dataset]. http://doi.org/10.7910/DVN/ODPNI8
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ODPNI8
Dataset updated
Jun 10, 2020
Dataset provided by
Harvard Dataverse
Authors
Sam Gubitz
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Use the project file first, then open the cleaning R file to clean the raw data. Then use the R file called OLS analysis to analyze the cleaned data, which was outputted as a .rds file.
d
Data from: Data and code from: A natural polymer material as a pesticide...
catalog.data.gov
s.cnmilf.com
+1more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: A natural polymer material as a pesticide adjuvant for mitigating off-target drift and protecting pollinator health [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-a-natural-polymer-material-as-a-pesticide-adjuvant-for-mitigating-off-t
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset contains all data and code required to clean the data, fit the models, and create the figures and tables for the laboratory experiment portion of the manuscript:Kannan, N., Q. D. Read, and W. Zhang. 2024. A natural polymer material as a pesticide adjuvant for mitigating off-target drift and protecting pollinator health. Heliyon, in press. https://doi.org/10.1016/j.heliyon.2024.e35510.In this dataset, we archive results from several laboratory and field trials testing different adjuvants (spray additives) that are intended to reduce particle drift, increase particle size, and slow down the particles from pesticide spray nozzles. We fit statistical models to the droplet size and speed distribution data and statistically compare different metrics between the adjuvants (sodium alginate, polyacrylamide [PAM], and control without any adjuvants). The following files are included:RawDataPAMsodAlgOxfLsr.xlsx: Raw data for primary analysesOrganizedDataPaperRevision20240614.xlsx: Raw data to produce density plots presented in Figs. 8 and 9raw_data_readme.md: Markdown file with description of the raw data filesR_code_supplement.R: All R code required to reproduce primary analysesR_code_supplement2.R: R code required to produce density plots presented in Figs. 8 and 9Intermediate R output files are also included so that tables and figures can be recreated without having to rerun the data preprocessing, model fitting, and posterior estimation steps:pam_cleaned.RData: Data combined into clean R data frames for analysisvelocityscaledlogdiamfit.rds: Fitted brms model object for velocitylnormfitreduced.rds: Fitted brms model object for diameter distributionemm_con_velo_diam_draws.RData: Posterior distributions of estimated marginal means for velocityemm_con_draws.RData: Posterior distributions of estimated marginal means for diameter distributionThe following software and package versions were used:R version 4.3.1CmdStan version 2.33.1R packages:brms version 2.20.5cmdstanr version 0.5.3fitdistrplus version 1.1-11tidybayes version 3.0.4emmeans version 1.8.9
Z
A dataset for temporal analysis of files related to the JFK case
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luczak-Roesch, Markus (2020). A dataset for temporal analysis of files related to the JFK case [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1042153
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Victoria University of Wellington
Authors
Luczak-Roesch, Markus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.

The code to derive the dataset is given as follows:

BEGIN R DATA PROCESSING SCRIPT

library(tesseract) library(pdftools)

pdfs <- list.files("[path to your output directory containing all PDF files]")

meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)

meta$Doc.Date <- as.character(meta$Doc.Date)

meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),] for(i in 1:nrow(meta.clean)){ meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i])

if(nchar(meta.clean$Doc.Date[i])<10){ meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y") }

}

meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y")

meta.clean <- meta.clean[order(meta.clean$Doc.Date),]

docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F) for(i in 1:nrow(meta.clean)){

for(i in 1:3){

pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i]))) tmp_files <- c() for(k in 1:pdf_prop$pages){ tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k)) }

img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files)

txt <- ""

for(j in 1:length(img_file)){ extract <- ocr(img_file[j], engine = tesseract("eng")) #unlink(img_file) txt <- paste(txt,extract,collapse = " ") }

docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\s+"," ",gsub("[[:punct:]]|[ ]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F) }

write.table(docs,"[path to your output directory]/documents.csv", row.names = F)

END R DATA PROCESSING SCRIPT
d
Data from: Designing data science workshops for data-intensive environmental...
datadryad.org
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7wm37pvp7
Dataset updated
Dec 8, 2020
Dataset provided by
Dryad
Authors
Allison Theobold; Stacey Hancock; Sara Mannheimer
Time period covered
Nov 14, 2020
Description
Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw. The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey. The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, r...
Cyclistic_data_visualization
kaggle.com
Updated Jun 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Woychick (2021). Cyclistic_data_visualization [Dataset]. https://www.kaggle.com/markwoychick/cyclistic-data-visualization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mark Woychick
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I created these files and analysis as part of working on a case study for the Google Data Analyst certificate.

Question investigated: Do annual members and casual riders use Cyclistic bikes differently? Why do we want to know?: Knowing bike usage/behavior by rider type will allow the Marketing, Analytics, and Executive team stakeholders to design, assess, and approve appropriate strategies that drive profitability.

Content

I used the script noted below to clean the files and then added some additional steps to create the visualizations to complete my analysis. The additional steps are noted in corresponding R Markdown file for this data set.

Acknowledgements

Files: most recent 1 year of data available, Divvy_Trips_2019_Q2.csv, Divvy_Trips_2019_Q3.csv, Divvy_Trips_2019_Q4.csv, Divvy_Trips_2020_Q1.csv Source: Downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html

Data cleaning script: followed this script to clean and merge files https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy

Note: Combined data set has 3,876,042 rows, so you will likely need to run R analysis on your computer (e.g., R Console) rather than in the cloud (e.g., RStudio Cloud)

Inspiration

This was my first attempt to conduct an analysis in R and create the R Markdown file. As you might guess, it was an eye-opening experience, with both exciting discoveries and aggravating moments.

One thing I have not yet been able to figure out is how to add a legend to the map. I was able to get a legend to appear on a separate (empty) map, but not on the map you will see here.

I am also interested to see what others did with this analysis - what were the findings and insights you found?
4
Scripts for cleaning and analysis of data from SOFC experiment on...
data.4tu.nl
zip
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berend van Veldhuizen (2024). Scripts for cleaning and analysis of data from SOFC experiment on inclination test-bench. [Dataset]. http://doi.org/10.4121/ed0a0cff-7af9-4d3a-baf7-aab5efe39bd1.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/ed0a0cff-7af9-4d3a-baf7-aab5efe39bd1.v1
Dataset updated
Aug 27, 2024
Dataset provided by
4TU.ResearchData
Authors
Berend van Veldhuizen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2023
Dataset funded by
European Commission
Description
This data set contains the scripts used for importing, trimming, cleaning, analysing, and plotting a large dataset of inclination experiments with an SOFC module. The measurement data is confidential, so it could not be published alongside the scripts. One row of dummy input data is published to illustrate the structure of the analysed data. The analysis is used for the journal paper "Experimental Evaluation of a Solid Oxide Fuel Cell System Exposed to Inclinations and Accelerations by Ship Motions".
The scripts contain:
- A script that reads the data, removes unusable data and transforms into analysable dataframes (Clean and trim.R)
- Two files to make a wide variety of plots (Plotting.R and Specificplots.R)
- A file data does a Gaussian Progress regression to estimate the degradation rate (Degradation estimation.R)
R
AI in Data Cleaning Market Research Report 2033
researchintelo.com
csv, pdf, pptx
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Intelo (2025). AI in Data Cleaning Market Research Report 2033 [Dataset]. https://researchintelo.com/report/ai-in-data-cleaning-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Jul 24, 2025
Dataset authored and provided by
Research Intelo
License
https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
Time period covered
2024 - 2033
Area covered
Global
Description
AI in Data Cleaning Market Outlook

According to our latest research, the global AI in Data Cleaning market size reached USD 1.82 billion in 2024, demonstrating remarkable momentum driven by the exponential growth of data-driven enterprises. The market is projected to grow at a CAGR of 28.1% from 2025 to 2033, reaching an estimated USD 17.73 billion by 2033. This exceptional growth trajectory is primarily fueled by increasing data volumes, the urgent need for high-quality datasets, and the adoption of artificial intelligence technologies across diverse industries.

The surging demand for automated data management solutions remains a key growth driver for the AI in Data Cleaning market. As organizations generate and collect massive volumes of structured and unstructured data, manual data cleaning processes have become insufficient, error-prone, and costly. AI-powered data cleaning tools address these challenges by leveraging machine learning algorithms, natural language processing, and pattern recognition to efficiently identify, correct, and eliminate inconsistencies, duplicates, and inaccuracies. This automation not only enhances data quality but also significantly reduces operational costs and improves decision-making capabilities, making AI-based solutions indispensable for enterprises aiming to achieve digital transformation and maintain a competitive edge.

Another crucial factor propelling market expansion is the growing emphasis on regulatory compliance and data governance. Sectors such as BFSI, healthcare, and government are subject to stringent data privacy and accuracy regulations, including GDPR, HIPAA, and CCPA. AI in data cleaning enables these industries to ensure data integrity, minimize compliance risks, and maintain audit trails, thereby safeguarding sensitive information and building stakeholder trust. Furthermore, the proliferation of cloud computing and advanced analytics platforms has made AI-powered data cleaning solutions more accessible, scalable, and cost-effective, further accelerating adoption across small, medium, and large enterprises.

The increasing integration of AI in data cleaning with other emerging technologies such as big data analytics, IoT, and robotic process automation (RPA) is unlocking new avenues for market growth. By embedding AI-driven data cleaning processes into end-to-end data pipelines, organizations can streamline data preparation, enable real-time analytics, and support advanced use cases like predictive modeling and personalized customer experiences. Strategic partnerships, investments in R&D, and the rise of specialized AI startups are also catalyzing innovation in this space, making AI in data cleaning a cornerstone of the broader data management ecosystem.

From a regional perspective, North America continues to lead the global AI in Data Cleaning market, accounting for the largest revenue share in 2024, followed closely by Europe and Asia Pacific. The region’s dominance is attributed to the presence of major technology vendors, robust digital infrastructure, and high adoption rates of AI and cloud technologies. Meanwhile, Asia Pacific is witnessing the fastest growth, propelled by rapid digitalization, expanding IT sectors, and increasing investments in AI-driven solutions by enterprises in China, India, and Southeast Asia. Europe remains a significant market, supported by strict data protection regulations and a mature enterprise landscape. Latin America and the Middle East & Africa are emerging as promising markets, albeit at a relatively nascent stage, with growing awareness and gradual adoption of AI-powered data cleaning solutions.

Component Analysis

The AI in Data Cleaning market is broadly segmented by component into software and services, with each segment playing a pivotal role in shaping the industry’s evolution. The software segment dominates the market, driven by the rapid adoption of advanced AI-based data cleaning platforms that automate complex data preparation tasks. These platforms leverage sophisticated algorithms to detect anomalies, standardize formats, and enrich datasets, thereby enabling organizations to maintain high-quality data repositories. The increasing demand for self-service data cleaning software, which empowers business users to cleanse data without extensive IT intervention, is further fueling growth in this segment. Vendors are continuously enhancing their offerings with intuitive interfaces, integration capabilities, and support for diverse data sources to cater to a wide r
Tooth Growth data set clean
kaggle.com
zip
Updated Jan 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SantosJGND (2024). Tooth Growth data set clean [Dataset]. https://www.kaggle.com/datasets/santosjgnd/tooth-growth-data-set-clean
Explore at:
zip(380 bytes)Available download formats
Dataset updated
Jan 12, 2024
Authors
SantosJGND
Description
Dataset

This dataset was created by SantosJGND

Contents
Google Ads sales dataset
kaggle.com
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NayakGanesh007 (2025). Google Ads sales dataset [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/google-ads-sales-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
NayakGanesh007
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Google Ads Sales Dataset for Data Analytics Campaigns (Raw & Uncleaned) 📝 Dataset Overview This dataset contains raw, uncleaned advertising data from a simulated Google Ads campaign promoting data analytics courses and services. It closely mimics what real digital marketers and analysts would encounter when working with exported campaign data — including typos, formatting issues, missing values, and inconsistencies.

It is ideal for practicing:

Data cleaning

Exploratory Data Analysis (EDA)

Marketing analytics

Campaign performance insights

Dashboard creation using tools like Excel, Python, or Power BI

📁 Columns in the Dataset Column Name ----- -Description Ad_ID --------Unique ID of the ad campaign Campaign_Name ------Name of the campaign (with typos and variations) Clicks --Number of clicks received Impressions --Number of ad impressions Cost --Total cost of the ad (in ₹ or $ format with missing values) Leads ---Number of leads generated Conversions ----Number of actual conversions (signups, sales, etc.) Conversion Rate ---Calculated conversion rate (Conversions ÷ Clicks) Sale_Amount ---Revenue generated from the conversions Ad_Date------ Date of the ad activity (in inconsistent formats like YYYY/MM/DD, DD-MM-YY) Location ------------City where the ad was served (includes spelling/case variations) Device------------ Device type (Mobile, Desktop, Tablet with mixed casing) Keyword ----------Keyword that triggered the ad (with typos)

⚠️ Data Quality Issues (Intentional) This dataset was intentionally left raw and uncleaned to reflect real-world messiness, such as:

Inconsistent date formats

Spelling errors (e.g., "analitics", "anaytics")

Duplicate rows

Mixed units and symbols in cost/revenue columns

Missing values

Irregular casing in categorical fields (e.g., "mobile", "Mobile", "MOBILE")

🎯 Use Cases Data cleaning exercises in Python (Pandas), R, Excel

Data preprocessing for machine learning

Campaign performance analysis

Conversion optimization tracking

Building dashboards in Power BI, Tableau, or Looker

💡 Sample Analysis Ideas Track campaign cost vs. return (ROI)

Analyze click-through rates (CTR) by device or location

Clean and standardize campaign names and keywords

Investigate keyword performance vs. conversions

🔖 Tags Digital Marketing · Google Ads · Marketing Analytics · Data Cleaning · Pandas Practice · Business Analytics · CRM Data
Data Insight: Google Analytics Capstone Project
kaggle.com
zip
Updated Mar 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sinderpreet (2024). Data Insight: Google Analytics Capstone Project [Dataset]. https://www.kaggle.com/datasets/sinderpreet/datainsight-google-analytics-capstone-project
Explore at:
zip(215409585 bytes)Available download formats
Dataset updated
Mar 2, 2024
Authors
sinderpreet
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
Case study: How does a bike-share navigate speedy success?

Scenario:

As a data analyst on Cyclistic's marketing team, our focus is on enhancing annual memberships to drive the company's success. We aim to analyze the differing usage patterns between casual riders and annual members to craft a marketing strategy aimed at converting casual riders. Our recommendations, supported by data insights and professional visualizations, await Cyclistic executives' approval to proceed.

About the company

In 2016, Cyclistic launched a bike-share program in Chicago, growing to 5,824 bikes and 692 stations. Initially, their marketing aimed at broad segments with flexible pricing plans attracting both casual riders (single-ride or full-day passes) and annual members. However, recognizing that annual members are more profitable, Cyclistic is shifting focus to convert casual riders into annual members. To achieve this, they plan to analyze historical bike trip data to understand the differences and preferences between the two user groups, aiming to tailor marketing strategies that encourage casual riders to purchase annual memberships.

Project Overview:

This capstone project is a culmination of the skills and knowledge acquired through the Google Professional Data Analytics Certification. It focuses on Track 1, which is centered around Cyclistic, a fictional bike-share company modeled to reflect real-world data analytics scenarios in the transportation and service industry.

Dataset Acknowledgment:

We are grateful to Motivate Inc. for providing the dataset that serves as the foundation of this capstone project. Their contribution has enabled us to apply practical data analytics techniques to a real-world dataset, mirroring the challenges and opportunities present in the bike-sharing sector.

Objective:

The primary goal of this project is to analyze the Cyclistic dataset to uncover actionable insights that could help the company optimize its operations, improve customer satisfaction, and increase its market share. Through comprehensive data exploration, cleaning, analysis, and visualization, we aim to identify patterns and trends that inform strategic business decisions.

Methodology:

Data Collection: Utilizing the dataset provided by Motivate Inc., which includes detailed information on bike usage, customer behavior, and operational metrics. Data Cleaning and Preparation: Ensuring the dataset is accurate, complete, and ready for analysis by addressing any inconsistencies, missing values, or anomalies. Data Analysis: Applying statistical methods and data analytics techniques to extract meaningful insights from the dataset.

Visualization and Reporting:

Creating intuitive and compelling visualizations to present the findings clearly and effectively, facilitating data-driven decision-making. Findings and Recommendations:

Conclusion:

The Cyclistic Capstone Project not only demonstrates the practical application of data analytics skills in a real-world scenario but also provides valuable insights that can drive strategic improvements for Cyclistic. Through this project, showcasing the power of data analytics in transforming data into actionable knowledge, underscoring the importance of data-driven decision-making in today's competitive business landscape.

Acknowledgments:

Special thanks to Motivate Inc. for their support and for providing the dataset that made this project possible. Their contribution is immensely appreciated and has significantly enhanced the learning experience.

STRATEGIES USED

Case Study Roadmap - ASK

●What is the problem you are trying to solve? ●How can your insights drive business decisions?

Key Tasks ● Identify the business task ● Consider key stakeholders

Deliverable ● A clear statement of the business task

Case Study Roadmap - PREPARE

● Where is your data located? ● Are there any problems with the data?

Key tasks ● Download data and store it appropriately. ● Identify how it’s organized.

Deliverable ● A description of all data sources used

Case Study Roadmap - PROCESS

● What tools are you choosing and why? ● What steps have you taken to ensure that your data is clean?

Key tasks ● Choose your tools. ● Document the cleaning process.

Deliverable ● Documentation of any cleaning or manipulation of data

Case Study Roadmap - ANALYZE

● Has your data been properly formaed? ● How will these insights help answer your business questions?

Key tasks ● Perform calculations ● Formatting

Deliverable ● A summary of analysis

Case Study Roadmap - SHARE

● Were you able to answer all questions of stakeholders? ● Can Data visualization help you share findings?

Key tasks ● Present your findings ● Create effective data viz.

Deliverable ● Supporting viz and key findings

**Case Study Roadmap - A...
f
Analysis scripts and supplementary files: Barriers to implementing clinical...
datasetcatalog.nlm.nih.gov
figshare.com
Updated May 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamerman, Peter; Parker, Romy; Wadley, Antonia; Jackson, Kirsty; Devan, Dershnee; Reardon, Cameron; Cameron, Sarah; Madden, Victoria J (2019). Analysis scripts and supplementary files: Barriers to implementing clinical trials on non-pharmacological treatments in developing countries – lessons learnt from addressing pain in HIV [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000178984
Explore at:
Dataset updated
May 7, 2019
Authors
Kamerman, Peter; Parker, Romy; Wadley, Antonia; Jackson, Kirsty; Devan, Dershnee; Reardon, Cameron; Cameron, Sarah; Madden, Victoria J
Description
DESCRIPTIONThis repository contains analysis scripts (with outputs), figures from the manuscript, and supplementary files the HIV Pain (HIP) Intervention Study. All analysis scripts (and their outputs -- /outputs subdirectory) are found in HIP-study.zip, while PDF copies of the analysis outputs that are cited in the manuscript as supplementary material are found in the relevant supplement-*.pdf file.Note: Participant consent did not provide for the publication of their data, and hence neither the original nor cleaned data have been made available. However, we do not wish to bar access to the data unnecessarily and we will judge requests to access the data on a case-by-case basis. Examples of potential use cases include independent assessments of our analyses, and secondary data analyses. Please contact Peter Kamerman (peter.kamerman@gmail.com), Dr Tory Madden (torymadden@gmail.com, or open an issue on the GitHub repo (https://github.com/kamermanpr/HIP-study/issues).BIBLIOGRAPHIC INFORMATIONRepository citationKamerman PR, Madden VJ, Parker R, Devan D, Cameron S, Jackson K, Reardon C, Wadley A. Analysis scripts and supplementary files: Barriers to implementing clinical trials on non-pharmacological treatments in developing countries – lessons learnt from addressing pain in HIV. DOI: 10.6084/m9.figshare.7654637.Manuscript citationParker R, Madden VJ, Devan D, Cameron S, Jackson K, Kamerman P, Reardon C, Wadley A. Barriers to implementing clinical trials on non-pharmacological treatments in developing countries – lessons learnt from addressing pain in HIV. Pain Reports [submitted 2019-01-31]Manuscript abstractintroduction: Pain affects over half of people living with HIV/AIDS (LWHA) and pharmacological treatment has limited efficacy. Preliminary evidence supports non-pharmacological interventions. We previously piloted a multimodal intervention in amaXhosa women LWHA and chronic pain in South Africa with improvements seen in all outcomes, in both intervention and control groups. Methods: A multicentre, single-blind randomised controlled trial with 160 participants recruited was conducted to determine whether the multimodal peer-led intervention reduced pain in different populations of both male and female South Africans LWHA. Participants were followed up at Weeks 4, 8, 12, 24 and 48 to evaluate effects on the primary outcome of pain, and on depression, self-efficacy and health-related quality of life. Results: We were unable to assess the efficacy of the intervention due to a 58% loss to follow up (LTFU). Secondary analysis of the LTFU found that sociocultural factors were not predictive of LTFU. Depression, however, did associate with LTFU, with greater severity of depressive symptoms predicting LTFU at week 8 (p=0.01). Discussion: We were unable to evaluate the effectiveness of the intervention due to the high LTFU and the risk of retention bias. The different sociocultural context in South Africa may warrant a different approach to interventions for pain in HIV compared to resource-rich countries, including a concurrent strategy to address barriers to health care service delivery. We suggest that assessment of pain and depression need to occur simultaneously in those with pain in HIV. We suggest investigation of the effect of social inclusion on pain and depression. USING DOCKER TO RUN THE HIP-STUDY ANALYSIS SCRIPTSThese instructions are for running the analysis on your local machine.You need to have Docker installed on your computer. To do so, go to docker.com (https://www.docker.com/community-edition#/download) and follow the instructions for downloading and installing Docker for your operating system. Once Docker has been installed, follow the steps below, noting that Docker commands are entered in a terminal window (Linux and OSX/macOS) or command prompt window (Windows). Windows users also may wish to install GNU Make (http://gnuwin32.sourceforge.net/downlinks/make.php) (required for the make method of running the scripts) and Git (https://gitforwindows.org/) version control software (not essential).Download the latest imageEnter: docker pull kamermanpr/docker-hip-study:v2.0.0Run the containerEnter: docker run -d -p 8787:8787 -v :/home/rstudio --name threshold -e USER=hip -e PASSWORD=study kamermanpr/docker-hip-study:v2.0.0Where refers to the path to the HIP-study directory on your computer, which you either cloned from GitHub (https://github.com/kamermanpr/HIP-study.git), git clone https://github.com/kamermanpr/HIP-study, or downloaded and extracted from figshare (https://doi.org/10.6084/m9.figshare.7654637).Login to RStudio Server- Open a web browser window and navigate to: localhost:8787- Use the following login credentials: - Username: hip - Password: study Prepare the HIP-study directoryThe HIP-study directory comes with the outputs for all the analysis scripts in the /outputs directory (html and md formats). However, should you wish to run the scripts yourself, there are several preparatory steps that are required:1. Acquire the data. The data required to run the scripts have not been included in the repo because participants in the studies did not consent to public release of their data. However, the data are available on request from Peter Kamerman (peter.kamerman@gmail.com). Once the data have been obtained, the files should be copied into a subdirectory named /data-original.2. Clean the /outputs directory by entering make clean in the Terminal tab in RStudio.Run the HIP-study analysis scriptsTo run all the scripts (including the data cleaning scripts), enter make all in the Terminal tab in RStudio.To run individual RMarkdown scripts (*.Rmd files)1. Generate the cleaned data using one of the following methods: - Enter make data-cleaned/demographics.rds in the Terminal tab in RStudio. - Enter source('clean-data-script.R') in the Console tab in RStudio. - Open the clean-data-script.R script through the File tab in RStudio, and then click the 'Source' button on the right of the Script console in RStudio for each script. 2. Run the individual script by: - Entering make outputs/.html in the Terminal tab in RStudio, OR - Opening the relevant *.Rmd file through the File tab in RStudio, and then clicking the 'knit' button on the left of the Script console in RStudio. Shutting downOnce done, log out of RStudio Server and enter the following into a terminal to stop the Docker container: docker stop hip. If you then want to remove the container, enter: docker rm threshold. If you also want to remove the Docker image you downloaded, enter: docker rmi kamermanpr/docker-hip-study:v2.0.0

Facebook

Twitter

Click to copy link

Link copied

Cite

Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:

167 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.5683/SP3/ZCN177

Dataset updated

Jul 13, 2023

Dataset provided by

Borealis

Authors

Rong Luo

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Clear search

Close search

Google apps

Main menu

Data Cleaning Sample

Writing Clean Code in R Workshop

Cleaned NHANES 1988-2018

TidY_PracticE_DatasetS

R/r custom clean llc USA Import & Buyer Data

Data cleaning EVI2

The fractured lab notebook: undergraduate and ecological data management...

Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It

Shanghai Hanyang Clean Technology R Export Import Data | Eximpedia

Replication Data for: Race, gender, and the politics of incivility

Data from: Data and code from: A natural polymer material as a pesticide...

A dataset for temporal analysis of files related to the JFK case

BEGIN R DATA PROCESSING SCRIPT

for(i in 1:3){

END R DATA PROCESSING SCRIPT

Data from: Designing data science workshops for data-intensive environmental...

Cyclistic_data_visualization

Context

Content

Acknowledgements

Inspiration

Scripts for cleaning and analysis of data from SOFC experiment on...

AI in Data Cleaning Market Research Report 2033

AI in Data Cleaning Market Outlook

Component Analysis

Tooth Growth data set clean

Dataset

Contents

Google Ads sales dataset

Data Insight: Google Analytics Capstone Project

Analysis scripts and supplementary files: Barriers to implementing clinical...

Data Cleaning Sample