38 datasets found

f
Mean AUC of 10x10-fold CV for different models and imputation methods.
plos.figshare.com
xls
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Mean AUC of 10x10-fold CV for different models and imputation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0251952.t004
Dataset updated
Jun 7, 2023
Dataset provided by
PLOS ONE
Authors
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mean AUC of 10x10-fold CV for different models and imputation methods.
f
Model performance results based on random forest, gradient boosting,...
figshare.com
xls
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Junying Wang; David D. Wu; Christine DeLorenzo; Jie Yang (2024). Model performance results based on random forest, gradient boosting, penalized logistic regression, XGBoost, SVM, neural network, and stacking for EMBARC data as training set and APAT data as testing set after multiple imputation for 10 times. [Dataset]. http://doi.org/10.1371/journal.pone.0299625.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299625.t006
Dataset updated
Mar 28, 2024
Dataset provided by
PLOS ONE
Authors
Junying Wang; David D. Wu; Christine DeLorenzo; Jie Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Model performance results based on random forest, gradient boosting, penalized logistic regression, XGBoost, SVM, neural network, and stacking for EMBARC data as training set and APAT data as testing set after multiple imputation for 10 times.
f
Life course multivariable linear regression between allostatic load and...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Castagné, Raphaële; Delpierre, Cyrille; Kelly-Irving, Michelle; Lepage, Benoit; Joannès, Camille (2021). Life course multivariable linear regression between allostatic load and parental interest using data obtained from multiple imputation for women (n = 4 056). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000884555
Explore at:
Dataset updated
Jun 17, 2021
Authors
Castagné, Raphaële; Delpierre, Cyrille; Kelly-Irving, Michelle; Lepage, Benoit; Joannès, Camille
Description
Life course multivariable linear regression between allostatic load and parental interest using data obtained from multiple imputation for women (n = 4 056).
Processed Datasets - Imputation in Well Log Data: A Benchmark
zenodo.org
application/gzip
Updated May 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pedro H. T. Gama; Pedro H. T. Gama; Jackson Faria; Jessica Sena; Jessica Sena; Francisco Neves; Francisco Neves; Vinícius R. Riffel; Vinícius R. Riffel; Lucas Perez; Lucas Perez; André Korenchendler; André Korenchendler; Matheus C. A. Sobreira; Matheus C. A. Sobreira; Alexei M. C. Machado; Alexei M. C. Machado; Jackson Faria (2024). Processed Datasets - Imputation in Well Log Data: A Benchmark [Dataset]. http://doi.org/10.5281/zenodo.10987946
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10987946
Dataset updated
May 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pedro H. T. Gama; Pedro H. T. Gama; Jackson Faria; Jessica Sena; Jessica Sena; Francisco Neves; Francisco Neves; Vinícius R. Riffel; Vinícius R. Riffel; Lucas Perez; Lucas Perez; André Korenchendler; André Korenchendler; Matheus C. A. Sobreira; Matheus C. A. Sobreira; Alexei M. C. Machado; Alexei M. C. Machado; Jackson Faria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 17, 2024
Description
Imputation of well log data is a common task in the field. However a quick review of the literature reveals a lack of padronization when evaluating methods for the problem. The goal of the benchmark is to introduce a standard evaluation protocol to any imputation method for well log data.

In the proposed benchmark, three public datasets are used:

Geolink: The Geolink Dataset is another public dataset of wells in the Norwegian offshore. The data is provided by the company of the same name, GEOLINK and follows the NOLD 2.0 license.
This dataset contains a total of 223 wells. It also has lithology labels for the wells with a total of 36 lithology classes. [download original]

Taranaki Basin: The Taranaki Basin Dataset is a curated set of wells and a convenient option for experimentation especially due to it is ease of accessibility and use.
This collection, under the CDLA-Sharing-1.0 license, contains well logs extracted from the New Zealand Petroleum & Minerals Online Exploration Database and Petlab.
There are a total of 407 wells, of which 289 are onshore and 118 are offshore exploration and production wells. [download original]

Teapot Dome: The Teapot Dome dataset is provided by the Rocky Mountain Oilfield Testing Center (RMOTC) and the US Department of Energy.
It contains different types of data related to the Teapot Dome oil field, such as 2D and 3D seismic data, well logs, and GIS data. The data is licensed under the Creative Commons 4.0 license.
In total, the dataset has 1,179 wells with available logs. The number of available logs varies across wells. There are only 91 wells with the gamma ray, bulk density, and neutron porosity logs, while only three wells have the complete basic suite. [direct download]

Here you can download all three datasets already preprocessed to be used with our implementation, found here.

File Description:

There are six files for each fold partition for each dataset.

datasetname_fold_k_well_log_metadata_train.json : JSON file with general information of the slices of training partition of the fold k. Contains total number of slices and the number of slices per well.

datasetname_fold_k_well_log_metadata_val.json : JSON file with general information of the slices of validation partition of the fold k. Contains total number of slices and the number of slices per well.

datasetname_fold_k_well_log_slices_train.npy: .npy (numpy) file ready to be loaded with the slices for training of the fold k already processed. When loaded should have shape of (total_slices, 256, number_of_logs)

datasetname_fold_k_well_log_slices_val.npy : .npy (numpy) file ready to be loaded with the slices for validation of the fold k already processed.

datasetname_fold_k_well_log_slices_meta_train.json : JSON file with the slices info for all slices in the training partition of the fold k. For each slice, 7 data points are provided, the last four are discarded (it would contain other information that was not used). The first three are in order the: origin well name, the starting position in that well, and the end position of the slice in that well.

datasetname_fold_k_well_log_slices_meta_val.json : JSON file with the slices info for all slices in the validation partition of the fold k.
e
Monthly imputation of delays
data.europa.eu
gimi9.com
+2more
csv, excel xlsx, json +5
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Infrabel (2024). Monthly imputation of delays [Dataset]. https://data.europa.eu/data/datasets/https-opendata-infrabel-be-explore-dataset-toewijzingvertraging-/embed
Explore at:
json-ld, csv, rdf xml, parquet, excel xlsx, json, n3, rdf turtleAvailable download formats
Dataset updated
Aug 6, 2024
Dataset authored and provided by
Infrabel
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The historical method used to calculate the statistics presented in this dataset takes into account all the minutes of delay caused by 'major incidents' (internally know as 'relazen') on the rail network as reported to the Railway Accident and Incident Investigation Body (OEAIF/OOIS) and the Railway Safety and Interoperability Service (NSA Rail Belgium) under the Royal Decree of 16 January 2007 laying down certain rules relating to investigations into railway accidents and incidents.

The criteria defining 'major incidents' (internally known as 'relations'**) are as follows:

1 passenger train delayed by an incident for 20 minutes or more Several passenger trains delayed by an incident for at least 40 minutes Incidents leading to the cancellation (partial or total) of trains Incidents with an impact on operational safety

There is no unequivocal relationship between the minutes of delay in 'major incidents' and the punctuality rate because:

The minutes included in 'major incidents' do not necessarily have an actual impact on punctuality (a train can make up its delay as it goes along). Some trains arrive at their terminus more than 6 minutes late (and therefore have an actual impact on punctuality), but are not included in the 'major incidents'.

In order to provide an exhaustive overview of the causes and responsibilities for delays, a new dataset has been made available: Monthly causes of loss of punctuality.

The data presented in this new dataset is as follows: for each train delayed by 6 minutes or more on arrival at a tracking point*, an analysis is made of the cause of all the minutes of delay along the route, and a proportional score is awarded for each responsibility identified.

More info in the new dataset's description
f
Between-day contrasts from imputed data.
plos.figshare.com
xlsx
Updated Mar 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas K. Winther; Ivan Baptista; Sigurd Pedersen; João Brito; Morten B. Randers; Dag Johansen; Svein Arne Pettersen (2024). Between-day contrasts from imputed data. [Dataset]. http://doi.org/10.1371/journal.pone.0299851.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299851.s002
Dataset updated
Mar 28, 2024
Dataset provided by
PLOS ONE
Authors
Andreas K. Winther; Ivan Baptista; Sigurd Pedersen; João Brito; Morten B. Randers; Dag Johansen; Svein Arne Pettersen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This observational study aimed to analyze external training load in highly trained female football players, comparing starters and non-starters across various cycle lengths and training days. Method: External training load [duration, total distance [TD], high-speed running distance [HSRD], sprint distance [SpD], and acceleration- and deceleration distance [AccDecdist] from 100 female football players (22.3 ± 3.7 years of age) in the Norwegian premier division were collected over two seasons using STATSports APEX. This resulted in a final dataset totaling 10498 observations after multiple imputation of missing data. Microcycle length was categorized based on the number of days between matches (2 to 7 days apart), while training days were categorized relative to match day (MD, MD+1, MD+2, MD-5, MD-4, MD-3, MD-2, MD-1). Linear mixed modeling was used to assess differences between days, and starters vs. non-starters. Results: In longer cycle lengths (5–7 days between matches), the middle of the week (usually MD-4 or MD-3) consistently exhibited the highest external training load (~21–79% of MD TD, MD HSRD, MD SpD, and MD AccDecdist); though, with the exception of duration (~108–120% of MD duration), it remained lower than MD. External training load was lowest on MD+2 and MD-1 (~1–37% of MD TD, MD HSRD, MD SpD, MD AccDecdist, and ~73–88% of MD peak speed). Non-starters displayed higher loads (~137–400% of starter TD, HSRD, SpD, AccDecdist) on MD+2 in cycles with 3 to 7 days between matches, with non-significant differences (~76–116%) on other training days. Conclusion: Loading patterns resemble a pyramid or skewed pyramid during longer cycle lengths (5–7 days), with higher training loads towards the middle compared to the start and the end of the cycle. Non-starters displayed slightly higher loads on MD+2, with no significant load differentiation from MD-5 onwards.
f
Multiple year training vs single year training (AUC).
plos.figshare.com
xls
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Multiple year training vs single year training (AUC). [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0251952.t009
Dataset updated
Jun 8, 2023
Dataset provided by
PLOS ONE
Authors
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multiple year training vs single year training (AUC).
u
Quarterly Labour Force Survey Household Dataset, October - December, 2022
beta.ukdataservice.ac.uk
Updated 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office For National Statistics (2023). Quarterly Labour Force Survey Household Dataset, October - December, 2022 [Dataset]. http://doi.org/10.5255/ukda-sn-9064-2
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-9064-2
Dataset updated
2023
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
datacite
Authors
Office For National Statistics
Description
Background
The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.

Household datasets
Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. From January 2011, a pseudonymised household identifier variable (HSERIALP) is also included in the main quarterly LFS dataset instead.

Change to coding of missing values for household series
From 1996-2013, all missing values in the household datasets were set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. This was also in line with the Annual Population Survey household series of the time. The change was applied to the back series during 2010 to ensure continuity for analytical purposes. From 2013 onwards, the -8 and -9 categories have been reinstated.

LFS Documentation
The documentation available from the Archive to accompany LFS datasets largely consists of the latest version of each volume alongside the appropriate questionnaire for the year concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS LFS User Guidance page before commencing analysis.

Additional data derived from the QLFS
The Archive also holds further QLFS series: End User Licence (EUL) quarterly datasets; Secure Access datasets (see below); two-quarter and five-quarter longitudinal datasets; quarterly, annual and ad hoc module datasets compiled for Eurostat; and some additional annual Northern Ireland datasets.

End User Licence and Secure Access QLFS Household datasets
Users should note that there are two discrete versions of the QLFS household datasets. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. Secure Access household datasets for the QLFS are available from 2009 onwards, and include additional, detailed variables not included in the standard EUL versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurrence of learning difficulty or disability; and benefits. For full details of variables included, see data dictionary documentation. The Secure Access version (see SN 7674) has more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements.

Changes to variables in QLFS Household EUL datasets
In order to further protect respondent confidentiality, ONS have made some changes to variables available in the EUL datasets. From July-September 2015 onwards, 4-digit industry class is available for main job only, meaning that 3-digit industry group is the most detailed level available for second and last job.

Review of imputation methods for LFS Household data - changes to missing values
A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.

Occupation data for 2021 and 2022 data files
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

Latest edition information
For the second edition (September 2023), the variables NSECM20, NSECMJ20, SC2010M, SC20SMJ, SC20SMN and SOC20M have been replaced with new versions. Further information on the SOC revisions can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
B
Data from: A comparison of genomic selection models across time in interior...
borealisdata.ca
Updated May 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blaise Ratcliffe; Omnia Gamal El-Dien; Jaroslav Klápště; Ilga Porth; Charles Chen; Barry Jaquish; Yousry A. El-Kassaby (2021). Data from: A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods [Dataset]. http://doi.org/10.5683/SP2/I9BJI6
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/I9BJI6
Dataset updated
May 19, 2021
Dataset provided by
Borealis
Authors
Blaise Ratcliffe; Omnia Gamal El-Dien; Jaroslav Klápště; Ilga Porth; Charles Chen; Barry Jaquish; Yousry A. El-Kassaby
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
British Columbia
Description
AbstractGenomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3–40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31–0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04–0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated. Usage notesphenotypephenotype and experimental designphen mask dryad.txtSVD genotype imputationmarker matrix for SVD imputation methodSVDimp dryad.txtMean genotype imputationmarker matrix for mean imputation methodMeanImp dryad.txtKNN genotype imputationmarker matrix for KNN imputation methodKNNimp dryad.txt
Employment Of India CLeaned and Messy Data
kaggle.com
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SONIA SHINDE
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
India
Description
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

🔹 Dataset Composition:

It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording

Transformations & Cleaning Applied:

The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

Purpose & Utility:

This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

It's also useful for: - Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines

By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
f
Ratio of training time of four models relative to RF.
figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Ratio of training time of four models relative to RF. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0251952.t005
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ratio of training time of four models relative to RF.
m
Synthetic Stroke Prediction Dataset
data.mendeley.com
kaggle.com
Updated May 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Borhan Uddin (2025). Synthetic Stroke Prediction Dataset [Dataset]. http://doi.org/10.17632/s2nh6fm925.1
Explore at:
Unique identifier
https://doi.org/10.17632/s2nh6fm925.1
Dataset updated
May 2, 2025
Authors
Mohammed Borhan Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.
f
Number of Landsat 7 ETM+ surface reflectance products acquired per year...
plos.figshare.com
xls
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Number of Landsat 7 ETM+ surface reflectance products acquired per year (January—December) per tile (cf. Fig 2 for location of tiles). [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0251952.t002
Dataset updated
Jun 8, 2023
Dataset provided by
PLOS ONE
Authors
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of Landsat 7 ETM+ surface reflectance products acquired per year (January—December) per tile (cf. Fig 2 for location of tiles).
Cleaned Driver License Dataset
kaggle.com
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Divyanshu_CODER (2025). Cleaned Driver License Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/12605959
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/12605959
Dataset updated
Jul 29, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Divyanshu_CODER
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🚗 Cleaned Drivers License Dataset

"This dataset is designed for those who believe in building smart, ethical AI models that can assist in real-world decision-making like licensing, risk assessment, and more."

📂 Dataset Overview

This dataset is a cleaned and preprocessed version of a drivers license dataset containing details like:

Age Group

Gender

Reaction Time

Driving Skill Level

Training Received

License Qualification Status

And many more total 20 columns you can train a best ML model with this dataset

The raw dataset contained missing values, categorical anomalies, and inconsistencies which have been fixed to make this dataset ML-ready.

💡 Key Highlights

✅ Cleaned missing values with intelligent imputation

🧠 Encoded categorical columns with appropriate techniques (OneHot, Ordinal)

🔍 Removed outliers using statistical logic

🔧 Feature engineered to preserve semantic meaning

💾 Ready-to-use for classification tasks (e.g., Predicting who qualifies for a license)

📊 Columns Description

Column Name Description

Gender Gender of the individual (Male, Female) Age Group Age segment (Teen, Adult, Senior) Race Ethnicity of the driver Reactions Reaction speed categorized (Fast, Average, Slow) Training Training received (Basic, Advanced) Driving Skill Skill level (Expert, Beginner, etc.) Qualified Whether the person qualified for a license (Yes, No)

🤖 Perfect For

📚 Machine Learning (Classification)

📊 Exploratory Data Analysis (EDA)

📉 Feature Engineering Practice

🧪 Model Evaluation & Experimentation

🚥 AI for Transport & Safety Projects

🏷️ Tags

MLReady #LicensePrediction #ClassificationDataset #CleanedData #FeatureEngineering #DriversLicense #Transportation #AIProjects #Imputation #OrdinalEncoding

📌 Author Notes

This dataset is part of a data cleaning and feature engineering project. One column is intentionally left unprocessed for developers to test their own pipeline or transformation strategies 😎.

🔗 Citation

If you use this dataset in your projects, notebooks, or blog posts — feel free to tag me or credit the original dataset and this cleaned version.

📚 Citation

If you use this dataset in your work, please cite it as:

Divyanshu_Coder, 2025. Cleaned Driver License Dataset. Kaggle.
H
A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution...
dataverse.harvard.edu
Updated Jan 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lianfa, Li; Jiajie, Wu (2021). A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution for Mainland China [Dataset]. http://doi.org/10.7910/DVN/RNSWRH
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/RNSWRH
Dataset updated
Jan 19, 2021
Dataset provided by
Harvard Dataverse
Authors
Lianfa, Li; Jiajie, Wu
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Jan 1, 2015 - Dec 31, 2018
Area covered
China
Description
We share the complete aerosol optical depth dataset with high spatial (1x1km^2) and temporal (daily) resolution and the Beijing 1954 projection (https://epsg.io/2412) for mainland China (2015-2018). The original aerosol optical depth images are from Multi-Angle Implementation of Atmospheric Correction Aerosol Optical Depth (MAIAC AOD) (https://lpdaac.usgs.gov/products/mcd19a2v006/) with the similar spatiotemporal resolution and the sinusoidal projection (https://en.wikipedia.org/wiki/Sinusoidal_projection). After projection conversion, eighteen tiles of MAIAC AOD were merged to obtain a large image of AOD covering the entire area of mainland China. Due to the conditions of clouds and high surface reflectance, each original MAIAC AOD image usually has many missing values, and the average missing percentage of each AOD image may exceed 60%. Such a high percentage of missing values severely limits applicability of the original MAIAC AOD dataset product. We used the sophisticated method of full residual deep networks (Li et al, 2020, https://ieeexplore.ieee.org/document/9186306) to impute the daily missing MAIAC AOD, thus obtaining the complete (no missing values) high-resolution AOD data product covering mainland China. The covariates used in imputation included coordinates, elevation, MERRA2 coarse-resolution PBLH and AOD variables, cloud fraction, high-resolution meteorological variables (air pressure, air temperature, relative humidity and wind speed) and/or time index etc. Ground monitoring data were used to generate high-resolution meteorological variables to ensure the reliability of interpolation. Overall, our daily imputation models achieved an average training R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075, with a range of 0.026 to 0.32) and an average test R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075 with a range of 0.026 to 0.32). With almost no difference between training metrics and test metrics, the high test R^2 and low test RMSE show the reliability of AOD imputation. In the evaluation using the ground AOD data from the monitoring stations of the Aerosol Robot Network (AERONET) in mainland China, our method obtained a R^2 of 0.78 and RMSE of 0.27, which further illustrated the reliability of the method. This database contains four datasets: - Daily complete high-resolution AOD image dataset for mainland China from January 1, 2015 to December 31, 2018. The archived resources contain 1461 images stored in 1461 files, and 3 summary Excel files. The table “CHN_AOD_INFO.xlsx” describing the properties of the 1461 images, including projection, training R^2 and RMSE, testing R^2 and RMSE, minmum, mean, median and maximum AOD that we predicted. - The table “Model_and_Accuracy_of_Meteorological_Elements.xlsx” describing the statistics of performance metrics in interpolation of high-resolution meteorological dataset. - The table “Evaluation_Using_AERONET_AOD.xlsx” showing the evaluation result of AERONET, including R^2, RMSE, and monitoring information used in this study.
Insurance Premium Data
kaggle.com
Updated Apr 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prachi Gopalani (2021). Insurance Premium Data [Dataset]. https://www.kaggle.com/prachi13/insurance13m-persistency/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Prachi Gopalani
Description
1. Problem Description:

Prepare a Machine Learning Model to predict the Persistency 13M Payment Behaviour at the New Business stage. ## ## 2. Objective:

Using Machine Learning techniques, provide scores for each policy at the New Business stage the likelihood to pay the 13M premium.

Identify the segments where maximum non payers are captured ## 3. Dataset:

“Training” & “Test” Dataset with the raw input attributes and the 13M actual paid/not paid flag.

“Out of Time” Datasets would be provided with just the raw input attributes. ## 4. Expected Steps:

Conduct appropriate Data Treatments for e.g. Missing Value Imputation, Outlier treatment etc.

Conduct required Feature Engineering for e.g. Binning, Ratio, Interaction, Polynomial etc.

Use any machine learning algorithm or combination of machine learning algorithms you deem fit.

Prepare your model on the Train Data and you can evaluate the generalization capability of your model by using K-Fold Cross Validation, Leave One Out

Cross Validation or any other validation technique that you see appropriate.

Score the Test and Out of Time Data and share it back to us along with the scored Train Data for evaluation. Also share all the Model Codes and Documentation.
Within-year classification performance (AUC) for all the years.
plos.figshare.com
xls
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Within-year classification performance (AUC) for all the years. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0251952.t006
Dataset updated
Jun 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Within-year classification performance (AUC) for all the years.
f
Model hyperparameters, their ranges and optimal values obtained.
figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Model hyperparameters, their ranges and optimal values obtained. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0251952.t003
Dataset updated
Jun 8, 2023
Dataset provided by
PLOS ONE
Authors
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Model hyperparameters, their ranges and optimal values obtained.
f
Ratio (%) of crop loss parcels for three groups with different areas and...
plos.figshare.com
datasetcatalog.nlm.nih.gov
+1more
xls
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Ratio (%) of crop loss parcels for three groups with different areas and four years. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0251952.t007
Dataset updated
Jun 7, 2023
Dataset provided by
PLOS ONE
Authors
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ratio (%) of crop loss parcels for three groups with different areas and four years.
f
Within-year AUC of three groups with different field parcel sizes.
plos.figshare.com
xls
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Within-year AUC of three groups with different field parcel sizes. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0251952.t008
Dataset updated
Jun 8, 2023
Dataset provided by
PLOS ONE
Authors
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Within-year AUC of three groups with different field parcel sizes.

Facebook

Twitter

Click to copy link

Link copied

Cite

Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Mean AUC of 10x10-fold CV for different models and imputation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t004

Mean AUC of 10x10-fold CV for different models and imputation methods.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0251952.t004

Dataset updated

Jun 7, 2023

Dataset provided by

PLOS ONE

Authors

Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Mean AUC of 10x10-fold CV for different models and imputation methods.

Clear search

Close search

Google apps

Main menu

Mean AUC of 10x10-fold CV for different models and imputation methods.

Model performance results based on random forest, gradient boosting,...

Life course multivariable linear regression between allostatic load and...

Processed Datasets - Imputation in Well Log Data: A Benchmark

File Description:

Monthly imputation of delays

Between-day contrasts from imputed data.

Multiple year training vs single year training (AUC).

Quarterly Labour Force Survey Household Dataset, October - December, 2022

Data from: A comparison of genomic selection models across time in interior...

Employment Of India CLeaned and Messy Data

🔹 Dataset Composition:

Transformations & Cleaning Applied:

Purpose & Utility:

Ratio of training time of four models relative to RF.

Synthetic Stroke Prediction Dataset

Number of Landsat 7 ETM+ surface reflectance products acquired per year...

Cleaned Driver License Dataset

MLReady #LicensePrediction #ClassificationDataset #CleanedData #FeatureEngineering #DriversLicense #Transportation #AIProjects #Imputation #OrdinalEncoding

📚 Citation

A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution...

Insurance Premium Data

1. Problem Description:

Within-year classification performance (AUC) for all the years.

Model hyperparameters, their ranges and optimal values obtained.

Ratio (%) of crop loss parcels for three groups with different areas and...

Within-year AUC of three groups with different field parcel sizes.

Mean AUC of 10x10-fold CV for different models and imputation methods.