Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mean AUC of 10x10-fold CV for different models and imputation methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model performance results based on random forest, gradient boosting, penalized logistic regression, XGBoost, SVM, neural network, and stacking for EMBARC data as training set and APAT data as testing set after multiple imputation for 10 times.
Life course multivariable linear regression between allostatic load and parental interest using data obtained from multiple imputation for women (n = 4 056).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Imputation of well log data is a common task in the field. However a quick review of the literature reveals a lack of padronization when evaluating methods for the problem. The goal of the benchmark is to introduce a standard evaluation protocol to any imputation method for well log data.
In the proposed benchmark, three public datasets are used:
Here you can download all three datasets already preprocessed to be used with our implementation, found here.
There are six files for each fold partition for each dataset.
datasetname_fold_k_well_log_metadata_train.json
: JSON file with general information of the slices of training partition of the fold k. Contains total number of slices and the number of slices per well. datasetname_fold_k_well_log_metadata_val.json
: JSON file with general information of the slices of validation partition of the fold k. Contains total number of slices and the number of slices per well. datasetname_fold_k_well_log_slices_train.npy
: .npy (numpy) file ready to be loaded with the slices for training of the fold k already processed. When loaded should have shape of (total_slices, 256, number_of_logs)datasetname_fold_k_well_log_slices_val.npy
: .npy (numpy) file ready to be loaded with the slices for validation of the fold k already processed.datasetname_fold_k_well_log_slices_meta_train.json
: JSON file with the slices info for all slices in the training partition of the fold k. For each slice, 7 data points are provided, the last four are discarded (it would contain other information that was not used). The first three are in order the: origin well name, the starting position in that well, and the end position of the slice in that well.datasetname_fold_k_well_log_slices_meta_val.json
: JSON file with the slices info for all slices in the validation partition of the fold k.CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The historical method used to calculate the statistics presented in this dataset takes into account all the minutes of delay caused by 'major incidents' (internally know as 'relazen') on the rail network as reported to the Railway Accident and Incident Investigation Body (OEAIF/OOIS) and the Railway Safety and Interoperability Service (NSA Rail Belgium) under the Royal Decree of 16 January 2007 laying down certain rules relating to investigations into railway accidents and incidents.
The criteria defining 'major incidents' (internally known as 'relations'**) are as follows:
1 passenger train delayed by an incident for 20 minutes or more Several passenger trains delayed by an incident for at least 40 minutes Incidents leading to the cancellation (partial or total) of trains Incidents with an impact on operational safety
There is no unequivocal relationship between the minutes of delay in 'major incidents' and the punctuality rate because:
The minutes included in 'major incidents' do not necessarily have an actual impact on punctuality (a train can make up its delay as it goes along). Some trains arrive at their terminus more than 6 minutes late (and therefore have an actual impact on punctuality), but are not included in the 'major incidents'.
In order to provide an exhaustive overview of the causes and responsibilities for delays, a new dataset has been made available: Monthly causes of loss of punctuality.
The data presented in this new dataset is as follows: for each train delayed by 6 minutes or more on arrival at a tracking point*, an analysis is made of the cause of all the minutes of delay along the route, and a proportional score is awarded for each responsibility identified.
More info in the new dataset's description
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This observational study aimed to analyze external training load in highly trained female football players, comparing starters and non-starters across various cycle lengths and training days. Method: External training load [duration, total distance [TD], high-speed running distance [HSRD], sprint distance [SpD], and acceleration- and deceleration distance [AccDecdist] from 100 female football players (22.3 ± 3.7 years of age) in the Norwegian premier division were collected over two seasons using STATSports APEX. This resulted in a final dataset totaling 10498 observations after multiple imputation of missing data. Microcycle length was categorized based on the number of days between matches (2 to 7 days apart), while training days were categorized relative to match day (MD, MD+1, MD+2, MD-5, MD-4, MD-3, MD-2, MD-1). Linear mixed modeling was used to assess differences between days, and starters vs. non-starters. Results: In longer cycle lengths (5–7 days between matches), the middle of the week (usually MD-4 or MD-3) consistently exhibited the highest external training load (~21–79% of MD TD, MD HSRD, MD SpD, and MD AccDecdist); though, with the exception of duration (~108–120% of MD duration), it remained lower than MD. External training load was lowest on MD+2 and MD-1 (~1–37% of MD TD, MD HSRD, MD SpD, MD AccDecdist, and ~73–88% of MD peak speed). Non-starters displayed higher loads (~137–400% of starter TD, HSRD, SpD, AccDecdist) on MD+2 in cycles with 3 to 7 days between matches, with non-significant differences (~76–116%) on other training days. Conclusion: Loading patterns resemble a pyramid or skewed pyramid during longer cycle lengths (5–7 days), with higher training loads towards the middle compared to the start and the end of the cycle. Non-starters displayed slightly higher loads on MD+2, with no significant load differentiation from MD-5 onwards.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multiple year training vs single year training (AUC).
Occupation data for 2021 and 2022 data files
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
Latest edition information
For the second edition (September 2023), the variables NSECM20, NSECMJ20, SC2010M, SC20SMJ, SC20SMN and SOC20M have been replaced with new versions. Further information on the SOC revisions can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
AbstractGenomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3–40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31–0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04–0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated. Usage notesphenotypephenotype and experimental designphen mask dryad.txtSVD genotype imputationmarker matrix for SVD imputation methodSVDimp dryad.txtMean genotype imputationmarker matrix for mean imputation methodMeanImp dryad.txtKNN genotype imputationmarker matrix for KNN imputation methodKNNimp dryad.txt
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.
It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.
Each record captures multiple attributes related to individuals in the Indian job market, including:
- Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording
The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.
This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools
It's also useful for:
- Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines
By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ratio of training time of four models relative to RF.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of Landsat 7 ETM+ surface reflectance products acquired per year (January—December) per tile (cf. Fig 2 for location of tiles).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
🚗 Cleaned Drivers License Dataset
"This dataset is designed for those who believe in building smart, ethical AI models that can assist in real-world decision-making like licensing, risk assessment, and more."
📂 Dataset Overview
This dataset is a cleaned and preprocessed version of a drivers license dataset containing details like:
Age Group
Gender
Reaction Time
Driving Skill Level
Training Received
License Qualification Status
And many more total 20 columns you can train a best ML model with this dataset
The raw dataset contained missing values, categorical anomalies, and inconsistencies which have been fixed to make this dataset ML-ready.
💡 Key Highlights
✅ Cleaned missing values with intelligent imputation
🧠 Encoded categorical columns with appropriate techniques (OneHot, Ordinal)
🔍 Removed outliers using statistical logic
🔧 Feature engineered to preserve semantic meaning
💾 Ready-to-use for classification tasks (e.g., Predicting who qualifies for a license)
📊 Columns Description
Column Name Description
Gender Gender of the individual (Male, Female) Age Group Age segment (Teen, Adult, Senior) Race Ethnicity of the driver Reactions Reaction speed categorized (Fast, Average, Slow) Training Training received (Basic, Advanced) Driving Skill Skill level (Expert, Beginner, etc.) Qualified Whether the person qualified for a license (Yes, No)
🤖 Perfect For
📚 Machine Learning (Classification)
📊 Exploratory Data Analysis (EDA)
📉 Feature Engineering Practice
🧪 Model Evaluation & Experimentation
🚥 AI for Transport & Safety Projects
🏷️ Tags
📌 Author Notes
This dataset is part of a data cleaning and feature engineering project. One column is intentionally left unprocessed for developers to test their own pipeline or transformation strategies 😎.
🔗 Citation
If you use this dataset in your projects, notebooks, or blog posts — feel free to tag me or credit the original dataset and this cleaned version.
If you use this dataset in your work, please cite it as:
Divyanshu_Coder, 2025. Cleaned Driver License Dataset. Kaggle.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We share the complete aerosol optical depth dataset with high spatial (1x1km^2) and temporal (daily) resolution and the Beijing 1954 projection (https://epsg.io/2412) for mainland China (2015-2018). The original aerosol optical depth images are from Multi-Angle Implementation of Atmospheric Correction Aerosol Optical Depth (MAIAC AOD) (https://lpdaac.usgs.gov/products/mcd19a2v006/) with the similar spatiotemporal resolution and the sinusoidal projection (https://en.wikipedia.org/wiki/Sinusoidal_projection). After projection conversion, eighteen tiles of MAIAC AOD were merged to obtain a large image of AOD covering the entire area of mainland China. Due to the conditions of clouds and high surface reflectance, each original MAIAC AOD image usually has many missing values, and the average missing percentage of each AOD image may exceed 60%. Such a high percentage of missing values severely limits applicability of the original MAIAC AOD dataset product. We used the sophisticated method of full residual deep networks (Li et al, 2020, https://ieeexplore.ieee.org/document/9186306) to impute the daily missing MAIAC AOD, thus obtaining the complete (no missing values) high-resolution AOD data product covering mainland China. The covariates used in imputation included coordinates, elevation, MERRA2 coarse-resolution PBLH and AOD variables, cloud fraction, high-resolution meteorological variables (air pressure, air temperature, relative humidity and wind speed) and/or time index etc. Ground monitoring data were used to generate high-resolution meteorological variables to ensure the reliability of interpolation. Overall, our daily imputation models achieved an average training R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075, with a range of 0.026 to 0.32) and an average test R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075 with a range of 0.026 to 0.32). With almost no difference between training metrics and test metrics, the high test R^2 and low test RMSE show the reliability of AOD imputation. In the evaluation using the ground AOD data from the monitoring stations of the Aerosol Robot Network (AERONET) in mainland China, our method obtained a R^2 of 0.78 and RMSE of 0.27, which further illustrated the reliability of the method. This database contains four datasets: - Daily complete high-resolution AOD image dataset for mainland China from January 1, 2015 to December 31, 2018. The archived resources contain 1461 images stored in 1461 files, and 3 summary Excel files. The table “CHN_AOD_INFO.xlsx” describing the properties of the 1461 images, including projection, training R^2 and RMSE, testing R^2 and RMSE, minmum, mean, median and maximum AOD that we predicted. - The table “Model_and_Accuracy_of_Meteorological_Elements.xlsx” describing the statistics of performance metrics in interpolation of high-resolution meteorological dataset. - The table “Evaluation_Using_AERONET_AOD.xlsx” showing the evaluation result of AERONET, including R^2, RMSE, and monitoring information used in this study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Within-year classification performance (AUC) for all the years.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model hyperparameters, their ranges and optimal values obtained.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ratio (%) of crop loss parcels for three groups with different areas and four years.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Within-year AUC of three groups with different field parcel sizes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mean AUC of 10x10-fold CV for different models and imputation methods.