100+ datasets found

Water-quality data imputation with a high percentage of missing values: a...
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
d
Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...
search.dataone.org
dataverse.harvard.edu
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UPL4TT
Dataset updated
Nov 23, 2023
Dataset provided by
Harvard Dataverse
Authors
Lall, Ranjit; Robinson, Thomas
Description
Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...
plos.figshare.com
datasetcatalog.nlm.nih.gov
+1more
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0164464
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Nikolaj Bak; Lars K. Hansen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.
Data from: Benchmarking imputation methods for categorical biological data
zenodo.org
data.niaid.nih.gov
zip
Updated Mar 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthieu Gendre; Torsten Hauffe; Torsten Hauffe; Catalina Pimiento; Catalina Pimiento; Daniele Silvestro; Daniele Silvestro; Matthieu Gendre (2024). Benchmarking imputation methods for categorical biological data [Dataset]. http://doi.org/10.5281/zenodo.10800016
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10800016
Dataset updated
Mar 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matthieu Gendre; Torsten Hauffe; Torsten Hauffe; Catalina Pimiento; Catalina Pimiento; Daniele Silvestro; Daniele Silvestro; Matthieu Gendre
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 9, 2024
Description
Description:

Welcome to the Zenodo repository for Publication Benchmarking imputation methods for categorical biological data, a comprehensive collection of datasets and scripts utilized in our research endeavors. This repository serves as a vital resource for researchers interested in exploring the empirical and simulated analyses conducted in our study.

Contents:

empirical_analysis:

Trait Dataset of Elasmobranchs: A collection of trait data for elasmobranch species obtained from FishBase , stored as RDS file.

Phylogenetic Tree: A phylogenetic tree stored as a TRE file.

Imputations Replicates (Imputation): Replicated imputations of missing data in the trait dataset, stored as RData files.

Error Calculation (Results): Error calculation results derived from imputed datasets, stored as RData files.

Scripts: Collection of R scripts used for the implementation of empirical analysis.

simulation_analysis:

Input Files: Input files utilized for simulation analyses as CSV files

Data Distribution PDFs: PDF files displaying the distribution of simulated data and the missingness.

Output Files: Simulated trait datasets, trait datasets with missing data, and trait imputed datasets with imputation errors calculated as RData files.

Scripts: Collection of R scripts used for the simulation analysis.

TDIP_package:

Scripts of the TDIP Package: All scripts related to the Trait Data Imputation with Phylogeny (TDIP) R package used in the analyses.

Purpose:

This repository aims to provide transparency and reproducibility to our research findings by making the datasets and scripts publicly accessible. Researchers interested in understanding our methodologies, replicating our analyses, or building upon our work can utilize this repository as a valuable reference.

Citation:

When using the datasets or scripts from this repository, we kindly request citing Publication Benchmarking imputation methods for categorical biological data and acknowledging the use of this Zenodo repository.

Thank you for your interest in our research, and we hope this repository serves as a valuable resource in your scholarly pursuits.
Retail Product Dataset with Missing Values
kaggle.com
zip
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
Explore at:
zip(47826 bytes)Available download formats
Dataset updated
Feb 17, 2025
Authors
Himel Sarder
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage

This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.
Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
Data Cleaning - Feature Imputation
kaggle.com
zip
Updated Aug 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mr.Machine (2022). Data Cleaning - Feature Imputation [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-cleaning-feature-imputation
Explore at:
zip(116097 bytes)Available download formats
Dataset updated
Aug 13, 2022
Authors
Mr.Machine
Description
Data Cleaning or Data cleansing is to clean the data by imputing missing values, smoothing noisy data, and identifying or removing outliers. In general, the missing values are found due to collection error or data is corrupted.

Here some info in details :Feature Engineering - Handling Missing Value

Wine_Quality.csv dataset have the numerical missing data, and students_Performance.mv.csv dataset have Numerical and categorical missing data's.
S
Deep learning based Missing Data Imputation
scidb.cn
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahjabeen Tahir (2024). Deep learning based Missing Data Imputation [Dataset]. http://doi.org/10.57760/sciencedb.16599
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.16599
Dataset updated
Mar 4, 2024
Dataset provided by
Science Data Bank
Authors
Mahjabeen Tahir
Description
The code provided is related to training an autoencoder, evaluating its performance, and using it for imputing missing values in a dataset. Let's break down each part:Training the Autoencoder (train_autoencoder function):This function takes an autoencoder model and the input features as input.It trains the autoencoder using the input features as both input and target output (hence features, features).The autoencoder is trained for a specified number of epochs (epochs) with a given batch size (batch_size).The shuffle=True argument ensures that the data is shuffled before each epoch to prevent the model from memorizing the input order.After training, it returns the trained autoencoder model and the training history.Evaluating the Autoencoder (evaluate_autoencoder function):This function takes a trained autoencoder model and the input features as input.It uses the trained autoencoder to predict the reconstructed features from the input features.It calculates Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R2) scores between the original and reconstructed features.These metrics provide insights into how well the autoencoder is able to reconstruct the input features.Imputing with the Autoencoder (impute_with_autoencoder function):This function takes a trained autoencoder model and the input features as input.It identifies missing values (e.g., -9999) in the input features.For each row with missing values, it predicts the missing values using the trained autoencoder.It replaces the missing values with the predicted values.The imputed features are returned as output.To reuse this code:Load your dataset and preprocess it as necessary.Build an autoencoder model using the build_autoencoder function.Train the autoencoder using the train_autoencoder function with your input features.Evaluate the performance of the autoencoder using the evaluate_autoencoder function.If your dataset contains missing values, use the impute_with_autoencoder function to impute them with the trained autoencoder.Use the trained autoencoder for any other relevant tasks, such as feature extraction or anomaly detection.
f
DataSheet_1_A Deep Learning Approach for Missing Data Imputation of Rating...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated Jun 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chung-Yuan Cheng; Wan-Ling Tseng; Ching-Fen Chang; Chuan-Hsiung Chang; Susan Shur-Fen Gau (2023). DataSheet_1_A Deep Learning Approach for Missing Data Imputation of Rating Scales Assessing Attention-Deficit Hyperactivity Disorder.pdf [Dataset]. http://doi.org/10.3389/fpsyt.2020.00673.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyt.2020.00673.s001
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers
Authors
Chung-Yuan Cheng; Wan-Ling Tseng; Ching-Fen Chang; Chuan-Hsiung Chang; Susan Shur-Fen Gau
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A variety of tools and methods have been used to measure behavioral symptoms of attention-deficit/hyperactivity disorder (ADHD). Missing data is a major concern in ADHD behavioral studies. This study used a deep learning method to impute missing data in ADHD rating scales and evaluated the ability of the imputed dataset (i.e., the imputed data replacing the original missing values) to distinguish youths with ADHD from youths without ADHD. The data were collected from 1220 youths, 799 of whom had an ADHD diagnosis, and 421 were typically developing (TD) youths without ADHD, recruited in Northern Taiwan. Participants were assessed using the Conners’ Continuous Performance Test, the Chinese versions of the Conners’ rating scale-revised: short form for parent and teacher reports, and the Swanson, Nolan, and Pelham, version IV scale for parent and teacher reports. We used deep learning, with information from the original complete dataset (referred to as the reference dataset), to perform missing data imputation and generate an imputation order according to the imputed accuracy of each question. We evaluated the effectiveness of imputation using support vector machine to classify the ADHD and TD groups in the imputed dataset. The imputed dataset can classify ADHD vs. TD up to 89% accuracy, which did not differ from the classification accuracy (89%) using the reference dataset. Most of the behaviors related to oppositional behaviors rated by teachers and hyperactivity/impulsivity rated by both parents and teachers showed high discriminatory accuracy to distinguish ADHD from non-ADHD. Our findings support a deep learning solution for missing data imputation without introducing bias to the data.
Finding_And_Visualizing_Missing_Data_Python
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Finding_And_Visualizing_Missing_Data_Python [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/finding-and-visualizing-missing-data-python
Explore at:
zip(371581 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
• This dataset is designed for learning how to identify missing data in Python.
• It focuses on techniques to detect null, NaN, and incomplete values.
• It includes examples of visualizing missing data patterns using Python libraries.
• Useful for beginners practicing data preprocessing and data cleaning.
• Helps users understand missing data handling methods for machine learning workflows.
• Supports practical exploration of datasets before model training.
m
Dataset: Efficient improvement for water quality analysis with large amount...
data.mendeley.com
Updated Jul 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Sierra Porta (2022). Dataset: Efficient improvement for water quality analysis with large amount of missing data [Dataset]. http://doi.org/10.17632/8y42cbc7h8.1
Explore at:
Unique identifier
https://doi.org/10.17632/8y42cbc7h8.1
Dataset updated
Jul 26, 2022
Authors
David Sierra Porta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Water is vital for life and local water pollution can damage the environment and affect human health. Governments and private institutions monitor and regulate water quality to protect the environment and populations. The consequences of pollution can reach far and wide, costing companies significant amounts in cleanup costs and loss of reputation. Most countries have official accredited laboratories and sampling teams that use varied technology, global expertise and local knowledge to provide water quality monitoring for different types of water and different and varied sampling locations. However, one of the main problems associated with monitoring and assessing water quality and meeting minimum standards of potability or usability is the analysis of samples based on local data. The problem lies in the fact that in many cases the data, due to the methodology or technique used or the expertise of the human resource that handles the samples, ends up configured in sets that have a large amount of missing information or data without information. This implies a problem depending on the analysis to be carried out. If you want to estimate a water quality index based on the samples, then you may have biased calculations due to the loss of information.

This dataset has been used for the generation of the manuscript: Efficient improvement for water quality analysis with large amount of missing data. D. Sierra-Porta,M. Tobón-Ospino. This manuscript is being submitted to Sustainable Production and Consumption (2022 Elsevier), Publication of the Institution of Chemical Engineers.
f
Data_Sheet_2_The Optimal Machine Learning-Based Missing Data Imputation for...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen (2023). Data_Sheet_2_The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.docx [Dataset]. http://doi.org/10.3389/fpubh.2021.680054.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2021.680054.s002
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.
Table 1_A random forest dynamic threshold imputation method for handling...
frontiersin.figshare.com
pdf
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaofeng You; Jianqin Yang; Xinai Xu (2025). Table 1_A random forest dynamic threshold imputation method for handling missing data in cognitive diagnosis assessments.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2025.1487111.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2025.1487111.s001
Dataset updated
Aug 5, 2025
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Xiaofeng You; Jianqin Yang; Xinai Xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The handling of missing data in cognitive diagnostic assessment is an important issue. The Random Forest Threshold Imputation (RFTI) method proposed by You et al. in 2023 is specifically designed for cognitive diagnostic models (CDMs) and built on the random forest imputation. However, in RFTI, the threshold for determining imputed values to be 0 is fixed at 0.5, which may result in uncertainty in this imputation. To address this issue, we proposed an improved method, Random Forest Dynamic Threshold Imputation (RFDTI), which possess two dynamic thresholds for dichotomous imputed values. A simulation study showed that the classification of attribute profiles when using RFDTI to impute missing data was always better than the four commonly used traditional methods (i.e., person mean imputation, two-way imputation, expectation–maximization algorithm, and multiple imputation). Compared with RFTI, RFDTI was slightly better for MAR or MCAR data, but slightly worse for MNAR or MIXED data, especially with a larger missingness proportion. An empirical example with MNAR data demonstrates the applicability of RFDTI, which performed similarly as RFTI and much better than the other four traditional methods. An R package is provided to facilitate the application of the proposed method.
H
Replication Data for: Comparative investigation of time series missing data...
dataverse.harvard.edu
dataone.org
Updated Jul 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LEIZHEN ZANG; Feng XIONG (2020). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GQHURF
Dataset updated
Jul 24, 2020
Dataset provided by
Harvard Dataverse
Authors
LEIZHEN ZANG; Feng XIONG
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.
S
Prediction of radionuclide diffusion enabled by missing data imputation and...
scidb.cn
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jun-Lei Tian; Jia-Xing Feng; Jia-Cong Shen; Lei Yao; Jing-Yan Wang; Tao Wu; Yao-Lin Zhao (2025). Prediction of radionuclide diffusion enabled by missing data imputation and ensemble machine learning [Dataset]. http://doi.org/10.57760/sciencedb.j00186.00710
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00186.00710
Dataset updated
May 6, 2025
Dataset provided by
Science Data Bank
Authors
Jun-Lei Tian; Jia-Xing Feng; Jia-Cong Shen; Lei Yao; Jing-Yan Wang; Tao Wu; Yao-Lin Zhao
Description
Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of machine learning models. A regression-based missing data imputation method using light gradient boosting machine algorithm was employed to impute over 60% of the missing data.
Removing missing values NFL Play by Play 2009-2017
kaggle.com
zip
Updated Jul 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdallah Ahmed A. (2025). Removing missing values NFL Play by Play 2009-2017 [Dataset]. https://www.kaggle.com/datasets/abdallahahmeda/missing-values-nfl-play-by-play-2009-2017
Explore at:
zip(135296739 bytes)Available download formats
Dataset updated
Jul 25, 2025
Authors
Abdallah Ahmed A.
Description
This is my first-ever project on datasets; it was a task assigned to me by my machine learning tutor. I only imputed and removed missing values depending on the context.

Notes: Down: ffill (logical order)

Time,TimeSecs,SideofField : FFill

Playtimediff: Median (has skews)

yrdln,yrdline100,: Mean

GoalToGo,FirstDown: Mode

postteam,DefensiveTeam : assign "None" to NA because it's logical for it to be NA.

Desc: FFill

ExPointResult,TwoPointConv,DefTwoPoint,PuntResult: Assign another name "None" to every NA

Passer,Passer_ID : Remove all rows that has either passer or passer_id missing, but not both, then change other rows that has both to NA to "None"

PassOutcome,PassLength: Remove all rows that has either passoutcome or passlength missing, then, change NA to "None" as the missing is logical.

PassLength: setting all NA to None

Interceptor: assign "None" to NA.

PassLocation: assign "None" to NA.

RunLocation,RunGap: use mode for both when RushAttempt.notna() = True, otherwise set to None.

ReturnResult,Returner,BlockingPlayer,FieldGoalResult,FieldGoalDistance,RecFumbTeam,RecFumbPlayer,ChalReplayResult,PenalizedTeam,PenaltyType,PenalizedPlayer,Timeout_Team : Dropping these columns entirely as they have 90%+ missing values.

Tackler1,Tackler2: assign "None" to NA.

DefTeamScore,PosTeamScore,ScoreDiff,AbsScoreDiff: FFill (before and after values are consistently the same unless new match)

No_Score_Prob,Opp_Field_Goal_Prob,Opp_Safety_Prob,Opp_Touchdown_Prob,Field_Goal_Prob,Safety_Prob,Touchdown_Prob,EPA,Win_Prob: assign "0.0" to missing values

Away_WP_post,Away_WP_pre,Home_WP_post,Away_WP_post,WPA,airWPA,yacWPA: Mean

*"None" values are chosen instead of deletion due to the missing value being conditional and not a data gathering error. They are then to be encoded
H
Replication Data for: Machine Learning Predictions as Regression Covariates
dataverse.harvard.edu
Updated Sep 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Fong; Matthew Tyler (2022). Replication Data for: Machine Learning Predictions as Regression Covariates [Dataset]. http://doi.org/10.7910/DVN/QQHBHY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/QQHBHY
Dataset updated
Sep 28, 2022
Dataset provided by
Harvard Dataverse
Authors
Christian Fong; Matthew Tyler
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.7910/DVN/QQHBHYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.7910/DVN/QQHBHY
Description
In text, images, merged surveys, voter files, and elsewhere, data sets are often missing important covariates, either because they are latent features of observations (such as sentiment in text) or because they are not collected (such as race in voter files). One promising approach for coping with this missing data is to find the true values of the missing covariates for a subset of the observations and then train a machine learning algorithm to predict the values of those covariates for the rest. However, plugging in these predictions without regard for prediction error renders regression analyses biased, inconsistent, and overconfident. We characterize the severity of the problem posed by prediction error, describe a procedure to avoid these inconsistencies under comparatively general assumptions, and demonstrate the performance of our estimators through simulations and a study of hostile political dialogue on the Internet. We provide software implementing our approach.
Z
Multi-Label Datasets with Missing Values
data.niaid.nih.gov
Updated Mar 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio F. L. Jacob Jr.; Fabrício A. do Carmo; Ádamo L. de Santana; Ewaldo Santana; Fábio M. F. Lobato (2023). Multi-Label Datasets with Missing Values [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7748932
Explore at:
Dataset updated
Mar 19, 2023
Dataset provided by
UEMA
UFOPA
Fuji Electric Co. Ltd.
Authors
Antonio F. L. Jacob Jr.; Fabrício A. do Carmo; Ádamo L. de Santana; Ewaldo Santana; Fábio M. F. Lobato
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Consisting of six multi-label datasets from the UCI Machine Learning repository.

Each dataset contains missing values which have been artificially added at the following rates: 5, 10, 15, 20, 25, and 30%. The “amputation” was performed using the “Missing Completely at Random” mechanism.

File names are represented as follows:

amp_DB_MR.arff

where:

DB = original dataset; MR = missing rate.

For more details, please read:

IEEE Access article (in review process)
d
Data from: Learning to see the wood for the trees: machine learning,...
datadryad.org
search.dataone.org
zip
Updated Sep 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Wills; Charlie J. Underwood; Paul M. Barrett (2020). Learning to see the wood for the trees: machine learning, decision trees and the classification of isolated theropod teeth [Dataset]. http://doi.org/10.5061/dryad.1zcrjdfq9
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.1zcrjdfq9
Dataset updated
Sep 14, 2020
Dataset provided by
Dryad
Authors
Simon Wills; Charlie J. Underwood; Paul M. Barrett
Time period covered
Sep 12, 2020
Description
Data to test the models was sourced from:

HENDRICKX, C., MATEUS, O. and ARAÚJO, R. 2015. The dentition of megalosaurid theropods. Acta Palaeontologica Polonica, 60, 627–642.

LARSON, DEREK W., BROWN, CALEB M. and EVANS, DAVID C. 2016. Dental Disparity and Ecological Stability in Bird-like Dinosaurs prior to the End-Cretaceous Mass Extinction. Current Biology, 26, 1325–1333.
Table_1_Comparison of machine learning and logistic regression as predictive...
frontiersin.figshare.com
xlsx
Updated Jun 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongying Zheng; Xinyu Hao; Muhanmmad Khan; Lixia Wang; Fan Li; Ning Xiang; Fuli Kang; Timo Hamalainen; Fengyu Cong; Kedong Song; Chong Qiao (2023). Table_1_Comparison of machine learning and logistic regression as predictive models for adverse maternal and neonatal outcomes of preeclampsia: A retrospective study.XLSX [Dataset]. http://doi.org/10.3389/fcvm.2022.959649.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fcvm.2022.959649.s003
Dataset updated
Jun 13, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Dongying Zheng; Xinyu Hao; Muhanmmad Khan; Lixia Wang; Fan Li; Ning Xiang; Fuli Kang; Timo Hamalainen; Fengyu Cong; Kedong Song; Chong Qiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionPreeclampsia, one of the leading causes of maternal and fetal morbidity and mortality, demands accurate predictive models for the lack of effective treatment. Predictive models based on machine learning algorithms demonstrate promising potential, while there is a controversial discussion about whether machine learning methods should be recommended preferably, compared to traditional statistical models.MethodsWe employed both logistic regression and six machine learning methods as binary predictive models for a dataset containing 733 women diagnosed with preeclampsia. Participants were grouped by four different pregnancy outcomes. After the imputation of missing values, statistical description and comparison were conducted preliminarily to explore the characteristics of documented 73 variables. Sequentially, correlation analysis and feature selection were performed as preprocessing steps to filter contributing variables for developing models. The models were evaluated by multiple criteria.ResultsWe first figured out that the influential variables screened by preprocessing steps did not overlap with those determined by statistical differences. Secondly, the most accurate imputation method is K-Nearest Neighbor, and the imputation process did not affect the performance of the developed models much. Finally, the performance of models was investigated. The random forest classifier, multi-layer perceptron, and support vector machine demonstrated better discriminative power for prediction evaluated by the area under the receiver operating characteristic curve, while the decision tree classifier, random forest, and logistic regression yielded better calibration ability verified, as by the calibration curve.ConclusionMachine learning algorithms can accomplish prediction modeling and demonstrate superior discrimination, while Logistic Regression can be calibrated well. Statistical analysis and machine learning are two scientific domains sharing similar themes. The predictive abilities of such developed models vary according to the characteristics of datasets, which still need larger sample sizes and more influential predictors to accumulate evidence.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169

Water-quality data imputation with a high percentage of missing values: a machine learning approach

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4731169

Dataset updated

Jun 8, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

Clear search

Close search

Google apps

Main menu

Water-quality data imputation with a high percentage of missing values: a...

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...

Data from: Benchmarking imputation methods for categorical biological data

Retail Product Dataset with Missing Values

Understanding and Managing Missing Data.pdf

Data Cleaning - Feature Imputation

Deep learning based Missing Data Imputation

DataSheet_1_A Deep Learning Approach for Missing Data Imputation of Rating...

Finding_And_Visualizing_Missing_Data_Python

Dataset: Efficient improvement for water quality analysis with large amount...

Data_Sheet_2_The Optimal Machine Learning-Based Missing Data Imputation for...

Table 1_A random forest dynamic threshold imputation method for handling...

Replication Data for: Comparative investigation of time series missing data...

Prediction of radionuclide diffusion enabled by missing data imputation and...

Removing missing values NFL Play by Play 2009-2017

Replication Data for: Machine Learning Predictions as Regression Covariates

Multi-Label Datasets with Missing Values

Data from: Learning to see the wood for the trees: machine learning,...

Table_1_Comparison of machine learning and logistic regression as predictive...

Water-quality data imputation with a high percentage of missing values: a machine learning approachSee More Versions

Water-quality data imputation with a high percentage of missing values: a machine learning approach