100+ datasets found

Water-quality data imputation with a high percentage of missing values: a...
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
d
Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...
search.dataone.org
dataverse.harvard.edu
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UPL4TT
Dataset updated
Nov 23, 2023
Dataset provided by
Harvard Dataverse
Authors
Lall, Ranjit; Robinson, Thomas
Description
Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
Retail Product Dataset with Missing Values
kaggle.com
zip
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
Explore at:
zip(47826 bytes)Available download formats
Dataset updated
Feb 17, 2025
Authors
Himel Sarder
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage

This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.
Data Cleaning - Feature Imputation
kaggle.com
zip
Updated Aug 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mr.Machine (2022). Data Cleaning - Feature Imputation [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-cleaning-feature-imputation
Explore at:
zip(116097 bytes)Available download formats
Dataset updated
Aug 13, 2022
Authors
Mr.Machine
Description
Data Cleaning or Data cleansing is to clean the data by imputing missing values, smoothing noisy data, and identifying or removing outliers. In general, the missing values are found due to collection error or data is corrupted.

Here some info in details :Feature Engineering - Handling Missing Value

Wine_Quality.csv dataset have the numerical missing data, and students_Performance.mv.csv dataset have Numerical and categorical missing data's.
Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...
plos.figshare.com
datasetcatalog.nlm.nih.gov
+1more
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0164464
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Nikolaj Bak; Lars K. Hansen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.
Data from: Benchmarking imputation methods for categorical biological data
zenodo.org
data.niaid.nih.gov
zip
Updated Mar 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthieu Gendre; Torsten Hauffe; Torsten Hauffe; Catalina Pimiento; Catalina Pimiento; Daniele Silvestro; Daniele Silvestro; Matthieu Gendre (2024). Benchmarking imputation methods for categorical biological data [Dataset]. http://doi.org/10.5281/zenodo.10800016
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10800016
Dataset updated
Mar 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matthieu Gendre; Torsten Hauffe; Torsten Hauffe; Catalina Pimiento; Catalina Pimiento; Daniele Silvestro; Daniele Silvestro; Matthieu Gendre
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 9, 2024
Description
Description:

Welcome to the Zenodo repository for Publication Benchmarking imputation methods for categorical biological data, a comprehensive collection of datasets and scripts utilized in our research endeavors. This repository serves as a vital resource for researchers interested in exploring the empirical and simulated analyses conducted in our study.

Contents:

empirical_analysis:

Trait Dataset of Elasmobranchs: A collection of trait data for elasmobranch species obtained from FishBase , stored as RDS file.

Phylogenetic Tree: A phylogenetic tree stored as a TRE file.

Imputations Replicates (Imputation): Replicated imputations of missing data in the trait dataset, stored as RData files.

Error Calculation (Results): Error calculation results derived from imputed datasets, stored as RData files.

Scripts: Collection of R scripts used for the implementation of empirical analysis.

simulation_analysis:

Input Files: Input files utilized for simulation analyses as CSV files

Data Distribution PDFs: PDF files displaying the distribution of simulated data and the missingness.

Output Files: Simulated trait datasets, trait datasets with missing data, and trait imputed datasets with imputation errors calculated as RData files.

Scripts: Collection of R scripts used for the simulation analysis.

TDIP_package:

Scripts of the TDIP Package: All scripts related to the Trait Data Imputation with Phylogeny (TDIP) R package used in the analyses.

Purpose:

This repository aims to provide transparency and reproducibility to our research findings by making the datasets and scripts publicly accessible. Researchers interested in understanding our methodologies, replicating our analyses, or building upon our work can utilize this repository as a valuable reference.

Citation:

When using the datasets or scripts from this repository, we kindly request citing Publication Benchmarking imputation methods for categorical biological data and acknowledging the use of this Zenodo repository.

Thank you for your interest in our research, and we hope this repository serves as a valuable resource in your scholarly pursuits.
f
DataSheet_1_A Deep Learning Approach for Missing Data Imputation of Rating...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated Jun 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chung-Yuan Cheng; Wan-Ling Tseng; Ching-Fen Chang; Chuan-Hsiung Chang; Susan Shur-Fen Gau (2023). DataSheet_1_A Deep Learning Approach for Missing Data Imputation of Rating Scales Assessing Attention-Deficit Hyperactivity Disorder.pdf [Dataset]. http://doi.org/10.3389/fpsyt.2020.00673.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyt.2020.00673.s001
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers
Authors
Chung-Yuan Cheng; Wan-Ling Tseng; Ching-Fen Chang; Chuan-Hsiung Chang; Susan Shur-Fen Gau
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A variety of tools and methods have been used to measure behavioral symptoms of attention-deficit/hyperactivity disorder (ADHD). Missing data is a major concern in ADHD behavioral studies. This study used a deep learning method to impute missing data in ADHD rating scales and evaluated the ability of the imputed dataset (i.e., the imputed data replacing the original missing values) to distinguish youths with ADHD from youths without ADHD. The data were collected from 1220 youths, 799 of whom had an ADHD diagnosis, and 421 were typically developing (TD) youths without ADHD, recruited in Northern Taiwan. Participants were assessed using the Conners’ Continuous Performance Test, the Chinese versions of the Conners’ rating scale-revised: short form for parent and teacher reports, and the Swanson, Nolan, and Pelham, version IV scale for parent and teacher reports. We used deep learning, with information from the original complete dataset (referred to as the reference dataset), to perform missing data imputation and generate an imputation order according to the imputed accuracy of each question. We evaluated the effectiveness of imputation using support vector machine to classify the ADHD and TD groups in the imputed dataset. The imputed dataset can classify ADHD vs. TD up to 89% accuracy, which did not differ from the classification accuracy (89%) using the reference dataset. Most of the behaviors related to oppositional behaviors rated by teachers and hyperactivity/impulsivity rated by both parents and teachers showed high discriminatory accuracy to distinguish ADHD from non-ADHD. Our findings support a deep learning solution for missing data imputation without introducing bias to the data.
Finding_And_Visualizing_Missing_Data_Python
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Finding_And_Visualizing_Missing_Data_Python [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/finding-and-visualizing-missing-data-python
Explore at:
zip(371581 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
• This dataset is designed for learning how to identify missing data in Python.
• It focuses on techniques to detect null, NaN, and incomplete values.
• It includes examples of visualizing missing data patterns using Python libraries.
• Useful for beginners practicing data preprocessing and data cleaning.
• Helps users understand missing data handling methods for machine learning workflows.
• Supports practical exploration of datasets before model training.
S
Deep learning based Missing Data Imputation
scidb.cn
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahjabeen Tahir (2024). Deep learning based Missing Data Imputation [Dataset]. http://doi.org/10.57760/sciencedb.16599
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.16599
Dataset updated
Mar 4, 2024
Dataset provided by
Science Data Bank
Authors
Mahjabeen Tahir
Description
The code provided is related to training an autoencoder, evaluating its performance, and using it for imputing missing values in a dataset. Let's break down each part:Training the Autoencoder (train_autoencoder function):This function takes an autoencoder model and the input features as input.It trains the autoencoder using the input features as both input and target output (hence features, features).The autoencoder is trained for a specified number of epochs (epochs) with a given batch size (batch_size).The shuffle=True argument ensures that the data is shuffled before each epoch to prevent the model from memorizing the input order.After training, it returns the trained autoencoder model and the training history.Evaluating the Autoencoder (evaluate_autoencoder function):This function takes a trained autoencoder model and the input features as input.It uses the trained autoencoder to predict the reconstructed features from the input features.It calculates Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R2) scores between the original and reconstructed features.These metrics provide insights into how well the autoencoder is able to reconstruct the input features.Imputing with the Autoencoder (impute_with_autoencoder function):This function takes a trained autoencoder model and the input features as input.It identifies missing values (e.g., -9999) in the input features.For each row with missing values, it predicts the missing values using the trained autoencoder.It replaces the missing values with the predicted values.The imputed features are returned as output.To reuse this code:Load your dataset and preprocess it as necessary.Build an autoencoder model using the build_autoencoder function.Train the autoencoder using the train_autoencoder function with your input features.Evaluate the performance of the autoencoder using the evaluate_autoencoder function.If your dataset contains missing values, use the impute_with_autoencoder function to impute them with the trained autoencoder.Use the trained autoencoder for any other relevant tasks, such as feature extraction or anomaly detection.
Table 1_A random forest dynamic threshold imputation method for handling...
frontiersin.figshare.com
pdf
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaofeng You; Jianqin Yang; Xinai Xu (2025). Table 1_A random forest dynamic threshold imputation method for handling missing data in cognitive diagnosis assessments.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2025.1487111.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2025.1487111.s001
Dataset updated
Aug 5, 2025
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Xiaofeng You; Jianqin Yang; Xinai Xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The handling of missing data in cognitive diagnostic assessment is an important issue. The Random Forest Threshold Imputation (RFTI) method proposed by You et al. in 2023 is specifically designed for cognitive diagnostic models (CDMs) and built on the random forest imputation. However, in RFTI, the threshold for determining imputed values to be 0 is fixed at 0.5, which may result in uncertainty in this imputation. To address this issue, we proposed an improved method, Random Forest Dynamic Threshold Imputation (RFDTI), which possess two dynamic thresholds for dichotomous imputed values. A simulation study showed that the classification of attribute profiles when using RFDTI to impute missing data was always better than the four commonly used traditional methods (i.e., person mean imputation, two-way imputation, expectation–maximization algorithm, and multiple imputation). Compared with RFTI, RFDTI was slightly better for MAR or MCAR data, but slightly worse for MNAR or MIXED data, especially with a larger missingness proportion. An empirical example with MNAR data demonstrates the applicability of RFDTI, which performed similarly as RFTI and much better than the other four traditional methods. An R package is provided to facilitate the application of the proposed method.
H
Replication Data for: Comparative investigation of time series missing data...
dataverse.harvard.edu
dataone.org
Updated Jul 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LEIZHEN ZANG; Feng XIONG (2020). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GQHURF
Dataset updated
Jul 24, 2020
Dataset provided by
Harvard Dataverse
Authors
LEIZHEN ZANG; Feng XIONG
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.
m
Synthetic dataset on eco-innovation for handling missing data
data.mendeley.com
Updated Sep 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isadora Valentim Vieira da Motta (2025). Synthetic dataset on eco-innovation for handling missing data [Dataset]. http://doi.org/10.17632/v88pwnjz79.1
Explore at:
Unique identifier
https://doi.org/10.17632/v88pwnjz79.1
Dataset updated
Sep 19, 2025
Authors
Isadora Valentim Vieira da Motta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset article describes the curation and preprocessing of the 2024 Eco-Innovation Index (EII) dataset, published by the European Commission. The raw dataset (in .xlsx format) was filtered to focus on the 2024 report, and missing values in the "Water Productivity" indicator were addressed via two imputation methods: (1) EU27 mean substitution and (2) cluster-based mean imputation using K-means, an unsupervised machine learning algorithm.
Ecommerce Dataset for Data Analysis
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description
This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Results of the ML models were obtained by deleting missing values from the...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jan 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turki Aljrees (2024). Results of the ML models were obtained by deleting missing values from the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0295632.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295632.t004
Dataset updated
Jan 3, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Turki Aljrees
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of the ML models were obtained by deleting missing values from the dataset.
m
Dataset: Efficient improvement for water quality analysis with large amount...
data.mendeley.com
Updated Jul 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Sierra Porta (2022). Dataset: Efficient improvement for water quality analysis with large amount of missing data [Dataset]. http://doi.org/10.17632/8y42cbc7h8.1
Explore at:
Unique identifier
https://doi.org/10.17632/8y42cbc7h8.1
Dataset updated
Jul 26, 2022
Authors
David Sierra Porta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Water is vital for life and local water pollution can damage the environment and affect human health. Governments and private institutions monitor and regulate water quality to protect the environment and populations. The consequences of pollution can reach far and wide, costing companies significant amounts in cleanup costs and loss of reputation. Most countries have official accredited laboratories and sampling teams that use varied technology, global expertise and local knowledge to provide water quality monitoring for different types of water and different and varied sampling locations. However, one of the main problems associated with monitoring and assessing water quality and meeting minimum standards of potability or usability is the analysis of samples based on local data. The problem lies in the fact that in many cases the data, due to the methodology or technique used or the expertise of the human resource that handles the samples, ends up configured in sets that have a large amount of missing information or data without information. This implies a problem depending on the analysis to be carried out. If you want to estimate a water quality index based on the samples, then you may have biased calculations due to the loss of information.

This dataset has been used for the generation of the manuscript: Efficient improvement for water quality analysis with large amount of missing data. D. Sierra-Porta,M. Tobón-Ospino. This manuscript is being submitted to Sustainable Production and Consumption (2022 Elsevier), Publication of the Institution of Chemical Engineers.
Machine learning models.
plos.figshare.com
xls
Updated Jan 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turki Aljrees (2024). Machine learning models. [Dataset]. http://doi.org/10.1371/journal.pone.0295632.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295632.t002
Dataset updated
Jan 3, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Turki Aljrees
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
Z
Multi-Label Datasets with Missing Values
data.niaid.nih.gov
Updated Mar 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio F. L. Jacob Jr.; Fabrício A. do Carmo; Ádamo L. de Santana; Ewaldo Santana; Fábio M. F. Lobato (2023). Multi-Label Datasets with Missing Values [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7748932
Explore at:
Dataset updated
Mar 19, 2023
Dataset provided by
UFOPA
Fuji Electric Co. Ltd.
UEMA
Authors
Antonio F. L. Jacob Jr.; Fabrício A. do Carmo; Ádamo L. de Santana; Ewaldo Santana; Fábio M. F. Lobato
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Consisting of six multi-label datasets from the UCI Machine Learning repository.

Each dataset contains missing values which have been artificially added at the following rates: 5, 10, 15, 20, 25, and 30%. The “amputation” was performed using the “Missing Completely at Random” mechanism.

File names are represented as follows:

amp_DB_MR.arff

where:

DB = original dataset; MR = missing rate.

For more details, please read:

IEEE Access article (in review process)
o
County Social Determinants of Health Data Pre-Processed to Facilitate...
openicpsr.org
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Crown; Rachel Adams; Mary Jo Larson (2025). County Social Determinants of Health Data Pre-Processed to Facilitate Machine Learning/Multivariate Analysis [Dataset]. http://doi.org/10.3886/E227481V2
Explore at:
Unique identifier
https://doi.org/10.3886/E227481V2
Dataset updated
Apr 23, 2025
Dataset provided by
Brandeis University
Boston University
Authors
William Crown; Rachel Adams; Mary Jo Larson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2018 - Dec 31, 2019
Area covered
US Counties
Description
These datasets contain data from the AHRQ Social Determinants of Health (SDOH) Database (https://www.ahrq.gov/sdoh/data-analytics/sdoh-data.html), processed to facilitate machine learning/multivariate analyses focusing on the healthcare context of counties. The datasets derive from the AHRQ 2019 and 2018 county-level SDOH files. Three sets of files are provided. The first "Raw" set has the source SDOH data with a few core pre-processing steps applied. The second, “Full” set has variables characterizing the health and healthcare context of counties (rather than outcomes), with further processing steps applied to facilitate multivariate and machine learning analytics (e.g. handling of missing data, normalizing, standardizing). The third set, labeled “Reduced”, incorporates those same data processing steps but in addition has had a further data reduction step applied in which groups of highly intercorrelated variables were removed and replaced with corresponding principal component scores, one for each group. These files would be useful for investigators interested in characterizing and comparing the broad SDOH context of US counties.
Results of the ML models using PCA imputer.
plos.figshare.com
xls
Updated Jan 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turki Aljrees (2024). Results of the ML models using PCA imputer. [Dataset]. http://doi.org/10.1371/journal.pone.0295632.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295632.t006
Dataset updated
Jan 3, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Turki Aljrees
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169

Water-quality data imputation with a high percentage of missing values: a machine learning approach

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4731169

Dataset updated

Jun 8, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

Clear search

Close search

Google apps

Main menu

Water-quality data imputation with a high percentage of missing values: a...

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

Understanding and Managing Missing Data.pdf

Retail Product Dataset with Missing Values

Data Cleaning - Feature Imputation

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...

Data from: Benchmarking imputation methods for categorical biological data

DataSheet_1_A Deep Learning Approach for Missing Data Imputation of Rating...

Finding_And_Visualizing_Missing_Data_Python

Deep learning based Missing Data Imputation

Table 1_A random forest dynamic threshold imputation method for handling...

Replication Data for: Comparative investigation of time series missing data...

Synthetic dataset on eco-innovation for handling missing data

Ecommerce Dataset for Data Analysis

Results of the ML models were obtained by deleting missing values from the...

Dataset: Efficient improvement for water quality analysis with large amount...

Machine learning models.

Multi-Label Datasets with Missing Values

County Social Determinants of Health Data Pre-Processed to Facilitate...

Results of the ML models using PCA imputer.

Water-quality data imputation with a high percentage of missing values: a machine learning approachSee More Versions

Water-quality data imputation with a high percentage of missing values: a machine learning approach