Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.
This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.
To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).
IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.
In this dataset, we include the original and imputed values for the following variables:
Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)
Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].
More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.
If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
Facebook
TwitterReplication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A variety of tools and methods have been used to measure behavioral symptoms of attention-deficit/hyperactivity disorder (ADHD). Missing data is a major concern in ADHD behavioral studies. This study used a deep learning method to impute missing data in ADHD rating scales and evaluated the ability of the imputed dataset (i.e., the imputed data replacing the original missing values) to distinguish youths with ADHD from youths without ADHD. The data were collected from 1220 youths, 799 of whom had an ADHD diagnosis, and 421 were typically developing (TD) youths without ADHD, recruited in Northern Taiwan. Participants were assessed using the Conners’ Continuous Performance Test, the Chinese versions of the Conners’ rating scale-revised: short form for parent and teacher reports, and the Swanson, Nolan, and Pelham, version IV scale for parent and teacher reports. We used deep learning, with information from the original complete dataset (referred to as the reference dataset), to perform missing data imputation and generate an imputation order according to the imputed accuracy of each question. We evaluated the effectiveness of imputation using support vector machine to classify the ADHD and TD groups in the imputed dataset. The imputed dataset can classify ADHD vs. TD up to 89% accuracy, which did not differ from the classification accuracy (89%) using the reference dataset. Most of the behaviors related to oppositional behaviors rated by teachers and hyperactivity/impulsivity rated by both parents and teachers showed high discriminatory accuracy to distinguish ADHD from non-ADHD. Our findings support a deep learning solution for missing data imputation without introducing bias to the data.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).
The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage
This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multiple imputation (MI) is effectively used to deal with missing data when the missing mechanism is missing at random. However, MI may not be effective when the missing mechanism is not missing at random (NMAR). In such cases, additional information is required to obtain an appropriate imputation. Pham et al. (2019) proposed the calibrated-δ adjustment method, which is a multiple imputation method using population information. It provides appropriate imputation in two NMAR settings. However, the calibrated-δ adjustment method has two problems. First, it can be used only when one variable has missing values. Second, the theoretical properties of the variance estimator have not been provided. This article proposes a multiple imputation method using population information that can be applied when several variables have missing values. The proposed method is proven to include the calibrated-δ adjustment method. It is shown that the proposed method provides a consistent estimator for the parameter of the imputation model in an NMAR situation. The asymptotic variance of the estimator obtained by the proposed method and its estimator are also given.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Facebook
TwitterA common problem in clinical trials is the missing data that occurs when patients do not complete the study and drop out without further measurements. Missing data cause the usual statistical analysis of complete or all available data to be subject to bias. There are no universally applicable methods for handling missing data. We recommend the following: (1) Report reasons for dropouts and proportions for each treatment group; (2) Conduct sensitivity analyses to encompass different scenarios of assumptions and discuss consistency or discrepancy among them; (3) Pay attention to minimize the chance of dropouts at the design stage and during trial monitoring; (4) Collect post-dropout data on the primary endpoints, if at all possible; and (5) Consider the dropout event itself an important endpoint in studies with many.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
When dealing with missing data in clinical trials, it is often convenient to work under simplifying assumptions, such as missing at random (MAR), and follow up with sensitivity analyses to address unverifiable missing data assumptions. One such sensitivity analysis, routinely requested by regulatory agencies, is the so-called tipping point analysis, in which the treatment effect is re-evaluated after adding a successively more extreme shift parameter to the predicted values among subjects with missing data. If the shift parameter needed to overturn the conclusion is so extreme that it is considered clinically implausible, then this indicates robustness to missing data assumptions. Tipping point analyses are frequently used in the context of continuous outcome data under multiple imputation. While simple to implement, computation can be cumbersome in the two-way setting where both comparator and active arms are shifted, essentially requiring the evaluation of a two-dimensional grid of models. We describe a computationally efficient approach to performing two-way tipping point analysis in the setting of continuous outcome data with multiple imputation. We show how geometric properties can lead to further simplification when exploring the impact of missing data. Lastly, we propose a novel extension to a multi-way setting which yields simple and general sufficient conditions for robustness to missing data assumptions.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/39526/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39526/terms
Health registries record data about patients with a specific health problem. These data may include age, weight, blood pressure, health problems, medical test results, and treatments received. But data in some patient records may be missing. For example, some patients may not report their weight or all of their health problems. Research studies can use data from health registries to learn how well treatments work. But missing data can lead to incorrect results. To address the problem, researchers often exclude patient records with missing data from their studies. But doing this can also lead to incorrect results. The fewer records that researchers use, the greater the chance for incorrect results. Missing data also lead to another problem: it is harder for researchers to find patient traits that could affect diagnosis and treatment. For example, patients who are overweight may get heart disease. But if data are missing, it is hard for researchers to be sure that trait could affect diagnosis and treatment of heart disease. In this study, the research team developed new statistical methods to fill in missing data in large studies. The team also developed methods to use when data are missing to help find patient traits that could affect diagnosis and treatment. To access the methods, software, and R package, please visit the Long Research Group website.
Facebook
TwitterIn this Datasets i simply showed the handling of missing values in your data with help of python libraries such as NumPy and pandas. You can also see the use of Nan and Non values. Detecting, dropping and filling of null values.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed specifically for beginners and intermediate learners to practice data cleaning techniques using Python and Pandas.
It includes 500 rows of simulated employee data with intentional errors such as:
Missing values in Age and Salary
Typos in email addresses (@gamil.com)
Inconsistent city name casing (e.g., lahore, Karachi)
Extra spaces in department names (e.g., " HR ")
✅ Skills You Can Practice:
Detecting and handling missing data
String cleaning and formatting
Removing duplicates
Validating email formats
Standardizing categorical data
You can use this dataset to build your own data cleaning notebook, or use it in interviews, assessments, and tutorials.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.7910/DVN/29606https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.7910/DVN/29606
Although social scientists devote considerable effort to mitigating measurement error during data collection, they often ignore the issue during data analysis. And although many statistical methods have been proposed for reducing measurement error-induced biases, few have been widely used because of implausible assumptions, high levels of model dependence, difficult computation, or inapplicability with multiple mismeasured variables. We develop an easy-to-use alternative without these problems; it generalizes the popular multiple imputation (MI) framework by treating missing data problems as a limiting special case of extreme measurement error, and corrects for both. Like MI, the proposed framework is a simple two-step procedure, so that in the second step researchers can use whatever statistical method they would have if there had been no problem in the first place. We also offer empirical illustrations, open source software that implements all the methods described herein, and a companion paper with technical details and extensions (Blackwell, Honaker, and King, 2014b). Notes: This is the first of two articles to appear in the same issue of the same journal by the same authors. The second is “A Unified Approach to Measurement Error and Missing Data: Details and Extensions.” See also: Missing Data
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/39492/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39492/terms
Clinical trials study the effects of medical treatments, like how safe they are and how well they work. But most clinical trials don't get all the data they need from patients. Patients may not answer all questions on a survey, or they may drop out of a study after it has started. The missing data can affect researchers' ability to detect the effects of treatments. To address the problem of missing data, researchers can make different guesses based on why and how data are missing. Then they can look at results for each guess. If results based on different guesses are similar, researchers can have more confidence that the study results are accurate. In this study, the research team created new methods to do these tests and developed software that runs these tests. To access the sensitivity analysis methods and software, please visit the MissingDataMatters website.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
Welcome to the Zenodo repository for Publication Benchmarking imputation methods for categorical biological data, a comprehensive collection of datasets and scripts utilized in our research endeavors. This repository serves as a vital resource for researchers interested in exploring the empirical and simulated analyses conducted in our study.
Contents:
empirical_analysis:
simulation_analysis:
TDIP_package:
Purpose:
This repository aims to provide transparency and reproducibility to our research findings by making the datasets and scripts publicly accessible. Researchers interested in understanding our methodologies, replicating our analyses, or building upon our work can utilize this repository as a valuable reference.
Citation:
When using the datasets or scripts from this repository, we kindly request citing Publication Benchmarking imputation methods for categorical biological data and acknowledging the use of this Zenodo repository.
Thank you for your interest in our research, and we hope this repository serves as a valuable resource in your scholarly pursuits.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GENERAL INFORMATION
Title of Dataset: A dataset from a survey investigating disciplinary differences in data citation
Date of data collection: January to March 2022
Collection instrument: SurveyMonkey
Funding: Alfred P. Sloan Foundation
SHARING/ACCESS INFORMATION
Licenses/restrictions placed on the data: These data are available under a CC BY 4.0 license
Links to publications that cite or use the data:
Gregory, K., Ninkov, A., Ripp, C., Peters, I., & Haustein, S. (2022). Surveying practices of data citation and reuse across disciplines. Proceedings of the 26th International Conference on Science and Technology Indicators. International Conference on Science and Technology Indicators, Granada, Spain. https://doi.org/10.5281/ZENODO.6951437
Gregory, K., Ninkov, A., Ripp, C., Roblin, E., Peters, I., & Haustein, S. (2023). Tracing data:
A survey investigating disciplinary differences in data citation. Zenodo. https://doi.org/10.5281/zenodo.7555266
DATA & FILE OVERVIEW
File List
Additional related data collected that was not included in the current data package: Open ended questions asked to respondents
METHODOLOGICAL INFORMATION
Description of methods used for collection/generation of data:
The development of the questionnaire (Gregory et al., 2022) was centered around the creation of two main branches of questions for the primary groups of interest in our study: researchers that reuse data (33 questions in total) and researchers that do not reuse data (16 questions in total). The population of interest for this survey consists of researchers from all disciplines and countries, sampled from the corresponding authors of papers indexed in the Web of Science (WoS) between 2016 and 2020.
Received 3,632 responses, 2,509 of which were completed, representing a completion rate of 68.6%. Incomplete responses were excluded from the dataset. The final total contains 2,492 complete responses and an uncorrected response rate of 1.57%. Controlling for invalid emails, bounced emails and opt-outs (n=5,201) produced a response rate of 1.62%, similar to surveys using comparable recruitment methods (Gregory et al., 2020).
Methods for processing the data:
Results were downloaded from SurveyMonkey in CSV format and were prepared for analysis using Excel and SPSS by recoding ordinal and multiple choice questions and by removing missing values.
Instrument- or software-specific information needed to interpret the data:
The dataset is provided in SPSS format, which requires IBM SPSS Statistics. The dataset is also available in a coded format in CSV. The Codebook is required to interpret to values.
DATA-SPECIFIC INFORMATION FOR: MDCDataCitationReuse2021surveydata
Number of variables: 94
Number of cases/rows: 2,492
Missing data codes: 999 Not asked
Refer to MDCDatacitationReuse2021Codebook.pdf for detailed variable information.
Facebook
TwitterThis datasets about Netflix Movies & TV Shows. Datasets have 12 columns with some null values. To analysis of dataset are used Pandas, plotly.express and Datetime libraries. Analysis process I divided into several parts for step wise analysis and to find out trending questions on social media for Bollywood actors and actress.
There are many representations of missing data. They are Null values, missing values. I used some of methods used in data analysis process to clean missing values.
There I used some string method on column such as 'cast', 'Lested_in' to extract data
Converting an object type into datatype objects with the to_datetime function then we have a datatime object, can extract various part of data such as year, month and day
Here, I find out several eye catching question. the following questions are like as- - Show the all Movies & TV Shows released by month - Count the all types of unique rating & which rating are with most number - Salman, Shah Rukh and Akshay Kumar all movie - Find out the Movies & Series have Maximum time length - Year on Year show added on Netflix by its type - Akshay Kumar all comedies movies, Shah Rukh movies with Kajol and Salman-Akshay Movies - Who Director has made the most TV Shows - Actors and Actress who have given most Number of Movies - Find out which types of genre has most movies and TV Shows
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Water is vital for life and local water pollution can damage the environment and affect human health. Governments and private institutions monitor and regulate water quality to protect the environment and populations. The consequences of pollution can reach far and wide, costing companies significant amounts in cleanup costs and loss of reputation. Most countries have official accredited laboratories and sampling teams that use varied technology, global expertise and local knowledge to provide water quality monitoring for different types of water and different and varied sampling locations. However, one of the main problems associated with monitoring and assessing water quality and meeting minimum standards of potability or usability is the analysis of samples based on local data. The problem lies in the fact that in many cases the data, due to the methodology or technique used or the expertise of the human resource that handles the samples, ends up configured in sets that have a large amount of missing information or data without information. This implies a problem depending on the analysis to be carried out. If you want to estimate a water quality index based on the samples, then you may have biased calculations due to the loss of information.
This dataset has been used for the generation of the manuscript: Efficient improvement for water quality analysis with large amount of missing data. D. Sierra-Porta,M. Tobón-Ospino. This manuscript is being submitted to Sustainable Production and Consumption (2022 Elsevier), Publication of the Institution of Chemical Engineers.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/39528/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39528/terms
Researchers can use data from health registries or electronic health records to compare two or more treatments. Registries store data about patients with a specific health problem. These data include how well those patients respond to treatments and information about patient traits, such as age, weight, or blood pressure. But sometimes data about patient traits are missing. Missing data about patient traits can lead to incorrect study results, especially when traits change over time. For example, weight can change over time, and the patient may not report their weight at some points along the way. Researchers use statistical methods to fill in these missing data. In this study, the research team compared a new statistical method to fill in missing data with traditional methods. Traditional methods remove patients with missing data or fill in each missing number with a single estimate. The new method creates multiple possible estimates to fill in each missing number. To access the methods, software, and R package, please visit the SimulateCER GitHub and SimTimeVar CRAN website.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.
This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.
To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).
IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.
In this dataset, we include the original and imputed values for the following variables:
Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)
Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].
More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.
If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318