100+ datasets found

Vehicle Trips Dataset with Missing Values
zenodo.org
data.niaid.nih.gov
zip
Updated Jul 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb; Rob Hyndman; Rob Hyndman; Pablo Montero-Manso; Pablo Montero-Manso (2021). Vehicle Trips Dataset with Missing Values [Dataset]. http://doi.org/10.5281/zenodo.5122535
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5122535
Dataset updated
Jul 23, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb; Rob Hyndman; Rob Hyndman; Pablo Montero-Manso; Pablo Montero-Manso
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 329 daily time series representing the number of trips and vehicles belonging to a set of for-hire vehicle (FHV) companies. The dataset was originally extracted from https://github.com/fivethirtyeight/uber-tlc-foil-response.
f
Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...
plos.figshare.com
figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0164464
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Nikolaj Bak; Lars K. Hansen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.
Water-quality data imputation with a high percentage of missing values: a...
zenodo.org
explore.openaire.eu
+1more
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
h
Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...
datahub.hku.hk
Updated Aug 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
Explore at:
Unique identifier
https://doi.org/10.25442/hku.12752549.v1
Dataset updated
Aug 13, 2020
Dataset provided by
HKU Data Repository
Authors
Wen Ma
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
o
Data from: Identifying Missing Data Handling Methods with Text Mining
openicpsr.org
delimited
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E185961V1
Dataset updated
Mar 8, 2023
Dataset provided by
Hungarian Academy of Sciences
Authors
Krisztián Boros; Zoltán Kmetty
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1999 - Dec 31, 2016
Description
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
Missing Data in the Uniform Crime Reports (UCR), 1977-2000 [United States]
icpsr.umich.edu
catalog.data.gov
ascii, delimited +4
Updated Nov 26, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Targonski, Joseph (2012). Missing Data in the Uniform Crime Reports (UCR), 1977-2000 [United States] [Dataset]. http://doi.org/10.3886/ICPSR32061.v1
Explore at:
spss, ascii, stata, sas, delimited, excelAvailable download formats
Unique identifier
https://doi.org/10.3886/ICPSR32061.v1
Dataset updated
Nov 26, 2012
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
Targonski, Joseph
License
https://www.icpsr.umich.edu/web/ICPSR/studies/32061/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/32061/terms
Time period covered
1977 - 2000
Area covered
United States
Description
This study reexamined and recoded missing data in the Uniform Crime Reports (UCR) for the years 1977 to 2000 for all police agencies in the United States. The principal investigator conducted a data cleaning of 20,067 Originating Agency Identifiers (ORIs) contained within the Offenses-Known UCR data from 1977 to 2000. Data cleaning involved performing agency name checks and creating new numerical codes for different types of missing data including missing data codes that identify whether a record was aggregated to a particular month, whether no data were reported (true missing), if more than one index crime was missing, if a particular index crime (motor vehicle theft, larceny, burglary, assault, robbery, rape, murder) was missing, researcher assigned missing value codes according to the "rule of 20", outlier values, whether an ORI was covered by another agency, and whether an agency did not exist during a particular time period.
Z
Missing data in the analysis of multilevel and dependent data (Examples)
data.niaid.nih.gov
zenodo.org
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliver Lüdtke (2023). Missing data in the analysis of multilevel and dependent data (Examples) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7773613
Explore at:
Dataset updated
Jul 20, 2023
Dataset provided by
Oliver Lüdtke
Alexander Robitzsch
Simon Grund
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").

The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)

In all data sets, missing values are coded as "NA".
d
Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...
search.dataone.org
dataverse.harvard.edu
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UPL4TT
Dataset updated
Nov 23, 2023
Dataset provided by
Harvard Dataverse
Authors
Lall, Ranjit; Robinson, Thomas
Description
Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
f
Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for...
frontiersin.figshare.com
docx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen (2023). Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.docx [Dataset]. http://doi.org/10.3389/fpubh.2021.680054.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2021.680054.s001
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.
Handle Missing Values
kaggle.com
Updated Oct 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Safacan Metin (2020). Handle Missing Values [Dataset]. https://www.kaggle.com/safacanmetin/handle-missing-values/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 24, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Safacan Metin
Description
Dataset

This dataset was created by Safacan Metin

Contents
f
Results of the ML models using KNN imputer.
plos.figshare.com
xls
Updated Jan 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turki Aljrees (2024). Results of the ML models using KNN imputer. [Dataset]. http://doi.org/10.1371/journal.pone.0295632.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295632.t005
Dataset updated
Jan 3, 2024
Dataset provided by
PLOS ONE
Authors
Turki Aljrees
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
f
A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...
acs.figshare.com
xlsx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker (2023). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.1c00070.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.1c00070.s004
Dataset updated
Jun 4, 2023
Dataset provided by
ACS Publications
Authors
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
n
Data from: Using multiple imputation to estimate missing data in...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Nov 25, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray (2015). Using multiple imputation to estimate missing data in meta-regression [Dataset]. http://doi.org/10.5061/dryad.m2v4m
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m2v4m
Dataset updated
Nov 25, 2015
Dataset provided by
Trent University
University of Prince Edward Island
Authors
E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
There is a growing need for scientific synthesis in ecology and evolution. In many cases, meta-analytic techniques can be used to complement such synthesis. However, missing data is a serious problem for any synthetic efforts and can compromise the integrity of meta-analyses in these and other disciplines. Currently, the prevalence of missing data in meta-analytic datasets in ecology and the efficacy of different remedies for this problem have not been adequately quantified. 2. We generated meta-analytic datasets based on literature reviews of experimental and observational data and found that missing data were prevalent in meta-analytic ecological datasets. We then tested the performance of complete case removal (a widely used method when data are missing) and multiple imputation (an alternative method for data recovery) and assessed model bias, precision, and multi-model rankings under a variety of simulated conditions using published meta-regression datasets. 3. We found that complete case removal led to biased and imprecise coefficient estimates and yielded poorly specified models. In contrast, multiple imputation provided unbiased parameter estimates with only a small loss in precision. The performance of multiple imputation, however, was dependent on the type of data missing. It performed best when missing values were weighting variables, but performance was mixed when missing values were predictor variables. Multiple imputation performed poorly when imputing raw data which was then used to calculate effect size and the weighting variable. 4. We conclude that complete case removal should not be used in meta-regression, and that multiple imputation has the potential to be an indispensable tool for meta-regression in ecology and evolution. However, we recommend that users assess the performance of multiple imputation by simulating missing data on a subset of their data before implementing it to recover actual missing data.
Z
Rideshare Dataset without Missing Values
data.niaid.nih.gov
zenodo.org
Updated Jul 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Webb, Geoff (2021). Rideshare Dataset without Missing Values [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5122231
Explore at:
Dataset updated
Jul 23, 2021
Dataset provided by
Montero-Manso, Pablo
Bergmeir, Christoph
Webb, Geoff
Hyndman, Rob
Godahewa, Rakshitha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains various hourly time series representations of attributes related to Uber and Lyft rideshare services for various locations in New York between 26/11/2018 and 18/12/2018.

For a given starting location, provider and service, the following types are represented: 'price_min', 'price_mean', 'price_max', 'distance_min', 'distance_mean', 'distance_max', 'surge_min', 'surge_mean', 'surge_max', 'api_calls', 'temp', 'rain', 'humidity', 'clouds' and 'wind'.

The original dataset contains missing values and they have been simply replaced by zeros.
Missing Values Data
kaggle.com
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed F. ElTantawy (2024). Missing Values Data [Dataset]. https://www.kaggle.com/datasets/ahmedgt/missing-values-data/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 23, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ahmed F. ElTantawy
License
https://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses
Description
Dataset

This dataset was created by Ahmed F. ElTantawy

Released under RAIL (specified in description)

Contents
Superstore Sales Dataset Cleaned
kaggle.com
Updated Jan 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubham Jagtap (2023). Superstore Sales Dataset Cleaned [Dataset]. https://www.kaggle.com/datasets/shubhamcjagtap4/superstore-sales-dataset-cleaned
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shubham Jagtap
Description
Dataset

This dataset was created by Shubham Jagtap

Contents
i
Three geological sensor data with missing values
ieee-dataport.org
Updated May 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jinsheng yang (2022). Three geological sensor data with missing values [Dataset]. https://ieee-dataport.org/documents/three-geological-sensor-data-missing-values
Explore at:
Dataset updated
May 17, 2022
Authors
jinsheng yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
45710421 x
f
MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as...
tandf.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mia Hubert; Peter J. Rousseeuw; Wannes Van den Bossche (2023). MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as Cellwise and Rowwise Outliers [Dataset]. http://doi.org/10.6084/m9.figshare.7624424.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7624424.v2
Dataset updated
Jun 2, 2023
Dataset provided by
Taylor & Francis
Authors
Mia Hubert; Peter J. Rousseeuw; Wannes Van den Bossche
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multivariate data are typically represented by a rectangular matrix (table) in which the rows are the objects (cases) and the columns are the variables (measurements). When there are many variables one often reduces the dimension by principal component analysis (PCA), which in its basic form is not robust to outliers. Much research has focused on handling rowwise outliers, that is, rows that deviate from the majority of the rows in the data (e.g., they might belong to a different population). In recent years also cellwise outliers are receiving attention. These are suspicious cells (entries) that can occur anywhere in the table. Even a relatively small proportion of outlying cells can contaminate over half the rows, which causes rowwise robust methods to break down. In this article, a new PCA method is constructed which combines the strengths of two existing robust methods to be robust against both cellwise and rowwise outliers. At the same time, the algorithm can cope with missing values. As of yet it is the only PCA method that can deal with all three problems simultaneously. Its name MacroPCA stands for PCA allowing for Missingness And Cellwise & Rowwise Outliers. Several simulations and real datasets illustrate its robustness. New residual maps are introduced, which help to determine which variables are responsible for the outlying behavior. The method is well-suited for online process control.
Temperature Rain Dataset without Missing Values
zenodo.org
data.niaid.nih.gov
zip
Updated Jul 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb; Rob Hyndman; Rob Hyndman; Pablo Montero-Manso; Pablo Montero-Manso (2021). Temperature Rain Dataset without Missing Values [Dataset]. http://doi.org/10.5281/zenodo.5129091
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5129091
Dataset updated
Jul 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb; Rob Hyndman; Rob Hyndman; Pablo Montero-Manso; Pablo Montero-Manso
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 32072 daily time series showing the temperature observations and rain forecasts, gathered by the Australian Bureau of Meteorology for 422 weather stations across Australia, between 02/05/2015 and 26/04/2017.

The original dataset contains missing values and they have been simply replaced by zeros.
Quarterly Labour Force Survey Household Dataset, April - June, 2021
beta.ukdataservice.ac.uk
Updated 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office For National Statistics (2023). Quarterly Labour Force Survey Household Dataset, April - June, 2021 [Dataset]. http://doi.org/10.5255/ukda-sn-8852-3
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-8852-3
Dataset updated
2023
Dataset provided by
DataCitehttps://www.datacite.org/
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
Office For National Statistics
Description
Background
The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.

Household datasets
Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. From January 2011, a pseudonymised household identifier variable (HSERIALP) is also included in the main quarterly LFS dataset instead.

Change to coding of missing values for household series
From 1996-2013, all missing values in the household datasets were set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. This was also in line with the Annual Population Survey household series of the time. The change was applied to the back series during 2010 to ensure continuity for analytical purposes. From 2013 onwards, the -8 and -9 categories have been reinstated.

LFS Documentation
The documentation available from the Archive to accompany LFS datasets largely consists of the latest version of each volume alongside the appropriate questionnaire for the year concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS LFS User Guidance page before commencing analysis.

Additional data derived from the QLFS
The Archive also holds further QLFS series: End User Licence (EUL) quarterly datasets; Secure Access datasets (see below); two-quarter and five-quarter longitudinal datasets; quarterly, annual and ad hoc module datasets compiled for Eurostat; and some additional annual Northern Ireland datasets.

End User Licence and Secure Access QLFS Household datasets
Users should note that there are two discrete versions of the QLFS household datasets. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. Secure Access household datasets for the QLFS are available from 2009 onwards, and include additional, detailed variables not included in the standard EUL versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurrence of learning difficulty or disability; and benefits. For full details of variables included, see data dictionary documentation. The Secure Access version (see SN 7674) has more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements.

Changes to variables in QLFS Household EUL datasets
In order to further protect respondent confidentiality, ONS have made some changes to variables available in the EUL datasets. From July-September 2015 onwards, 4-digit industry class is available for main job only, meaning that 3-digit industry group is the most detailed level available for second and last job.

Review of imputation methods for LFS Household data - changes to missing values
A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.

Occupation data for 2021 and 2022 data files
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

Latest edition information
For the third edition (September 2023), the variables NSECM20, NSECMJ20, SC2010M, SC20SMJ, SC20SMN, SOC20M and SOC20O have been replaced with new versions. Further information on the SOC revisions can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb; Rob Hyndman; Rob Hyndman; Pablo Montero-Manso; Pablo Montero-Manso (2021). Vehicle Trips Dataset with Missing Values [Dataset]. http://doi.org/10.5281/zenodo.5122535

Vehicle Trips Dataset with Missing Values

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.5122535

Dataset updated

Jul 23, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb; Rob Hyndman; Rob Hyndman; Pablo Montero-Manso; Pablo Montero-Manso

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains 329 daily time series representing the number of trips and vehicles belonging to a set of for-hire vehicle (FHV) companies. The dataset was originally extracted from https://github.com/fivethirtyeight/uber-tlc-foil-response.

Clear search

Close search

Google apps

Main menu

Vehicle Trips Dataset with Missing Values

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...

Water-quality data imputation with a high percentage of missing values: a...

Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

Data from: Identifying Missing Data Handling Methods with Text Mining

Missing Data in the Uniform Crime Reports (UCR), 1977-2000 [United States]

Missing data in the analysis of multilevel and dependent data (Examples)

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for...

Handle Missing Values

Dataset

Contents

Results of the ML models using KNN imputer.

A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

Data from: Using multiple imputation to estimate missing data in...

Rideshare Dataset without Missing Values

Missing Values Data

Dataset

Contents

Superstore Sales Dataset Cleaned

Dataset

Contents

Three geological sensor data with missing values

MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as...

Temperature Rain Dataset without Missing Values

Quarterly Labour Force Survey Household Dataset, April - June, 2021

Vehicle Trips Dataset with Missing ValuesSee More Versions

Vehicle Trips Dataset with Missing Values