82 datasets found

o
Data from: Identifying Missing Data Handling Methods with Text Mining
openicpsr.org
delimited
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E185961V1
Dataset updated
Mar 8, 2023
Dataset provided by
Hungarian Academy of Sciences
Authors
Krisztián Boros; Zoltán Kmetty
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1999 - Dec 31, 2016
Description
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
Water-quality data imputation with a high percentage of missing values: a...
zenodo.org
explore.openaire.eu
+1more
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
d
Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...
search.dataone.org
dataverse.harvard.edu
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UPL4TT
Dataset updated
Nov 23, 2023
Dataset provided by
Harvard Dataverse
Authors
Lall, Ranjit; Robinson, Thomas
Description
Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
h
Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...
datahub.hku.hk
Updated Aug 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
Explore at:
Unique identifier
https://doi.org/10.25442/hku.12752549.v1
Dataset updated
Aug 13, 2020
Dataset provided by
HKU Data Repository
Authors
Wen Ma
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
d
Missing Person Information Clearinghouse
catalog.data.gov
s.cnmilf.com
+2more
Updated Sep 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.iowa.gov (2023). Missing Person Information Clearinghouse [Dataset]. https://catalog.data.gov/dataset/missing-person-information-clearinghouse
Explore at:
Dataset updated
Sep 1, 2023
Dataset provided by
data.iowa.gov
Description
The Missing Person Information Clearinghouse was established July 1, 1985, within the Department of Public Safety providing a program for compiling, coordinating and disseminating information in relation to missing persons and unidentified body/persons. Housed within the Division of Criminal Investigation, the clearinghouse assists in helping to locate missing persons through public awareness and cooperation, and in educating law enforcement officers and the general public about missing person issues.
n
Data from: Using decision trees to understand structure in missing data
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Jun 2, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas J. Tierney; Fiona A. Harden; Maurice J. Harden; Kerrie L. Mengersen (2015). Using decision trees to understand structure in missing data [Dataset]. http://doi.org/10.5061/dryad.j4f19
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.j4f19
Dataset updated
Jun 2, 2015
Dataset provided by
Hunter Industrial Medicine
Queensland University of Technology
Authors
Nicholas J. Tierney; Fiona A. Harden; Maurice J. Harden; Kerrie L. Mengersen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objectives: Demonstrate the application of decision trees—classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)—to understand structure in missing data. Setting: Data taken from employees at 3 different industrial sites in Australia. Participants: 7915 observations were included. Materials and methods: The approach was evaluated using an occupational health data set comprising results of questionnaires, medical tests and environmental monitoring. Statistical methods included standard statistical tests and the ‘rpart’ and ‘gbm’ packages for CART and BRT analyses, respectively, from the statistical software ‘R’. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. Results: CART and BRT models were effective in highlighting a missingness structure in the data, related to the type of data (medical or environmental), the site in which it was collected, the number of visits, and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured as compared to structured missingness. Discussion: Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. Conclusions: Researchers are encouraged to use CART and BRT models to explore and understand missing data.
Z
NN5 Daily Dataset (without Missing Values)
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Apr 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Webb, Geoff (2021). NN5 Daily Dataset (without Missing Values) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3889739
Explore at:
Dataset updated
Apr 1, 2021
Dataset provided by
Godahewa, Rakshitha
Bergmeir, Christoph
Webb, Geoff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was used in the NN5 forecasting competition. It contains 111 time series from the banking domain. The goal is predicting the daily cash withdrawals from ATMs in UK.

The original dataset contains missing values. A missing value on a particular day is replaced by the median across all the same days of the week along the whole series.
d
Replication Data for: Qualitative Imputation of Missing Potential Outcomes
search.dataone.org
dataverse.harvard.edu
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coppock, Alexander; Kaur, Dipin (2023). Replication Data for: Qualitative Imputation of Missing Potential Outcomes [Dataset]. http://doi.org/10.7910/DVN/2IVKXD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/2IVKXD
Dataset updated
Nov 9, 2023
Dataset provided by
Harvard Dataverse
Authors
Coppock, Alexander; Kaur, Dipin
Description
We propose a framework for meta-analysis of qualitative causal inferences. We integrate qualitative counterfactual inquiry with an approach from the quantitative causal inference literature called extreme value bounds. Qualitative counterfactual analysis uses the observed outcome and auxiliary information to infer what would have happened had the treatment been set to a different level. Imputing missing potential outcomes is hard and when it fails, we can fill them in under best- and worst-case scenarios. We apply our approach to 63 cases that could have experienced transitional truth commissions upon democratization, 8 of which did. Prior to any analysis, the extreme value bounds around the average treatment effect on authoritarian resumption are 100 percentage points wide; imputation shrinks the width of these bounds to 51 points. We further demonstrate our method by aggregating specialists' beliefs about causal effects gathered through an expert survey, shrinking the width of the bounds to 44 points.
z
Data from: Incomplete specimens in geometric morphometric analyses
zenodo.org
explore.openaire.eu
+1more
Updated Oct 11, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arbour, Jessica H.; Brown, Caleb M. (2014). Data from: Incomplete specimens in geometric morphometric analyses [Dataset]. http://doi.org/10.5061/dryad.mp713
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.mp713
Dataset updated
Oct 11, 2014
Dataset provided by
University of Toronto
Authors
Arbour, Jessica H.; Brown, Caleb M.
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
1.The analysis of morphological diversity frequently relies on the use of multivariate methods for characterizing biological shape. However, many of these methods are intolerant of missing data, which can limit the use of rare taxa and hinder the study of broad patterns of ecological diversity and morphological evolution. This study applied a mutli-dataset approach to compare variation in missing data estimation and its effect on geometric morphometric analysis across taxonomically-variable groups, landmark position and sample sizes. 2.Missing morphometric landmark data was simulated from five real, complete datasets, including modern fish, primates and extinct theropod dinosaurs. Missing landmarks were then estimated using several standard approaches and a geometric-morphometric-specific method. The accuracy of missing data estimation was determined for each estimation method, landmark position, and morphological dataset. Procrustes superimposition was used to compare the eigenvectors and principal component scores of a geometric morphometric analysis of the original landmark data, to datasets with A) missing values estimated, or B) simulated incomplete specimens excluded, for varying levels of specimens incompleteness and sample sizes. 3.Standard estimation techniques were more reliable estimators and had lower impacts on morphometric analysis compared to a geometric-morphometric-specific estimator. For most datasets and estimation techniques, estimating missing data produced a better fit to the structure of the original data than exclusion of incomplete specimens, and this was maintained even at considerably reduced sample sizes. The impact of missing data on geometric morphometric analysis was disproportionately affected by the most fragmentary specimens. 4.Missing data estimation was influenced by variability of specific anatomical features, and may be improved by a better understanding of shape variation present in a dataset. Our results suggest that the inclusion of incomplete specimens through the use of effective missing data estimators better reflects the patterns of shape variation within a dataset than using only complete specimens, however the effectiveness of missing data estimation can be maximized by excluding only the most incomplete specimens. It is advised that missing data estimators be evaluated for each dataset and landmark independently, as the effectiveness of estimators can vary strongly and unpredictably between different taxa and structures.
Z
Bitcoin Dataset with Missing Values
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jul 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Webb, Geoff (2021). Bitcoin Dataset with Missing Values [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5121964
Explore at:
Dataset updated
Jul 23, 2021
Dataset provided by
Hyndman, Rob
Montero-Manso, Pablo
Godahewa, Rakshitha
Bergmeir, Christoph
Webb, Geoff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the potential influencers of the bitcoin price. There are a total of 18 daily time series including hash rate, block size, mining difficulty etc. It also encompasses public opinion in the form of tweets and google searches mentioning the keyword bitcoin. The data is scraped from the interactive web-graphs available at https://bitinfocharts.com.
National Missing and Unidentified Persons System (NamUs)
catalog.data.gov
datasets.ai
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Justice Programs (2025). National Missing and Unidentified Persons System (NamUs) [Dataset]. https://catalog.data.gov/dataset/national-missing-and-unidentified-persons-system-namus
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
Office of Justice Programshttps://ojp.gov/
Description
NamUs is the only national repository for missing, unidentified, and unclaimed persons cases. The program provides a singular resource hub for law enforcement, medical examiners, coroners, and investigating professionals. It is the only national database for missing, unidentified, and unclaimed persons that allows limited access to the public, empowering family members to take a more proactive role in the search for their missing loved ones.
d
Replication data for: A Unified Approach To Measurement Error And Missing...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blackwell, Matthew; Honaker, James; King, Gary (2023). Replication data for: A Unified Approach To Measurement Error And Missing Data: Overview [Dataset]. http://doi.org/10.7910/DVN/29606
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/29606
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Blackwell, Matthew; Honaker, James; King, Gary
Description
Although social scientists devote considerable effort to mitigating measurement error during data collection, they often ignore the issue during data analysis. And although many statistical methods have been proposed for reducing measurement error-induced biases, few have been widely used because of implausible assumptions, high levels of model dependence, difficult computation, or inapplicability with multiple mismeasured variables. We develop an easy-to-use alternative without these problems; it generalizes the popular multiple imputation (MI) framework by treating missing data problems as a limiting special case of extreme measurement error, and corrects for both. Like MI, the proposed framework is a simple two-step procedure, so that in the second step researchers can use whatever statistical method they would have if there had been no problem in the first place. We also offer empirical illustrations, open source software that implements all the methods described herein, and a companion paper with technical details and extensions (Blackwell, Honaker, and King, 2014b). Notes: This is the first of two articles to appear in the same issue of the same journal by the same authors. The second is “A Unified Approach to Measurement Error and Missing Data: Details and Extensions.” See also: Missing Data
a
YOSAR Missing Person Data 2000 - 2011
data-pjdohertymaps.opendata.arcgis.com
hub.arcgis.com
Updated Nov 17, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paulix (2015). YOSAR Missing Person Data 2000 - 2011 [Dataset]. https://data-pjdohertymaps.opendata.arcgis.com/maps/7963d15255544d99ac7d35bcd46b3960
Explore at:
Dataset updated
Nov 17, 2015
Dataset authored and provided by
Paulix
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered

Description
This is a basic web map for showing the Yosemite Search and Rescue missing person dataset that shows the initial planning point, point found, and direct line path between the two.For more information, please Jared Doke's MS Thesis: Analysis of Search Incidents and Lost Person Behavior in Yosemite National Park.Study of wilderness search and rescue (WiSAR) incidents suggests a dependency on demographics as well as physical geography in relation to decisions made before/after becoming lost and subsequent locations in which subjects are found. Thus an understanding of the complex relationship between demographics and physical geography could enhance the responders’ ability to locate the subject in a timely manner. Various global datasets have been organized to provide general distance and feature based geostatistical methods for describing this relationship. However, there is some question as to the applicability of these generalized datasets to local incidents that are dominated by a specific physical geography. This study consists of two primary objectives related to the allocation of geographic probability intended to manage the overall size of the search area. The first objective considers the applicability of a global dataset of lost person incidents to a localized environment with limited geographic diversity. This is followed by a comparison between a commonly used Euclidean distance statistic and an alternative travel-cost model that accounts for the influence of anthropogenic and landscape features on subject mobility and travel time. In both instances, lost person incident data from years 2000 to 2010 for Yosemite National Park is used and compared to a large pool of internationally compiled cases consisting of similar subject profiles.
D
Data from: Estimating and imputing missing tax loss carryforward data to...
dataverse.nl
test.dataverse.nl
txt
Updated May 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M.M. Max; J.L. Wielhouwer; E. Wiersma; M.M. Max; J.L. Wielhouwer; E. Wiersma (2025). Estimating and imputing missing tax loss carryforward data to reduce measurement error [Dataset]. http://doi.org/10.34894/N9J1WE
Explore at:
txt(3617825), txt(3728396), txt(3690658), txt(3808750)Available download formats
Unique identifier
https://doi.org/10.34894/N9J1WE
Dataset updated
May 1, 2025
Dataset provided by
DataverseNL
Authors
M.M. Max; J.L. Wielhouwer; E. Wiersma; M.M. Max; J.L. Wielhouwer; E. Wiersma
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
On this page, you find the imputed tax loss carryforward (TLCF) values based on the algorithm presented in Max, Wielhouwer & Wiersma (2023). Estimating and imputing missing tax loss carryforward data to reduce measurement error. European Accounting Review 32(1), 55-84. https://doi.org/10.1080/09638180.2021.1924812 . Note that the dataset contains only the imputations for the missing values on Compustat. If the Compustat TLCF is available, we have not included it in the dataset here. We download all observations from Compustat from the period 1982 up until the most recent year. Missing values on the input variables for the Shevlin (1990) taxable income measure are replaced by zero. We drop firms from the sample if years are missing in their time series. Note: We seek to update the dataset each year with the latest imputations, so pay attention to the 'Versions' tab of this dataset. Each update will be posted in a new version.
Overwatch 2 statistics
kaggle.com
Updated Jun 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mykhailo Kachan (2023). Overwatch 2 statistics [Dataset]. https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2023
Dataset provided by
Kaggle
Authors
Mykhailo Kachan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.

The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).

Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.

Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!

The code on GitHub .

All procedure is done in 5 stages:

Stage 1:

Data is retrieved directly from HTML elements on the page with the selenium tool on python.

Stage 2:

After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.

Stage 3:

Data were arranged into a table and saved to CSV.

Stage 4:

Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.

Stage 5:

Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.

The procedure to fetch the data takes 7 minutes on average.

This project and code were born from this GitHub code.
f
Effect of missing data on topological inference using a total evidence...
figshare.com
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Guillerme; Natalie Cooper (2023). Effect of missing data on topological inference using a total evidence approach [Dataset]. http://doi.org/10.6084/m9.figshare.1306861.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1306861.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Thomas Guillerme; Natalie Cooper
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To fully understand macroevolutionary patterns and processes, we need to include both extant and extinct species in our models. This requires phylogenetic trees with both living and fossil taxa at the tips. One way to infer such phylogenies is the Total Evidence approach which uses molecular data from living taxa and morphological data from living and fossil taxa.Although the Total Evidence approach is very promising, it requires a great deal of data that can be hard to collect. Therefore this method is likely to suffer from missing data issues that may affect its ability to infer correct phylogenies.Here we use simulations to assess the effects of missing data on tree topologies inferred from Total Evidence matrices. We investigate three major factors that directly affect the completeness and the size of the morphological part of the matrix: the proportion of living taxa with no morphological data, the amount of missing data in the fossil record, and the overall number of morphological characters in the matrix. We infer phylogenies from complete matrices and from matrices with various amounts of missing data, and then compare missing data topologies to the "best" tree topology inferred using the complete matrix.We find that the number of living taxa with morphological characters and the overall number of morphological characters in the matrix, are more important than the amount of missing data in the fossil record for recovering the "best" tree topology. Therefore, we suggest that sampling effort should be focused on morphological data collection for living species to increase the accuracy of topological inference in a Total Evidence framework. Additionally, we find that Bayesian methods consistently outperform other tree inference methods. We therefore recommend using Bayesian consensus trees to fix the tree topology prior to further analyses.
KDD Cup Dataset (with Missing Values)
zenodo.org
explore.openaire.eu
+1more
bin
Updated Apr 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb (2021). KDD Cup Dataset (with Missing Values) [Dataset]. http://doi.org/10.5281/zenodo.3893535
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3893535
Dataset updated
Apr 1, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was used in the KDD Cup 2018 forecasting competition. It contains long hourly time series representing the air quality levels in 59 stations in 2 cities: Beijing (35 stations) and London (24 stations) from 01/01/2017 to 31/03/2018. The air quality level is represented in multiple measurements such as PM2.5, PM10, NO2, CO, O3 and SO2.

The dataset uploaded here contains 270 hourly time series which have been categorized using city, station name and air quality measurement.
Slack Queries
kaggle.com
Updated Aug 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aikanshi Vaish (2021). Slack Queries [Dataset]. https://www.kaggle.com/aikanshivaish/slack-queries/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 16, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aikanshi Vaish
Description
Hello Kagglers,

If you are new to kaggle and want to learn handling of datetime type dataset, this can be helpful for you to access and get best possible date time values. It includes some missing values in date time columns.

This is a dataset for beginners who want to learn EDA on datetime type data.

Three out of eight columns are datetime type convertible and object type in raw data.

You can use any method for filling missing values of datetime which you find will be best match
f
Final search terms and lengths (in months) and missing values (in weeks) of...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ulrich S. Tran; Rita Andel; Thomas Niederkrotenthaler; Benedikt Till; Vladeta Ajdacic-Gross; Martin Voracek (2023). Final search terms and lengths (in months) and missing values (in weeks) of the respective Google Trends time series. [Dataset]. http://doi.org/10.1371/journal.pone.0183149.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0183149.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Ulrich S. Tran; Rita Andel; Thomas Niederkrotenthaler; Benedikt Till; Vladeta Ajdacic-Gross; Martin Voracek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Final search terms and lengths (in months) and missing values (in weeks) of the respective Google Trends time series.
e
Missing Value
knb.ecoinformatics.org
search.dataone.org
Updated Jan 6, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tao (2015). Missing Value [Dataset]. http://doi.org/10.5063/AA/tao.12069.1
Explore at:
Unique identifier
https://doi.org/10.5063/AA/tao.12069.1
Dataset updated
Jan 6, 2015
Dataset provided by
Knowledge Network for Biocomplexity
Authors
tao
Variables measured
Column1, Column2
Description
No description is available. Visit https://dataone.org/datasets/doi%3A10.5063%2FAA%2Ftao.12069.1 for complete metadata about this dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1

Data from: Identifying Missing Data Handling Methods with Text Mining

Explore at:

delimitedAvailable download formats

Unique identifier

https://doi.org/10.3886/E185961V1

Dataset updated

Mar 8, 2023

Dataset provided by

Hungarian Academy of Sciences

Authors

Krisztián Boros; Zoltán Kmetty

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 1, 1999 - Dec 31, 2016

Description

Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

Clear search

Close search

Google apps

Main menu

Data from: Identifying Missing Data Handling Methods with Text Mining

Water-quality data imputation with a high percentage of missing values: a...

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

Missing Person Information Clearinghouse

Data from: Using decision trees to understand structure in missing data

NN5 Daily Dataset (without Missing Values)

Replication Data for: Qualitative Imputation of Missing Potential Outcomes

Data from: Incomplete specimens in geometric morphometric analyses

Bitcoin Dataset with Missing Values

National Missing and Unidentified Persons System (NamUs)

Replication data for: A Unified Approach To Measurement Error And Missing...

YOSAR Missing Person Data 2000 - 2011

Data from: Estimating and imputing missing tax loss carryforward data to...

Overwatch 2 statistics

Stage 1:

Stage 2:

Stage 3:

Stage 4:

Stage 5:

Effect of missing data on topological inference using a total evidence...

KDD Cup Dataset (with Missing Values)

Slack Queries

Final search terms and lengths (in months) and missing values (in weeks) of...

Missing Value

Data from: Identifying Missing Data Handling Methods with Text Mining