Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Heart failure (HF) affects at least 26 million people worldwide, so predicting adverse events in HF patients represents a major target of clinical data science. However, achieving large sample sizes sometimes represents a challenge due to difficulties in patient recruiting and long follow-up times, increasing the problem of missing data. To overcome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinality is the number of patients in that dataset), population-enhancing algorithms are therefore crucial. The aim of this study was to design a random shuffle method to enhance the cardinality of an HF dataset while it is statistically legitimate, without the need of specific hypotheses and regression models. The cardinality enhancement was validated against an established random repeated-measures method with regard to the correctness in predicting clinical conditions and endpoints. In particular, machine learning and regression models were employed to highlight the benefits of the enhanced datasets. The proposed random shuffle method was able to enhance the HF dataset cardinality (711 patients before dataset preprocessing) circa 10 times and circa 21 times when followed by a random repeated-measures approach. We believe that the random shuffle method could be used in the cardiovascular field and in other data science problems when missing data and the narrow dataset cardinality represent an issue.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data sets for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the data sets used in both example analyses (Examples 1 and 2) in two file formats (binary ".rda" for use in R; plain-text ".dat").
The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:
ID
= group identifier (1-2000)
x
= numeric (Level 1)
y
= numeric (Level 1)
w
= binary (Level 2)
In all data sets, missing values are coded as "NA".
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.
[ENG] This dataset contains the data used in the tutorial article "Handling Planned and Unplanned Missing Data in a Longitudinal Study", in press at "The Quantitative Methods for Psychology". It contains a subset of longitudinal data collected within the context of a survey about COVID-19 (data on sleep and emotions). This dataset is intended for tutorial purposes only. With the observations and variables in this dataset, the analyses presented in the tutorial can be reproduced. For more information, see de la Sablonnière et al. (2020). [FRE] Ce jeu de données contient les données utilisées dans l'article tutoriel "Handling Planned and Unplanned Missing Data in a Longitudinal Study", sous presse à "The Quantitative Methods for Psychology". Il contient un sous-ensemble de données longitudinales collectées dans le cadre d'une enquête sur le COVID-19 (données sur le sommeil et les émotions). Cet ensemble de données est destiné à des fins didactiques uniquement. Avec les observations et les variables de ce jeu de données, les analyses présentées dans le tutoriel peuvent être reproduites. Pour plus d'informations, voir de la Sablonnière et al. (2020).
Although social scientists devote considerable effort to mitigating measurement error during data collection, they often ignore the issue during data analysis. And although many statistical methods have been proposed for reducing measurement error-induced biases, few have been widely used because of implausible assumptions, high levels of model dependence, difficult computation, or inapplicability with multiple mismeasured variables. We develop an easy-to-use alternative without these problems; it generalizes the popular multiple imputation (MI) framework by treating missing data problems as a limiting special case of extreme measurement error, and corrects for both. Like MI, the proposed framework is a simple two-step procedure, so that in the second step researchers can use whatever statistical method they would have if there had been no problem in the first place. We also offer empirical illustrations, open source software that implements all the methods described herein, and a companion paper with technical details and extensions (Blackwell, Honaker, and King, 2014b). Notes: This is the first of two articles to appear in the same issue of the same journal by the same authors. The second is “A Unified Approach to Measurement Error and Missing Data: Details and Extensions.” See also: Missing Data
We propose a framework for meta-analysis of qualitative causal inferences. We integrate qualitative counterfactual inquiry with an approach from the quantitative causal inference literature called extreme value bounds. Qualitative counterfactual analysis uses the observed outcome and auxiliary information to infer what would have happened had the treatment been set to a different level. Imputing missing potential outcomes is hard and when it fails, we can fill them in under best- and worst-case scenarios. We apply our approach to 63 cases that could have experienced transitional truth commissions upon democratization, 8 of which did. Prior to any analysis, the extreme value bounds around the average treatment effect on authoritarian resumption are 100 percentage points wide; imputation shrinks the width of these bounds to 51 points. We further demonstrate our method by aggregating specialists' beliefs about causal effects gathered through an expert survey, shrinking the width of the bounds to 44 points.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IMAGIC-500 is a large-scale, fully synthetic benchmark dataset designed to evaluate missing data imputation methods in hierarchical, real-world-like socio-economic survey data. It is derived from the World Bank’s Synthetic Data for an Imaginary Country (SDIC, 2023)—an openly available synthetic census-like dataset simulating a fictional middle-income country. This dataset combines the individual-level and household-level components of SDIC by joining them on household ID, preserving the nested structure of real survey data (individual → household → district → province). From this joined population, we sample 500,000 individuals across approximately 136,476 households, ensuring broad geographic and demographic diversity. For downstream task, we select 19 mixed-type variables from the SDIC attributes, covering both household-level and individual-level variables. Specifically, household-level features include geographic and socioeconomic variables, while individual-level features include demographics and socioeconomics. Moreover, we also select the individual’s highest educational attainment ("cat_educ_attain") as a target variable for downstream tasks.
Replication Data for: A GMM Approach for Dealing with Missing Data on Regressors
Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of machine learning models. A regression-based missing data imputation method using light gradient boosting machine algorithm was employed to impute over 60% of the missing data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Heart failure (HF) affects at least 26 million people worldwide, so predicting adverse events in HF patients represents a major target of clinical data science. However, achieving large sample sizes sometimes represents a challenge due to difficulties in patient recruiting and long follow-up times, increasing the problem of missing data. To overcome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinality is the number of patients in that dataset), population-enhancing algorithms are therefore crucial. The aim of this study was to design a random shuffle method to enhance the cardinality of an HF dataset while it is statistically legitimate, without the need of specific hypotheses and regression models. The cardinality enhancement was validated against an established random repeated-measures method with regard to the correctness in predicting clinical conditions and endpoints. In particular, machine learning and regression models were employed to highlight the benefits of the enhanced datasets. The proposed random shuffle method was able to enhance the HF dataset cardinality (711 patients before dataset preprocessing) circa 10 times and circa 21 times when followed by a random repeated-measures approach. We believe that the random shuffle method could be used in the cardiovascular field and in other data science problems when missing data and the narrow dataset cardinality represent an issue.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.
It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.
Each record captures multiple attributes related to individuals in the Indian job market, including:
- Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording
The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.
This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools
It's also useful for:
- Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines
By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GENERAL INFORMATION
Title of Dataset: A dataset from a survey investigating disciplinary differences in data citation
Date of data collection: January to March 2022
Collection instrument: SurveyMonkey
Funding: Alfred P. Sloan Foundation
SHARING/ACCESS INFORMATION
Licenses/restrictions placed on the data: These data are available under a CC BY 4.0 license
Links to publications that cite or use the data:
Gregory, K., Ninkov, A., Ripp, C., Peters, I., & Haustein, S. (2022). Surveying practices of data citation and reuse across disciplines. Proceedings of the 26th International Conference on Science and Technology Indicators. International Conference on Science and Technology Indicators, Granada, Spain. https://doi.org/10.5281/ZENODO.6951437
Gregory, K., Ninkov, A., Ripp, C., Roblin, E., Peters, I., & Haustein, S. (2023). Tracing data:
A survey investigating disciplinary differences in data citation. Zenodo. https://doi.org/10.5281/zenodo.7555266
DATA & FILE OVERVIEW
File List
Additional related data collected that was not included in the current data package: Open ended questions asked to respondents
METHODOLOGICAL INFORMATION
Description of methods used for collection/generation of data:
The development of the questionnaire (Gregory et al., 2022) was centered around the creation of two main branches of questions for the primary groups of interest in our study: researchers that reuse data (33 questions in total) and researchers that do not reuse data (16 questions in total). The population of interest for this survey consists of researchers from all disciplines and countries, sampled from the corresponding authors of papers indexed in the Web of Science (WoS) between 2016 and 2020.
Received 3,632 responses, 2,509 of which were completed, representing a completion rate of 68.6%. Incomplete responses were excluded from the dataset. The final total contains 2,492 complete responses and an uncorrected response rate of 1.57%. Controlling for invalid emails, bounced emails and opt-outs (n=5,201) produced a response rate of 1.62%, similar to surveys using comparable recruitment methods (Gregory et al., 2020).
Methods for processing the data:
Results were downloaded from SurveyMonkey in CSV format and were prepared for analysis using Excel and SPSS by recoding ordinal and multiple choice questions and by removing missing values.
Instrument- or software-specific information needed to interpret the data:
The dataset is provided in SPSS format, which requires IBM SPSS Statistics. The dataset is also available in a coded format in CSV. The Codebook is required to interpret to values.
DATA-SPECIFIC INFORMATION FOR: MDCDataCitationReuse2021surveydata
Number of variables: 95
Number of cases/rows: 2,492
Missing data codes: 999 Not asked
Refer to MDCDatacitationReuse2021Codebook.pdf for detailed variable information.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets with synthetically generated missingess structures, based on the publicly available "BreastCancerCoimbra" dataset (M. Patrício, J. Pereira, J. Crisóstomo, P. Matafome, M. Gomes, R. Seiça, and F. Caramelo. "Using resistin, glucose, age and bmi to predict the presence of breast cancer". BMC cancer, 18(1):29, 2018.The datasets are part of the supplemental material for: Johansson Fernstad, S., Alsufyani, S., Del-Din, S., Yarnall, A., & Rochester, L. (2025). "To Measure What Isn’t There — Visual Exploration of Missingness Structures Using Quality Metrics", which is under review. The generation of the synthetic missingness structures are described in this paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data publication provides access to (1) an archive of maps and statistics on MISR L1B2 GRP data products updated as described in Verstraete et al. (2020, https://doi.org/10.5194/essd-2019-210), (2) a user manual describing this archive, (3) a large archive of standard (unprocessed) MISR data files that can be used in conjunction with the IDL software repository published on GitHub and available from https://github.com/mmverstraete (Verstraete et al., 2019, https://doi.org/10.5281/zenodo.3519989), (4) an additional archive of maps and statistics on MISR L1B2 GRP data products updated as described for eight additional Blocks of MISR data, spanning a broader range of climatic and environmental conditions (between Iraq and Namibia), and (5) a user manual describing this second archive. The authors also make a self-contained, stand-alone version of that processing software available to all users, using the IDL Virtual Machine technology (which does not require an IDL license) from Verstraete et al., 2020: http://doi.org/10.5880/fidgeo.2020.011. (1) The compressed archive 'L1B2_Out.zip' contains all outputs produced in the course of generating the various Figures of the manuscript Verstraete et al. (2020b). Once this archive is installed and uncompressed, 9 subdirectories named Fig-fff-Ttt_Pxxx-Oyyyyyy-Bzzz are created, where fff, tt, xxx, yyyyyy and zzz stand for the Figure number, an optional Table number, Path, Orbit and Block numbers, respectively. These directories contain collections of text, graphics (maps and scatterplots) and binary data files relative to the intermediary, final and ancillary results generated while preparing those Figures. Maps and scatterplots are provided as graphics files in PNG format. Map legends are plain text files with the same names as the maps themselves, but with a file extension '.txt'. Log files are also plain text files. They are generated by the software that creates those graphics files and provide additional details on the intermediary and final results. The processing of MISR L1B2 GRP data product files requires access to cloud masks for the same geographical areas (one for each of the 9 cameras). Since those masks are themselves derived from the L1B2 GRP data and therefore also contain missing data, the outcomes from updating the RCCM data products, as described in Verstraete et al. (2020, https://doi.org/10.5194/essd-12-611-2020), are also included in this archive. The last 2 subdirectories contain the outcomes from the normal processing of the indicated data files, as well as those generated when additional missing data are artificially inserted in the input files for the purpose of assessing the performance of the algorithms. (2) The document 'L1B2_Out.pdf' provides the User Manual to install and explore the compressed archive 'L1B2_Out.zip'. (3) The compressed archive 'L1B2_input_68050.zip' contains MISR L1B2 GRP and RCCM data for the full Orbit 68050, acquired on 3 October 2012, as well as the corresponding AGP file, which is required by the processing system to update the radiance product. This archive includes data for a wide range of locations, from Russia to north-west Iran, central and eastern Iraq, Saudi Arabia, and many more countries along the eastern coast of the African continent. It is provided to allow users to analyze actual data with the software package mentioned above, without needing to download MISR data from the NASA ASDC web site. (4) The compressed archive 'L1B2_Suppl.zip' contains a set of results similar to the archive 'L1B2_Out.zip' mentioned above, for four additional sites, spanning a much wider range of geographical, climatic and ecological conditions: these are covering areas in Iraq (marsh and arid lands), Kenya (agriculture and tropical forests), South Sudan (grasslands) and Namibia (coastal desert and Atlantic Ocean). Two of them involve largely clear scenes, and the other two include clouds. The last case also includes a test to artificially introduce missing data over deep water and clouds, to demonstrate the performance of the procedure on targets other than continental areas. Once uncompressed, this new archive expands into 8 subdirectories and takes up 1.8 GB of disk space, providing access to about 2,900 files. (5) The companion user manual L1B2_Suppl.pdf, describing how to install, uncompress and explore those additional files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains information about appeal cases heard at the Supreme Court of Nigeria (SCN) between the years 1962 to 2022. The dataset was extracted from case files that were provided by The Prison Law Pavillion; a data archiving firm in Nigeria. The dataset originally consisted of documentation of the various appeal cases alongside the outcome of the judgment of the SCN. Feature extraction techniques were used to generate a structured dataset containing information about a number of annotated features. Some of the features were stored as string values while some of the features were stored as numeric values. The dataset consists of information about 14 features including the outcome of the judgment. 13 features are the input variables among which 4 are stored as strings while the remaining 9 were stored as numeric values. Missing values among the numeric values were represented using the value -1. Unsupervised and Supervised machine learning algorithms can be applied to the dataset for the purpose of extracting important information required for gaining a better understanding of the relationship that exists among the features and with respect to predicting the target class which is the outcome of the SCN judgment.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Abstract: Given the methodological sophistication of the debate over the “political resource curse”—the purported negative relationship between natural resource wealth (in particular oil wealth) and democracy—it is surprising that scholars have not paid more attention to the basic statistical issue of how to deal with missing data. This article highlights the problems caused by the most common strategy for analyzing missing data in the political resource curse literature—listwise deletion—and investigates how addressing such problems through the best-practice technique of multiple imputation affects empirical results. I find that multiple imputation causes the results of a number of influential recent studies to converge on a key common finding: A political resource curse does exist, but only since the widespread nationalization of petroleum industries in the 1970s. This striking finding suggests that much of the controversy over the political resource curse has been caused by a neglect of missing-data issues.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The Big Data Analysis Platform market is experiencing robust growth, projected to reach $121.07 billion in 2025. While the provided CAGR is missing, considering the rapid advancements in data analytics technologies and the increasing adoption across diverse sectors like computer, electronics, energy, machinery, and chemicals, a conservative estimate of a 15% Compound Annual Growth Rate (CAGR) from 2025 to 2033 seems plausible. This would indicate substantial market expansion, driven by the exponential growth of data volume, the need for improved business intelligence, and the rise of advanced analytics techniques like machine learning and AI. Key drivers include the increasing demand for real-time data insights, the need for better decision-making, and the growing adoption of cloud-based solutions. Trends such as the integration of big data with IoT devices, the increasing use of data visualization tools, and the focus on data security are further shaping the market landscape. Despite the opportunities, challenges such as the complexity of big data implementation, the need for skilled professionals, and data privacy concerns represent significant restraints. The market is segmented by application and geography, with North America and Europe currently dominating, but Asia-Pacific is expected to show significant growth in the coming years due to increasing digitalization and investment in technology. The competitive landscape is highly dynamic, with established players like IBM, Microsoft, and Google competing alongside specialized analytics companies such as Alteryx and Splunk, and numerous emerging firms. The success of individual companies will depend on factors including the breadth and depth of their analytical capabilities, the ease of use of their platforms, the strength of their integrations with existing systems, and their capacity to address industry-specific needs. The forecast period from 2025-2033 presents immense opportunities for both established and emerging companies that can effectively innovate and address the evolving demands of the Big Data Analysis Platform market. The ability to offer scalable, secure, and insightful solutions will be crucial for gaining market share and achieving sustainable growth.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.