100+ datasets found

o
Data from: Identifying Missing Data Handling Methods with Text Mining
openicpsr.org
delimited
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E185961V1
Dataset updated
Mar 8, 2023
Dataset provided by
Hungarian Academy of Sciences
Authors
Krisztián Boros; Zoltán Kmetty
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1999 - Dec 31, 2016
Description
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
d
Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...
search.dataone.org
dataverse.harvard.edu
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UPL4TT
Dataset updated
Nov 23, 2023
Dataset provided by
Harvard Dataverse
Authors
Lall, Ranjit; Robinson, Thomas
Description
Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
f
Data_Sheet_2_A Random Shuffle Method to Expand a Narrow Dataset and Overcome...
frontiersin.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo Fassina; Alessandro Faragli; Francesco Paolo Lo Muzio; Sebastian Kelle; Carlo Campana; Burkert Pieske; Frank Edelmann; Alessio Alogna (2023). Data_Sheet_2_A Random Shuffle Method to Expand a Narrow Dataset and Overcome the Associated Challenges in a Clinical Study: A Heart Failure Cohort Example.PDF [Dataset]. http://doi.org/10.3389/fcvm.2020.599923.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fcvm.2020.599923.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Lorenzo Fassina; Alessandro Faragli; Francesco Paolo Lo Muzio; Sebastian Kelle; Carlo Campana; Burkert Pieske; Frank Edelmann; Alessio Alogna
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heart failure (HF) affects at least 26 million people worldwide, so predicting adverse events in HF patients represents a major target of clinical data science. However, achieving large sample sizes sometimes represents a challenge due to difficulties in patient recruiting and long follow-up times, increasing the problem of missing data. To overcome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinality is the number of patients in that dataset), population-enhancing algorithms are therefore crucial. The aim of this study was to design a random shuffle method to enhance the cardinality of an HF dataset while it is statistically legitimate, without the need of specific hypotheses and regression models. The cardinality enhancement was validated against an established random repeated-measures method with regard to the correctness in predicting clinical conditions and endpoints. In particular, machine learning and regression models were employed to highlight the benefits of the enhanced datasets. The proposed random shuffle method was able to enhance the HF dataset cardinality (711 patients before dataset preprocessing) circa 10 times and circa 21 times when followed by a random repeated-measures approach. We believe that the random shuffle method could be used in the cardiovascular field and in other data science problems when missing data and the narrow dataset cardinality represent an issue.
z
Missing data in the analysis of multilevel and dependent data (Example data...
zenodo.org
bin
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Example data sets) [Dataset]. http://doi.org/10.5281/zenodo.7773614
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7773614
Dataset updated
Jul 20, 2023
Dataset provided by
Springer
Authors
Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data sets for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the data sets used in both example analyses (Examples 1 and 2) in two file formats (binary ".rda" for use in R; plain-text ".dat").

The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

ID = group identifier (1-2000)
x = numeric (Level 1)
y = numeric (Level 1)
w = binary (Level 2)

In all data sets, missing values are coded as "NA".
H
Replication Data for: Comparative investigation of time series missing data...
dataverse.harvard.edu
Updated Jul 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LEIZHEN ZANG; Feng XIONG (2020). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GQHURF
Dataset updated
Jul 24, 2020
Dataset provided by
Harvard Dataverse
Authors
LEIZHEN ZANG; Feng XIONG
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.
d
Tutorial data for the article \"Handling Planned and Unplanned Missing Data...
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caron-Diotte, Mathieu; Pelletier-Dumas, Mathieu; Lacourse, Éric; Dorfman, Anna; Stolle, Dietlind; Lina, Jean-Marc; de la Sablonnière, Roxane (2023). Tutorial data for the article \"Handling Planned and Unplanned Missing Data in a Longitudinal Study\" [2020, Canada] [Dataset]. http://doi.org/10.5683/SP3/P8OUOT
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/P8OUOT
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Caron-Diotte, Mathieu; Pelletier-Dumas, Mathieu; Lacourse, Éric; Dorfman, Anna; Stolle, Dietlind; Lina, Jean-Marc; de la Sablonnière, Roxane
Time period covered
Apr 6, 2020 - Jun 10, 2020
Area covered
Canada
Description
[ENG] This dataset contains the data used in the tutorial article "Handling Planned and Unplanned Missing Data in a Longitudinal Study", in press at "The Quantitative Methods for Psychology". It contains a subset of longitudinal data collected within the context of a survey about COVID-19 (data on sleep and emotions). This dataset is intended for tutorial purposes only. With the observations and variables in this dataset, the analyses presented in the tutorial can be reproduced. For more information, see de la Sablonnière et al. (2020). [FRE] Ce jeu de données contient les données utilisées dans l'article tutoriel "Handling Planned and Unplanned Missing Data in a Longitudinal Study", sous presse à "The Quantitative Methods for Psychology". Il contient un sous-ensemble de données longitudinales collectées dans le cadre d'une enquête sur le COVID-19 (données sur le sommeil et les émotions). Cet ensemble de données est destiné à des fins didactiques uniquement. Avec les observations et les variables de ce jeu de données, les analyses présentées dans le tutoriel peuvent être reproduites. Pour plus d'informations, voir de la Sablonnière et al. (2020).
d
Replication data for: A Unified Approach To Measurement Error And Missing...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blackwell, Matthew; Honaker, James; King, Gary (2023). Replication data for: A Unified Approach To Measurement Error And Missing Data: Overview [Dataset]. http://doi.org/10.7910/DVN/29606
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/29606
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Blackwell, Matthew; Honaker, James; King, Gary
Description
Although social scientists devote considerable effort to mitigating measurement error during data collection, they often ignore the issue during data analysis. And although many statistical methods have been proposed for reducing measurement error-induced biases, few have been widely used because of implausible assumptions, high levels of model dependence, difficult computation, or inapplicability with multiple mismeasured variables. We develop an easy-to-use alternative without these problems; it generalizes the popular multiple imputation (MI) framework by treating missing data problems as a limiting special case of extreme measurement error, and corrects for both. Like MI, the proposed framework is a simple two-step procedure, so that in the second step researchers can use whatever statistical method they would have if there had been no problem in the first place. We also offer empirical illustrations, open source software that implements all the methods described herein, and a companion paper with technical details and extensions (Blackwell, Honaker, and King, 2014b). Notes: This is the first of two articles to appear in the same issue of the same journal by the same authors. The second is “A Unified Approach to Measurement Error and Missing Data: Details and Extensions.” See also: Missing Data
d
Replication Data for: Qualitative Imputation of Missing Potential Outcomes
search.dataone.org
dataverse.harvard.edu
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coppock, Alexander; Kaur, Dipin (2023). Replication Data for: Qualitative Imputation of Missing Potential Outcomes [Dataset]. http://doi.org/10.7910/DVN/2IVKXD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/2IVKXD
Dataset updated
Nov 9, 2023
Dataset provided by
Harvard Dataverse
Authors
Coppock, Alexander; Kaur, Dipin
Description
We propose a framework for meta-analysis of qualitative causal inferences. We integrate qualitative counterfactual inquiry with an approach from the quantitative causal inference literature called extreme value bounds. Qualitative counterfactual analysis uses the observed outcome and auxiliary information to infer what would have happened had the treatment been set to a different level. Imputing missing potential outcomes is hard and when it fails, we can fill them in under best- and worst-case scenarios. We apply our approach to 63 cases that could have experienced transitional truth commissions upon democratization, 8 of which did. Prior to any analysis, the extreme value bounds around the average treatment effect on authoritarian resumption are 100 percentage points wide; imputation shrinks the width of these bounds to 51 points. We further demonstrate our method by aggregating specialists' beliefs about causal effects gathered through an expert survey, shrinking the width of the bounds to 44 points.
H
IMAGIC-500: A Benchmark Dataset for Missing Data Imputation in Hierarchical...
dataverse.harvard.edu
Updated May 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siyi Sun (2025). IMAGIC-500: A Benchmark Dataset for Missing Data Imputation in Hierarchical Socio-Economic Surveys [Dataset]. http://doi.org/10.7910/DVN/7GMPBH
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/7GMPBH
Dataset updated
May 10, 2025
Dataset provided by
Harvard Dataverse
Authors
Siyi Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IMAGIC-500 is a large-scale, fully synthetic benchmark dataset designed to evaluate missing data imputation methods in hierarchical, real-world-like socio-economic survey data. It is derived from the World Bank’s Synthetic Data for an Imaginary Country (SDIC, 2023)—an openly available synthetic census-like dataset simulating a fictional middle-income country. This dataset combines the individual-level and household-level components of SDIC by joining them on household ID, preserving the nested structure of real survey data (individual → household → district → province). From this joined population, we sample 500,000 individuals across approximately 136,476 households, ensuring broad geographic and demographic diversity. For downstream task, we select 19 mixed-type variables from the SDIC attributes, covering both household-level and individual-level variables. Specifically, household-level features include geographic and socioeconomic variables, while individual-level features include demographics and socioeconomics. Moreover, we also select the individual’s highest educational attainment ("cat_educ_attain") as a target variable for downstream tasks.
d
Replication Data for: A GMM Approach for Dealing with Missing Data on...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald, Stephen; Abrevaya, Jason (2023). Replication Data for: A GMM Approach for Dealing with Missing Data on Regressors [Dataset]. http://doi.org/10.7910/DVN/JMWMWW
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/JMWMWW
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Donald, Stephen; Abrevaya, Jason
Description
Replication Data for: A GMM Approach for Dealing with Missing Data on Regressors
S
Data from: Prediction of radionuclide diffusion enabled by missing data...
scidb.cn
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jun-Lei Tian; Jia-Xing Feng; Jia-Cong Shen; Lei Yao; Jing-Yan Wang; Tao Wu; Yao-Lin Zhao (2025). Prediction of radionuclide diffusion enabled by missing data imputation and ensemble machine learning [Dataset]. http://doi.org/10.57760/sciencedb.j00186.00710
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00186.00710
Dataset updated
May 6, 2025
Dataset provided by
Science Data Bank
Authors
Jun-Lei Tian; Jia-Xing Feng; Jia-Cong Shen; Lei Yao; Jing-Yan Wang; Tao Wu; Yao-Lin Zhao
Description
Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of machine learning models. A regression-based missing data imputation method using light gradient boosting machine algorithm was employed to impute over 60% of the missing data.
f
Data_Sheet_3_A Random Shuffle Method to Expand a Narrow Dataset and Overcome...
frontiersin.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo Fassina; Alessandro Faragli; Francesco Paolo Lo Muzio; Sebastian Kelle; Carlo Campana; Burkert Pieske; Frank Edelmann; Alessio Alogna (2023). Data_Sheet_3_A Random Shuffle Method to Expand a Narrow Dataset and Overcome the Associated Challenges in a Clinical Study: A Heart Failure Cohort Example.ZIP [Dataset]. http://doi.org/10.3389/fcvm.2020.599923.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fcvm.2020.599923.s002
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Lorenzo Fassina; Alessandro Faragli; Francesco Paolo Lo Muzio; Sebastian Kelle; Carlo Campana; Burkert Pieske; Frank Edelmann; Alessio Alogna
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heart failure (HF) affects at least 26 million people worldwide, so predicting adverse events in HF patients represents a major target of clinical data science. However, achieving large sample sizes sometimes represents a challenge due to difficulties in patient recruiting and long follow-up times, increasing the problem of missing data. To overcome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinality is the number of patients in that dataset), population-enhancing algorithms are therefore crucial. The aim of this study was to design a random shuffle method to enhance the cardinality of an HF dataset while it is statistically legitimate, without the need of specific hypotheses and regression models. The cardinality enhancement was validated against an established random repeated-measures method with regard to the correctness in predicting clinical conditions and endpoints. In particular, machine learning and regression models were employed to highlight the benefits of the enhanced datasets. The proposed random shuffle method was able to enhance the HF dataset cardinality (711 patients before dataset preprocessing) circa 10 times and circa 21 times when followed by a random repeated-measures approach. We believe that the random shuffle method could be used in the cardiovascular field and in other data science problems when missing data and the narrow dataset cardinality represent an issue.
Employment Of India CLeaned and Messy Data
kaggle.com
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SONIA SHINDE
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
India
Description
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

🔹 Dataset Composition:

It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording

Transformations & Cleaning Applied:

The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

Purpose & Utility:

This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

It's also useful for: - Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines

By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
A dataset from a survey investigating disciplinary differences in data...
zenodo.org
explore.openaire.eu
+1more
bin, csv, pdf, txt
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anton Boudreau Ninkov; Anton Boudreau Ninkov; Chantal Ripp; Chantal Ripp; Kathleen Gregory; Kathleen Gregory; Isabella Peters; Isabella Peters; Stefanie Haustein; Stefanie Haustein (2024). A dataset from a survey investigating disciplinary differences in data citation [Dataset]. http://doi.org/10.5281/zenodo.7853477
Explore at:
txt, pdf, bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7853477
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anton Boudreau Ninkov; Anton Boudreau Ninkov; Chantal Ripp; Chantal Ripp; Kathleen Gregory; Kathleen Gregory; Isabella Peters; Isabella Peters; Stefanie Haustein; Stefanie Haustein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
GENERAL INFORMATION

Title of Dataset: A dataset from a survey investigating disciplinary differences in data citation

Date of data collection: January to March 2022

Collection instrument: SurveyMonkey

Funding: Alfred P. Sloan Foundation

SHARING/ACCESS INFORMATION

Licenses/restrictions placed on the data: These data are available under a CC BY 4.0 license

Links to publications that cite or use the data:

Gregory, K., Ninkov, A., Ripp, C., Peters, I., & Haustein, S. (2022). Surveying practices of data citation and reuse across disciplines. Proceedings of the 26th International Conference on Science and Technology Indicators. International Conference on Science and Technology Indicators, Granada, Spain. https://doi.org/10.5281/ZENODO.6951437

Gregory, K., Ninkov, A., Ripp, C., Roblin, E., Peters, I., & Haustein, S. (2023). Tracing data:
A survey investigating disciplinary differences in data citation. Zenodo. https://doi.org/10.5281/zenodo.7555266

DATA & FILE OVERVIEW

File List

Filename: MDCDatacitationReuse2021Codebookv2.pdf
Codebook

Filename: MDCDataCitationReuse2021surveydatav2.csv
Dataset format in csv

Filename: MDCDataCitationReuse2021surveydatav2.sav
Dataset format in SPSS

Filename: MDCDataCitationReuseSurvey2021QNR.pdf
Questionnaire

Additional related data collected that was not included in the current data package: Open ended questions asked to respondents

METHODOLOGICAL INFORMATION

Description of methods used for collection/generation of data:

The development of the questionnaire (Gregory et al., 2022) was centered around the creation of two main branches of questions for the primary groups of interest in our study: researchers that reuse data (33 questions in total) and researchers that do not reuse data (16 questions in total). The population of interest for this survey consists of researchers from all disciplines and countries, sampled from the corresponding authors of papers indexed in the Web of Science (WoS) between 2016 and 2020.

Received 3,632 responses, 2,509 of which were completed, representing a completion rate of 68.6%. Incomplete responses were excluded from the dataset. The final total contains 2,492 complete responses and an uncorrected response rate of 1.57%. Controlling for invalid emails, bounced emails and opt-outs (n=5,201) produced a response rate of 1.62%, similar to surveys using comparable recruitment methods (Gregory et al., 2020).

Methods for processing the data:

Results were downloaded from SurveyMonkey in CSV format and were prepared for analysis using Excel and SPSS by recoding ordinal and multiple choice questions and by removing missing values.

Instrument- or software-specific information needed to interpret the data:

The dataset is provided in SPSS format, which requires IBM SPSS Statistics. The dataset is also available in a coded format in CSV. The Codebook is required to interpret to values.

DATA-SPECIFIC INFORMATION FOR: MDCDataCitationReuse2021surveydata

Number of variables: 95

Number of cases/rows: 2,492

Missing data codes: 999 Not asked

Refer to MDCDatacitationReuse2021Codebook.pdf for detailed variable information.
Learn Data Science Series Part 1
kaggle.com
Updated Dec 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rupesh Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Chapter 1: Getting started with pandas

Chapter 2: Analysis: Bringing it all together and making decisions

Chapter 3: Appending to DataFrame

Chapter 4: Boolean indexing of dataframes

Chapter 5: Categorical data

Chapter 6: Computational Tools

Chapter 7: Creating DataFrames

Chapter 8: Cross sections of different axes with MultiIndex

Chapter 9: Data Types

Chapter 10: Dealing with categorical variables

Chapter 11: Duplicated data

Chapter 12: Getting information about DataFrames

Chapter 13: Gotchas of pandas

Chapter 14: Graphs and Visualizations

Chapter 15: Grouping Data

Chapter 16: Grouping Time Series Data

Chapter 17: Holiday Calendars

Chapter 18: Indexing and selecting data

Chapter 19: IO for Google BigQuery

Chapter 20: JSON

Chapter 21: Making Pandas Play Nice With Native Python Datatypes

Chapter 22: Map Values

Chapter 23: Merge, join, and concatenate

Chapter 24: Meta: Documentation Guidelines

Chapter 25: Missing Data

Chapter 26: MultiIndex

Chapter 27: Pandas Datareader

Chapter 28: Pandas IO tools (reading and saving data sets)

Chapter 29: pd.DataFrame.apply

Chapter 30: Read MySQL to DataFrame

Chapter 31: Read SQL Server to Dataframe

Chapter 32: Reading files into pandas DataFrame

Chapter 33: Resampling

Chapter 34: Reshaping and pivoting

Chapter 35: Save pandas dataframe to a csv file

Chapter 36: Series

Chapter 37: Shifting and Lagging Data

Chapter 38: Simple manipulation of DataFrames

Chapter 39: String manipulation

Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame

Chapter 41: Working with Time Series
n
Datasets with synthetically generated missingess structures.
data.ncl.ac.uk
csv
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Johansson Fernstad (2025). Datasets with synthetically generated missingess structures. [Dataset]. http://doi.org/10.25405/data.ncl.28680893.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.25405/data.ncl.28680893.v1
Dataset updated
Apr 1, 2025
Dataset provided by
Newcastle University
Authors
Sara Johansson Fernstad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets with synthetically generated missingess structures, based on the publicly available "BreastCancerCoimbra" dataset (M. Patrício, J. Pereira, J. Crisóstomo, P. Matafome, M. Gomes, R. Seiça, and F. Caramelo. "Using resistin, glucose, age and bmi to predict the presence of breast cancer". BMC cancer, 18(1):29, 2018.The datasets are part of the supplemental material for: Johansson Fernstad, S., Alsufyani, S., Del-Din, S., Yarnall, A., & Rochester, L. (2025). "To Measure What Isn’t There — Visual Exploration of Missingness Structures Using Quality Metrics", which is under review. The generation of the synthetic missingness structures are described in this paper.
L1B2.out: Samples of MISR L1B2 GRP data to explore the missing data...
dataservices.gfz-potsdam.de
Updated Feb 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datacite (2020). L1B2.out: Samples of MISR L1B2 GRP data to explore the missing data replacement process [Dataset]. http://doi.org/10.5880/fidgeo.2020.012
Explore at:
Unique identifier
https://doi.org/10.5880/fidgeo.2020.012
Dataset updated
Feb 27, 2020
Dataset provided by
DataCitehttps://www.datacite.org/
GFZ Data Services
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered

Description
This data publication provides access to (1) an archive of maps and statistics on MISR L1B2 GRP data products updated as described in Verstraete et al. (2020, https://doi.org/10.5194/essd-2019-210), (2) a user manual describing this archive, (3) a large archive of standard (unprocessed) MISR data files that can be used in conjunction with the IDL software repository published on GitHub and available from https://github.com/mmverstraete (Verstraete et al., 2019, https://doi.org/10.5281/zenodo.3519989), (4) an additional archive of maps and statistics on MISR L1B2 GRP data products updated as described for eight additional Blocks of MISR data, spanning a broader range of climatic and environmental conditions (between Iraq and Namibia), and (5) a user manual describing this second archive. The authors also make a self-contained, stand-alone version of that processing software available to all users, using the IDL Virtual Machine technology (which does not require an IDL license) from Verstraete et al., 2020: http://doi.org/10.5880/fidgeo.2020.011. (1) The compressed archive 'L1B2_Out.zip' contains all outputs produced in the course of generating the various Figures of the manuscript Verstraete et al. (2020b). Once this archive is installed and uncompressed, 9 subdirectories named Fig-fff-Ttt_Pxxx-Oyyyyyy-Bzzz are created, where fff, tt, xxx, yyyyyy and zzz stand for the Figure number, an optional Table number, Path, Orbit and Block numbers, respectively. These directories contain collections of text, graphics (maps and scatterplots) and binary data files relative to the intermediary, final and ancillary results generated while preparing those Figures. Maps and scatterplots are provided as graphics files in PNG format. Map legends are plain text files with the same names as the maps themselves, but with a file extension '.txt'. Log files are also plain text files. They are generated by the software that creates those graphics files and provide additional details on the intermediary and final results. The processing of MISR L1B2 GRP data product files requires access to cloud masks for the same geographical areas (one for each of the 9 cameras). Since those masks are themselves derived from the L1B2 GRP data and therefore also contain missing data, the outcomes from updating the RCCM data products, as described in Verstraete et al. (2020, https://doi.org/10.5194/essd-12-611-2020), are also included in this archive. The last 2 subdirectories contain the outcomes from the normal processing of the indicated data files, as well as those generated when additional missing data are artificially inserted in the input files for the purpose of assessing the performance of the algorithms. (2) The document 'L1B2_Out.pdf' provides the User Manual to install and explore the compressed archive 'L1B2_Out.zip'. (3) The compressed archive 'L1B2_input_68050.zip' contains MISR L1B2 GRP and RCCM data for the full Orbit 68050, acquired on 3 October 2012, as well as the corresponding AGP file, which is required by the processing system to update the radiance product. This archive includes data for a wide range of locations, from Russia to north-west Iran, central and eastern Iraq, Saudi Arabia, and many more countries along the eastern coast of the African continent. It is provided to allow users to analyze actual data with the software package mentioned above, without needing to download MISR data from the NASA ASDC web site. (4) The compressed archive 'L1B2_Suppl.zip' contains a set of results similar to the archive 'L1B2_Out.zip' mentioned above, for four additional sites, spanning a much wider range of geographical, climatic and ecological conditions: these are covering areas in Iraq (marsh and arid lands), Kenya (agriculture and tropical forests), South Sudan (grasslands) and Namibia (coastal desert and Atlantic Ocean). Two of them involve largely clear scenes, and the other two include clouds. The last case also includes a test to artificially introduce missing data over deep water and clouds, to demonstrate the performance of the procedure on targets other than continental areas. Once uncompressed, this new archive expands into 8 subdirectories and takes up 1.8 GB of disk space, providing access to about 2,900 files. (5) The companion user manual L1B2_Suppl.pdf, describing how to install, uncompress and explore those additional files.
m
Appeal Cases heard at the Supreme Court of Nigeria Dataset
data.mendeley.com
Updated Jun 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeremiah Balogun (2023). Appeal Cases heard at the Supreme Court of Nigeria Dataset [Dataset]. http://doi.org/10.17632/ky6zfyf669.1
Explore at:
Unique identifier
https://doi.org/10.17632/ky6zfyf669.1
Dataset updated
Jun 9, 2023
Authors
Jeremiah Balogun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Nigeria
Description
The dataset contains information about appeal cases heard at the Supreme Court of Nigeria (SCN) between the years 1962 to 2022. The dataset was extracted from case files that were provided by The Prison Law Pavillion; a data archiving firm in Nigeria. The dataset originally consisted of documentation of the various appeal cases alongside the outcome of the judgment of the SCN. Feature extraction techniques were used to generate a structured dataset containing information about a number of annotated features. Some of the features were stored as string values while some of the features were stored as numeric values. The dataset consists of information about 14 features including the outcome of the judgment. 13 features are the input variables among which 4 are stored as strings while the remaining 9 were stored as numeric values. Missing values among the numeric values were represented using the value -1. Unsupervised and Supervised machine learning algorithms can be applied to the dataset for the purpose of extracting important information required for gaining a better understanding of the relationship that exists among the features and with respect to predicting the target class which is the outcome of the SCN judgment.
H
Replication Data for: "The Missing Dimension of the Political Resource Curse...
dataverse.harvard.edu
search.dataone.org
Updated Jun 24, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2017). Replication Data for: "The Missing Dimension of the Political Resource Curse Debate" (Comparative Political Studies) [Dataset]. http://doi.org/10.7910/DVN/UHABC6
Explore at:
tsv(52146209), tsv(377788), tsv(5022842), tsv(1409885), tsv(229670197), tsv(1890549), tsv(208382761), csv(1831600), tsv(5923932), application/x-stata-syntax(24825), tsv(128753671), tsv(351459690), tsv(3317975)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/UHABC6
Dataset updated
Jun 24, 2017
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Abstract: Given the methodological sophistication of the debate over the “political resource curse”—the purported negative relationship between natural resource wealth (in particular oil wealth) and democracy—it is surprising that scholars have not paid more attention to the basic statistical issue of how to deal with missing data. This article highlights the problems caused by the most common strategy for analyzing missing data in the political resource curse literature—listwise deletion—and investigates how addressing such problems through the best-practice technique of multiple imputation affects empirical results. I find that multiple imputation causes the results of a number of influential recent studies to converge on a key common finding: A political resource curse does exist, but only since the widespread nationalization of petroleum industries in the 1970s. This striking finding suggests that much of the controversy over the political resource curse has been caused by a neglect of missing-data issues.
B
Big Data Analysis Platform Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Big Data Analysis Platform Report [Dataset]. https://www.archivemarketresearch.com/reports/big-data-analysis-platform-56417
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Mar 12, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Big Data Analysis Platform market is experiencing robust growth, projected to reach $121.07 billion in 2025. While the provided CAGR is missing, considering the rapid advancements in data analytics technologies and the increasing adoption across diverse sectors like computer, electronics, energy, machinery, and chemicals, a conservative estimate of a 15% Compound Annual Growth Rate (CAGR) from 2025 to 2033 seems plausible. This would indicate substantial market expansion, driven by the exponential growth of data volume, the need for improved business intelligence, and the rise of advanced analytics techniques like machine learning and AI. Key drivers include the increasing demand for real-time data insights, the need for better decision-making, and the growing adoption of cloud-based solutions. Trends such as the integration of big data with IoT devices, the increasing use of data visualization tools, and the focus on data security are further shaping the market landscape. Despite the opportunities, challenges such as the complexity of big data implementation, the need for skilled professionals, and data privacy concerns represent significant restraints. The market is segmented by application and geography, with North America and Europe currently dominating, but Asia-Pacific is expected to show significant growth in the coming years due to increasing digitalization and investment in technology. The competitive landscape is highly dynamic, with established players like IBM, Microsoft, and Google competing alongside specialized analytics companies such as Alteryx and Splunk, and numerous emerging firms. The success of individual companies will depend on factors including the breadth and depth of their analytical capabilities, the ease of use of their platforms, the strength of their integrations with existing systems, and their capacity to address industry-specific needs. The forecast period from 2025-2033 presents immense opportunities for both established and emerging companies that can effectively innovate and address the evolving demands of the Big Data Analysis Platform market. The ability to offer scalable, secure, and insightful solutions will be crucial for gaining market share and achieving sustainable growth.

Facebook

Twitter

Click to copy link

Link copied

Cite

Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1

Data from: Identifying Missing Data Handling Methods with Text Mining

Explore at:

delimitedAvailable download formats

Unique identifier

https://doi.org/10.3886/E185961V1

Dataset updated

Mar 8, 2023

Dataset provided by

Hungarian Academy of Sciences

Authors

Krisztián Boros; Zoltán Kmetty

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 1, 1999 - Dec 31, 2016

Description

Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

Clear search

Close search

Google apps

Main menu

Data from: Identifying Missing Data Handling Methods with Text Mining

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

Data_Sheet_2_A Random Shuffle Method to Expand a Narrow Dataset and Overcome...

Missing data in the analysis of multilevel and dependent data (Example data...

Replication Data for: Comparative investigation of time series missing data...

Tutorial data for the article \"Handling Planned and Unplanned Missing Data...

Replication data for: A Unified Approach To Measurement Error And Missing...

Replication Data for: Qualitative Imputation of Missing Potential Outcomes

IMAGIC-500: A Benchmark Dataset for Missing Data Imputation in Hierarchical...

Replication Data for: A GMM Approach for Dealing with Missing Data on...

Data from: Prediction of radionuclide diffusion enabled by missing data...

Data_Sheet_3_A Random Shuffle Method to Expand a Narrow Dataset and Overcome...

Employment Of India CLeaned and Messy Data

🔹 Dataset Composition:

Transformations & Cleaning Applied:

Purpose & Utility:

A dataset from a survey investigating disciplinary differences in data...

Learn Data Science Series Part 1

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Datasets with synthetically generated missingess structures.

L1B2.out: Samples of MISR L1B2 GRP data to explore the missing data...

Appeal Cases heard at the Supreme Court of Nigeria Dataset

Replication Data for: "The Missing Dimension of the Political Resource Curse...

Big Data Analysis Platform Report

Data from: Identifying Missing Data Handling Methods with Text Mining