100+ datasets found
  1. o

    Data from: Identifying Missing Data Handling Methods with Text Mining

    • openicpsr.org
    delimited
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Mar 8, 2023
    Dataset provided by
    Hungarian Academy of Sciences
    Authors
    Krisztián Boros; Zoltán Kmetty
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1999 - Dec 31, 2016
    Description

    Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

  2. d

    Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
    Explore at:
    Dataset updated
    Nov 23, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Lall, Ranjit; Robinson, Thomas
    Description

    Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

  3. f

    Data_Sheet_2_A Random Shuffle Method to Expand a Narrow Dataset and Overcome...

    • frontiersin.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenzo Fassina; Alessandro Faragli; Francesco Paolo Lo Muzio; Sebastian Kelle; Carlo Campana; Burkert Pieske; Frank Edelmann; Alessio Alogna (2023). Data_Sheet_2_A Random Shuffle Method to Expand a Narrow Dataset and Overcome the Associated Challenges in a Clinical Study: A Heart Failure Cohort Example.PDF [Dataset]. http://doi.org/10.3389/fcvm.2020.599923.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Lorenzo Fassina; Alessandro Faragli; Francesco Paolo Lo Muzio; Sebastian Kelle; Carlo Campana; Burkert Pieske; Frank Edelmann; Alessio Alogna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Heart failure (HF) affects at least 26 million people worldwide, so predicting adverse events in HF patients represents a major target of clinical data science. However, achieving large sample sizes sometimes represents a challenge due to difficulties in patient recruiting and long follow-up times, increasing the problem of missing data. To overcome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinality is the number of patients in that dataset), population-enhancing algorithms are therefore crucial. The aim of this study was to design a random shuffle method to enhance the cardinality of an HF dataset while it is statistically legitimate, without the need of specific hypotheses and regression models. The cardinality enhancement was validated against an established random repeated-measures method with regard to the correctness in predicting clinical conditions and endpoints. In particular, machine learning and regression models were employed to highlight the benefits of the enhanced datasets. The proposed random shuffle method was able to enhance the HF dataset cardinality (711 patients before dataset preprocessing) circa 10 times and circa 21 times when followed by a random repeated-measures approach. We believe that the random shuffle method could be used in the cardiovascular field and in other data science problems when missing data and the narrow dataset cardinality represent an issue.

  4. z

    Missing data in the analysis of multilevel and dependent data (Example data...

    • zenodo.org
    bin
    Updated Jul 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Example data sets) [Dataset]. http://doi.org/10.5281/zenodo.7773614
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    Springer
    Authors
    Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data sets for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the data sets used in both example analyses (Examples 1 and 2) in two file formats (binary ".rda" for use in R; plain-text ".dat").

    The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

    ID = group identifier (1-2000)
    x = numeric (Level 1)
    y = numeric (Level 1)
    w = binary (Level 2)

    In all data sets, missing values are coded as "NA".

  5. H

    Replication Data for: Comparative investigation of time series missing data...

    • dataverse.harvard.edu
    Updated Jul 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LEIZHEN ZANG; Feng XIONG (2020). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    LEIZHEN ZANG; Feng XIONG
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.

  6. d

    Tutorial data for the article \"Handling Planned and Unplanned Missing Data...

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caron-Diotte, Mathieu; Pelletier-Dumas, Mathieu; Lacourse, Éric; Dorfman, Anna; Stolle, Dietlind; Lina, Jean-Marc; de la Sablonnière, Roxane (2023). Tutorial data for the article \"Handling Planned and Unplanned Missing Data in a Longitudinal Study\" [2020, Canada] [Dataset]. http://doi.org/10.5683/SP3/P8OUOT
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Caron-Diotte, Mathieu; Pelletier-Dumas, Mathieu; Lacourse, Éric; Dorfman, Anna; Stolle, Dietlind; Lina, Jean-Marc; de la Sablonnière, Roxane
    Time period covered
    Apr 6, 2020 - Jun 10, 2020
    Area covered
    Canada
    Description

    [ENG] This dataset contains the data used in the tutorial article "Handling Planned and Unplanned Missing Data in a Longitudinal Study", in press at "The Quantitative Methods for Psychology". It contains a subset of longitudinal data collected within the context of a survey about COVID-19 (data on sleep and emotions). This dataset is intended for tutorial purposes only. With the observations and variables in this dataset, the analyses presented in the tutorial can be reproduced. For more information, see de la Sablonnière et al. (2020). [FRE] Ce jeu de données contient les données utilisées dans l'article tutoriel "Handling Planned and Unplanned Missing Data in a Longitudinal Study", sous presse à "The Quantitative Methods for Psychology". Il contient un sous-ensemble de données longitudinales collectées dans le cadre d'une enquête sur le COVID-19 (données sur le sommeil et les émotions). Cet ensemble de données est destiné à des fins didactiques uniquement. Avec les observations et les variables de ce jeu de données, les analyses présentées dans le tutoriel peuvent être reproduites. Pour plus d'informations, voir de la Sablonnière et al. (2020).

  7. d

    Replication data for: A Unified Approach To Measurement Error And Missing...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blackwell, Matthew; Honaker, James; King, Gary (2023). Replication data for: A Unified Approach To Measurement Error And Missing Data: Overview [Dataset]. http://doi.org/10.7910/DVN/29606
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Blackwell, Matthew; Honaker, James; King, Gary
    Description

    Although social scientists devote considerable effort to mitigating measurement error during data collection, they often ignore the issue during data analysis. And although many statistical methods have been proposed for reducing measurement error-induced biases, few have been widely used because of implausible assumptions, high levels of model dependence, difficult computation, or inapplicability with multiple mismeasured variables. We develop an easy-to-use alternative without these problems; it generalizes the popular multiple imputation (MI) framework by treating missing data problems as a limiting special case of extreme measurement error, and corrects for both. Like MI, the proposed framework is a simple two-step procedure, so that in the second step researchers can use whatever statistical method they would have if there had been no problem in the first place. We also offer empirical illustrations, open source software that implements all the methods described herein, and a companion paper with technical details and extensions (Blackwell, Honaker, and King, 2014b). Notes: This is the first of two articles to appear in the same issue of the same journal by the same authors. The second is “A Unified Approach to Measurement Error and Missing Data: Details and Extensions.” See also: Missing Data

  8. d

    Replication Data for: Qualitative Imputation of Missing Potential Outcomes

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coppock, Alexander; Kaur, Dipin (2023). Replication Data for: Qualitative Imputation of Missing Potential Outcomes [Dataset]. http://doi.org/10.7910/DVN/2IVKXD
    Explore at:
    Dataset updated
    Nov 9, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Coppock, Alexander; Kaur, Dipin
    Description

    We propose a framework for meta-analysis of qualitative causal inferences. We integrate qualitative counterfactual inquiry with an approach from the quantitative causal inference literature called extreme value bounds. Qualitative counterfactual analysis uses the observed outcome and auxiliary information to infer what would have happened had the treatment been set to a different level. Imputing missing potential outcomes is hard and when it fails, we can fill them in under best- and worst-case scenarios. We apply our approach to 63 cases that could have experienced transitional truth commissions upon democratization, 8 of which did. Prior to any analysis, the extreme value bounds around the average treatment effect on authoritarian resumption are 100 percentage points wide; imputation shrinks the width of these bounds to 51 points. We further demonstrate our method by aggregating specialists' beliefs about causal effects gathered through an expert survey, shrinking the width of the bounds to 44 points.

  9. H

    IMAGIC-500: A Benchmark Dataset for Missing Data Imputation in Hierarchical...

    • dataverse.harvard.edu
    Updated May 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siyi Sun (2025). IMAGIC-500: A Benchmark Dataset for Missing Data Imputation in Hierarchical Socio-Economic Surveys [Dataset]. http://doi.org/10.7910/DVN/7GMPBH
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Siyi Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IMAGIC-500 is a large-scale, fully synthetic benchmark dataset designed to evaluate missing data imputation methods in hierarchical, real-world-like socio-economic survey data. It is derived from the World Bank’s Synthetic Data for an Imaginary Country (SDIC, 2023)—an openly available synthetic census-like dataset simulating a fictional middle-income country. This dataset combines the individual-level and household-level components of SDIC by joining them on household ID, preserving the nested structure of real survey data (individual → household → district → province). From this joined population, we sample 500,000 individuals across approximately 136,476 households, ensuring broad geographic and demographic diversity. For downstream task, we select 19 mixed-type variables from the SDIC attributes, covering both household-level and individual-level variables. Specifically, household-level features include geographic and socioeconomic variables, while individual-level features include demographics and socioeconomics. Moreover, we also select the individual’s highest educational attainment ("cat_educ_attain") as a target variable for downstream tasks.

  10. d

    Replication Data for: A GMM Approach for Dealing with Missing Data on...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donald, Stephen; Abrevaya, Jason (2023). Replication Data for: A GMM Approach for Dealing with Missing Data on Regressors [Dataset]. http://doi.org/10.7910/DVN/JMWMWW
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Donald, Stephen; Abrevaya, Jason
    Description

    Replication Data for: A GMM Approach for Dealing with Missing Data on Regressors

  11. S

    Data from: Prediction of radionuclide diffusion enabled by missing data...

    • scidb.cn
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jun-Lei Tian; Jia-Xing Feng; Jia-Cong Shen; Lei Yao; Jing-Yan Wang; Tao Wu; Yao-Lin Zhao (2025). Prediction of radionuclide diffusion enabled by missing data imputation and ensemble machine learning [Dataset]. http://doi.org/10.57760/sciencedb.j00186.00710
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Jun-Lei Tian; Jia-Xing Feng; Jia-Cong Shen; Lei Yao; Jing-Yan Wang; Tao Wu; Yao-Lin Zhao
    Description

    Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of machine learning models. A regression-based missing data imputation method using light gradient boosting machine algorithm was employed to impute over 60% of the missing data.

  12. f

    Data_Sheet_3_A Random Shuffle Method to Expand a Narrow Dataset and Overcome...

    • frontiersin.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenzo Fassina; Alessandro Faragli; Francesco Paolo Lo Muzio; Sebastian Kelle; Carlo Campana; Burkert Pieske; Frank Edelmann; Alessio Alogna (2023). Data_Sheet_3_A Random Shuffle Method to Expand a Narrow Dataset and Overcome the Associated Challenges in a Clinical Study: A Heart Failure Cohort Example.ZIP [Dataset]. http://doi.org/10.3389/fcvm.2020.599923.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Lorenzo Fassina; Alessandro Faragli; Francesco Paolo Lo Muzio; Sebastian Kelle; Carlo Campana; Burkert Pieske; Frank Edelmann; Alessio Alogna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Heart failure (HF) affects at least 26 million people worldwide, so predicting adverse events in HF patients represents a major target of clinical data science. However, achieving large sample sizes sometimes represents a challenge due to difficulties in patient recruiting and long follow-up times, increasing the problem of missing data. To overcome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinality is the number of patients in that dataset), population-enhancing algorithms are therefore crucial. The aim of this study was to design a random shuffle method to enhance the cardinality of an HF dataset while it is statistically legitimate, without the need of specific hypotheses and regression models. The cardinality enhancement was validated against an established random repeated-measures method with regard to the correctness in predicting clinical conditions and endpoints. In particular, machine learning and regression models were employed to highlight the benefits of the enhanced datasets. The proposed random shuffle method was able to enhance the HF dataset cardinality (711 patients before dataset preprocessing) circa 10 times and circa 21 times when followed by a random repeated-measures approach. We believe that the random shuffle method could be used in the cardiovascular field and in other data science problems when missing data and the narrow dataset cardinality represent an issue.

  13. Employment Of India CLeaned and Messy Data

    • kaggle.com
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SONIA SHINDE
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    India
    Description

    This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

    🔹 Dataset Composition:

    It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

    Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
    - Employment Status (Employed/Unemployed)
    - Monthly Salary (INR)
    - Education Level
    - Industry Sector
    - Years of Experience
    - Location
    - Perceived AI Risk
    - Date of Data Recording

    Transformations & Cleaning Applied:

    The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

    Purpose & Utility:

    This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

    It's also useful for: - Training ML models with clean inputs
    - Data storytelling with visual clarity
    - Demonstrating reproducibility in data cleaning pipelines

    By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.

  14. A dataset from a survey investigating disciplinary differences in data...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    bin, csv, pdf, txt
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anton Boudreau Ninkov; Anton Boudreau Ninkov; Chantal Ripp; Chantal Ripp; Kathleen Gregory; Kathleen Gregory; Isabella Peters; Isabella Peters; Stefanie Haustein; Stefanie Haustein (2024). A dataset from a survey investigating disciplinary differences in data citation [Dataset]. http://doi.org/10.5281/zenodo.7853477
    Explore at:
    txt, pdf, bin, csvAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anton Boudreau Ninkov; Anton Boudreau Ninkov; Chantal Ripp; Chantal Ripp; Kathleen Gregory; Kathleen Gregory; Isabella Peters; Isabella Peters; Stefanie Haustein; Stefanie Haustein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GENERAL INFORMATION

    Title of Dataset: A dataset from a survey investigating disciplinary differences in data citation

    Date of data collection: January to March 2022

    Collection instrument: SurveyMonkey

    Funding: Alfred P. Sloan Foundation


    SHARING/ACCESS INFORMATION

    Licenses/restrictions placed on the data: These data are available under a CC BY 4.0 license

    Links to publications that cite or use the data:

    Gregory, K., Ninkov, A., Ripp, C., Peters, I., & Haustein, S. (2022). Surveying practices of data citation and reuse across disciplines. Proceedings of the 26th International Conference on Science and Technology Indicators. International Conference on Science and Technology Indicators, Granada, Spain. https://doi.org/10.5281/ZENODO.6951437

    Gregory, K., Ninkov, A., Ripp, C., Roblin, E., Peters, I., & Haustein, S. (2023). Tracing data:
    A survey investigating disciplinary differences in data citation.
    Zenodo. https://doi.org/10.5281/zenodo.7555266


    DATA & FILE OVERVIEW

    File List

    • Filename: MDCDatacitationReuse2021Codebookv2.pdf
      Codebook
    • Filename: MDCDataCitationReuse2021surveydatav2.csv
      Dataset format in csv
    • Filename: MDCDataCitationReuse2021surveydatav2.sav
      Dataset format in SPSS
    • Filename: MDCDataCitationReuseSurvey2021QNR.pdf
      Questionnaire

    Additional related data collected that was not included in the current data package: Open ended questions asked to respondents


    METHODOLOGICAL INFORMATION

    Description of methods used for collection/generation of data:

    The development of the questionnaire (Gregory et al., 2022) was centered around the creation of two main branches of questions for the primary groups of interest in our study: researchers that reuse data (33 questions in total) and researchers that do not reuse data (16 questions in total). The population of interest for this survey consists of researchers from all disciplines and countries, sampled from the corresponding authors of papers indexed in the Web of Science (WoS) between 2016 and 2020.

    Received 3,632 responses, 2,509 of which were completed, representing a completion rate of 68.6%. Incomplete responses were excluded from the dataset. The final total contains 2,492 complete responses and an uncorrected response rate of 1.57%. Controlling for invalid emails, bounced emails and opt-outs (n=5,201) produced a response rate of 1.62%, similar to surveys using comparable recruitment methods (Gregory et al., 2020).

    Methods for processing the data:

    Results were downloaded from SurveyMonkey in CSV format and were prepared for analysis using Excel and SPSS by recoding ordinal and multiple choice questions and by removing missing values.

    Instrument- or software-specific information needed to interpret the data:

    The dataset is provided in SPSS format, which requires IBM SPSS Statistics. The dataset is also available in a coded format in CSV. The Codebook is required to interpret to values.


    DATA-SPECIFIC INFORMATION FOR: MDCDataCitationReuse2021surveydata

    Number of variables: 95

    Number of cases/rows: 2,492

    Missing data codes: 999 Not asked

    Refer to MDCDatacitationReuse2021Codebook.pdf for detailed variable information.

  15. Learn Data Science Series Part 1

    • kaggle.com
    Updated Dec 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rupesh Kumar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

    Overview:

    • Chapter 1: Getting started with pandas
    • Chapter 2: Analysis: Bringing it all together and making decisions
    • Chapter 3: Appending to DataFrame
    • Chapter 4: Boolean indexing of dataframes
    • Chapter 5: Categorical data
    • Chapter 6: Computational Tools
    • Chapter 7: Creating DataFrames
    • Chapter 8: Cross sections of different axes with MultiIndex
    • Chapter 9: Data Types
    • Chapter 10: Dealing with categorical variables
    • Chapter 11: Duplicated data
    • Chapter 12: Getting information about DataFrames
    • Chapter 13: Gotchas of pandas
    • Chapter 14: Graphs and Visualizations
    • Chapter 15: Grouping Data
    • Chapter 16: Grouping Time Series Data
    • Chapter 17: Holiday Calendars
    • Chapter 18: Indexing and selecting data
    • Chapter 19: IO for Google BigQuery
    • Chapter 20: JSON
    • Chapter 21: Making Pandas Play Nice With Native Python Datatypes
    • Chapter 22: Map Values
    • Chapter 23: Merge, join, and concatenate
    • Chapter 24: Meta: Documentation Guidelines
    • Chapter 25: Missing Data
    • Chapter 26: MultiIndex
    • Chapter 27: Pandas Datareader
    • Chapter 28: Pandas IO tools (reading and saving data sets)
    • Chapter 29: pd.DataFrame.apply
    • Chapter 30: Read MySQL to DataFrame
    • Chapter 31: Read SQL Server to Dataframe
    • Chapter 32: Reading files into pandas DataFrame
    • Chapter 33: Resampling
    • Chapter 34: Reshaping and pivoting
    • Chapter 35: Save pandas dataframe to a csv file
    • Chapter 36: Series
    • Chapter 37: Shifting and Lagging Data
    • Chapter 38: Simple manipulation of DataFrames
    • Chapter 39: String manipulation
    • Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame
    • Chapter 41: Working with Time Series
  16. n

    Datasets with synthetically generated missingess structures.

    • data.ncl.ac.uk
    csv
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara Johansson Fernstad (2025). Datasets with synthetically generated missingess structures. [Dataset]. http://doi.org/10.25405/data.ncl.28680893.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    Newcastle University
    Authors
    Sara Johansson Fernstad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets with synthetically generated missingess structures, based on the publicly available "BreastCancerCoimbra" dataset (M. Patrício, J. Pereira, J. Crisóstomo, P. Matafome, M. Gomes, R. Seiça, and F. Caramelo. "Using resistin, glucose, age and bmi to predict the presence of breast cancer". BMC cancer, 18(1):29, 2018.The datasets are part of the supplemental material for: Johansson Fernstad, S., Alsufyani, S., Del-Din, S., Yarnall, A., & Rochester, L. (2025). "To Measure What Isn’t There — Visual Exploration of Missingness Structures Using Quality Metrics", which is under review. The generation of the synthetic missingness structures are described in this paper.

  17. L1B2.out: Samples of MISR L1B2 GRP data to explore the missing data...

    • dataservices.gfz-potsdam.de
    Updated Feb 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    datacite (2020). L1B2.out: Samples of MISR L1B2 GRP data to explore the missing data replacement process [Dataset]. http://doi.org/10.5880/fidgeo.2020.012
    Explore at:
    Dataset updated
    Feb 27, 2020
    Dataset provided by
    DataCitehttps://www.datacite.org/
    GFZ Data Services
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Description

    This data publication provides access to (1) an archive of maps and statistics on MISR L1B2 GRP data products updated as described in Verstraete et al. (2020, https://doi.org/10.5194/essd-2019-210), (2) a user manual describing this archive, (3) a large archive of standard (unprocessed) MISR data files that can be used in conjunction with the IDL software repository published on GitHub and available from https://github.com/mmverstraete (Verstraete et al., 2019, https://doi.org/10.5281/zenodo.3519989), (4) an additional archive of maps and statistics on MISR L1B2 GRP data products updated as described for eight additional Blocks of MISR data, spanning a broader range of climatic and environmental conditions (between Iraq and Namibia), and (5) a user manual describing this second archive. The authors also make a self-contained, stand-alone version of that processing software available to all users, using the IDL Virtual Machine technology (which does not require an IDL license) from Verstraete et al., 2020: http://doi.org/10.5880/fidgeo.2020.011. (1) The compressed archive 'L1B2_Out.zip' contains all outputs produced in the course of generating the various Figures of the manuscript Verstraete et al. (2020b). Once this archive is installed and uncompressed, 9 subdirectories named Fig-fff-Ttt_Pxxx-Oyyyyyy-Bzzz are created, where fff, tt, xxx, yyyyyy and zzz stand for the Figure number, an optional Table number, Path, Orbit and Block numbers, respectively. These directories contain collections of text, graphics (maps and scatterplots) and binary data files relative to the intermediary, final and ancillary results generated while preparing those Figures. Maps and scatterplots are provided as graphics files in PNG format. Map legends are plain text files with the same names as the maps themselves, but with a file extension '.txt'. Log files are also plain text files. They are generated by the software that creates those graphics files and provide additional details on the intermediary and final results. The processing of MISR L1B2 GRP data product files requires access to cloud masks for the same geographical areas (one for each of the 9 cameras). Since those masks are themselves derived from the L1B2 GRP data and therefore also contain missing data, the outcomes from updating the RCCM data products, as described in Verstraete et al. (2020, https://doi.org/10.5194/essd-12-611-2020), are also included in this archive. The last 2 subdirectories contain the outcomes from the normal processing of the indicated data files, as well as those generated when additional missing data are artificially inserted in the input files for the purpose of assessing the performance of the algorithms. (2) The document 'L1B2_Out.pdf' provides the User Manual to install and explore the compressed archive 'L1B2_Out.zip'. (3) The compressed archive 'L1B2_input_68050.zip' contains MISR L1B2 GRP and RCCM data for the full Orbit 68050, acquired on 3 October 2012, as well as the corresponding AGP file, which is required by the processing system to update the radiance product. This archive includes data for a wide range of locations, from Russia to north-west Iran, central and eastern Iraq, Saudi Arabia, and many more countries along the eastern coast of the African continent. It is provided to allow users to analyze actual data with the software package mentioned above, without needing to download MISR data from the NASA ASDC web site. (4) The compressed archive 'L1B2_Suppl.zip' contains a set of results similar to the archive 'L1B2_Out.zip' mentioned above, for four additional sites, spanning a much wider range of geographical, climatic and ecological conditions: these are covering areas in Iraq (marsh and arid lands), Kenya (agriculture and tropical forests), South Sudan (grasslands) and Namibia (coastal desert and Atlantic Ocean). Two of them involve largely clear scenes, and the other two include clouds. The last case also includes a test to artificially introduce missing data over deep water and clouds, to demonstrate the performance of the procedure on targets other than continental areas. Once uncompressed, this new archive expands into 8 subdirectories and takes up 1.8 GB of disk space, providing access to about 2,900 files. (5) The companion user manual L1B2_Suppl.pdf, describing how to install, uncompress and explore those additional files.

  18. m

    Appeal Cases heard at the Supreme Court of Nigeria Dataset

    • data.mendeley.com
    Updated Jun 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeremiah Balogun (2023). Appeal Cases heard at the Supreme Court of Nigeria Dataset [Dataset]. http://doi.org/10.17632/ky6zfyf669.1
    Explore at:
    Dataset updated
    Jun 9, 2023
    Authors
    Jeremiah Balogun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Nigeria
    Description

    The dataset contains information about appeal cases heard at the Supreme Court of Nigeria (SCN) between the years 1962 to 2022. The dataset was extracted from case files that were provided by The Prison Law Pavillion; a data archiving firm in Nigeria. The dataset originally consisted of documentation of the various appeal cases alongside the outcome of the judgment of the SCN. Feature extraction techniques were used to generate a structured dataset containing information about a number of annotated features. Some of the features were stored as string values while some of the features were stored as numeric values. The dataset consists of information about 14 features including the outcome of the judgment. 13 features are the input variables among which 4 are stored as strings while the remaining 9 were stored as numeric values. Missing values among the numeric values were represented using the value -1. Unsupervised and Supervised machine learning algorithms can be applied to the dataset for the purpose of extracting important information required for gaining a better understanding of the relationship that exists among the features and with respect to predicting the target class which is the outcome of the SCN judgment.

  19. H

    Replication Data for: "The Missing Dimension of the Political Resource Curse...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Jun 24, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2017). Replication Data for: "The Missing Dimension of the Political Resource Curse Debate" (Comparative Political Studies) [Dataset]. http://doi.org/10.7910/DVN/UHABC6
    Explore at:
    tsv(52146209), tsv(377788), tsv(5022842), tsv(1409885), tsv(229670197), tsv(1890549), tsv(208382761), csv(1831600), tsv(5923932), application/x-stata-syntax(24825), tsv(128753671), tsv(351459690), tsv(3317975)Available download formats
    Dataset updated
    Jun 24, 2017
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Abstract: Given the methodological sophistication of the debate over the “political resource curse”—the purported negative relationship between natural resource wealth (in particular oil wealth) and democracy—it is surprising that scholars have not paid more attention to the basic statistical issue of how to deal with missing data. This article highlights the problems caused by the most common strategy for analyzing missing data in the political resource curse literature—listwise deletion—and investigates how addressing such problems through the best-practice technique of multiple imputation affects empirical results. I find that multiple imputation causes the results of a number of influential recent studies to converge on a key common finding: A political resource curse does exist, but only since the widespread nationalization of petroleum industries in the 1970s. This striking finding suggests that much of the controversy over the political resource curse has been caused by a neglect of missing-data issues.

  20. B

    Big Data Analysis Platform Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Big Data Analysis Platform Report [Dataset]. https://www.archivemarketresearch.com/reports/big-data-analysis-platform-56417
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Mar 12, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Big Data Analysis Platform market is experiencing robust growth, projected to reach $121.07 billion in 2025. While the provided CAGR is missing, considering the rapid advancements in data analytics technologies and the increasing adoption across diverse sectors like computer, electronics, energy, machinery, and chemicals, a conservative estimate of a 15% Compound Annual Growth Rate (CAGR) from 2025 to 2033 seems plausible. This would indicate substantial market expansion, driven by the exponential growth of data volume, the need for improved business intelligence, and the rise of advanced analytics techniques like machine learning and AI. Key drivers include the increasing demand for real-time data insights, the need for better decision-making, and the growing adoption of cloud-based solutions. Trends such as the integration of big data with IoT devices, the increasing use of data visualization tools, and the focus on data security are further shaping the market landscape. Despite the opportunities, challenges such as the complexity of big data implementation, the need for skilled professionals, and data privacy concerns represent significant restraints. The market is segmented by application and geography, with North America and Europe currently dominating, but Asia-Pacific is expected to show significant growth in the coming years due to increasing digitalization and investment in technology. The competitive landscape is highly dynamic, with established players like IBM, Microsoft, and Google competing alongside specialized analytics companies such as Alteryx and Splunk, and numerous emerging firms. The success of individual companies will depend on factors including the breadth and depth of their analytical capabilities, the ease of use of their platforms, the strength of their integrations with existing systems, and their capacity to address industry-specific needs. The forecast period from 2025-2033 presents immense opportunities for both established and emerging companies that can effectively innovate and address the evolving demands of the Big Data Analysis Platform market. The ability to offer scalable, secure, and insightful solutions will be crucial for gaining market share and achieving sustainable growth.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1

Data from: Identifying Missing Data Handling Methods with Text Mining

Related Article
Explore at:
delimitedAvailable download formats
Dataset updated
Mar 8, 2023
Dataset provided by
Hungarian Academy of Sciences
Authors
Krisztián Boros; Zoltán Kmetty
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Jan 1, 1999 - Dec 31, 2016
Description

Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

Search
Clear search
Close search
Google apps
Main menu