100+ datasets found
  1. d

    Data from: Problems in dealing with missing data and informative censoring...

    • catalog.data.gov
    • data.virginia.gov
    Updated Sep 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Problems in dealing with missing data and informative censoring in clinical trials [Dataset]. https://catalog.data.gov/dataset/problems-in-dealing-with-missing-data-and-informative-censoring-in-clinical-trials
    Explore at:
    Dataset updated
    Sep 7, 2025
    Dataset provided by
    National Institutes of Health
    Description

    A common problem in clinical trials is the missing data that occurs when patients do not complete the study and drop out without further measurements. Missing data cause the usual statistical analysis of complete or all available data to be subject to bias. There are no universally applicable methods for handling missing data. We recommend the following: (1) Report reasons for dropouts and proportions for each treatment group; (2) Conduct sensitivity analyses to encompass different scenarios of assumptions and discuss consistency or discrepancy among them; (3) Pay attention to minimize the chance of dropouts at the design stage and during trial monitoring; (4) Collect post-dropout data on the primary endpoints, if at all possible; and (5) Consider the dropout event itself an important endpoint in studies with many.

  2. Methods for Handling Missing Item Values in Regression Models Using the...

    • catalog.data.gov
    • data.virginia.gov
    • +1more
    Updated Sep 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Substance Abuse and Mental Health Services Administration (2025). Methods for Handling Missing Item Values in Regression Models Using the National Survey on Drug Use and Health (NSDUH) [Dataset]. https://catalog.data.gov/dataset/methods-for-handling-missing-item-values-in-regression-models-using-the-national-survey-on
    Explore at:
    Dataset updated
    Sep 7, 2025
    Dataset provided by
    Substance Abuse and Mental Health Services Administrationhttps://www.samhsa.gov/
    Description

    The purpose of this report is to guide analysts interested in fitting regression models using data from the National Survey on Drug Use and Health (NSDUH) by providing them with methods for handling missing item values in regression analyses (MIVRA). The report includes a theoretical review of existing MIVRA methods, a simulation study that evaluates several of the more promising methods using existing NSDUH datasets, and a final chapter where the results of both the theoretical review and the simulation study are synthesized into guidance for analysts via decision trees.

  3. Statistical Methods for Missing Data in Large Observational Studies [Methods...

    • icpsr.umich.edu
    Updated Oct 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Long, Qi (2025). Statistical Methods for Missing Data in Large Observational Studies [Methods Study], Georgia, 2013-2018 [Dataset]. http://doi.org/10.3886/ICPSR39526.v1
    Explore at:
    Dataset updated
    Oct 27, 2025
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Long, Qi
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/39526/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39526/terms

    Time period covered
    2013 - 2018
    Area covered
    Georgia, United States
    Description

    Health registries record data about patients with a specific health problem. These data may include age, weight, blood pressure, health problems, medical test results, and treatments received. But data in some patient records may be missing. For example, some patients may not report their weight or all of their health problems. Research studies can use data from health registries to learn how well treatments work. But missing data can lead to incorrect results. To address the problem, researchers often exclude patient records with missing data from their studies. But doing this can also lead to incorrect results. The fewer records that researchers use, the greater the chance for incorrect results. Missing data also lead to another problem: it is harder for researchers to find patient traits that could affect diagnosis and treatment. For example, patients who are overweight may get heart disease. But if data are missing, it is hard for researchers to be sure that trait could affect diagnosis and treatment of heart disease. In this study, the research team developed new statistical methods to fill in missing data in large studies. The team also developed methods to use when data are missing to help find patient traits that could affect diagnosis and treatment. To access the methods, software, and R package, please visit the Long Research Group website.

  4. Z

    Missing data in the analysis of multilevel and dependent data (Examples)

    • data.niaid.nih.gov
    Updated Jul 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Grund; Oliver Lüdtke; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Examples) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7773613
    Explore at:
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    University of Hamburg
    IPN - Leibniz Institute for Science and Mathematics Education
    Authors
    Simon Grund; Oliver Lüdtke; Alexander Robitzsch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").

    The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

    ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)

    In all data sets, missing values are coded as "NA".

  5. d

    Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
    Explore at:
    Dataset updated
    Nov 23, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Lall, Ranjit; Robinson, Thomas
    Description

    Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

  6. A dataset from a survey investigating disciplinary differences in data...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin, csv, pdf, txt
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anton Boudreau Ninkov; Anton Boudreau Ninkov; Chantal Ripp; Chantal Ripp; Kathleen Gregory; Kathleen Gregory; Isabella Peters; Isabella Peters; Stefanie Haustein; Stefanie Haustein (2024). A dataset from a survey investigating disciplinary differences in data citation [Dataset]. http://doi.org/10.5281/zenodo.7555363
    Explore at:
    csv, txt, pdf, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anton Boudreau Ninkov; Anton Boudreau Ninkov; Chantal Ripp; Chantal Ripp; Kathleen Gregory; Kathleen Gregory; Isabella Peters; Isabella Peters; Stefanie Haustein; Stefanie Haustein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GENERAL INFORMATION

    Title of Dataset: A dataset from a survey investigating disciplinary differences in data citation

    Date of data collection: January to March 2022

    Collection instrument: SurveyMonkey

    Funding: Alfred P. Sloan Foundation


    SHARING/ACCESS INFORMATION

    Licenses/restrictions placed on the data: These data are available under a CC BY 4.0 license

    Links to publications that cite or use the data:

    Gregory, K., Ninkov, A., Ripp, C., Peters, I., & Haustein, S. (2022). Surveying practices of data citation and reuse across disciplines. Proceedings of the 26th International Conference on Science and Technology Indicators. International Conference on Science and Technology Indicators, Granada, Spain. https://doi.org/10.5281/ZENODO.6951437

    Gregory, K., Ninkov, A., Ripp, C., Roblin, E., Peters, I., & Haustein, S. (2023). Tracing data:
    A survey investigating disciplinary differences in data citation.
    Zenodo. https://doi.org/10.5281/zenodo.7555266


    DATA & FILE OVERVIEW

    File List

    • Filename: MDCDatacitationReuse2021Codebook.pdf
      Codebook
    • Filename: MDCDataCitationReuse2021surveydata.csv
      Dataset format in csv
    • Filename: MDCDataCitationReuse2021surveydata.sav
      Dataset format in SPSS
    • Filename: MDCDataCitationReuseSurvey2021QNR.pdf
      Questionnaire

    Additional related data collected that was not included in the current data package: Open ended questions asked to respondents


    METHODOLOGICAL INFORMATION

    Description of methods used for collection/generation of data:

    The development of the questionnaire (Gregory et al., 2022) was centered around the creation of two main branches of questions for the primary groups of interest in our study: researchers that reuse data (33 questions in total) and researchers that do not reuse data (16 questions in total). The population of interest for this survey consists of researchers from all disciplines and countries, sampled from the corresponding authors of papers indexed in the Web of Science (WoS) between 2016 and 2020.

    Received 3,632 responses, 2,509 of which were completed, representing a completion rate of 68.6%. Incomplete responses were excluded from the dataset. The final total contains 2,492 complete responses and an uncorrected response rate of 1.57%. Controlling for invalid emails, bounced emails and opt-outs (n=5,201) produced a response rate of 1.62%, similar to surveys using comparable recruitment methods (Gregory et al., 2020).

    Methods for processing the data:

    Results were downloaded from SurveyMonkey in CSV format and were prepared for analysis using Excel and SPSS by recoding ordinal and multiple choice questions and by removing missing values.

    Instrument- or software-specific information needed to interpret the data:

    The dataset is provided in SPSS format, which requires IBM SPSS Statistics. The dataset is also available in a coded format in CSV. The Codebook is required to interpret to values.


    DATA-SPECIFIC INFORMATION FOR: MDCDataCitationReuse2021surveydata

    Number of variables: 94

    Number of cases/rows: 2,492

    Missing data codes: 999 Not asked

    Refer to MDCDatacitationReuse2021Codebook.pdf for detailed variable information.

  7. d

    New Approach to Evaluating Supplementary Homicide Report (SHR) Data...

    • catalog.data.gov
    • icpsr.umich.edu
    Updated Nov 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 [Dataset]. https://catalog.data.gov/dataset/new-approach-to-evaluating-supplementary-homicide-report-shr-data-imputation-1990-1995-ff769
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    National Institute of Justice
    Description

    The purpose of the project was to learn more about patterns of homicide in the United States by strengthening the ability to make imputations for Supplementary Homicide Report (SHR) data with missing values. Supplementary Homicide Reports (SHR) and local police data from Chicago, Illinois, St. Louis, Missouri, Philadelphia, Pennsylvania, and Phoenix, Arizona, for 1990 to 1995 were merged to create a master file by linking on overlapping information on victim and incident characteristics. Through this process, 96 percent of the cases in the SHR were matched with cases in the police files. The data contain variables for three types of cases: complete in SHR, missing offender and incident information in SHR but known in police report, and missing offender and incident information in both. The merged file allows estimation of similarities and differences between the cases with known offender characteristics in the SHR and those in the other two categories. The accuracy of existing data imputation methods can be assessed by comparing imputed values in an "incomplete" dataset (the SHR), generated by the three imputation strategies discussed in the literature, with the actual values in a known "complete" dataset (combined SHR and police data). Variables from both the Supplemental Homicide Reports and the additional police report offense data include incident date, victim characteristics, offender characteristics, incident details, geographic information, as well as variables regarding the matching procedure.

  8. Data from: Multiple Imputation for the Supplementary Homicide Reports:...

    • icpsr.umich.edu
    • datasets.ai
    • +3more
    Updated Mar 31, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberts, John; Roberts, Aki (2016). Multiple Imputation for the Supplementary Homicide Reports: Evaluation in Unique Test Data, 1990-1995, Chicago, Philadelphia, Phoenix and St. Louis [Dataset]. http://doi.org/10.3886/ICPSR36379.v1
    Explore at:
    Dataset updated
    Mar 31, 2016
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Roberts, John; Roberts, Aki
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/36379/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/36379/terms

    Time period covered
    1990 - 1995
    Area covered
    Philadelphia, Illinois, United States, Chicago, Arizona, Missouri, Pennsylvania, Phoenix, St. Louis
    Description

    This study was an evaluation of multiple imputation strategies to address missing data using the New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 (ICPSR 20060) dataset.

  9. Retail Product Dataset with Missing Values

    • kaggle.com
    zip
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
    Explore at:
    zip(47826 bytes)Available download formats
    Dataset updated
    Feb 17, 2025
    Authors
    Himel Sarder
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

    The dataset includes:
    - Category (Categorical): Product category (A, B, C, D)
    - Price (Numerical): Randomized product prices
    - Rating (Numerical): Ratings between 1 to 5
    - Stock (Categorical): Availability status (In Stock, Out of Stock)
    - Discount (Numerical): Discount percentage

    This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.

  10. Handling of Missing Data Induced by Time-Varying Covariates in Comparative...

    • icpsr.umich.edu
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Desai, Manisha (2025). Handling of Missing Data Induced by Time-Varying Covariates in Comparative Effectiveness Research HIV Patients [Methods Study], 2013-2018 [Dataset]. http://doi.org/10.3886/ICPSR39528.v1
    Explore at:
    Dataset updated
    Oct 9, 2025
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Desai, Manisha
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/39528/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39528/terms

    Time period covered
    2013 - 2018
    Description

    Researchers can use data from health registries or electronic health records to compare two or more treatments. Registries store data about patients with a specific health problem. These data include how well those patients respond to treatments and information about patient traits, such as age, weight, or blood pressure. But sometimes data about patient traits are missing. Missing data about patient traits can lead to incorrect study results, especially when traits change over time. For example, weight can change over time, and the patient may not report their weight at some points along the way. Researchers use statistical methods to fill in these missing data. In this study, the research team compared a new statistical method to fill in missing data with traditional methods. Traditional methods remove patients with missing data or fill in each missing number with a single estimate. The new method creates multiple possible estimates to fill in each missing number. To access the methods, software, and R package, please visit the SimulateCER GitHub and SimTimeVar CRAN website.

  11. f

    Data_Sheet_1_Three Sample Estimates of Fraction of Missing Information From...

    • frontiersin.figshare.com
    pdf
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lihan Chen; Victoria Savalei (2023). Data_Sheet_1_Three Sample Estimates of Fraction of Missing Information From Full Information Maximum Likelihood.PDF [Dataset]. http://doi.org/10.3389/fpsyg.2021.667802.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Lihan Chen; Victoria Savalei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In missing data analysis, the reporting of missing rates is insufficient for the readers to determine the impact of missing data on the efficiency of parameter estimates. A more diagnostic measure, the fraction of missing information (FMI), shows how the standard errors of parameter estimates increase from the information loss due to ignorable missing data. FMI is well-known in the multiple imputation literature (Rubin, 1987), but it has only been more recently developed for full information maximum likelihood (Savalei and Rhemtulla, 2012). Sample FMI estimates using this approach have since then been made accessible as part of the lavaan package (Rosseel, 2012) in the R statistical programming language. However, the properties of FMI estimates at finite sample sizes have not been the subject of comprehensive investigation. In this paper, we present a simulation study on the properties of three sample FMI estimates from FIML in two common models in psychology, regression and two-factor analysis. We summarize the performance of these FMI estimates and make recommendations on their application.

  12. Additional file 2 of The reporting and handling of missing data in...

    • springernature.figshare.com
    xlsx
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chinenye Okpara; Chidozie Edokwe; George Ioannidis; Alexandra Papaioannou; Jonathan D. Adachi; Lehana Thabane (2024). Additional file 2 of The reporting and handling of missing data in longitudinal studies of older adults is suboptimal: a methodological survey of geriatric journals [Dataset]. http://doi.org/10.6084/m9.figshare.19663862.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 15, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Chinenye Okpara; Chidozie Edokwe; George Ioannidis; Alexandra Papaioannou; Jonathan D. Adachi; Lehana Thabane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 2.

  13. Percentage (%) and number (n) of missing values in the outcome (maximum grip...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Percentage (%) and number (n) of missing values in the outcome (maximum grip strength) among participants that were interviewed, by age group and sex using all available data. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Percentage (%) and number (n) of missing values in the outcome (maximum grip strength) among participants that were interviewed, by age group and sex using all available data.

  14. Public Reporting of Missing Digital Contact Information

    • catalog.data.gov
    • data.virginia.gov
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Medicare & Medicaid Services (2025). Public Reporting of Missing Digital Contact Information [Dataset]. https://catalog.data.gov/dataset/public-reporting-of-missing-digital-contact-information-ff7e7
    Explore at:
    Dataset updated
    Oct 7, 2025
    Dataset provided by
    Centers for Medicare & Medicaid Services
    Description

    In the May 2020 CMS Interoperability and Patient Access final rule, CMS finalized the policy to publicly report the names and NPIs of those providers who do not have digital contact information included in the NPPES system (85 FR 25584). This data includes the NPI and provider name of providers and clinicians without digital contact information in NPPES.

  15. Additional file 1 of Working with missing data in large-scale assessments

    • springernature.figshare.com
    zip
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francis Huang; Brian Keller (2025). Additional file 1 of Working with missing data in large-scale assessments [Dataset]. http://doi.org/10.6084/m9.figshare.28853685.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Francis Huang; Brian Keller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary material 1.

  16. T

    Indicator Zero or Missing Value Report

    • mi-treasury.data.socrata.com
    csv, xlsx, xml
    Updated Mar 24, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Indicator Zero or Missing Value Report [Dataset]. https://mi-treasury.data.socrata.com/dataset/Indicator-Zero-or-Missing-Value-Report/p38z-ybgt
    Explore at:
    xml, xlsx, csvAvailable download formats
    Dataset updated
    Mar 24, 2017
    Description

    Latest database - 1/5/2017

  17. Data from: Benchmarking imputation methods for categorical biological data

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Mar 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthieu Gendre; Torsten Hauffe; Torsten Hauffe; Catalina Pimiento; Catalina Pimiento; Daniele Silvestro; Daniele Silvestro; Matthieu Gendre (2024). Benchmarking imputation methods for categorical biological data [Dataset]. http://doi.org/10.5281/zenodo.10800016
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matthieu Gendre; Torsten Hauffe; Torsten Hauffe; Catalina Pimiento; Catalina Pimiento; Daniele Silvestro; Daniele Silvestro; Matthieu Gendre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 9, 2024
    Description

    Description:

    Welcome to the Zenodo repository for Publication Benchmarking imputation methods for categorical biological data, a comprehensive collection of datasets and scripts utilized in our research endeavors. This repository serves as a vital resource for researchers interested in exploring the empirical and simulated analyses conducted in our study.

    Contents:

    1. empirical_analysis:

      • Trait Dataset of Elasmobranchs: A collection of trait data for elasmobranch species obtained from FishBase , stored as RDS file.
      • Phylogenetic Tree: A phylogenetic tree stored as a TRE file.
      • Imputations Replicates (Imputation): Replicated imputations of missing data in the trait dataset, stored as RData files.
      • Error Calculation (Results): Error calculation results derived from imputed datasets, stored as RData files.
      • Scripts: Collection of R scripts used for the implementation of empirical analysis.
    2. simulation_analysis:

      • Input Files: Input files utilized for simulation analyses as CSV files
      • Data Distribution PDFs: PDF files displaying the distribution of simulated data and the missingness.
      • Output Files: Simulated trait datasets, trait datasets with missing data, and trait imputed datasets with imputation errors calculated as RData files.
      • Scripts: Collection of R scripts used for the simulation analysis.
    3. TDIP_package:

      • Scripts of the TDIP Package: All scripts related to the Trait Data Imputation with Phylogeny (TDIP) R package used in the analyses.

    Purpose:

    This repository aims to provide transparency and reproducibility to our research findings by making the datasets and scripts publicly accessible. Researchers interested in understanding our methodologies, replicating our analyses, or building upon our work can utilize this repository as a valuable reference.

    Citation:

    When using the datasets or scripts from this repository, we kindly request citing Publication Benchmarking imputation methods for categorical biological data and acknowledging the use of this Zenodo repository.

    Thank you for your interest in our research, and we hope this repository serves as a valuable resource in your scholarly pursuits.

  18. n

    Data from: Bias and sensitivity in the placement of fossil taxa resulting...

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +2more
    zip
    Updated Nov 21, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert S. Sansom (2014). Bias and sensitivity in the placement of fossil taxa resulting from interpretations of missing data [Dataset]. http://doi.org/10.5061/dryad.7tq20
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 21, 2014
    Dataset provided by
    University of Manchester
    Authors
    Robert S. Sansom
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The utility of fossils in evolutionary contexts is dependent on their accurate placement in phylogenetic frameworks, yet intrinsic and widespread missing data make this problematic. The complex taphonomic processes occurring during fossilization can make it difficult to distinguish absence from non-preservation, especially in the case of exceptionally preserved soft-tissue fossils: is a particular morphological character (e.g. appendage, tentacle or nerve) missing from a fossil because it was never there (phylogenetic absence), or just happened to not be preserved (taphonomic loss)? Missing data has not been tested in the context of interpretation of non-present anatomy nor in the context of directional shifts and biases in affinity. Here, complete taxa, both simulated and empirical, are subjected to data loss through the replacement of present entries (1s) with either missing (?s) or absent (0s) entries. Both cause taxa to drift down trees, from their original position, toward the root. Absolute thresholds at which downshift is significant are extremely low for introduced absences (2 entries replaced, 6 % of present characters). The opposite threshold in empirical fossil taxa is also found to be low; two absent entries replaced with presences causes fossil taxa to drift up trees. As such, only a few instances of non-preserved characters interpreted as absences will cause fossil organisms to be erroneously interpreted as more primitive than they were in life. This observed sensitivity to coding non-present morphology presents a problem for all evolutionary studies that attempt to use fossils to reconstruct rates of evolution or unlock sequences of morphological change. Stem-ward slippage, whereby fossilization processes cause organisms to appear artificially primitive, appears to be a ubiquitous and problematic phenomenon inherent to missing data, even when no decay biases exist. Absent characters therefore require explicit justification and taphonomic frameworks to support their interpretation.

  19. Handling of missing values in python

    • kaggle.com
    zip
    Updated Jul 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xodeum (2022). Handling of missing values in python [Dataset]. https://www.kaggle.com/datasets/xodeum/handling-of-missing-values-in-python
    Explore at:
    zip(2634 bytes)Available download formats
    Dataset updated
    Jul 3, 2022
    Authors
    xodeum
    Description

    In this Datasets i simply showed the handling of missing values in your data with help of python libraries such as NumPy and pandas. You can also see the use of Nan and Non values. Detecting, dropping and filling of null values.

  20. f

    Data from: Additional file 2 of Accommodating heterogeneous missing data...

    • datasetcatalog.nlm.nih.gov
    • springernature.figshare.com
    Updated Jul 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Herkommer, Kathleen; Leach, Robin J.; Freedland, Stephen J.; Cooperberg, Matthew R.; Ankerst, Donna P.; Poyet, Cedric; Meissner, Valentin H.; De Hoedt, Amanda M.; Vickers, Andrew J.; Guerrios-Rivera, Lourdes; Neumair, Matthias; Haese, Alexander; Saba, Karim; Boorjian, Stephen A.; Kattan, Michael W.; Liss, Michael A. (2022). Additional file 2 of Accommodating heterogeneous missing data patterns for prostate cancer risk prediction [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000323195
    Explore at:
    Dataset updated
    Jul 22, 2022
    Authors
    Herkommer, Kathleen; Leach, Robin J.; Freedland, Stephen J.; Cooperberg, Matthew R.; Ankerst, Donna P.; Poyet, Cedric; Meissner, Valentin H.; De Hoedt, Amanda M.; Vickers, Andrew J.; Guerrios-Rivera, Lourdes; Neumair, Matthias; Haese, Alexander; Saba, Karim; Boorjian, Stephen A.; Kattan, Michael W.; Liss, Michael A.
    Description

    Additional file 2: R code for all 1,024 models available at https://riskcalc.org/ExtendedPBCG/ .

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
National Institutes of Health (2025). Problems in dealing with missing data and informative censoring in clinical trials [Dataset]. https://catalog.data.gov/dataset/problems-in-dealing-with-missing-data-and-informative-censoring-in-clinical-trials

Data from: Problems in dealing with missing data and informative censoring in clinical trials

Related Article
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
National Institutes of Health
Description

A common problem in clinical trials is the missing data that occurs when patients do not complete the study and drop out without further measurements. Missing data cause the usual statistical analysis of complete or all available data to be subject to bias. There are no universally applicable methods for handling missing data. We recommend the following: (1) Report reasons for dropouts and proportions for each treatment group; (2) Conduct sensitivity analyses to encompass different scenarios of assumptions and discuss consistency or discrepancy among them; (3) Pay attention to minimize the chance of dropouts at the design stage and during trial monitoring; (4) Collect post-dropout data on the primary endpoints, if at all possible; and (5) Consider the dropout event itself an important endpoint in studies with many.

Search
Clear search
Close search
Google apps
Main menu