47 datasets found
  1. Additional file 5 of Heckman imputation models for binary or continuous MNAR...

    • springernature.figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon (2023). Additional file 5 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors [Dataset]. http://doi.org/10.6084/m9.figshare.7038107.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R code to impute continuous outcome. (R 1 kb)

  2. Data from: Matrix Completion When Missing Is Not at Random and Its...

    • tandf.figshare.com
    zip
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jungjun Choi; Ming Yuan (2024). Matrix Completion When Missing Is Not at Random and Its Applications in Causal Panel Data Models [Dataset]. http://doi.org/10.6084/m9.figshare.26319010.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 20, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Jungjun Choi; Ming Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This article develops an inferential framework for matrix completion when missing is not at random and without the requirement of strong signals. Our development is based on the observation that if the number of missing entries is small enough compared to the panel size, then they can be estimated well even when missing is not at random. Taking advantage of this fact, we divide the missing entries into smaller groups and estimate each group via nuclear norm regularization. In addition, we show that with appropriate debiasing, our proposed estimate is asymptotically normal even for fairly weak signals. Our work is motivated by recent research on the Tick Size Pilot Program, an experiment conducted by the Security and Exchange Commission (SEC) to evaluate the impact of widening the tick size on the market quality of stocks from 2016 to 2018. While previous studies were based on traditional regression or difference-in-difference methods by assuming that the treatment effect is invariant with respect to time and unit, our analyses suggest significant heterogeneity across units and intriguing dynamics over time during the pilot program. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

  3. Understanding and Managing Missing Data.pdf

    • figshare.com
    pdf
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ibrahim Denis Fofanah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

  4. Water-quality data imputation with a high percentage of missing values: a...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 8, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

    This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

    To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

    IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

    In this dataset, we include the original and imputed values for the following variables:

    • Water temperature (Tw)

    • Dissolved oxygen (DO)

    • Electrical conductivity (EC)

    • pH

    • Turbidity (Turb)

    • Nitrite (NO2-)

    • Nitrate (NO3-)

    • Total Nitrogen (TN)

    Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

    More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

    If you use this dataset in your work, please cite our paper:
    Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

  5. Sensitivity analysis for missing data in cost-effectiveness analysis: Stata...

    • figshare.com
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baptiste Leurent; Manuel Gomes; Rita Faria; Stephen Morris; Richard Grieve; James R Carpenter (2023). Sensitivity analysis for missing data in cost-effectiveness analysis: Stata code [Dataset]. http://doi.org/10.6084/m9.figshare.6714206.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Baptiste Leurent; Manuel Gomes; Rita Faria; Stephen Morris; Richard Grieve; James R Carpenter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Stata do-files and data to support tutorial "Sensitivity Analysis for Not-at-Random Missing Data in Trial-Based Cost-Effectiveness Analysis" (Leurent, B. et al. PharmacoEconomics (2018) 36: 889).Do-files should be similar to the code provided in the article's supplementary material.Dataset based on 10 Top Tips trial, but modified to preserve confidentiality. Results will differ from those published.

  6. Data from: A multiple imputation method using population information

    • tandf.figshare.com
    pdf
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tadayoshi Fushiki (2025). A multiple imputation method using population information [Dataset]. http://doi.org/10.6084/m9.figshare.28900017.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Tadayoshi Fushiki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multiple imputation (MI) is effectively used to deal with missing data when the missing mechanism is missing at random. However, MI may not be effective when the missing mechanism is not missing at random (NMAR). In such cases, additional information is required to obtain an appropriate imputation. Pham et al. (2019) proposed the calibrated-δ adjustment method, which is a multiple imputation method using population information. It provides appropriate imputation in two NMAR settings. However, the calibrated-δ adjustment method has two problems. First, it can be used only when one variable has missing values. Second, the theoretical properties of the variance estimator have not been provided. This article proposes a multiple imputation method using population information that can be applied when several variables have missing values. The proposed method is proven to include the calibrated-δ adjustment method. It is shown that the proposed method provides a consistent estimator for the parameter of the imputation model in an NMAR situation. The asymptotic variance of the estimator obtained by the proposed method and its estimator are also given.

  7. d

    Replication Data for: Strategic Binary Choice Models with Partial...

    • dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nieman, Mark (2023). Replication Data for: Strategic Binary Choice Models with Partial Observability [Dataset]. http://doi.org/10.7910/DVN/JANZHM
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Nieman, Mark
    Description

    Strategic interactions among rational, self-interested actors are commonly theorized in the behavioral, economic, and social sciences. The theorized strategic processes have traditionally been modeled with multi-stage structural estimators, which improve parameter estimates at one stage by using the information from other stages. Multi-stage approaches, however, impose rather strict demands on data availability: data must be available for the actions of each strategic actor at every stage of the interaction. Observational data are not always structured in a manner that is conducive to these approaches. Moreover, the theorized strategic process implies that these data are missing not at random. In this paper, I derive a strategic logistic regression model with partial observability that probabilistically estimates unobserved actor choices related to earlier stages of strategic interactions. I compare the estimator to traditional logit and split-population logit estimators using Monte Carlo simulations and a substantive example of the strategic firm–regulator interaction associated with pollution and environmental sanctions.

  8. Jane Street is_missing feature labels

    • kaggle.com
    zip
    Updated Dec 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom M (2020). Jane Street is_missing feature labels [Dataset]. https://www.kaggle.com/tpmeli/jane-street-is-missing-feature-labels
    Explore at:
    zip(10396120 bytes)Available download formats
    Dataset updated
    Dec 3, 2020
    Authors
    Tom M
    Description

    The purpose of this dataset is to prevent consistent procesing times. One can add it to train.csv and experiment with a very simple df.join(this.csv). Since the missing rows seem to have some values that are completely missing not at random, they should have some meaning attached to them.

  9. n

    Data from: A hierarchical Bayesian approach for handling missing...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Mar 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alison C. Ketz; Therese L. Johnson; Mevin B. Hooten; M. Thompson Hobbs (2019). A hierarchical Bayesian approach for handling missing classification data [Dataset]. http://doi.org/10.5061/dryad.8h36t01
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 22, 2019
    Dataset provided by
    National Park Service
    Colorado State University
    Authors
    Alison C. Ketz; Therese L. Johnson; Mevin B. Hooten; M. Thompson Hobbs
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Southwest US
    Description

    Ecologists use classifications of individuals in categories to understand composition of populations and communities. These categories might be defined by demographics, functional traits, or species. Assignment of categories is often imperfect, but frequently treated as observations without error. When individuals are observed but not classified, these “partial” observations must be modified to include the missing data mechanism to avoid spurious inference.

    We developed two hierarchical Bayesian models to overcome the assumption of perfect assignment to mutually exclusive categories in the multinomial distribution of categorical counts, when classifications are missing. These models incorporate auxiliary information to adjust the posterior distributions of the proportions of membership in categories. In one model, we use an empirical Bayes approach, where a subset of data from one year serves as a prior for the missing data the next. In the other approach, we use a small random sample of data within a year to inform the distribution of the missing data.

    We performed a simulation to show the bias that occurs when partial observations were ignored and demonstrated the altered inference for the estimation of demographic ratios. We applied our models to demographic classifications of elk (Cervus elaphus nelsoni) to demonstrate improved inference for the proportions of sex and stage classes.

    We developed multiple modeling approaches using a generalizable nested multinomial structure to account for partially observed data that were missing not at random for classification counts. Accounting for classification uncertainty is important to accurately understand the composition of populations and communities in ecological studies.

  10. h

    ssa-breast-missing-data-patterns

    • huggingface.co
    Updated Nov 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Electric Sheep (2025). ssa-breast-missing-data-patterns [Dataset]. https://huggingface.co/datasets/electricsheepafrica/ssa-breast-missing-data-patterns
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset authored and provided by
    Electric Sheep
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    SSA Breast Missing Data Patterns (Synthetic)

      Dataset summary
    

    This module provides a synthetic missing-data sandbox for oncology care in African healthcare contexts, focusing on:

    Realistic loss-to-follow-up (LTFU) and retention patterns over 0–24 months. Incomplete diagnostic and laboratory test results (ordered vs completed vs available in records). Non-random missingness driven by facility type, distance, socioeconomic status (SES), and insurance.

    The dataset is… See the full description on the dataset page: https://huggingface.co/datasets/electricsheepafrica/ssa-breast-missing-data-patterns.

  11. e

    ComBat HarmonizR enables the integrated analysis of independently generated...

    • ebi.ac.uk
    Updated May 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hannah Voß (2022). ComBat HarmonizR enables the integrated analysis of independently generated proteomic datasets through data harmonization with appropriate handling of missing values [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD027467
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Hannah Voß
    Variables measured
    Proteomics
    Description

    The integration of proteomic datasets, generated by non-cooperating laboratories using different LC-MS/MS setups can overcome limitations in statistically underpowered sample cohorts but has not been demonstrated to this day. In proteomics, differences in sample preservation and preparation strategies, chromatography and mass spectrometry approaches and the used quantification strategy distort protein abundance distributions in integrated datasets. The Removal of these technical batch effects requires setup-specific normalization and strategies that can deal with missing at random (MAR) and missing not at random (MNAR) type values at a time. Algorithms for batch effect removal, such as the ComBat-algorithm, commonly used for other omics types, disregard proteins with MNAR missing values and reduce the informational yield and the effect size for combined datasets significantly. Here, we present a strategy for data harmonization across different tissue preservation techniques, LC-MS/MS instrumentation setups and quantification approaches. To enable batch effect removal without the need for data reduction or error-prone imputation we developed an extension to the ComBat algorithm, ´ComBat HarmonizR, that performs data harmonization with appropriate handling of MAR and MNAR missing values by matrix dissection The ComBat HarmonizR based strategy enables the combined analysis of independently generated proteomic datasets for the first time. Furthermore, we found ComBat HarmonizR to be superior for removing batch effects between different Tandem Mass Tag (TMT)-plexes, compared to commonly used internal reference scaling (iRS). Due to the matrix dissection approach without the need of data imputation, the HarmonizR algorithm can be applied to any type of -omics data while assuring minimal data loss

  12. f

    Data from: A Bayesian hybrid method for the analysis of generalized linear...

    • tandf.figshare.com
    pdf
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sezgin Ciftci; Zeynep Kalaylioglu (2025). A Bayesian hybrid method for the analysis of generalized linear models with missing-not-at-random covariates [Dataset]. http://doi.org/10.6084/m9.figshare.27244867.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Sezgin Ciftci; Zeynep Kalaylioglu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Missing data handling is one of the main problems in modelling, particularly if the missingness is of type missing-not-at-random (MNAR) where missingness occurs due to the actual value of the observation. The focus of the current article is generalized linear modelling of fully observed binary response variables depending on at least one MNAR covariate. For the traditional analysis of such models, an individual model for the probability of missingness is assumed and incorporated in the model framework. However, this probability model is untestable, as the missingness of MNAR data depend on their actual values that would have been observed otherwise. In this article, we consider creating a model space that consist of all possible and plausible models for probability of missingness and develop a hybrid method in which a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm is combined with Bayesian Model Averaging (BMA). RJMCMC is adopted to obtain posterior estimates of model parameters as well as probability of each model in the model space. BMA is used to synthesize coefficient estimates from all models in the model space while accounting for model uncertainties. Through a validation study with a simulated data set and a real data application, the performance of the proposed methodology is found to be satisfactory in accuracy and efficiency of estimates.

  13. H

    Replication Data for: Comparative investigation of time series missing data...

    • dataverse.harvard.edu
    • dataone.org
    Updated Jul 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LEIZHEN ZANG; Feng XIONG (2020). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    LEIZHEN ZANG; Feng XIONG
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.

  14. f

    A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    • +1more
    Updated May 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dabke, Kruttika; Jones, Michelle R.; Kreimer, Simion; Parker, Sarah J. (2021). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000907442
    Explore at:
    Dataset updated
    May 3, 2021
    Authors
    Dabke, Kruttika; Jones, Michelle R.; Kreimer, Simion; Parker, Sarah J.
    Description

    Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.

  15. n

    Data from: A new method for handling missing species in diversification...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jan 6, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natalie Cusimano; Tanja Stadler; Susanne S. Renner (2012). A new method for handling missing species in diversification analysis applicable to randomly or non-randomly sampled phylogenies [Dataset]. http://doi.org/10.5061/dryad.r8f04fk2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 6, 2012
    Dataset provided by
    Ludwig-Maximilians-Universität München
    Authors
    Natalie Cusimano; Tanja Stadler; Susanne S. Renner
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Chronograms from molecular dating are increasingly being used to infer rates of diversification and their change over time. A major limitation in such analyses is incomplete species sampling that moreover is usually non-random. While the widely used γ statistic with the MCCR test or the birth-death likelihood analysis with the ∆AICrc test statistic are appropriate for comparing the fit of different diversification models in phylogenies with random species sampling, no objective, automated method has been developed for fitting diversification models to non-randomly sampled phylogenies. Here we introduce a novel approach, CorSiM, which involves simulating missing splits under a constant-rate birth-death model and allows the user to specify whether species sampling in the phylogeny being analyzed is random or non-random. The completed trees can be used in subsequent model-fitting analyses. This is fundamentally different from previous diversification rate estimation methods, which were based on null distributions derived from the incomplete trees. CorSiM is automated in an R package and can easily be applied to large data sets. We illustrate the approach in two Araceae clades, one with a random species sampling of 52% and one with a non-random sampling of 55%. In the latter clade, the CorSiM approach detects and quantifies an increase in diversification rate while classic approaches prefer a constant rate model, whereas in the former clade, results do not differ among methods (as indeed expected since the classic approaches are valid only for randomly sampled phylogenies). The CorSiM method greatly reduces the type I error in diversification analysis, but type II error remains a methodological problem.

  16. f

    Data from: Validity of using multiple imputation for "unknown" stage at...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 27, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luo, Qingwei; Egger, Sam; Yu, Xue Qin; Smith, David P.; O’Connell, Dianne L. (2017). Validity of using multiple imputation for "unknown" stage at diagnosis in population-based cancer registry data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001781541
    Explore at:
    Dataset updated
    Jun 27, 2017
    Authors
    Luo, Qingwei; Egger, Sam; Yu, Xue Qin; Smith, David P.; O’Connell, Dianne L.
    Description

    BackgroundThe multiple imputation approach to missing data has been validated by a number of simulation studies by artificially inducing missingness on fully observed stage data under a pre-specified missing data mechanism. However, the validity of multiple imputation has not yet been assessed using real data. The objective of this study was to assess the validity of using multiple imputation for “unknown” prostate cancer stage recorded in the New South Wales Cancer Registry (NSWCR) in real-world conditions.MethodsData from the population-based cohort study NSW Prostate Cancer Care and Outcomes Study (PCOS) were linked to 2000–2002 NSWCR data. For cases with “unknown” NSWCR stage, PCOS-stage was extracted from clinical notes. Logistic regression was used to evaluate the missing at random assumption adjusted for variables from two imputation models: a basic model including NSWCR variables only and an enhanced model including the same NSWCR variables together with PCOS primary treatment. Cox regression was used to evaluate the performance of MI.ResultsOf the 1864 prostate cancer cases 32.7% were recorded as having “unknown” NSWCR stage. The missing at random assumption was satisfied when the logistic regression included the variables included in the enhanced model, but not those in the basic model only. The Cox models using data with imputed stage from either imputation model provided generally similar estimated hazard ratios but with wider confidence intervals compared with those derived from analysis of the data with PCOS-stage. However, the complete-case analysis of the data provided a considerably higher estimated hazard ratio for the low socio-economic status group and rural areas in comparison with those obtained from all other datasets.ConclusionsUsing MI to deal with “unknown” stage data recorded in a population-based cancer registry appears to provide valid estimates. We would recommend a cautious approach to the use of this method elsewhere.

  17. h

    OceanVerse

    • huggingface.co
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JingWei (2025). OceanVerse [Dataset]. https://huggingface.co/datasets/jingwei-sjtu/OceanVerse
    Explore at:
    Dataset updated
    May 13, 2025
    Authors
    JingWei
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    OceanVerse Dataset

    OceanVerse is a comprehensive dataset designed to address the challenge of reconstructing sparse ocean observation data. It integrates nearly 2 million real-world profile data points since 1900 and three sets of Earth system numerical simulation data. OceanVerse provides a novel large-scale (∼100× nodes vs. existing datasets) dataset that meets the MNAR (Missing Not at Random) condition, supporting more effective model comparison, generalization evaluation and… See the full description on the dataset page: https://huggingface.co/datasets/jingwei-sjtu/OceanVerse.

  18. f

    Data from: Non-ignorable missing data, single index propensity score and...

    • tandf.figshare.com
    • figshare.com
    txt
    Updated May 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuerong Chen; Denis Heng-Yan Leung; Jing Qin (2021). Non-ignorable missing data, single index propensity score and profile synthetic distribution function [Dataset]. http://doi.org/10.6084/m9.figshare.13341851.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 25, 2021
    Dataset provided by
    Taylor & Francis
    Authors
    Xuerong Chen; Denis Heng-Yan Leung; Jing Qin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In missing data problems, missing not at random is difficult to handle since the response probability or propensity score is confounded with the outcome data model in the likelihood. Existing works often assume the propensity score is known up to a finite dimensional parameter. We relax this assumption and consider an unspecified single index model for the propensity score. A pseudo-likelihood based on the complete data is constructed by profiling out a synthetic distribution function that involves the unknown propensity score. The pseudo-likelihood gives asymptotically normal estimates. Simulations show the method compares favourably with existing methods.

  19. Numpy , pandas and matplot lib practice

    • kaggle.com
    zip
    Updated Jul 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pratham saraf (2023). Numpy , pandas and matplot lib practice [Dataset]. https://www.kaggle.com/datasets/prathamsaraf1389/numpy-pandas-and-matplot-lib-practise/suggestions
    Explore at:
    zip(385020 bytes)Available download formats
    Dataset updated
    Jul 16, 2023
    Authors
    pratham saraf
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.

    Specifics of the Dataset:

    The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.

    One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:

    Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule

    Context of the Dataset:

    The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:

    The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.

  20. d

    Data from: Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 -...

    • search.dataone.org
    Updated Nov 9, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kirsten E. Ironside; David Mattson; David Choate; David Stoner; Terence R. Arundel; Jered Hansen; Tad Theimer; Brandon Holton; Brian Jansen; Joseph O. Sexton; Kathleen Longshore; Thomas C. Edwards, Jr.; Michael Peters (2017). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data [Dataset]. https://search.dataone.org/view/b4619339-45e5-4768-aba3-2bae04693510
    Explore at:
    Dataset updated
    Nov 9, 2017
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Kirsten E. Ironside; David Mattson; David Choate; David Stoner; Terence R. Arundel; Jered Hansen; Tad Theimer; Brandon Holton; Brian Jansen; Joseph O. Sexton; Kathleen Longshore; Thomas C. Edwards, Jr.; Michael Peters
    Time period covered
    Jan 1, 2003 - Jan 1, 2016
    Area covered
    Description

    Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the m... Visit https://dataone.org/datasets/b4619339-45e5-4768-aba3-2bae04693510 for complete metadata about this dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon (2023). Additional file 5 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors [Dataset]. http://doi.org/10.6084/m9.figshare.7038107.v1
Organization logoOrganization logo

Additional file 5 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors

Related Article
Explore at:
txtAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

R code to impute continuous outcome. (R 1 kb)

Search
Clear search
Close search
Google apps
Main menu