100+ datasets found
  1. o

    Data from: Identifying Missing Data Handling Methods with Text Mining

    • openicpsr.org
    delimited
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Mar 8, 2023
    Dataset provided by
    Hungarian Academy of Sciences
    Authors
    Krisztián Boros; Zoltán Kmetty
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1999 - Dec 31, 2016
    Description

    Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

  2. Data from: Water-quality data imputation with a high percentage of missing...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    csv
    Updated Jun 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 8, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

    This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

    To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

    IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

    In this dataset, we include the original and imputed values for the following variables:

    • Water temperature (Tw)

    • Dissolved oxygen (DO)

    • Electrical conductivity (EC)

    • pH

    • Turbidity (Turb)

    • Nitrite (NO2-)

    • Nitrate (NO3-)

    • Total Nitrogen (TN)

    Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

    More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

    If you use this dataset in your work, please cite our paper:
    Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

  3. H

    Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

    • dataverse.harvard.edu
    • search.dataone.org
    csv, pdf, svg, tex +7
    Updated Sep 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2022). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
    Explore at:
    tsv(43557), tsv(49163), tsv(49093), tsv(146147), tsv(49146), tsv(49128), tsv(146438), tsv(49049), tsv(49132), tsv(49183), tsv(49084), tsv(49181), tsv(49240), tsv(49180), tsv(43513), tsv(49131), tsv(49190), tsv(49104), tsv(49194), tsv(49111), tsv(49073), tsv(49154), tsv(43454), tsv(49134), tsv(49205), tsv(49165), tsv(146222), tsv(49114), tsv(49108), tsv(49116), tsv(146128), tsv(49112), tsv(146087), tsv(49121), tsv(49206), tsv(49101), tsv(49204), tsv(49097), text/x-python-script(2201), text/x-python-script(2651), tsv(49148), tsv(43442), tsv(74900104), tsv(94461), tsv(49186), pdf(60698), tsv(44031), tsv(146134), tsv(49217), tsv(146230), tsv(49103), tsv(49184), tsv(49071), tsv(49207), tsv(49280), type/x-r-syntax(1967), tsv(49059), tsv(49192), tsv(49143), tsv(42657), tsv(43432), tsv(49123), tsv(94637), tsv(49171), tsv(49066), tsv(49127), tsv(43257), tsv(49159), tsv(49221), tsv(28855), tsv(49151), tsv(74900547), tsv(49197), tsv(49149), tsv(146036), tsv(49177), tsv(49139), tsv(146116), tsv(49069), tsv(146090), tsv(49175), tsv(49245), tsv(49174), tsv(49077), tsv(49144), tsv(49155), tsv(49125), tsv(49138), tsv(49193), tsv(49110), tsv(49090), tsv(49119), tsv(49185), tsv(49117), tsv(49102), tsv(43391), tsv(94508), tsv(49166), text/x-python-script(1871), zip(1063448), tsv(43156), tsv(49203), tsv(146440), tsv(94567), tsv(145999), tsv(49178), tsv(49153), tsv(49133), tsv(49136), tsv(146510), tsv(49208), tsv(43582), tsv(49055), tsv(49202), tsv(49227), tsv(49129), tsv(49161), tsv(49135), tsv(94470), tsv(146372), tsv(43576), tsv(49187), tsv(49275), tsv(43489), tsv(49259), tsv(49092), tsv(43142), tsv(49147), tsv(49188), tsv(49100), tsv(49270), tsv(49115), tsv(49079), tsv(44270), tsv(49162), pdf(14486), tsv(49156), tsv(43769), tsv(49170), tsv(43184), tsv(49142), tsv(49218), tsv(146362), type/x-r-syntax(30758), tsv(49130), tsv(49199), tsv(44087), tsv(49091), tsv(146048), tsv(146073), tsv(49098), tsv(49201), tsv(49179), tsv(49164), tsv(43255), tsv(49122), tsv(94401), tsv(49141), tsv(6431999), type/x-r-syntax(6347), text/x-python-script(1517), tsv(49220), tsv(146444), tsv(49025), tsv(146223), tsv(49211), tsv(43357), tsv(49255), tsv(49169), tsv(43814), tsv(49157), tsv(74899695), tsv(49086), tsv(42864), tsv(49152), tsv(49124), tsv(94516), tsv(146457), tsv(49075), tsv(49172), tsv(49057), tsv(49050), tsv(49126), tsv(49105), tsv(74899837), tsv(49234), tsv(43607), tsv(49113), tsv(94523), tsv(49212), tsv(49176), tsv(49118), tsv(44280), tsv(49072), tsv(49158), tsv(49214), tsv(49160), tsv(49150), tsv(94514), tsv(43969), tsv(49226), tsv(43573), tsv(43811), tsv(49210), tsv(49235), tsv(49056), tsv(43272), tsv(49095), tsv(112342), tsv(49167), tsv(145955), tsv(146460), tsv(43593), pdf(9111), tsv(49254), tsv(49053), tsv(43774), tsv(49087), tsv(49216), tsv(49140), tsv(49191), tsv(44099), tsv(114971), tsv(146332), tsv(43984), tsv(49068), tsv(43848), tsv(43674), tsv(146131), tsv(49037), tsv(43246), tsv(74899280), tsv(49137), tsv(49246), tsv(145994), tsv(49200), tsv(74900141), tsv(49080), tsv(2121154), tsv(49242), tsv(5266), tsv(146264), tsv(49096), tsv(49107), tsv(49196), tsv(146573), tsv(43802), tsv(44086), tsv(44198), tsv(146038), tsv(49088), tsv(44215), tsv(49045), tsv(145967), tsv(49109), tsv(43824), tsv(49089), tsv(74898650), tsv(49120), tsv(49198), tsv(49215), tsv(94521), tsv(145948), tsv(49189), tsv(49236), tsv(43492), tsv(49063), tsv(146018), tsv(99458), tsv(49044), tsv(410792), tsv(49268), tsv(49249), tsv(49229), tsv(49032), tsv(49065), tsv(94524), tsv(49052), tsv(43861), tsv(49233), tsv(49238), tsv(2120551), text/markdown(11635), tsv(49081), pdf(126145), tsv(146338), tsv(146244), tsv(43076), tsv(49168), tsv(43618), tsv(49239), tsv(49225), tsv(49043), tsv(145949), tsv(146577), tsv(44001), tsv(115570), tsv(49230), tsv(146000), tsv(49247), tsv(49094), tsv(146065), tsv(146413), tsv(74899669), tsv(49085), tsv(146052), text/x-sh(629), tsv(49251), csv(18503), tsv(43918), tsv(49173), tsv(43881), tsv(43904), tsv(44169), tsv(113769), tsv(43526), tsv(43974), tsv(43658), tsv(43365), tsv(145974), tsv(146107), tsv(146192), tsv(146340), tsv(146602), tsv(49082), tsv(94417), tsv(9243), tsv(49213), tsv(44398), tsv(146453), tsv(43242), tsv(49106), text/x-python-script(4310), tsv(49182), pdf(36487), tsv(49265), type/x-r-syntax(369), tsv(43778), tsv(42950), tsv(49099), tsv(44401), tsv(49062), tsv(49209), tsv(49195), tsv(43999), tsv(146120), tsv(146143), tsv(94379), tsv(43419), tsv(43744), text/plain; charset=us-ascii(11233), tsv(94554), tsv(94361), tsv(49219), tsv(146102), tsv(49076), tsv(146100), tsv(146562), tsv(145980), tsv(42987), tsv(146086), tsv(145984), tsv(43571), pdf(12110), tsv(49061), tsv(146088), tsv(146168), tsv(111463), tsv(49276), tsv(43734), tsv(114331), type/x-r-syntax(5837), tsv(146129), tsv(44033), tsv(146098), tsv(49083), tsv(146160), tsv(49047), tsv(145951), tsv(146104), tsv(146119), tsv(43943), type/x-r-syntax(24088), type/x-r-syntax(34337), tsv(44125), pdf(5859), tsv(49033), tsv(43709), tsv(42539), tex(1712), tsv(146126), pdf(20745), tsv(146458), tsv(146042), tsv(43533), tsv(42824), tsv(145968), tsv(145983), tsv(74899217), tsv(44302), tsv(43010), tsv(145995), tsv(49222), tsv(16966), tsv(49262), tsv(49228), text/x-python-script(1927), tsv(42967), tsv(146046), tsv(43766), tsv(146111), tsv(44522), tex(536), tsv(43842), tsv(49009), tsv(74899421), tsv(94456), type/x-r-syntax(7767), tsv(146554), type/x-r-syntax(5060), tsv(146072), tsv(146037), tsv(43402), tsv(49309), tsv(49054), tsv(49232), tsv(19609842), tsv(44150), tsv(146092), pdf(6465), tsv(146125), tsv(146085), tsv(43714), tsv(49042), tsv(145990), tsv(43776), tsv(49231), tsv(10958565), tsv(146251), tsv(43469), tsv(43686), tsv(43600), tsv(74900079), type/x-r-syntax(10437), pdf(54155), tsv(146135), tsv(94486), tsv(74897862), tsv(49051), tsv(43834), tsv(146094), tsv(74900674), tsv(146027), tsv(49048), tsv(49267), tsv(146138), tsv(94466), tsv(44120), tsv(43059), tsv(49058), tsv(6341), tsv(145914), type/x-r-syntax(22275), pdf(10582), tsv(43910), tsv(49272), tsv(146144), tsv(145907), tsv(146106), tsv(74898914), tsv(42764), tsv(44027), tsv(146069), pdf(12519), tsv(145939), tsv(43756), tsv(43361), svg(62883), text/x-python-script(1673), tsv(94356), text/markdown(2160), text/x-python-script(609), tsv(49145), tsv(44788), tsv(49070), type/x-r-syntax(4399), tsv(49237), tsv(43504), tsv(49041), tsv(146146), tsv(146239), tsv(94423), tsv(74898811), tsv(44007)Available download formats
    Dataset updated
    Sep 29, 2022
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

  4. h

    Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

    • datahub.hku.hk
    Updated Aug 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
    Explore at:
    Dataset updated
    Aug 13, 2020
    Dataset provided by
    HKU Data Repository
    Authors
    Wen Ma
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description
    1. NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
  5. f

    Data from: Variable Selection with Multiply-Imputed Datasets: Choosing...

    • tandf.figshare.com
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee (2023). Variable Selection with Multiply-Imputed Datasets: Choosing Between Stacked and Grouped Methods [Dataset]. http://doi.org/10.6084/m9.figshare.19111441.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Penalized regression methods are used in many biomedical applications for variable selection and simultaneous coefficient estimation. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors. This article considers a general class of penalized objective functions which, by construction, force selection of the same variables across imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as “stacked” and “grouped” objective functions. Building on existing work, we (i) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for continuous and binary outcome data, (ii) incorporate adaptive shrinkage penalties, (iii) compare these methods through simulation, and (iv) develop an R package miselect. Simulations demonstrate that the “stacked” approaches are more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Biorepository aiming to identify the association between environmental pollutants and ALS risk. Supplementary materials for this article are available online.

  6. f

    A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

    • acs.figshare.com
    xlsx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker (2023). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.1c00070.s004
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    ACS Publications
    Authors
    Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.

  7. f

    Data from: Fast tipping point sensitivity analyses in clinical trials with...

    • tandf.figshare.com
    application/gzip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anders Gorst-Rasmussen; Mads Jeppe Tarp-Johansen (2023). Fast tipping point sensitivity analyses in clinical trials with missing continuous outcomes under multiple imputation [Dataset]. http://doi.org/10.6084/m9.figshare.19967496.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Anders Gorst-Rasmussen; Mads Jeppe Tarp-Johansen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    When dealing with missing data in clinical trials, it is often convenient to work under simplifying assumptions, such as missing at random (MAR), and follow up with sensitivity analyses to address unverifiable missing data assumptions. One such sensitivity analysis, routinely requested by regulatory agencies, is the so-called tipping point analysis, in which the treatment effect is re-evaluated after adding a successively more extreme shift parameter to the predicted values among subjects with missing data. If the shift parameter needed to overturn the conclusion is so extreme that it is considered clinically implausible, then this indicates robustness to missing data assumptions. Tipping point analyses are frequently used in the context of continuous outcome data under multiple imputation. While simple to implement, computation can be cumbersome in the two-way setting where both comparator and active arms are shifted, essentially requiring the evaluation of a two-dimensional grid of models. We describe a computationally efficient approach to performing two-way tipping point analysis in the setting of continuous outcome data with multiple imputation. We show how geometric properties can lead to further simplification when exploring the impact of missing data. Lastly, we propose a novel extension to a multi-way setting which yields simple and general sufficient conditions for robustness to missing data assumptions.

  8. Data from: Missing data estimation in morphometrics: how much is too much?

    • zenodo.org
    • data.niaid.nih.gov
    • +2more
    Updated Jun 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel (2022). Data from: Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
    Explore at:
    Dataset updated
    Jun 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.

  9. H

    Replication data for: What To Do about Missing Data in Time-Series...

    • dataverse.harvard.edu
    Updated Nov 17, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2016). Replication data for: What To Do about Missing Data in Time-Series Cross-Sectional Data [Dataset]. http://doi.org/10.7910/DVN/GGUR0P
    Explore at:
    application/x-rlang-transport(346394), bin(1401754), tsv(899448), text/x-stata-syntax; charset=us-ascii(8669), text/plain; charset=us-ascii(298), pdf(41652)Available download formats
    Dataset updated
    Nov 17, 2016
    Dataset provided by
    Harvard Dataverse
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/5.1/customlicense?persistentId=doi:10.7910/DVN/GGUR0Phttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/5.1/customlicense?persistentId=doi:10.7910/DVN/GGUR0P

    Description

    Applications of modern methods for analyzing data with missing values, based primarily on multiple imputation, have in the last half-decade become common in American politics and political behavior. Scholars in these fields have thus increasingly avoided the biases and inefficiencies caused by ad hoc methods like listwise deletion and best guess imputation. However, researchers in much of comparative politics and international relations, and others with similar data, have been unable to do the same because the best available imputation methods work poorly with the time-series cross-section data structures common in these fields. We attempt to rectify this situation. First, we build a multiple i mputation model that allows smooth time trends, shifts across cross-sectional units, and correlations over time and space, resulting in far more accurate imputations. Second, we build nonignorable missingness models by enabling analysts to incorporate knowledge from area studies experts via priors on individual missing cell values, rather than on difficult-to-interpret model parameters. Third, since these tasks could not be accomplished within existing imputation algorithms, in that they cannot handle as many variables as needed even in the simpler cross-sectional data for which they were designed, we also develop a new algorithm that substantially expands the range of computationally feasible data types and sizes for which multiple imputation can be used. These developments also made it possible to implement the methods introduced here in freely available open source software that is considerably more reliable than existing strategies. These developments also made it possible to implement the methods introduced here in freely available open source software, Amelia II: A Program for Missing Data, that is considerably more reliable than existing strategies. See also: Missing Data

  10. d

    Replication Data for: A GMM Approach for Dealing with Missing Data on...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donald, Stephen; Abrevaya, Jason (2023). Replication Data for: A GMM Approach for Dealing with Missing Data on Regressors [Dataset]. http://doi.org/10.7910/DVN/JMWMWW
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Donald, Stephen; Abrevaya, Jason
    Description

    Replication Data for: A GMM Approach for Dealing with Missing Data on Regressors

  11. f

    Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

    • figshare.com
    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-Hui Zhou; Ehsan Saghapour
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

  12. Data for "How to use scale invariant properties of imperviousness in urban...

    • zenodo.org
    bin, csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gires Auguste; Gires Auguste; Ioulia Tchiguirinskaia; Daniel Schertzer; Ioulia Tchiguirinskaia; Daniel Schertzer (2020). Data for "How to use scale invariant properties of imperviousness in urban areas to handle missing data ?" [Dataset]. http://doi.org/10.5281/zenodo.3465905
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gires Auguste; Gires Auguste; Ioulia Tchiguirinskaia; Daniel Schertzer; Ioulia Tchiguirinskaia; Daniel Schertzer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data set corresponds the data presented in the data paper : “How to use scale invariant properties of imperviousness in urban areas to handle missing data ?“ which has been submitted to Water Resources Research ” (https://agupubs.onlinelibrary.wiley.com/journal/19447973).

    It corresponds to :

    - the rainfall data collected on 2019-06-02 with 5 min and 30 s time steps by a disdrometer installed on the roof of Ecole des Ponts ParisTech building.

    - land use distribution for the Jouy-en-Josas catchment (1 = forest, 2= road, 3=Grass, 4=building, 5=Gully, 6=missing data), with pixel size of 10 m and 2 m.

    More details can be found in the file and in the paper.

  13. H

    Replication data for: Modeling global health indicators: missing data...

    • data.niaid.nih.gov
    • dataverse.harvard.edu
    Updated May 5, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jamie mie & Gerstein Bethany (2014). Replication data for: Modeling global health indicators: missing data imputation and accounting for ‘double uncertainty’ [Dataset]. http://doi.org/10.7910/DVN/25683
    Explore at:
    Dataset updated
    May 5, 2014
    Dataset provided by
    Harvard University
    Authors
    Jamie mie & Gerstein Bethany
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    World
    Description

    Global health indicators such as infant and maternal mortality are important for informing priorities for health research, policy development, and resource allocation. However, due to inconsistent reporting within and across nations, construction of comparable indicators often requires extensive data imputation and complex modeling from limited observed data. We draw on Ahmed et al.’s 2012 paper – an analysis of maternal deaths averted by contraceptive use for 172 countries in 2008 – as an exemplary case of the challenge of building reliable models with scarce observations. The authors’ employ a counterfactual modeling approach using regression imputation on the independent variable which assumes no estimation uncertainty in the final model and does not address the potential for scattered missingness in the predictor variables. We replicate their results and test the sensitivity of their published estimates to the use of an alternative method for imputing missing data, multiple imputation. We also calculate alternative estimates of standard errors for the model estimates that more appropriately account for the uncertainty introduced through data imputation of multiple predictor variables. Based on our results, we discuss the risks associated with the missing data practices employed and evaluate the appropriateness of multiple imputation as an alternative for data imputation and uncertainty estimation for models of global health indicators.

  14. Quarterly Labour Force Survey Household Dataset, April - June, 2021

    • beta.ukdataservice.ac.uk
    Updated 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office For National Statistics (2023). Quarterly Labour Force Survey Household Dataset, April - June, 2021 [Dataset]. http://doi.org/10.5255/ukda-sn-8852-3
    Explore at:
    Dataset updated
    2023
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    DataCitehttps://www.datacite.org/
    Authors
    Office For National Statistics
    Description
    Background
    The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.

    Household datasets
    Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. From January 2011, a pseudonymised household identifier variable (HSERIALP) is also included in the main quarterly LFS dataset instead.

    Change to coding of missing values for household series
    From 1996-2013, all missing values in the household datasets were set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. This was also in line with the Annual Population Survey household series of the time. The change was applied to the back series during 2010 to ensure continuity for analytical purposes. From 2013 onwards, the -8 and -9 categories have been reinstated.

    LFS Documentation
    The documentation available from the Archive to accompany LFS datasets largely consists of the latest version of each volume alongside the appropriate questionnaire for the year concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS
    LFS User Guidance page before commencing analysis.

    Additional data derived from the QLFS
    The Archive also holds further QLFS series: End User Licence (EUL) quarterly datasets; Secure Access datasets (see below); two-quarter and five-quarter longitudinal datasets; quarterly, annual and ad hoc module datasets compiled for Eurostat; and some additional annual Northern Ireland datasets.

    End User Licence and Secure Access QLFS Household datasets
    Users should note that there are two discrete versions of the QLFS household datasets. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. Secure Access household datasets for the QLFS are available from 2009 onwards, and include additional, detailed variables not included in the standard EUL versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurrence of learning difficulty or disability; and benefits. For full details of variables included, see data dictionary documentation. The Secure Access version (see SN 7674) has more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements.

    Changes to variables in QLFS Household EUL datasets
    In order to further protect respondent confidentiality, ONS have made some changes to variables available in the EUL datasets. From July-September 2015 onwards, 4-digit industry class is available for main job only, meaning that 3-digit industry group is the most detailed level available for second and last job.

    Review of imputation methods for LFS Household data - changes to missing values
    A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.

    Occupation data for 2021 and 2022 data files

    The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

    Latest edition information

    For the third edition (September 2023), the variables NSECM20, NSECMJ20, SC2010M, SC20SMJ, SC20SMN, SOC20M and SOC20O have been replaced with new versions. Further information on the SOC revisions can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

  15. Z

    Models and predictions for "How to deal w_ missing input data"

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gauch, Martin (2025). Models and predictions for "How to deal w_ missing input data" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15008460
    Explore at:
    Dataset updated
    Mar 15, 2025
    Dataset authored and provided by
    Gauch, Martin
    Description

    How to deal w_ missing input data

    This repository contains the models, configs, and results files for the paper Gauch et al., "How to deal w_ missing input data".

    The corresponding analysis code is available on GitHub: https://github.com/gauchm/missing-inputs.

    Contents of this repository

    missing-inputs.ipynb -- Jupyter notebook to reproduce figures from the paper.

    results/ -- Folder with model weights, configs, and predictions used in missing-inputs.ipynb.

    patches/ -- Contains patches for local modifications to reproduce experiments from the paper.

    Required setup

    Clone neuralhydrology: git clone https://github.com/neuralhydrology/neuralhydrology.git.

    Install an editable version of neuralhydrology: cd neuralhydrology && pip install -e ..

    Download the following data:

    the CAMELS US dataset (CAMELS time series meteorology, observed flow, meta data, version 1.2) from NCAR into some data directory (has to match data_dir in the config files).

    the extended Maurer and NLDAS forcings set available on HydroShare: Maurer, NLDAS.

    the models, results, and config files from this paper avaliable on this Zenodo repository.

    Note that to reproduce the experiments, local modifications to NeuralHydrology are necessary. To do so, apply the patches in the patches/ directory: git apply patches/experiment-N.patch.

  16. L1B2.out: Samples of MISR L1B2 GRP data to explore the missing data...

    • dataservices.gfz-potsdam.de
    Updated Feb 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GFZ Data Services (2020). L1B2.out: Samples of MISR L1B2 GRP data to explore the missing data replacement process [Dataset]. http://doi.org/10.5880/fidgeo.2020.012
    Explore at:
    Dataset updated
    Feb 27, 2020
    Dataset provided by
    DataCitehttps://www.datacite.org/
    GFZ Data Services
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Description

    This data publication provides access to (1) an archive of maps and statistics on MISR L1B2 GRP data products updated as described in Verstraete et al. (2020, https://doi.org/10.5194/essd-2019-210), (2) a user manual describing this archive, (3) a large archive of standard (unprocessed) MISR data files that can be used in conjunction with the IDL software repository published on GitHub and available from https://github.com/mmverstraete (Verstraete et al., 2019, https://doi.org/10.5281/zenodo.3519989), (4) an additional archive of maps and statistics on MISR L1B2 GRP data products updated as described for eight additional Blocks of MISR data, spanning a broader range of climatic and environmental conditions (between Iraq and Namibia), and (5) a user manual describing this second archive. The authors also make a self-contained, stand-alone version of that processing software available to all users, using the IDL Virtual Machine technology (which does not require an IDL license) from Verstraete et al., 2020: http://doi.org/10.5880/fidgeo.2020.011. (1) The compressed archive 'L1B2_Out.zip' contains all outputs produced in the course of generating the various Figures of the manuscript Verstraete et al. (2020b). Once this archive is installed and uncompressed, 9 subdirectories named Fig-fff-Ttt_Pxxx-Oyyyyyy-Bzzz are created, where fff, tt, xxx, yyyyyy and zzz stand for the Figure number, an optional Table number, Path, Orbit and Block numbers, respectively. These directories contain collections of text, graphics (maps and scatterplots) and binary data files relative to the intermediary, final and ancillary results generated while preparing those Figures. Maps and scatterplots are provided as graphics files in PNG format. Map legends are plain text files with the same names as the maps themselves, but with a file extension '.txt'. Log files are also plain text files. They are generated by the software that creates those graphics files and provide additional details on the intermediary and final results. The processing of MISR L1B2 GRP data product files requires access to cloud masks for the same geographical areas (one for each of the 9 cameras). Since those masks are themselves derived from the L1B2 GRP data and therefore also contain missing data, the outcomes from updating the RCCM data products, as described in Verstraete et al. (2020, https://doi.org/10.5194/essd-12-611-2020), are also included in this archive. The last 2 subdirectories contain the outcomes from the normal processing of the indicated data files, as well as those generated when additional missing data are artificially inserted in the input files for the purpose of assessing the performance of the algorithms. (2) The document 'L1B2_Out.pdf' provides the User Manual to install and explore the compressed archive 'L1B2_Out.zip'. (3) The compressed archive 'L1B2_input_68050.zip' contains MISR L1B2 GRP and RCCM data for the full Orbit 68050, acquired on 3 October 2012, as well as the corresponding AGP file, which is required by the processing system to update the radiance product. This archive includes data for a wide range of locations, from Russia to north-west Iran, central and eastern Iraq, Saudi Arabia, and many more countries along the eastern coast of the African continent. It is provided to allow users to analyze actual data with the software package mentioned above, without needing to download MISR data from the NASA ASDC web site. (4) The compressed archive 'L1B2_Suppl.zip' contains a set of results similar to the archive 'L1B2_Out.zip' mentioned above, for four additional sites, spanning a much wider range of geographical, climatic and ecological conditions: these are covering areas in Iraq (marsh and arid lands), Kenya (agriculture and tropical forests), South Sudan (grasslands) and Namibia (coastal desert and Atlantic Ocean). Two of them involve largely clear scenes, and the other two include clouds. The last case also includes a test to artificially introduce missing data over deep water and clouds, to demonstrate the performance of the procedure on targets other than continental areas. Once uncompressed, this new archive expands into 8 subdirectories and takes up 1.8 GB of disk space, providing access to about 2,900 files. (5) The companion user manual L1B2_Suppl.pdf, describing how to install, uncompress and explore those additional files.

  17. d

    Replication Data for: \"The Missing Dimension of the Political Resource...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ranjit, Lall (2023). Replication Data for: \"The Missing Dimension of the Political Resource Curse Debate\" (Comparative Political Studies) [Dataset]. http://doi.org/10.7910/DVN/UHABC6
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Ranjit, Lall
    Description

    Abstract: Given the methodological sophistication of the debate over the “political resource curse”—the purported negative relationship between natural resource wealth (in particular oil wealth) and democracy—it is surprising that scholars have not paid more attention to the basic statistical issue of how to deal with missing data. This article highlights the problems caused by the most common strategy for analyzing missing data in the political resource curse literature—listwise deletion—and investigates how addressing such problems through the best-practice technique of multiple imputation affects empirical results. I find that multiple imputation causes the results of a number of influential recent studies to converge on a key common finding: A political resource curse does exist, but only since the widespread nationalization of petroleum industries in the 1970s. This striking finding suggests that much of the controversy over the political resource curse has been caused by a neglect of missing-data issues.

  18. D

    Data for: Filling the data gaps within GRACE missions using Singular...

    • darus.uni-stuttgart.de
    Updated May 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuang Yi; Nico Sneeuw (2021). Data for: Filling the data gaps within GRACE missions using Singular Spectrum Analysis [Dataset]. http://doi.org/10.18419/DARUS-807
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 14, 2021
    Dataset provided by
    DaRUS
    Authors
    Shuang Yi; Nico Sneeuw
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dozens of missing epochs in the monthly gravity product of the satellite mission Gravity Recovery and Climate Experiment (GRACE) and its follow-on (GRACE-FO) mission greatly inhibit the complete analysis and full utilization of the data. Despite previous attempts to handle this problem, a general all-purpose gap-filling solution is still lacking. Here we propose a non-parametric, data-adaptive and easy-to-implement approach - composed of the Singular Spectrum Analysis (SSA) gap-filling technique, cross-validation, and spectral testing for significant components - to produce reasonable gap-filling results in the form of spherical harmonic coefficients (SHCs). We demonstrate that this approach is adept at inferring missing data from long-term and oscillatory changes extracted from available observations. A comparison in the spectral domain reveals that the gap-filling result resembles the product of GRACE missions below spherical harmonic degree 30 very well. As the degree increases above 30, the amplitude per degree of the gap-filling result decreases more rapidly than that of GRACE/GRACE-FO SHCs, showing effective suppression of noise. As a result, our approach can reduce noise in the oceans without sacrificing resolutions on land. The gap filling dataset is stored in the “SSA_filing/" folder. Each file represents a monthly result in the form of spherical harmonics. The data format follows the convention of the site ftp://isdcftp.gfz-potsdam.de/grace/. Low degree corrections (degree-1, C20, C30) have been made. The code to generate the dataset is located in the “code_share/“ folder, with an example for C30. The model-based Greenland mass balance result for data validation (results given in the paper) is provided in the "Greenland_SMB-D.txt” file.

  19. f

    MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as...

    • tandf.figshare.com
    pdf
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mia Hubert; Peter J. Rousseeuw; Wannes Van den Bossche (2023). MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as Cellwise and Rowwise Outliers [Dataset]. http://doi.org/10.6084/m9.figshare.7624424.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Mia Hubert; Peter J. Rousseeuw; Wannes Van den Bossche
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multivariate data are typically represented by a rectangular matrix (table) in which the rows are the objects (cases) and the columns are the variables (measurements). When there are many variables one often reduces the dimension by principal component analysis (PCA), which in its basic form is not robust to outliers. Much research has focused on handling rowwise outliers, that is, rows that deviate from the majority of the rows in the data (e.g., they might belong to a different population). In recent years also cellwise outliers are receiving attention. These are suspicious cells (entries) that can occur anywhere in the table. Even a relatively small proportion of outlying cells can contaminate over half the rows, which causes rowwise robust methods to break down. In this article, a new PCA method is constructed which combines the strengths of two existing robust methods to be robust against both cellwise and rowwise outliers. At the same time, the algorithm can cope with missing values. As of yet it is the only PCA method that can deal with all three problems simultaneously. Its name MacroPCA stands for PCA allowing for Missingness And Cellwise & Rowwise Outliers. Several simulations and real datasets illustrate its robustness. New residual maps are introduced, which help to determine which variables are responsible for the outlying behavior. The method is well-suited for online process control.

  20. J

    MAXIMUM LIKELIHOOD ESTIMATION OF FACTOR MODELS ON DATASETS WITH ARBITRARY...

    • journaldata.zbw.eu
    • jda-test.zbw.eu
    txt
    Updated Dec 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marta Banbura; Michele Modugno; Marta Banbura; Michele Modugno (2022). MAXIMUM LIKELIHOOD ESTIMATION OF FACTOR MODELS ON DATASETS WITH ARBITRARY PATTERN OF MISSING DATA (replication data) [Dataset]. http://doi.org/10.15456/jae.2022321.0712228351
    Explore at:
    txt(94822), txt(2114), txt(6719)Available download formats
    Dataset updated
    Dec 7, 2022
    Dataset provided by
    ZBW - Leibniz Informationszentrum Wirtschaft
    Authors
    Marta Banbura; Michele Modugno; Marta Banbura; Michele Modugno
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this paper we modify the expectation maximization algorithm in order to estimate the parameters of the dynamic factor model on a dataset with an arbitrary pattern of missing data. We also extend the model to the case with a serially correlated idiosyncratic component. The framework allows us to handle efficiently and in an automatic manner sets of indicators characterized by different publication delays, frequencies and sample lengths. This can be relevant, for example, for young economies for which many indicators have been compiled only recently. We evaluate the methodology in a Monte Carlo experiment and we apply it to nowcasting of the euro area gross domestic product.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1

Data from: Identifying Missing Data Handling Methods with Text Mining

Related Article
Explore at:
delimitedAvailable download formats
Dataset updated
Mar 8, 2023
Dataset provided by
Hungarian Academy of Sciences
Authors
Krisztián Boros; Zoltán Kmetty
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Jan 1, 1999 - Dec 31, 2016
Description

Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

Search
Clear search
Close search
Google apps
Main menu