44 datasets found
  1. f

    Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for...

    • frontiersin.figshare.com
    docx
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen (2023). Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.docx [Dataset]. http://doi.org/10.3389/fpubh.2021.680054.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Frontiers
    Authors
    Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.

  2. f

    Understanding and Managing Missing Data.pdf

    • figshare.com
    pdf
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    figshare
    Authors
    Ibrahim Denis Fofanah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

  3. m

    Using a repeated measure mixed model for handling missing cost and outcome...

    • data.mendeley.com
    Updated Jan 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Modou Diop (2021). Using a repeated measure mixed model for handling missing cost and outcome data in clinical trial-based cost-effectiveness analysis [Dataset]. http://doi.org/10.17632/j8fmdwd4jp.1
    Explore at:
    Dataset updated
    Jan 21, 2021
    Authors
    Modou Diop
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OBJECTIVES: Where trial recruitment is staggered over time, patients may have different lengths of follow-up, meaning that the dataset is an unbalanced panel with a considerable amount of missing data. This study presents a method for estimating the difference in total costs and total Quality – Adjusted Life Years (QALY) over a given time horizon using a repeated measure mixed model (RMMM). To the authors’ knowledge this is the first time this method has been exploited in the context of economic evaluation within clinical trials. METHODS: An example (EVLA trial, NIHR HTA project 11/129/197) is used where patients have between 1 and 5 years of follow up. Early treatment is compared with delayed treatment. Coefficients at each time point from the repeated measures mixed model were aggregated to estimate total mean cost and total mean QALY over 3 years. Results were compared with other methods for handling missing data: Complete-Case-Analysis (CCA), multiple imputation using linear regression (MILR) and using predictive mean matching (MIPMM), and Bayesian parametric approach (BPA). RESULTS: Mean differences in costs obtained varied among the different approach, CCA, MIPMM and MILRM recorded greater mean costs in delayed treatment, £216 (95% CI -£1413 to £1845), £36 (95% CI to £-581 to 652£), £30(95% CI to -£617 to 679£), respectively. While RMM and BPA showed greater costs in early intervention, -£67 (95% CI -£1069 to £855), -£162 (95% CI -£728-£402), respectively. Early intervention was associated with greater QALY among all methods at year 3, RMM show the highest QALYs, 0.073 (95% CI -0.06 to 0.2). CONCLUSION: MIPMM show most efficient results in our cost-effectiveness analysis. By contrast when the percentage of missing is high RMM shows similar results than MIPMM. Hence, we conclude that RMM is a flexible way and robust alternative for modelling continuous outcomes data that can be considered missing-at-random.

  4. f

    MAPE and PB statistics for IBFI compared with other imputation methods...

    • plos.figshare.com
    xls
    Updated Jun 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adil Aslam Mir; Kimberlee Jane Kearfott; Fatih Vehbi Çelebi; Muhammad Rafique (2023). MAPE and PB statistics for IBFI compared with other imputation methods (mean, median, mode, PMM, and Hotdeck) for 20% missingness of type MAR and all parameters tested (RN, TH, TC, RH, and PR). [Dataset]. http://doi.org/10.1371/journal.pone.0262131.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Adil Aslam Mir; Kimberlee Jane Kearfott; Fatih Vehbi Çelebi; Muhammad Rafique
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MAPE and PB statistics for IBFI compared with other imputation methods (mean, median, mode, PMM, and Hotdeck) for 20% missingness of type MAR and all parameters tested (RN, TH, TC, RH, and PR).

  5. m

    Longitudinal Quasi-Experimental Assessment Data: Kazakhstan

    • data.mendeley.com
    Updated Aug 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Courtney (2023). Longitudinal Quasi-Experimental Assessment Data: Kazakhstan [Dataset]. http://doi.org/10.17632/sxpdsszps8.1
    Explore at:
    Dataset updated
    Aug 17, 2023
    Authors
    Matthew Courtney
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Kazakhstan
    Description

    The data in this paper pertain to those analyzed as part of a pilot monitoring study as described by Courtney et al. [1]. While the study reported on the results for three languages, the paper at hand references the results for just one study involving the Kazakh medium-of-instruction. The first variable in that data includes is ‘Student_id’, which provides a unique anonymized student ID number for each student. The next six variables in the data vary between-schools: (1) ‘experiment’, for which each school is defined as either a pilot (experiment) or control school; ‘Region’, defining the region that each school is situated in the country; (3) School_name, the name assigned to each unique school; (4) ‘Language’ defining the medium-of-instruction used in each school; ‘School_type’ where schools are defined as either (i) small, (ii) gymnasium, (iii) lyceum, or (iv) common (common used as baseline in modelling). Thereafter, the within-school variable, ‘Gender’ is given for which each child is defined as either male or female. In addition, columns 9-92 pertain to the item-response data (for all elements, 0 = incorrect, 1 = correct) for each child on each item in the five tests. 2015 Test: Items OG1 to OG5.4 (columns 9-19; link items to 2016 are OG2.2A, OG2.3A, OG23A) 2016 Test: Items OG2.2 to OG11.2 (20-35; link items to 2017 are ) 2017 Test: Items OG11.1 to K3.6 (36-47) 2018 Test: Items K4.1 to K3.3 (48-68) 2019 Test: Items X19K1 to K6 (69-92)

    The data is filtered data. Only students who completed all five literacy assessments (2015 to 2019) were included in the data. Alternative approaches for dealing with missing data (attrition) in the analysis described in this article. Original data in Cyrillic used to showcase the R software’s capacity to manage and recode such data.

  6. Science Education Research Topic Modeling Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, html +2
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph (2024). Science Education Research Topic Modeling Dataset [Dataset]. http://doi.org/10.5281/zenodo.4094974
    Explore at:
    bin, txt, html, text/x-pythonAvailable download formats
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

    The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

    • We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.
    • We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.
    • We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)
    • We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.
    • We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.
    • We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).
    • We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

    After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

    In addition to this file, we have also included the following files:

    1. SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data
    2. Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.
    3. Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.

    This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.

  7. n

    Data from: Macaques preferentially attend to intermediately surprising...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Apr 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd (2022). Macaques preferentially attend to intermediately surprising information [Dataset]. http://doi.org/10.6078/D15Q7Q
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 26, 2022
    Dataset provided by
    Yale University
    University of Minnesota
    University of California, Berkeley
    Klaviyo
    Authors
    Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:

    "csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:

    subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp

    "csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:

    rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.

    "csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences

    Empty Values in Datasets:

    There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.

    Codes:

    In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd

  8. f

    Missing completely at random test.

    • plos.figshare.com
    xls
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayman Omar Baniamer (2025). Missing completely at random test. [Dataset]. http://doi.org/10.1371/journal.pone.0321344.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Ayman Omar Baniamer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistical models are essential tools in data analysis. However, missing data plays a pivotal role in impacting the assumptions and effectiveness of statistical models, especially when there is a significant amount of missing data. This study addresses one of the core assumptions supporting many statistical models, the assumption of unidimensionality. It examines the impact of missing data rates and imputation methods on fulfilling this assumption. The study employs three imputation methods: Corrected Item Mean, multiple imputation, and expectation maximization, assessing their performance across nineteen levels of missing data rates, and examining their impact on the assumption of unidimensionality using several indicators (Cronbach’s alpha, corrected correlation coefficients, factor analysis (Eigenvalues (, , and cumulative variance, and communalities). The study concluded that all imputation methods used effectively provided data that maintained the unidimensionality assumption, regardless of missing data rates. Additionally, it was found that most of the unidimensionality indicators increased in value as missing data rates rose.

  9. Uncleaned Laptop Price dataset

    • kaggle.com
    Updated Feb 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EHTISHAM SADIQ (2023). Uncleaned Laptop Price dataset [Dataset]. https://www.kaggle.com/datasets/ehtishamsadiq/uncleaned-laptop-price-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    EHTISHAM SADIQ
    Description

    The Uncleaned Laptop Price dataset is a collection of laptop product listings scraped from an online e-commerce website. The dataset includes information about various laptop models, such as their brand, screen size, processor, memory, storage capacity, operating system, and price. However, the dataset is uncleaned, meaning that it contains missing values, inconsistent formatting, and other errors that need to be addressed before the data can be used for analysis or modeling.

    The dataset contains both categorical and numerical variables, with the majority of variables being categorical, including brand, model name, screen resolution, processor type, and operating system. Some numerical variables include screen size, memory, and storage capacity. The target variable in the dataset is the price, which is a continuous variable.

    The dataset contains over 1,300 laptop listings, making it a reasonably sized dataset for analysis and modeling. The dataset may be useful for machine learning projects related to predicting the price of a laptop based on its specifications. However, before using the dataset, it would be necessary to clean and preprocess the data to address the inconsistencies and missing values.

  10. Crop Yield Prediction using Soil and Weather

    • kaggle.com
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gurudath g (2024). Crop Yield Prediction using Soil and Weather [Dataset]. https://www.kaggle.com/datasets/gurudathg/crop-yield-prediction-using-soil-and-weather/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gurudath g
    Description

    This dataset is designed to help researchers and data scientists predict crop yield based on key agricultural factors. It contains columns representing the amount of fertilizer used, temperature, and soil nutrients (nitrogen, phosphorus, and potassium), along with the corresponding crop yield. This dataset can be used for machine learning projects focusing on agricultural optimization, yield forecasting, and resource management. The data is structured and ready for analysis, making it ideal for training regression models or conducting exploratory data analysis.

    File Information: Number of Rows: 2,596 Number of Columns: 6 Data Types: Fertilizer: float64 temp: float64 N (Nitrogen): float64 P (Phosphorus): float64 K (Potassium): float64 yeild (Yield): float64 Missing Values: None (0 missing values in all columns) Summary Statistics: Fertilizer: Mean: 66.49, Min: 49.75, Max: 80.22 Temperature: Mean: 33.85, Min: 23.77, Max: 40.27 Nitrogen (N): Mean: 69.52, Min: 58.84, Max: 80.22 Phosphorus (P): Mean: 20.71, Min: 17.72, Max: 25.16 Potassium (K): Mean: 17.81, Min: 14.70, Max: 22.06 Yield: Mean: 8.53, Min: 5.15, Max: 12.34 This dataset is complete, with no missing values, and provides diverse statistics for various features important for crop yield prediction.

  11. f

    Data from: To Adjust or not to Adjust? Estimating the Average Treatment...

    • tandf.figshare.com
    pdf
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anqi Zhao; Peng Ding (2023). To Adjust or not to Adjust? Estimating the Average Treatment Effect in Randomized Experiments with Missing Covariates [Dataset]. http://doi.org/10.6084/m9.figshare.21082244.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Anqi Zhao; Peng Ding
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Randomized experiments allow for consistent estimation of the average treatment effect based on the difference in mean outcomes without strong modeling assumptions. Appropriate use of pretreatment covariates can further improve the estimation efficiency. Missingness in covariates is nevertheless common in practice, and raises an important question: should we adjust for covariates subject to missingness, and if so, how? The unadjusted difference in means is always unbiased. The complete-covariate analysis adjusts for all completely observed covariates, and is asymptotically more efficient than the difference in means if at least one completely observed covariate is predictive of the outcome. Then what is the additional gain of adjusting for covariates subject to missingness? To reconcile the conflicting recommendations in the literature, we analyze and compare five strategies for handling missing covariates in randomized experiments under the design-based framework, and recommend the missingness-indicator method, as a known but not so popular strategy in the literature, due to its multiple advantages. First, it removes the dependence of the regression-adjusted estimators on the imputed values for the missing covariates. Second, it does not require modeling the missingness mechanism, and yields consistent estimators even when the missingness mechanism is related to the missing covariates and unobservable potential outcomes. Third, it ensures large-sample efficiency over the complete-covariate analysis and the analysis based on only the imputed covariates. Lastly, it is easy to implement via least squares. We also propose modifications to it based on asymptotic and finite sample considerations. Importantly, our theory views randomization as the basis for inference, and does not impose any modeling assumptions on the data-generating process or missingness mechanism. Supplementary materials for this article are available online.

  12. S

    2023 Census main means of travel to work by statistical area 3

    • datafinder.stats.govt.nz
    csv, dbf (dbase iii) +4
    Updated Jun 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats NZ (2025). 2023 Census main means of travel to work by statistical area 3 [Dataset]. https://datafinder.stats.govt.nz/table/122496-2023-census-main-means-of-travel-to-work-by-statistical-area-3/
    Explore at:
    mapinfo mif, csv, dbf (dbase iii), geodatabase, mapinfo tab, geopackage / sqliteAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Statistics New Zealandhttp://www.stats.govt.nz/
    Authors
    Stats NZ
    License

    https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/

    Description

    Dataset shows an individual’s statistical area 3 (SA3) of usual residence and the SA3 of their workplace address, for the employed census usually resident population count aged 15 years and over, by main means of travel to work from the 2018 and 2023 Censuses.

    The main means of travel to work categories are:

    • Work at home
    • Drive a private car, truck, or van
    • Drive a company car, truck, or van
    • Passenger in a car, truck, van, or company bus
    • Public bus
    • Train
    • Bicycle
    • Walk or jog
    • Ferry
    • Other.

    Main means of travel to work is the usual method which an employed person aged 15 years and over used to travel the longest distance to their place of work.

    Workplace address refers to where someone usually works in their main job, that is the job in which they worked the most hours. For people who work at home, this is the same address as their usual residence address. For people who do not work at home, this could be the address of the business they work for or another address, such as a building site.

    Workplace address is coded to the most detailed geography possible from the available information. This dataset only includes travel to work information for individuals whose workplace address is available at SA3 level. The sum of the counts for each region in this dataset may not equal the total employed census usually resident population count aged 15 years and over for that region. Workplace address – 2023 Census: Information by concept has more information.

    This dataset can be used in conjunction with the following spatial files by joining on the SA3 code values:

    Download data table using the instructions in the Koordinates help guide.

    Footnotes

    Geographical boundaries

    Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.

    Subnational census usually resident population

    The census usually resident population count of an area (subnational count) is a count of all people who usually live in that area and were present in New Zealand on census night. It excludes visitors from overseas, visitors from elsewhere in New Zealand, and residents temporarily overseas on census night. For example, a person who usually lives in Christchurch city and is visiting Wellington city on census night will be included in the census usually resident population count of Christchurch city. 

    Population counts

    Stats NZ publishes a number of different population counts, each using a different definition and methodology. Population statistics – user guide has more information about different counts. 

    Caution using time series

    Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data).

    Workplace address time series

    Workplace address time series data should be interpreted with care at lower geographic levels, such as statistical area 2 (SA2). Methodological improvements in 2023 Census resulted in greater data accuracy, including a greater proportion of people being counted at lower geographic areas compared to the 2018 Census. Workplace address – 2023 Census: Information by concept has more information.

    Working at home

    In the census, working at home captures both remote work, and people whose business is at their home address (e.g. farmers or small business owners operating from their home). The census asks respondents whether they ‘mostly’ work at home or away from home. It does not capture whether someone does both, or how frequently they do one or the other.

    Rows excluded from the dataset

    Rows show SA3 of usual residence by SA3 of workplace address. Rows with a total population count of less than six have been removed to reduce the size of the dataset, given only a small proportion of SA3-SA3 combinations have commuter flows.

    About the 2023 Census dataset

    For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.

    Data quality

    The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.

    Quality rating of a variable

    The quality rating of a variable provides an overall evaluation of data quality for that variable, usually at the highest levels of classification. The quality ratings shown are for the 2023 Census unless stated. There is variability in the quality of data at smaller geographies. Data quality may also vary between censuses, for subpopulations, or when cross tabulated with other variables or at lower levels of the classification. Data quality ratings for 2023 Census variables has more information on quality ratings by variable.

    Main means of travel to work quality rating

    Main means of travel to work is rated as moderate quality.

    Main means of travel to work – 2023 Census: Information by concept has more information, for example, definitions and data quality.

    Workplace address quality rating

    Workplace address is rated as moderate quality.

    Workplace address – 2023 Census: Information by concept has more information, for example, definitions and data quality.

    Using data for good

    Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.

    Confidentiality

    The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.

    Percentages

    To calculate percentages, divide the figure for the category of interest by the figure for ‘Total stated’ where this applies.

    Symbol

    -999 Confidential

    Inconsistencies in definitions

    Please note that there may be differences in definitions between census classifications and those used for other data collections.

  13. f

    Absolute difference between missing rates as compared with ZERO rate of...

    • figshare.com
    • plos.figshare.com
    xls
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayman Omar Baniamer (2025). Absolute difference between missing rates as compared with ZERO rate of missing due to imputation method. [Dataset]. http://doi.org/10.1371/journal.pone.0321344.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Ayman Omar Baniamer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Absolute difference between missing rates as compared with ZERO rate of missing due to imputation method.

  14. n

    Dengue incidence and climatic variables in Cali from 2015 to 2021

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mauricio Frieri (2024). Dengue incidence and climatic variables in Cali from 2015 to 2021 [Dataset]. http://doi.org/10.5061/dryad.0zpc8675h
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 6, 2024
    Dataset provided by
    Universidad Autónoma de Occidente
    Authors
    Mauricio Frieri
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    In this work we studied the relationship between dengue incidence in Cali and the climatic variables that are known to have an impact on the mosquito and were available (precipitation, relative humidity, minimum, mean, and maximum temperature). Since the natural processes of the mosquito imply that any changes on climatic variables need some time to be visible on the dengue incidence, a lagged correlation analysis was done in order to choose the predictor variables of count regression models. A Principal Component Analysis was done to reduce dimensionality and study the correlation among the climatic variables. Finally, aiming to predict the monthly dengue incidence, three different regression models were constructed and compared using de Akaike information criterion. The best model was the negative binomial regression model, and the predictor variables were mean temperature with a 3-month lag and mean temperature with a 5-month lag as well as their interaction. The other variables were not significant on the models. And interesting conclusion was that according to the coefficients of the regression model, a 1°C increase in the monthly mean temperature will reflect as a 45% increase in dengue incidence after 3 months. The rises to a 64% increase after 5 months. Methods Monthly dengue incidence data was provided by the Public Health Department of Cali. The climatic data was collected from the Hydrology, Meteorology and Environmental Studies Institute (IDEAM) webpage. Missing data was imputed using a random forest regression model.

  15. f

    Cronbach’s Alpha for Imputation Methods.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayman Omar Baniamer (2025). Cronbach’s Alpha for Imputation Methods. [Dataset]. http://doi.org/10.1371/journal.pone.0321344.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Ayman Omar Baniamer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistical models are essential tools in data analysis. However, missing data plays a pivotal role in impacting the assumptions and effectiveness of statistical models, especially when there is a significant amount of missing data. This study addresses one of the core assumptions supporting many statistical models, the assumption of unidimensionality. It examines the impact of missing data rates and imputation methods on fulfilling this assumption. The study employs three imputation methods: Corrected Item Mean, multiple imputation, and expectation maximization, assessing their performance across nineteen levels of missing data rates, and examining their impact on the assumption of unidimensionality using several indicators (Cronbach’s alpha, corrected correlation coefficients, factor analysis (Eigenvalues (, , and cumulative variance, and communalities). The study concluded that all imputation methods used effectively provided data that maintained the unidimensionality assumption, regardless of missing data rates. Additionally, it was found that most of the unidimensionality indicators increased in value as missing data rates rose.

  16. f

    The variables having missing value are preprocessed.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). The variables having missing value are preprocessed. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The variables having missing value are preprocessed.

  17. f

    Statistical details of the SRGC time series dataset.

    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adil Aslam Mir; Kimberlee Jane Kearfott; Fatih Vehbi Çelebi; Muhammad Rafique (2023). Statistical details of the SRGC time series dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0262131.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Adil Aslam Mir; Kimberlee Jane Kearfott; Fatih Vehbi Çelebi; Muhammad Rafique
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistical details of the SRGC time series dataset.

  18. f

    Cumulative total variance for components of imputation methods.

    • plos.figshare.com
    xls
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayman Omar Baniamer (2025). Cumulative total variance for components of imputation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0321344.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Ayman Omar Baniamer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cumulative total variance for components of imputation methods.

  19. f

    UC_vs_US Statistic Analysis.xlsx

    • figshare.com
    xlsx
    Updated Jul 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 9, 2020
    Dataset provided by
    Utrecht University
    Authors
    F. (Fabiano) Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

    Tagging scheme:
    Aligned (AL) - A concept is represented as a class in both models, either
    

    with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

    All the calculations and information provided in the following sheets
    

    originate from that raw data.

    Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,
    

    including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

    Sheet 3 (Size-Ratio):
    

    The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

    Sheet 4 (Overall):
    

    Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

    For sheet 4 as well as for the following four sheets, diverging stacked bar
    

    charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

    Sheet 5 (By-Notation):
    

    Model correctness and model completeness is compared by notation - UC, US.

    Sheet 6 (By-Case):
    

    Model correctness and model completeness is compared by case - SIM, HOS, IFA.

    Sheet 7 (By-Process):
    

    Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

    Sheet 8 (By-Grade):
    

    Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.

  20. f

    R scripts used for Monte Carlo simulations and data analyses.

    • plos.figshare.com
    zip
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lateef Babatunde Amusa; Twinomurinzi Hossana (2024). R scripts used for Monte Carlo simulations and data analyses. [Dataset]. http://doi.org/10.1371/journal.pone.0297037.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 19, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lateef Babatunde Amusa; Twinomurinzi Hossana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R scripts used for Monte Carlo simulations and data analyses.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen (2023). Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.docx [Dataset]. http://doi.org/10.3389/fpubh.2021.680054.s001

Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.

Search
Clear search
Close search
Google apps
Main menu