Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OBJECTIVES: Where trial recruitment is staggered over time, patients may have different lengths of follow-up, meaning that the dataset is an unbalanced panel with a considerable amount of missing data. This study presents a method for estimating the difference in total costs and total Quality – Adjusted Life Years (QALY) over a given time horizon using a repeated measure mixed model (RMMM). To the authors’ knowledge this is the first time this method has been exploited in the context of economic evaluation within clinical trials. METHODS: An example (EVLA trial, NIHR HTA project 11/129/197) is used where patients have between 1 and 5 years of follow up. Early treatment is compared with delayed treatment. Coefficients at each time point from the repeated measures mixed model were aggregated to estimate total mean cost and total mean QALY over 3 years. Results were compared with other methods for handling missing data: Complete-Case-Analysis (CCA), multiple imputation using linear regression (MILR) and using predictive mean matching (MIPMM), and Bayesian parametric approach (BPA). RESULTS: Mean differences in costs obtained varied among the different approach, CCA, MIPMM and MILRM recorded greater mean costs in delayed treatment, £216 (95% CI -£1413 to £1845), £36 (95% CI to £-581 to 652£), £30(95% CI to -£617 to 679£), respectively. While RMM and BPA showed greater costs in early intervention, -£67 (95% CI -£1069 to £855), -£162 (95% CI -£728-£402), respectively. Early intervention was associated with greater QALY among all methods at year 3, RMM show the highest QALYs, 0.073 (95% CI -0.06 to 0.2). CONCLUSION: MIPMM show most efficient results in our cost-effectiveness analysis. By contrast when the percentage of missing is high RMM shows similar results than MIPMM. Hence, we conclude that RMM is a flexible way and robust alternative for modelling continuous outcomes data that can be considered missing-at-random.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MAPE and PB statistics for IBFI compared with other imputation methods (mean, median, mode, PMM, and Hotdeck) for 20% missingness of type MAR and all parameters tested (RN, TH, TC, RH, and PR).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data in this paper pertain to those analyzed as part of a pilot monitoring study as described by Courtney et al. [1]. While the study reported on the results for three languages, the paper at hand references the results for just one study involving the Kazakh medium-of-instruction. The first variable in that data includes is ‘Student_id’, which provides a unique anonymized student ID number for each student. The next six variables in the data vary between-schools: (1) ‘experiment’, for which each school is defined as either a pilot (experiment) or control school; ‘Region’, defining the region that each school is situated in the country; (3) School_name, the name assigned to each unique school; (4) ‘Language’ defining the medium-of-instruction used in each school; ‘School_type’ where schools are defined as either (i) small, (ii) gymnasium, (iii) lyceum, or (iv) common (common used as baseline in modelling). Thereafter, the within-school variable, ‘Gender’ is given for which each child is defined as either male or female. In addition, columns 9-92 pertain to the item-response data (for all elements, 0 = incorrect, 1 = correct) for each child on each item in the five tests. 2015 Test: Items OG1 to OG5.4 (columns 9-19; link items to 2016 are OG2.2A, OG2.3A, OG23A) 2016 Test: Items OG2.2 to OG11.2 (20-35; link items to 2017 are ) 2017 Test: Items OG11.1 to K3.6 (36-47) 2018 Test: Items K4.1 to K3.3 (48-68) 2019 Test: Items X19K1 to K6 (69-92)
The data is filtered data. Only students who completed all five literacy assessments (2015 to 2019) were included in the data. Alternative approaches for dealing with missing data (attrition) in the analysis described in this article. Original data in Cyrillic used to showcase the R software’s capacity to manage and recode such data.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.
The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:
After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.
In addition to this file, we have also included the following files:
This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:
"csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:
subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp
"csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:
rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.
"csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences
Empty Values in Datasets:
There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.
Codes:
In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistical models are essential tools in data analysis. However, missing data plays a pivotal role in impacting the assumptions and effectiveness of statistical models, especially when there is a significant amount of missing data. This study addresses one of the core assumptions supporting many statistical models, the assumption of unidimensionality. It examines the impact of missing data rates and imputation methods on fulfilling this assumption. The study employs three imputation methods: Corrected Item Mean, multiple imputation, and expectation maximization, assessing their performance across nineteen levels of missing data rates, and examining their impact on the assumption of unidimensionality using several indicators (Cronbach’s alpha, corrected correlation coefficients, factor analysis (Eigenvalues (, , and cumulative variance, and communalities). The study concluded that all imputation methods used effectively provided data that maintained the unidimensionality assumption, regardless of missing data rates. Additionally, it was found that most of the unidimensionality indicators increased in value as missing data rates rose.
The Uncleaned Laptop Price dataset is a collection of laptop product listings scraped from an online e-commerce website. The dataset includes information about various laptop models, such as their brand, screen size, processor, memory, storage capacity, operating system, and price. However, the dataset is uncleaned, meaning that it contains missing values, inconsistent formatting, and other errors that need to be addressed before the data can be used for analysis or modeling.
The dataset contains both categorical and numerical variables, with the majority of variables being categorical, including brand, model name, screen resolution, processor type, and operating system. Some numerical variables include screen size, memory, and storage capacity. The target variable in the dataset is the price, which is a continuous variable.
The dataset contains over 1,300 laptop listings, making it a reasonably sized dataset for analysis and modeling. The dataset may be useful for machine learning projects related to predicting the price of a laptop based on its specifications. However, before using the dataset, it would be necessary to clean and preprocess the data to address the inconsistencies and missing values.
This dataset is designed to help researchers and data scientists predict crop yield based on key agricultural factors. It contains columns representing the amount of fertilizer used, temperature, and soil nutrients (nitrogen, phosphorus, and potassium), along with the corresponding crop yield. This dataset can be used for machine learning projects focusing on agricultural optimization, yield forecasting, and resource management. The data is structured and ready for analysis, making it ideal for training regression models or conducting exploratory data analysis.
File Information: Number of Rows: 2,596 Number of Columns: 6 Data Types: Fertilizer: float64 temp: float64 N (Nitrogen): float64 P (Phosphorus): float64 K (Potassium): float64 yeild (Yield): float64 Missing Values: None (0 missing values in all columns) Summary Statistics: Fertilizer: Mean: 66.49, Min: 49.75, Max: 80.22 Temperature: Mean: 33.85, Min: 23.77, Max: 40.27 Nitrogen (N): Mean: 69.52, Min: 58.84, Max: 80.22 Phosphorus (P): Mean: 20.71, Min: 17.72, Max: 25.16 Potassium (K): Mean: 17.81, Min: 14.70, Max: 22.06 Yield: Mean: 8.53, Min: 5.15, Max: 12.34 This dataset is complete, with no missing values, and provides diverse statistics for various features important for crop yield prediction.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Randomized experiments allow for consistent estimation of the average treatment effect based on the difference in mean outcomes without strong modeling assumptions. Appropriate use of pretreatment covariates can further improve the estimation efficiency. Missingness in covariates is nevertheless common in practice, and raises an important question: should we adjust for covariates subject to missingness, and if so, how? The unadjusted difference in means is always unbiased. The complete-covariate analysis adjusts for all completely observed covariates, and is asymptotically more efficient than the difference in means if at least one completely observed covariate is predictive of the outcome. Then what is the additional gain of adjusting for covariates subject to missingness? To reconcile the conflicting recommendations in the literature, we analyze and compare five strategies for handling missing covariates in randomized experiments under the design-based framework, and recommend the missingness-indicator method, as a known but not so popular strategy in the literature, due to its multiple advantages. First, it removes the dependence of the regression-adjusted estimators on the imputed values for the missing covariates. Second, it does not require modeling the missingness mechanism, and yields consistent estimators even when the missingness mechanism is related to the missing covariates and unobservable potential outcomes. Third, it ensures large-sample efficiency over the complete-covariate analysis and the analysis based on only the imputed covariates. Lastly, it is easy to implement via least squares. We also propose modifications to it based on asymptotic and finite sample considerations. Importantly, our theory views randomization as the basis for inference, and does not impose any modeling assumptions on the data-generating process or missingness mechanism. Supplementary materials for this article are available online.
https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Dataset shows an individual’s statistical area 3 (SA3) of usual residence and the SA3 of their workplace address, for the employed census usually resident population count aged 15 years and over, by main means of travel to work from the 2018 and 2023 Censuses.
The main means of travel to work categories are:
Main means of travel to work is the usual method which an employed person aged 15 years and over used to travel the longest distance to their place of work.
Workplace address refers to where someone usually works in their main job, that is the job in which they worked the most hours. For people who work at home, this is the same address as their usual residence address. For people who do not work at home, this could be the address of the business they work for or another address, such as a building site.
Workplace address is coded to the most detailed geography possible from the available information. This dataset only includes travel to work information for individuals whose workplace address is available at SA3 level. The sum of the counts for each region in this dataset may not equal the total employed census usually resident population count aged 15 years and over for that region. Workplace address – 2023 Census: Information by concept has more information.
This dataset can be used in conjunction with the following spatial files by joining on the SA3 code values:
Download data table using the instructions in the Koordinates help guide.
Footnotes
Geographical boundaries
Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.
Subnational census usually resident population
The census usually resident population count of an area (subnational count) is a count of all people who usually live in that area and were present in New Zealand on census night. It excludes visitors from overseas, visitors from elsewhere in New Zealand, and residents temporarily overseas on census night. For example, a person who usually lives in Christchurch city and is visiting Wellington city on census night will be included in the census usually resident population count of Christchurch city.
Population counts
Stats NZ publishes a number of different population counts, each using a different definition and methodology. Population statistics – user guide has more information about different counts.
Caution using time series
Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data).
Workplace address time series
Workplace address time series data should be interpreted with care at lower geographic levels, such as statistical area 2 (SA2). Methodological improvements in 2023 Census resulted in greater data accuracy, including a greater proportion of people being counted at lower geographic areas compared to the 2018 Census. Workplace address – 2023 Census: Information by concept has more information.
Working at home
In the census, working at home captures both remote work, and people whose business is at their home address (e.g. farmers or small business owners operating from their home). The census asks respondents whether they ‘mostly’ work at home or away from home. It does not capture whether someone does both, or how frequently they do one or the other.
Rows excluded from the dataset
Rows show SA3 of usual residence by SA3 of workplace address. Rows with a total population count of less than six have been removed to reduce the size of the dataset, given only a small proportion of SA3-SA3 combinations have commuter flows.
About the 2023 Census dataset
For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.
Data quality
The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.
Quality rating of a variable
The quality rating of a variable provides an overall evaluation of data quality for that variable, usually at the highest levels of classification. The quality ratings shown are for the 2023 Census unless stated. There is variability in the quality of data at smaller geographies. Data quality may also vary between censuses, for subpopulations, or when cross tabulated with other variables or at lower levels of the classification. Data quality ratings for 2023 Census variables has more information on quality ratings by variable.
Main means of travel to work quality rating
Main means of travel to work is rated as moderate quality.
Main means of travel to work – 2023 Census: Information by concept has more information, for example, definitions and data quality.
Workplace address quality rating
Workplace address is rated as moderate quality.
Workplace address – 2023 Census: Information by concept has more information, for example, definitions and data quality.
Using data for good
Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.
Confidentiality
The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.
Percentages
To calculate percentages, divide the figure for the category of interest by the figure for ‘Total stated’ where this applies.
Symbol
-999 Confidential
Inconsistencies in definitions
Please note that there may be differences in definitions between census classifications and those used for other data collections.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Absolute difference between missing rates as compared with ZERO rate of missing due to imputation method.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
In this work we studied the relationship between dengue incidence in Cali and the climatic variables that are known to have an impact on the mosquito and were available (precipitation, relative humidity, minimum, mean, and maximum temperature). Since the natural processes of the mosquito imply that any changes on climatic variables need some time to be visible on the dengue incidence, a lagged correlation analysis was done in order to choose the predictor variables of count regression models. A Principal Component Analysis was done to reduce dimensionality and study the correlation among the climatic variables. Finally, aiming to predict the monthly dengue incidence, three different regression models were constructed and compared using de Akaike information criterion. The best model was the negative binomial regression model, and the predictor variables were mean temperature with a 3-month lag and mean temperature with a 5-month lag as well as their interaction. The other variables were not significant on the models. And interesting conclusion was that according to the coefficients of the regression model, a 1°C increase in the monthly mean temperature will reflect as a 45% increase in dengue incidence after 3 months. The rises to a 64% increase after 5 months. Methods Monthly dengue incidence data was provided by the Public Health Department of Cali. The climatic data was collected from the Hydrology, Meteorology and Environmental Studies Institute (IDEAM) webpage. Missing data was imputed using a random forest regression model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistical models are essential tools in data analysis. However, missing data plays a pivotal role in impacting the assumptions and effectiveness of statistical models, especially when there is a significant amount of missing data. This study addresses one of the core assumptions supporting many statistical models, the assumption of unidimensionality. It examines the impact of missing data rates and imputation methods on fulfilling this assumption. The study employs three imputation methods: Corrected Item Mean, multiple imputation, and expectation maximization, assessing their performance across nineteen levels of missing data rates, and examining their impact on the assumption of unidimensionality using several indicators (Cronbach’s alpha, corrected correlation coefficients, factor analysis (Eigenvalues (, , and cumulative variance, and communalities). The study concluded that all imputation methods used effectively provided data that maintained the unidimensionality assumption, regardless of missing data rates. Additionally, it was found that most of the unidimensionality indicators increased in value as missing data rates rose.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The variables having missing value are preprocessed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistical details of the SRGC time series dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cumulative total variance for components of imputation methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.
Tagging scheme:
Aligned (AL) - A concept is represented as a class in both models, either
with the same name or using synonyms or clearly linkable names;
Wrongly represented (WR) - A class in the domain expert model is
incorrectly represented in the student model, either (i) via an attribute,
method, or relationship rather than class, or
(ii) using a generic term (e.g., user'' instead of
urban
planner'');
System-oriented (SO) - A class in CM-Stud that denotes a technical
implementation aspect, e.g., access control. Classes that represent legacy
system or the system under design (portal, simulator) are legitimate;
Omitted (OM) - A class in CM-Expert that does not appear in any way in
CM-Stud;
Missing (MI) - A class in CM-Stud that does not appear in any way in
CM-Expert.
All the calculations and information provided in the following sheets
originate from that raw data.
Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,
including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.
Sheet 3 (Size-Ratio):
The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.
Sheet 4 (Overall):
Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.
For sheet 4 as well as for the following four sheets, diverging stacked bar
charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:
Sheet 5 (By-Notation):
Model correctness and model completeness is compared by notation - UC, US.
Sheet 6 (By-Case):
Model correctness and model completeness is compared by case - SIM, HOS, IFA.
Sheet 7 (By-Process):
Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.
Sheet 8 (By-Grade):
Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R scripts used for Monte Carlo simulations and data analyses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.