40 datasets found

Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
Sensitivity analysis for missing data in cost-effectiveness analysis: Stata...
figshare.com
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Baptiste Leurent; Manuel Gomes; Rita Faria; Stephen Morris; Richard Grieve; James R Carpenter (2023). Sensitivity analysis for missing data in cost-effectiveness analysis: Stata code [Dataset]. http://doi.org/10.6084/m9.figshare.6714206.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6714206.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Baptiste Leurent; Manuel Gomes; Rita Faria; Stephen Morris; Richard Grieve; James R Carpenter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Stata do-files and data to support tutorial "Sensitivity Analysis for Not-at-Random Missing Data in Trial-Based Cost-Effectiveness Analysis" (Leurent, B. et al. PharmacoEconomics (2018) 36: 889).Do-files should be similar to the code provided in the article's supplementary material.Dataset based on 10 Top Tips trial, but modified to preserve confidentiality. Results will differ from those published.
Water-quality data imputation with a high percentage of missing values: a...
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
f
Data from: A Bayesian hybrid method for the analysis of generalized linear...
tandf.figshare.com
pdf
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sezgin Ciftci; Zeynep Kalaylioglu (2025). A Bayesian hybrid method for the analysis of generalized linear models with missing-not-at-random covariates [Dataset]. http://doi.org/10.6084/m9.figshare.27244867.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27244867.v1
Dataset updated
Jun 1, 2025
Dataset provided by
Taylor & Francis
Authors
Sezgin Ciftci; Zeynep Kalaylioglu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data handling is one of the main problems in modelling, particularly if the missingness is of type missing-not-at-random (MNAR) where missingness occurs due to the actual value of the observation. The focus of the current article is generalized linear modelling of fully observed binary response variables depending on at least one MNAR covariate. For the traditional analysis of such models, an individual model for the probability of missingness is assumed and incorporated in the model framework. However, this probability model is untestable, as the missingness of MNAR data depend on their actual values that would have been observed otherwise. In this article, we consider creating a model space that consist of all possible and plausible models for probability of missingness and develop a hybrid method in which a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm is combined with Bayesian Model Averaging (BMA). RJMCMC is adopted to obtain posterior estimates of model parameters as well as probability of each model in the model space. BMA is used to synthesize coefficient estimates from all models in the model space while accounting for model uncertainties. Through a validation study with a simulated data set and a real data application, the performance of the proposed methodology is found to be satisfactory in accuracy and efficiency of estimates.
f
Data from: Performance of standard imputation methods for missing quality of...
tandf.figshare.com
docx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marion Procter; Chris Robertson (2023). Performance of standard imputation methods for missing quality of life data as covariate in survival analysis based on simulations from the International Breast Cancer Study Group Trials VI and VII* [Dataset]. http://doi.org/10.6084/m9.figshare.6960167.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6960167.v1
Dataset updated
Jun 3, 2023
Dataset provided by
Taylor & Francis
Authors
Marion Procter; Chris Robertson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Imputation methods for missing data on a time-dependent variable within time-dependent Cox models are investigated in a simulation study. Quality of life (QoL) assessments were removed from the complete simulated datasets, which have a positive relationship between QoL and disease-free survival (DFS) and delayed chemotherapy and DFS, by missing at random and missing not at random (MNAR) mechanisms. Standard imputation methods were applied before analysis. Method performance was influenced by missing data mechanism, with one exception for simple imputation. The greatest bias occurred under MNAR and large effect sizes. It is important to carefully investigate the missing data mechanism.
f
A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...
datasetcatalog.nlm.nih.gov
acs.figshare.com
+1more
Updated May 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dabke, Kruttika; Jones, Michelle R.; Kreimer, Simion; Parker, Sarah J. (2021). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000907442
Explore at:
Dataset updated
May 3, 2021
Authors
Dabke, Kruttika; Jones, Michelle R.; Kreimer, Simion; Parker, Sarah J.
Description
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
H
Replication Data for: Comparative investigation of time series missing data...
dataverse.harvard.edu
dataone.org
Updated Jul 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LEIZHEN ZANG; Feng XIONG (2020). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GQHURF
Dataset updated
Jul 24, 2020
Dataset provided by
Harvard Dataverse
Authors
LEIZHEN ZANG; Feng XIONG
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.
e
ComBat HarmonizR enables the integrated analysis of independently generated...
ebi.ac.uk
Updated May 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hannah Voß (2022). ComBat HarmonizR enables the integrated analysis of independently generated proteomic datasets through data harmonization with appropriate handling of missing values [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD027467
Explore at:
Dataset updated
May 23, 2022
Authors
Hannah Voß
Variables measured
Proteomics
Description
The integration of proteomic datasets, generated by non-cooperating laboratories using different LC-MS/MS setups can overcome limitations in statistically underpowered sample cohorts but has not been demonstrated to this day. In proteomics, differences in sample preservation and preparation strategies, chromatography and mass spectrometry approaches and the used quantification strategy distort protein abundance distributions in integrated datasets. The Removal of these technical batch effects requires setup-specific normalization and strategies that can deal with missing at random (MAR) and missing not at random (MNAR) type values at a time. Algorithms for batch effect removal, such as the ComBat-algorithm, commonly used for other omics types, disregard proteins with MNAR missing values and reduce the informational yield and the effect size for combined datasets significantly. Here, we present a strategy for data harmonization across different tissue preservation techniques, LC-MS/MS instrumentation setups and quantification approaches. To enable batch effect removal without the need for data reduction or error-prone imputation we developed an extension to the ComBat algorithm, ´ComBat HarmonizR, that performs data harmonization with appropriate handling of MAR and MNAR missing values by matrix dissection The ComBat HarmonizR based strategy enables the combined analysis of independently generated proteomic datasets for the first time. Furthermore, we found ComBat HarmonizR to be superior for removing batch effects between different Tandem Mass Tag (TMT)-plexes, compared to commonly used internal reference scaling (iRS). Due to the matrix dissection approach without the need of data imputation, the HarmonizR algorithm can be applied to any type of -omics data while assuring minimal data loss
n
Data from: A new method for handling missing species in diversification...
data.niaid.nih.gov
datadryad.org
zip
Updated Jan 6, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natalie Cusimano; Tanja Stadler; Susanne S. Renner (2012). A new method for handling missing species in diversification analysis applicable to randomly or non-randomly sampled phylogenies [Dataset]. http://doi.org/10.5061/dryad.r8f04fk2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.r8f04fk2
Dataset updated
Jan 6, 2012
Dataset provided by
Ludwig-Maximilians-Universität München
Authors
Natalie Cusimano; Tanja Stadler; Susanne S. Renner
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Chronograms from molecular dating are increasingly being used to infer rates of diversification and their change over time. A major limitation in such analyses is incomplete species sampling that moreover is usually non-random. While the widely used γ statistic with the MCCR test or the birth-death likelihood analysis with the ∆AICrc test statistic are appropriate for comparing the fit of different diversification models in phylogenies with random species sampling, no objective, automated method has been developed for fitting diversification models to non-randomly sampled phylogenies. Here we introduce a novel approach, CorSiM, which involves simulating missing splits under a constant-rate birth-death model and allows the user to specify whether species sampling in the phylogeny being analyzed is random or non-random. The completed trees can be used in subsequent model-fitting analyses. This is fundamentally different from previous diversification rate estimation methods, which were based on null distributions derived from the incomplete trees. CorSiM is automated in an R package and can easily be applied to large data sets. We illustrate the approach in two Araceae clades, one with a random species sampling of 52% and one with a non-random sampling of 55%. In the latter clade, the CorSiM approach detects and quantifies an increase in diversification rate while classic approaches prefer a constant rate model, whereas in the former clade, results do not differ among methods (as indeed expected since the classic approaches are valid only for randomly sampled phylogenies). The CorSiM method greatly reduces the type I error in diversification analysis, but type II error remains a methodological problem.
Numpy , pandas and matplot lib practice
kaggle.com
zip
Updated Jul 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pratham saraf (2023). Numpy , pandas and matplot lib practice [Dataset]. https://www.kaggle.com/datasets/prathamsaraf1389/numpy-pandas-and-matplot-lib-practise/suggestions
Explore at:
zip(385020 bytes)Available download formats
Dataset updated
Jul 16, 2023
Authors
pratham saraf
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.

Specifics of the Dataset:

The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.

One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:

Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule

Context of the Dataset:

The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:

The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.
Data from: Application of sensitivity analysis to incomplete longitudinal...
tandf.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdul-Karim Iddrisu; Freedom Gumedze (2023). Application of sensitivity analysis to incomplete longitudinal CD4 count data [Dataset]. http://doi.org/10.6084/m9.figshare.6982298.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6982298.v1
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Abdul-Karim Iddrisu; Freedom Gumedze
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this paper, we investigate the effect of tuberculosis pericarditis (TBP) treatment on CD4 count changes over time and draw inferences in the presence of missing data. We accounted for missing data and conducted sensitivity analyses to assess whether inferences under missing at random (MAR) assumption are sensitive to not missing at random (NMAR) assumptions using the selection model (SeM) framework. We conducted sensitivity analysis using the local influence approach and stress-testing analysis. Our analyses showed that the inferences from the MAR are robust to the NMAR assumption and influential subjects do not overturn the study conclusions about treatment effects and the dropout mechanism. Therefore, the missing CD4 count measurements are likely to be MAR. The results also revealed that TBP treatment does not interact with HIV/AIDS treatment and that TBP treatment has no significant effect on CD4 count changes over time. Although the methods considered were applied to data in the IMPI trial setting, the methods can also be applied to clinical trials with similar settings.
f
Data from: Validity of using multiple imputation for "unknown" stage at...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luo, Qingwei; Egger, Sam; Yu, Xue Qin; Smith, David P.; O’Connell, Dianne L. (2017). Validity of using multiple imputation for "unknown" stage at diagnosis in population-based cancer registry data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001781541
Explore at:
Dataset updated
Jun 27, 2017
Authors
Luo, Qingwei; Egger, Sam; Yu, Xue Qin; Smith, David P.; O’Connell, Dianne L.
Description
BackgroundThe multiple imputation approach to missing data has been validated by a number of simulation studies by artificially inducing missingness on fully observed stage data under a pre-specified missing data mechanism. However, the validity of multiple imputation has not yet been assessed using real data. The objective of this study was to assess the validity of using multiple imputation for “unknown” prostate cancer stage recorded in the New South Wales Cancer Registry (NSWCR) in real-world conditions.MethodsData from the population-based cohort study NSW Prostate Cancer Care and Outcomes Study (PCOS) were linked to 2000–2002 NSWCR data. For cases with “unknown” NSWCR stage, PCOS-stage was extracted from clinical notes. Logistic regression was used to evaluate the missing at random assumption adjusted for variables from two imputation models: a basic model including NSWCR variables only and an enhanced model including the same NSWCR variables together with PCOS primary treatment. Cox regression was used to evaluate the performance of MI.ResultsOf the 1864 prostate cancer cases 32.7% were recorded as having “unknown” NSWCR stage. The missing at random assumption was satisfied when the logistic regression included the variables included in the enhanced model, but not those in the basic model only. The Cox models using data with imputed stage from either imputation model provided generally similar estimated hazard ratios but with wider confidence intervals compared with those derived from analysis of the data with PCOS-stage. However, the complete-case analysis of the data provided a considerably higher estimated hazard ratio for the low socio-economic status group and rural areas in comparison with those obtained from all other datasets.ConclusionsUsing MI to deal with “unknown” stage data recorded in a population-based cancer registry appears to provide valid estimates. We would recommend a cautious approach to the use of this method elsewhere.
d
Data and code from: Coordinated distributed experiments in ecology do not...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jul 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julia Bebout; Jeremy Fox (2025). Data and code from: Coordinated distributed experiments in ecology do not consistently reduce heterogeneity in effect size [Dataset]. http://doi.org/10.5061/dryad.cz8w9gj8w
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.cz8w9gj8w
Dataset updated
Jul 28, 2025
Dataset provided by
Dryad Digital Repository
Authors
Julia Bebout; Jeremy Fox
Time period covered
Jan 1, 2023
Description
Ecological meta-analyses usually exhibit high relative heterogeneity of effect size: most among-study variation in effect size represents true variation in mean effect size, rather than sampling error. This heterogeneity arises from both methodological and ecological sources. Methodological heterogeneity is a nuisance that complicates the interpretation of data syntheses. One way to reduce methodological heterogeneity is via coordinated distributed experiments, in which investigators conduct the same experiment at different sites, using the same methods. We tested whether coordinated distributed experiments in ecology exhibit a) low heterogeneity in effect size, and b) lower heterogeneity than meta-analyses, using data on 17 effects from eight coordinated distributed experiments, and 406 meta-analyses. Consistent with our expectations, among-site heterogeneity typically comprised <50% of the variance in effect size in distributed experiments. In contrast, heterogeneity within and amo..., , , # Coordinated distributed experiments in ecology do not consistently reduce heterogeneity in effect size

Included here is a data file for a distributed experiment, and code which analyses the heterogeneity of many coordinated distributed experiments and meta-analyses.Â The R code file reproduces the results of this study, called meta-analyses vs distd expts - R code for sharing v 2.R.

## Description of the data and file structure

Data File:

rousk et al 2013 table 3 data - INCREASE.csv: data from the INCREASE distributed experiment by Rousk et al. (2013)

All other data used in code is automatically sourced from URLs, but relevant variables are still described below.

Other variables in datasets were not used in our analysis, and so are not explained in this README file. Cells with missing data have "NA" values.

Variables used in code:

Costello & Fox variables:Â

meta.analysis.id: Unique ID number for each meta-analysis

eff.size: Effect size

var. eff.size: Variance in e...
f
Data_Sheet_2_The Optimal Machine Learning-Based Missing Data Imputation for...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen (2023). Data_Sheet_2_The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.docx [Dataset]. http://doi.org/10.3389/fpubh.2021.680054.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2021.680054.s002
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.
T
Field soil survey and analysis data in the upper reaches of Heihe River...
poles.tpdc.ac.cn
tpdc.ac.cn
+1more
zip
Updated Dec 12, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chansheng HE (2014). Field soil survey and analysis data in the upper reaches of Heihe River Basin (2013-2014) [Dataset]. http://doi.org/10.3972/westdc.x.2013.db
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3972/westdc.x.2013.db
Dataset updated
Dec 12, 2014
Dataset provided by
TPDC
Authors
Chansheng HE
Area covered
Heihe,
Description
The dataset is the field soil measurement and analysis data of the upstream of Heihe River Basin from 2013 to 2014, including soil particle analysis, water characteristic curve, saturated water conductivity, soil porosity, infiltration analysis, and soil bulk density I. Soil particle analysis 1. The soil particle size data were measured in the particle size laboratory of the Key Laboratory of the Ministry of Education, West Ministry of Lanzhou University.The measuring instrument is Marvin laser particle size meter MS2000. 2. Particle size data were measured by laser particle size analyzer.As a result, sample points with large particles cannot be measured, such as D23 and D25 cannot be measured without data.Plus partial sample missing. Ii. Soil moisture characteristic curve 1. Centrifuge method: The unaltered soil of the ring-cutter collected in the field was put into the centrifuge, and the rotor weight of each time was measured with the rotation speed of 0, 310, 980, 1700, 2190, 2770, 3100, 5370, 6930, 8200 and 11600 respectively. 2. The ring cutter is numbered from 1 to the back according to the number. Since three groups are sampled at different places at the same time, in order to avoid repeated numbering, the first group is numbered from 1, the second group is numbered from 500, and the third group is numbered from 1000.It's consistent with the number of the sampling point.You can find the corresponding number in the two Excel. 3. The soil bulk density data in 2013 is supplementary to the sampling in 2012, so the data are not available at every point.At the same time, the soil layer of some sample points is not up to 70 cm thick, so the data of 5 layers cannot be taken. At the same time, a large part of data is missing due to transportation and recording problems.At the same time, only one layer of data is selected by random points. 4. Weight after drying: The drying weight of some samples was not measured due to problems with the oven during the experiment. 3. Saturated water conductivity of soil 1. Description of measurement method: The measurement method is based on the self-made instrument of Yiyanli (2009) for fixing water hair.The mariot bottle was used to keep the constant water head during the experiment.At the same time, the measured Ks was finally converted to the Ks value at 10℃ for analysis and calculation.Detailed measurement record table refer to saturation conductivity measurement description.K10℃ is the data of saturated water conductivity after conversion to 10℃.Unit: cm/min. 2. Data loss explanation: The data of saturated water conductivity is partly due to the lack of soil samples and the insufficient depth of the soil layer to obtain the data of the 4th or 5th layer 3. Sampling time: July 2014 4. Soil porosity 1. Use bulk density method to deduce: according to the relationship between soil bulk density and soil porosity. 2. The data in 2014 is supplementary to the sampling in 2012, so the data are not available at every point.At the same time, the soil layer of some sample points is not up to 70 cm thick, so the data of 5 layers cannot be taken. At the same time, a large part of data is missing due to transportation and recording problems.At the same time, only one layer of data is selected by random points. 5. Soil infiltration analysis 1. The infiltration data were measured by the "MINI DISK PORTABLE specific vector INFILTROMETER".The approximate saturation water conductivity under a certain negative pressure is obtained.The instrument is detailed in website: http://www.decagon.com/products/hydrology/hydraulic-conductivity/mini-disk-portable-tension-infiltrometer/ 2.D7 infiltration tests were not measured at that time because of rain. Vi. Soil bulk density 1. The bulk density of soil in 2014 refers to the undisturbed soil taken by ring cutter based on the basis of 2012. 2. The soil bulk density is dry soil bulk density, which is measured by drying method.The undisturbed ring-knife soil samples collected in the field were kept in an oven at 105℃ for 24 hours, and the dry weight of the soil was divided by the soil volume (100 cubic centimeters). 3. Unit: G /cm3
d
Data from: Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 -...
catalog.data.gov
data.usgs.gov
+2more
Updated Nov 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data [Dataset]. https://catalog.data.gov/dataset/variable-terrestrial-gps-telemetry-detection-rates-parts-1-7data
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
U.S. Geological Survey
Description
Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.
Airline-Delay-Prediction
kaggle.com
zip
Updated Apr 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Mostafa (2025). Airline-Delay-Prediction [Dataset]. https://www.kaggle.com/datasets/ahmed4mostafa/air-line
Explore at:
zip(22905 bytes)Available download formats
Dataset updated
Apr 5, 2025
Authors
Ahmed Mostafa
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Airline Delay Prediction Dataset A Machine Learning-Ready Dataset for Flight Delay Analysis and Predictive Modeling

📌 Dataset Overview This dataset provides historical flight data curated to analyze and predict airline delays using machine learning. It includes key features such as flight schedules, weather conditions, and delay causes, making it ideal for:

🚀 ML model training (binary classification: delayed/not delayed).

📈 Trend analysis (e.g., weather impact, airline performance).

🎯 Academic research or industry applications.

📂 Data Specifications Format: CSV (ready for pandas/scikit-learn).

Size: [X] thousand records (covers [Year Range]).

Variables:

Flight details: Departure/arrival times, airline, aircraft type.

Delay causes: Weather, technical issues, security, etc.

Weather data: Temperature, visibility, wind speed.

Target variable: Delay status (e.g., Delayed: Yes/No or Delay_minutes).

🎯 Potential Use Cases 1.Predictive Modeling: from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier().fit(X_train, y_train) 2.Airline Performance Benchmarking.

3.Weather-Delay Correlation Analysis. 🔍 Why Use This Dataset? Clean & Preprocessed: Minimal missing values, outliers handled.

Feature-Rich: Combines flight + weather data for robust analysis.

Benchmark Ready: Compatible with Kaggle kernels for easy experimentation.
NBA Player Data (1996-2024)
kaggle.com
Updated May 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damir Dizdarevic (2024). NBA Player Data (1996-2024) [Dataset]. https://www.kaggle.com/datasets/damirdizdarevic/nba-dataset-eda-and-ml-compatible
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 24, 2024
Dataset provided by
Kaggle
Authors
Damir Dizdarevic
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
NBA data ranging from 1996 to 2024 contains physical attributes, bio information, (advanced) stats, and positions of players.

No missing values, certain data preprocessing will be needed depending on the task.

Data was gathered from the nba.com and Basketball Reference - starting with the season 1996/97 and up until the latest season 2023/24.

A lot of options for EDA & ML present - analyzing the change of physical attributes by position, how the number of 3-point shots changed throughout years, how the number of foreign players increased; using Machine Learning to predict player's points, rebounds and assists, predicting player's position, player clustering, etc.

The issue with the data was that the data about player height and weight was in Imperial system, so the scatterplot of heights and weights was not looking good (around only 20 distinct values for height and around 150 for weight, which is quite bad for the dataset of 13.000 players). I created a script in which I assign a random height to the player between 2 heights (let's say between 200.66 cm and 203.2 cm, which would be 6-7 and 6-8 in Imperial system), but I did it in a way that 80% of values fall in the range of 5 to 35% increase, which still keeps the integrity of the data (average height of the whole dataset increased for less than 1 cm). I did the same thing for the weight: since difference between 2 pounds is around 0.44 kg, I would assign a random value for weight for each player that is either +/- 0.22 from his original weight. Here I observed a change in the average weight of the whole dataset of around 0.09 kg, which is insignificant.

Unfortunately the NBA doesn't provide the data in cm and kg, and although this is not the perfect approach regarding accuracy, it is still much better than assigning only 20 heights to the dataset of 13.000 players.
S
Experimental Dataset on the Impact of Unfair Behavior by AI and Humans on...
scidb.cn
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang Luo (2025). Experimental Dataset on the Impact of Unfair Behavior by AI and Humans on Trust: Evidence from Six Experimental Studies [Dataset]. http://doi.org/10.57760/sciencedb.psych.00565
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.psych.00565
Dataset updated
Apr 30, 2025
Dataset provided by
Science Data Bank
Authors
Yang Luo
Description
This dataset originates from a series of experimental studies titled “Tough on People, Tolerant to AI? Differential Effects of Human vs. AI Unfairness on Trust” The project investigates how individuals respond to unfair behavior (distributive, procedural, and interactional unfairness) enacted by artificial intelligence versus human agents, and how such behavior affects cognitive and affective trust.1 Experiment 1a: The Impact of AI vs. Human Distributive Unfairness on TrustOverview: This dataset comes from an experimental study aimed at examining how individuals respond in terms of cognitive and affective trust when distributive unfairness is enacted by either an artificial intelligence (AI) agent or a human decision-maker. Experiment 1a specifically focuses on the main effect of the “type of decision-maker” on trust.Data Generation and Processing: The data were collected through Credamo, an online survey platform. Initially, 98 responses were gathered from students at a university in China. Additional student participants were recruited via Credamo to supplement the sample. Attention check items were embedded in the questionnaire, and participants who failed were automatically excluded in real-time. Data collection continued until 202 valid responses were obtained. SPSS software was used for data cleaning and analysis.Data Structure and Format: The data file is named “Experiment1a.sav” and is in SPSS format. It contains 28 columns and 202 rows, where each row corresponds to one participant. Columns represent measured variables, including: grouping and randomization variables, one manipulation check item, four items measuring distributive fairness perception, six items on cognitive trust, five items on affective trust, three items for honesty checks, and four demographic variables (gender, age, education, and grade level). The final three columns contain computed means for distributive fairness, cognitive trust, and affective trust.Additional Information: No missing data are present. All variable names are labeled in English abbreviations to facilitate further analysis. The dataset can be directly opened in SPSS or exported to other formats.2 Experiment 1b: The Mediating Role of Perceived Ability and Benevolence (Distributive Unfairness)Overview: This dataset originates from an experimental study designed to replicate the findings of Experiment 1a and further examine the potential mediating role of perceived ability and perceived benevolence.Data Generation and Processing: Participants were recruited via the Credamo online platform. Attention check items were embedded in the survey to ensure data quality. Data were collected using a rolling recruitment method, with invalid responses removed in real time. A total of 228 valid responses were obtained.Data Structure and Format: The dataset is stored in a file named Experiment1b.sav in SPSS format and can be directly opened in SPSS software. It consists of 228 rows and 40 columns. Each row represents one participant’s data record, and each column corresponds to a different measured variable. Specifically, the dataset includes: random assignment and grouping variables; one manipulation check item; four items measuring perceived distributive fairness; six items on perceived ability; five items on perceived benevolence; six items on cognitive trust; five items on affective trust; three items for attention check; and three demographic variables (gender, age, and education). The last five columns contain the computed mean scores for perceived distributive fairness, ability, benevolence, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variables are labeled using standardized English abbreviations to facilitate reuse and secondary analysis. The file can be analyzed directly in SPSS or exported to other formats as needed.3 Experiment 2a: Differential Effects of AI vs. Human Procedural Unfairness on TrustOverview: This dataset originates from an experimental study aimed at examining whether individuals respond differently in terms of cognitive and affective trust when procedural unfairness is enacted by artificial intelligence versus human decision-makers. Experiment 2a focuses on the main effect of the decision agent on trust outcomes.Data Generation and Processing: Participants were recruited via the Credamo online survey platform from two universities located in different regions of China. A total of 227 responses were collected. After excluding those who failed the attention check items, 204 valid responses were retained for analysis. Data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in a file named Experiment2a.sav in SPSS format and can be directly opened in SPSS software. It contains 204 rows and 30 columns. Each row represents one participant’s response record, while each column corresponds to a specific variable. Variables include: random assignment and grouping; one manipulation check item; seven items measuring perceived procedural fairness; six items on cognitive trust; five items on affective trust; three attention check items; and three demographic variables (gender, age, and education). The final three columns contain computed average scores for procedural fairness, cognitive trust, and affective trust.Additional Notes: The dataset contains no missing values. All variables are labeled using standardized English abbreviations to facilitate reuse and secondary analysis. The file can be directly analyzed in SPSS or exported to other formats as needed.4 Experiment 2b: Mediating Role of Perceived Ability and Benevolence (Procedural Unfairness)Overview: This dataset comes from an experimental study designed to replicate the findings of Experiment 2a and to further examine the potential mediating roles of perceived ability and perceived benevolence in shaping trust responses under procedural unfairness.Data Generation and Processing: Participants were working adults recruited through the Credamo online platform. A rolling data collection strategy was used, where responses failing attention checks were excluded in real time. The final dataset includes 235 valid responses. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in a file named Experiment2b.sav, which is in SPSS format and can be directly opened using SPSS software. It contains 235 rows and 43 columns. Each row corresponds to a single participant, and each column represents a specific measured variable. These include: random assignment and group labels; one manipulation check item; seven items measuring procedural fairness; six items for perceived ability; five items for perceived benevolence; six items for cognitive trust; five items for affective trust; three attention check items; and three demographic variables (gender, age, education). The final five columns contain the computed average scores for procedural fairness, perceived ability, perceived benevolence, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variables are labeled using standardized English abbreviations to support future reuse and secondary analysis. The dataset can be directly analyzed in SPSS and easily converted into other formats if needed.5 Experiment 3a: Effects of AI vs. Human Interactional Unfairness on TrustOverview: This dataset comes from an experimental study that investigates how interactional unfairness, when enacted by either artificial intelligence or human decision-makers, influences individuals’ cognitive and affective trust. Experiment 3a focuses on the main effect of the “decision-maker type” under interactional unfairness conditions.Data Generation and Processing: Participants were college students recruited from two universities in different regions of China through the Credamo survey platform. After excluding responses that failed attention checks, a total of 203 valid cases were retained from an initial pool of 223 responses. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in the file named Experiment3a.sav, in SPSS format and compatible with SPSS software. It contains 203 rows and 27 columns. Each row represents a single participant, while each column corresponds to a specific measured variable. These include: random assignment and condition labels; one manipulation check item; four items measuring interactional fairness perception; six items for cognitive trust; five items for affective trust; three attention check items; and three demographic variables (gender, age, education). The final three columns contain computed average scores for interactional fairness, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variable names are provided using standardized English abbreviations to facilitate secondary analysis. The data can be directly analyzed using SPSS and exported to other formats as needed.6 Experiment 3b: The Mediating Role of Perceived Ability and Benevolence (Interactional Unfairness)Overview: This dataset comes from an experimental study designed to replicate the findings of Experiment 3a and further examine the potential mediating roles of perceived ability and perceived benevolence under conditions of interactional unfairness.Data Generation and Processing: Participants were working adults recruited via the Credamo platform. Attention check questions were embedded in the survey, and responses that failed these checks were excluded in real time. Data collection proceeded in a rolling manner until a total of 227 valid responses were obtained. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in the file named Experiment3b.sav, in SPSS format and compatible with SPSS software. It includes 227 rows and

Loan Dataset | Easy to Understand | yashaswi

kaggle.com

zip

Updated May 11, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ayushman Yashaswi (2025). Loan Dataset | Easy to Understand | yashaswi [Dataset]. https://www.kaggle.com/datasets/ayushmanyashaswi/loan-dataset-easy-to-understand-yashaswi

Explore at:

zip(7973 bytes)Available download formats

Dataset updated

May 11, 2025

Authors

Ayushman Yashaswi

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Perfect! Here's the fully updated Kaggle dataset description with your model results, comparison graphs, and confusion matrices added:

📝 Dataset Description: Loan Approval Prediction

This dataset is designed to help beginners understand and practice classification problems using machine learning. It includes real-world loan application data with features relevant to determining whether a loan should be approved or not.

📂 File Information

Filename: loan.csv
Size: 38.01 KB
Total Records: 614
Columns: 13

🔍 Features

Column	Description
`Loan_ID`	Unique identifier for each loan application
`Gender`	Applicant's gender (Male/Female)
`Married`	Applicant's marital status
`Dependents`	Number of dependents (0, 1, 2, 3+)
`Education`	Education level (Graduate/Not Graduate)
`Self_Employed`	Self-employment status
`ApplicantIncome`	Income of the applicant
`CoapplicantIncome`	Income of the co-applicant
`LoanAmount`	Loan amount in thousands
`Loan_Amount_Term`	Term of the loan (in days)
`Credit_History`	Credit history meets guidelines (1.0 = Yes, 0.0 = No)
`Property_Area`	Urban/Semiurban/Rural
`Loan_Status`	Loan approval status (Y = Approved, N = Not Approved)

🧪 ML Model Performance

The dataset was tested using various classification models. Below are the results:

Model	Training Accuracy	Testing Accuracy
Logistic Regression	76.4%	76.7%
Random Forest	100%	84.2%
Decision Tree	100%	76.7%
Support Vector Machine (SVM)	77.5%	82.5%

📌 Observations:

Random Forest and Decision Tree show overfitting due to perfect training accuracy.
SVM performed well with better generalization.
The data is ideal for evaluating models and understanding overfitting/underfitting.

📊 Visualization & Analysis

✅ Model Comparison:

Bar graphs were used to compare training and testing accuracies of all models side by side.

✅ Confusion Matrix:

Individual confusion matrices were generated for each model to evaluate prediction performance, class-wise accuracy, false positives, and false negatives.

These visualizations help in interpreting model strengths, weaknesses, and real-world applicability.

🎯 Use Cases

Classification modeling (predicting loan approval)
Data cleaning & preprocessing practice
Handling categorical and missing data
Exploratory Data Analysis (EDA)
Model evaluation techniques (accuracy, confusion matrix, visualization)

✅ Perfect For

Beginners in ML & Data Science
ML model comparison and overfitting/underfitting analysis
Kaggle Notebooks, Portfolio Projects, ML practice tasks

Facebook

Twitter

Click to copy link

Link copied

Cite

Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1

Understanding and Managing Missing Data.pdf

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.29265155.v1

Dataset updated

Jun 9, 2025

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Ibrahim Denis Fofanah

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

Clear search

Close search

Google apps

Main menu

Understanding and Managing Missing Data.pdf

Sensitivity analysis for missing data in cost-effectiveness analysis: Stata...

Water-quality data imputation with a high percentage of missing values: a...

Data from: A Bayesian hybrid method for the analysis of generalized linear...

Data from: Performance of standard imputation methods for missing quality of...

A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

Replication Data for: Comparative investigation of time series missing data...

ComBat HarmonizR enables the integrated analysis of independently generated...

Data from: A new method for handling missing species in diversification...

Numpy , pandas and matplot lib practice

Data from: Application of sensitivity analysis to incomplete longitudinal...

Data from: Validity of using multiple imputation for "unknown" stage at...

Data and code from: Coordinated distributed experiments in ecology do not...

## Description of the data and file structure

Data_Sheet_2_The Optimal Machine Learning-Based Missing Data Imputation for...

Field soil survey and analysis data in the upper reaches of Heihe River...

Data from: Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 -...

Airline-Delay-Prediction

NBA Player Data (1996-2024)

Experimental Dataset on the Impact of Unfair Behavior by AI and Humans on...

Loan Dataset | Easy to Understand | yashaswi

📝 Dataset Description: Loan Approval Prediction

📂 File Information

🔍 Features

🧪 ML Model Performance

📊 Visualization & Analysis

🎯 Use Cases

✅ Perfect For

Understanding and Managing Missing Data.pdf