Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data sets for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the data sets used in both example analyses (Examples 1 and 2) in two file formats (binary ".rda" for use in R; plain-text ".dat").
The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:
ID
= group identifier (1-2000)
x
= numeric (Level 1)
y
= numeric (Level 1)
w
= binary (Level 2)
In all data sets, missing values are coded as "NA".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.
This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.
To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).
IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.
In this dataset, we include the original and imputed values for the following variables:
Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)
Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].
More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.
If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This example dataset is used to illustrate the usage of the R package survtd in the Supplementary Materials of the paper:Moreno-Betancur M, Carlin JB, Brilleman SL, Tanamas S, Peeters A, Wolfe R (2017). Survival analysis with time-dependent covariates subject to measurement error and missing data: Two-stage joint model using multiple imputation (submitted).The data was generated using the simjm function of the package, using the following code:dat
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code to impute binary outcome. (R 1 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For each test, and each study, there are scores missing, although all test co-occur at least once.
Estimates of captives carried in the Atlantic slave trade by decade, 1650s to 1860s. Data: routes of voyages and recorded numbers of captives (10 variables and 33,345 cases of slave voyages). Data are organized into 40 routes linking African regions to overseas regions. Purpose: estimation of missing data and totals of captive flows. Method: techniques of Bayesian statistics to estimate missing data on routes and flows of captives. Also included is R-language code for simulating routes and populations
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annual hourly air quality and meteorological data by pollutant for the 2019 calendar year. For more information on air quality, including live air data, please visit www.qld.gov.au/environment/pollution/monitoring/air. \r \r Data resolution: One-hour average values \r Data row timestamp: Start of averaging period \r Missing data/not monitored: Blank cell \r Sampling height: Four metres above ground (unless otherwise indicated)
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Raw data and R code for the paper "Evaluating the Effects of Randomness on Missing Data in Archaeological Networks"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R scripts used for Monte Carlo simulations and data analyses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Stata do-files and data to support tutorial "Sensitivity Analysis for Not-at-Random Missing Data in Trial-Based Cost-Effectiveness Analysis" (Leurent, B. et al. PharmacoEconomics (2018) 36: 889).Do-files should be similar to the code provided in the article's supplementary material.Dataset based on 10 Top Tips trial, but modified to preserve confidentiality. Results will differ from those published.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
Welcome to the Zenodo repository for Publication Benchmarking imputation methods for categorical biological data, a comprehensive collection of datasets and scripts utilized in our research endeavors. This repository serves as a vital resource for researchers interested in exploring the empirical and simulated analyses conducted in our study.
Contents:
empirical_analysis:
simulation_analysis:
TDIP_package:
Purpose:
This repository aims to provide transparency and reproducibility to our research findings by making the datasets and scripts publicly accessible. Researchers interested in understanding our methodologies, replicating our analyses, or building upon our work can utilize this repository as a valuable reference.
Citation:
When using the datasets or scripts from this repository, we kindly request citing Publication Benchmarking imputation methods for categorical biological data and acknowledging the use of this Zenodo repository.
Thank you for your interest in our research, and we hope this repository serves as a valuable resource in your scholarly pursuits.
Spatial misalignment occurs when at least one of multiple outcome variables is missing at an observed location. For spatial data, prediction of these missing observations should be informed by within location association among outcomes and by proximate locations where measurements were recorded. This study details and illustrates a Bayesian regression framework for modelling spatially misaligned multivariate data. Particular attention is paid to developing valid probability models capable of estimating parameter posterior distributions and propagating uncertainty through to outcomes' predictive distributions at locations where some or all of the outcomes are not observed. Models and associated software are presented for both Gaussian and non-Gaussian outcomes. Model parameter and predictive inference within the proposed framework is illustrated using a synthetic and forest inventory data set. The proposed Markov chain Monte carlo samplers were written in c++ and leverage R's Foreign Lan...
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Data and coding scripts for Seddon et al. (2016) Nature (DOI 10.1038/nature16986). We derived monthly time-series of four key terrestrial ecosystem variables at 0.05 degree (~5km) resolution from observations by the MODIS sensor on Terra (AM) for the period February 2010-December 2013 inclusive, and developed a method to identify vegetation sensitivity to climate variability over this period (see Methods in main paper).
This ORA item contains all data and files required to run the analysis described in the paper. Data required to run the script are provided in six zip files evi.zip, temp.zip, aetpet.zip, cld.zip, stdev.zip, numpxl.zip, each containing 167 text files, one per month of available data, in addition to a supporting files folder. Details are as follows.
supporting_files.zip : This directory includes computer code and additional supporting files. Please see the 'read me.txt' file within this directory for more information.
evi.zip: ENHANCED VEGETATION INDEX (EVI). We used the MOD13C2 product (Huete et al 2002) which comprises monthly, global EVI at 0.05 degree resolution. In some cases where no clear-sky observations are available, the MOD13C2 version 5 product replaces no-data values with climatological monthly means, so we removed these values where appropriate.
EVI format = ascii text file projection = geographic projection spatial resolution = 0.05 degrees min x = -180 max x = 180 min y = -60 max x = 90 rows = 3000 cols = 7200 bit depth = 16 bit signed integer nodata (sea) = -9999 missing data (on land) = -999 units = dimensionless scale factor = 10000 (divide the value by 10000 to get EVI) filenames = yyyymmevi.txt
numpxl.zip - COUNTS OF THE NUMBER OF PIXELS USED IN EVI CALCULATION. The MOD13C2 product is the result of a spatially and temporally averaged mosaic of higher resolution (1km pixels). Data in this directory represent the number of 1km observations used to calculate the MODIS EVI product. See the online documentation for more details (Solano et al. 2010).
numpxl format = ascii text file projection = geographic projection spatial resolution = 0.05 degrees min x = -180 max x = 180 min y = -60 max x = 90 rows = 3000 cols = 7200 bit depth = 16 bit signed integer nodata (sea) = -9999 missing data (on land) = -999 units = counts filenames = yyyy_mm_numpxl_pt05deg.txt
stdev.zip - STANDARD DEVIATION OF EVI VALUES. Standard deviation of the monthly EVI observations. See discussion in numpxl.zip item (above) and the online documentation for more details (Solano et al. 2010).
stdev format = ascii text file projection = geographic projection spatial resolution = 0.05 degrees min x = -180 max x = 180 min y = -60 max x = 90 rows = 3000 cols = 7200 bit depth = 16 bit signed integer nodata (sea) = -9999 missing data (on land) = -999 units = dimensionless scale factor = 10000 (divide the value by 10000 to get EVI) filenames = yyyy_mm_stdev_pt05deg.txt
temp.zip: AIR TEMPERATURE. We used the MOD07_L2 Atmospheric Profile product (Seeman et al 2006) as a measure of air temperature. Five-minute swaths of Retrieved Temperature Profile were projected to geographic co-ordinates. Pixels from the highest available pressure level, corresponding to the temperature closest to the Earth's surface, were selected from each swath. Swaths were then mean-mosaicked into global daily grids, and the daily global grids were mean-composited to monthly grids of air temperature.
Air temperature format = ascii text file projection = geographic projection spatial resolution = 0.05 degrees min x = -180 max x = 180 min y = -60 max x = 90 rows = 3000 cols = 7200 bit depth = 16 bit signed integer nodata (sea) = -9999 missing data (on land) = -999 units = degrees C scale factor = 1 (divide the value by 1 to get Air temperature) filenames = yyyymmtemp.txt
aetpet.zip: WATER AVAILABILITY. We used the MOD16 Global Evapotranspiration product (Mu et al 2011) to calculate the monthly 0.05 degree ratio of Actual to Potential Evapotranspiration (AET/PET).
AET/PET format = ascii text file projection = geographic projection spatial resolution = 0.05 degrees min x = -180 max x = 180 min y = -60 max x = 90 rows = 3000 cols = 7200 bit depth = 16 bit signed integer nodata (sea) = -9999 missing data (on land) = -999 units = dimensionless scale factor = 10000 (divide the value by 10000 to get AET/PET) filenames = yyyymmaetpet.txt
cld.zip - CLOUDINESS. We used the MOD35_L2 Cloud Mask product (Ackerman et al 2010). This product provides daily records on the presence of cloudy vs cloudless skies, and we used this to make an index of the proportion of of cloudy to clear days in a given pixel. After conversion to geographic co-ordinates, five-minute swaths at 1-km resolution were reclassed as clear sky or cloudy, and these daily swaths were mean-mosaicked to global coverages, mean composited from daily to monthly, and mean-aggregated from 1km to 0.05 degree.
cld format = ascii text file projection = geographic projection spatial resolution = 0.05 degrees min x = -180 max x = 180 min y = -60 max x = 90 rows = 3000 cols = 7200 bit depth = 16 bit signed integer nodata (sea) = -9999 missing data (on land) = -999 units = percentage of days in the month which were cloudy scale factor = 100 (divide the value by 100 to get percentage cloudy days) filenames = yyyymmcld.txt
References
Ackerman, S. et al. (2010) Discriminating clear-sky from cloud with MODIS: Algorithm Theoretical Basis Document (MOD35), Version 6.1. (URL: ttp://modis- atmos.gsfc.nasa.gov/_docs/MOD35_A TBD_Collection6.pdf)
Huete, A. et al. (2002) Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sensing of Environment 83, 195–213.
Mu, Q., Zhao, M., Running, S.R. (2011) Improvements to a MODIS global terrestrial evapotranspiration algorithm. Remote Sensing of Environment 115, 1781-1800
Seeman, S. W., Borbas, E. E., Li, J., Menzel, W. P. & Gumley, L. E. (2006) MODIS Atmospheric Profile Retrieval Algorithm Theoretical Basis Document, Version 6 (URL: http://modis-atmos.gsfc.nasa.gov/_docs/MOD07_atbd_v7_April2011.pdf)
Solano, R. et al. (2010) MODIS Vegetation Index User’s Guide (MOD13 Series) Version 2.00, May 2010 (Collection 5) (URL: http://vip.arizona.edu/documents/MODIS/MODIS_VI_UsersGuide_01_2012.pdf) Seddon et al. (2016) Nature (DOI 10.1038/nature16986) ABSTRACT: Identification of properties that contribute to the persistence and resilience of ecosystems despite climate change constitutes a research priority of global significance. Here, we present a novel, empirical approach to assess the relative sensitivity of ecosystems to climate variability, one property of resilience that builds on theoretical modelling work recognising that systems closer to critical thresholds respond more sensitively to external perturbations. We develop a new metric, the Vegetation Sensitivity Index (VSI) which identifies areas sensitive to climate variability over the past 14 years. The metric uses time-series data of MODIS derived Enhanced Vegetation Index (EVI) and three climatic variables that drive vegetation productivity (air temperature, water availability and cloudiness). Underlying the analysis is an autoregressive modelling approach used to identify regions with memory effects and reduced response rates to external forcing. We find ecologically sensitive regions with amplified responses to climate variability in the arctic tundra, parts of the boreal forest belt, the tropical rainforest, alpine regions worldwide, steppe and prairie regions of central Asia and North and South America, the Caatinga deciduous forest in eastern South America, and eastern areas of Australia. Our study provides a quantitative methodology for assessing the relative response rate of ecosystems – be they natural or with a strong anthropogenic signature – to environmental variability, which is the first step to address why some regions appear to be more sensitive than others and what impact this has upon the resilience of ecosystem service provision and human wellbeing.
Druga vežba za predmet Tehnike obrade biomedicinskih signala na master akademskim studijama na Elektrotehničkom fakultetu Univerziteta u Beogradu. Repozitorijum sadrži i odgovarajuće .txt i .csv datoteke sa podacima koji se koriste za izradu zadataka.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and code for publication: H. Rathmann, S. Lismann, M. Francken, A. Spatzier, Estimating inter-individual Mahalanobis distances from mixed incomplete high-dimensional data: Application to human skeletal remains from 3rd to 1st millennia BC Southwest Germany. Journal of Archaeological Science 156: 105802. https://doi.org/10.1016/j.jas.2023.105802
The repository contains:
“R code for FLEXDIST.txt”: R code for executing FLEXDIST, a tool to estimate inter-individual Mahalanobis-type distances, taking correlations among variables into account, applicable to multiple variable scales (nominal, ordinal, continuous, or any mixture thereof), accommodating missing values, and handling high-dimensional data. Please refer to the latest version of this repository for the most up-to-date R code.
“data.csv”: Pre-processed dataset comprising 85 dental morphological features collected from 64 archaeological human remains from Final Neolithic to Early Iron Age Southwest Germany used for analysis.
“complete dataset.xlsx”: Complete dataset comprising 199 dental morphological features collected from 144 archaeological human remains from Final Neolithic to Early Iron Age Southwest Germany.
Paleontological investigations into morphological diversity, or disparity, are often confronted with large amounts of missing data. We illustrate how missing discrete data effects disparity using a novel simulation for removing data based on parameters from published datasets that contain both extinct and extant taxa. We develop an algorithm that assesses the distribution of missing characters in extinct taxa, and simulates data loss by applying that distribution to extant taxa. We term this technique ‘linkage’. We compare differences in disparity metrics and ordination spaces produced by linkage and random character removal. When we incorporated linkage among characters, disparity metrics declined and ordination spaces shrank at a slower rate with increasing missing data, indicating that correlations among characters govern the sensitivity of disparity analysis. We also present and test a new disparity method that uses the linkage algorithm to correct for the bias caused by missing dat...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MAPE and PB statistics for IBFI compared with other imputation methods (mean, median, mode, PMM, and Hotdeck) for 20% missingness of type MAR and all parameters tested (RN, TH, TC, RH, and PR).
This dataset was collected from K-12 teachers via online surveys (Qualtrics). The statistical analyses were conducted in R-programing.
In the present research, we tested whether a combination of getting perspective and exposure to relevant incremental theories can mitigate the consequences of bias on discipline decisions. We call this combination of approaches a “Bias-Consequence Alleviation” (BCA) intervention. The present research sought to determine how the following components can be integrated to reduce the process by which bias contributes to racial inequality in discipline decisions: (1) getting a misbehaving student’s perspective, “student-perspective”; (2) belief that others’ personalities can change, “student-growth”; and (3) belief that one’s own ability to sustain positive relationships can change, “relationship-growth.” Can a combination of these three components curb troublemaker-labeling and pattern-prediction responses to a Black student’s misbehavior (Exp...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data sets for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the data sets used in both example analyses (Examples 1 and 2) in two file formats (binary ".rda" for use in R; plain-text ".dat").
The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:
ID
= group identifier (1-2000)
x
= numeric (Level 1)
y
= numeric (Level 1)
w
= binary (Level 2)
In all data sets, missing values are coded as "NA".