Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.
This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.
To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).
IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.
In this dataset, we include the original and imputed values for the following variables:
Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)
Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].
More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.
If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data sets for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the data sets used in both example analyses (Examples 1 and 2) in two file formats (binary ".rda" for use in R; plain-text ".dat").
The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:
ID = group identifier (1-2000)
x = numeric (Level 1)
y = numeric (Level 1)
w = binary (Level 2)
In all data sets, missing values are coded as "NA".
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We share the complete aerosol optical depth dataset with high spatial (1x1km^2) and temporal (daily) resolution and the Beijing 1954 projection (https://epsg.io/2412) for mainland China (2015-2018). The original aerosol optical depth images are from Multi-Angle Implementation of Atmospheric Correction Aerosol Optical Depth (MAIAC AOD) (https://lpdaac.usgs.gov/products/mcd19a2v006/) with the similar spatiotemporal resolution and the sinusoidal projection (https://en.wikipedia.org/wiki/Sinusoidal_projection). After projection conversion, eighteen tiles of MAIAC AOD were merged to obtain a large image of AOD covering the entire area of mainland China. Due to the conditions of clouds and high surface reflectance, each original MAIAC AOD image usually has many missing values, and the average missing percentage of each AOD image may exceed 60%. Such a high percentage of missing values severely limits applicability of the original MAIAC AOD dataset product. We used the sophisticated method of full residual deep networks (Li et al, 2020, https://ieeexplore.ieee.org/document/9186306) to impute the daily missing MAIAC AOD, thus obtaining the complete (no missing values) high-resolution AOD data product covering mainland China. The covariates used in imputation included coordinates, elevation, MERRA2 coarse-resolution PBLH and AOD variables, cloud fraction, high-resolution meteorological variables (air pressure, air temperature, relative humidity and wind speed) and/or time index etc. Ground monitoring data were used to generate high-resolution meteorological variables to ensure the reliability of interpolation. Overall, our daily imputation models achieved an average training R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075, with a range of 0.026 to 0.32) and an average test R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075 with a range of 0.026 to 0.32). With almost no difference between training metrics and test metrics, the high test R^2 and low test RMSE show the reliability of AOD imputation. In the evaluation using the ground AOD data from the monitoring stations of the Aerosol Robot Network (AERONET) in mainland China, our method obtained a R^2 of 0.78 and RMSE of 0.27, which further illustrated the reliability of the method. This database contains four datasets: - Daily complete high-resolution AOD image dataset for mainland China from January 1, 2015 to December 31, 2018. The archived resources contain 1461 images stored in 1461 files, and 3 summary Excel files. The table “CHN_AOD_INFO.xlsx” describing the properties of the 1461 images, including projection, training R^2 and RMSE, testing R^2 and RMSE, minmum, mean, median and maximum AOD that we predicted. - The table “Model_and_Accuracy_of_Meteorological_Elements.xlsx” describing the statistics of performance metrics in interpolation of high-resolution meteorological dataset. - The table “Evaluation_Using_AERONET_AOD.xlsx” showing the evaluation result of AERONET, including R^2, RMSE, and monitoring information used in this study.
Facebook
TwitterOriginal dataset that is shared on Github can be found here. These are hands on practice datasets that were linked through the Coursera Guided Project Certificate Course for Handling Missing Values in R, a part of Coursera Project Network. The datasets links were shared by the original author and instructor of the course Arimoro Olayinka Imisioluwa.
Things you could do with this dataset: As a beginner in R, these datasets helped me to get a hang over making data clean and tidy and handling missing values(only numeric) using R. Good for anyone looking for a beginner to intermediate level understanding in these subjects.
Here are my notebooks as kernels using these datasets and using a few more preloaded datasets in R, as suggested by the instructor. TidY DatA Practice MissinG DatA HandlinG - NumeriC
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A mixed data frame (MDF) is a table collecting categorical, numerical, and count observations. The use of MDF is widespread in statistics and the applications are numerous from abundance data in ecology to recommender systems. In many cases, an MDF exhibits simultaneously main effects, such as row, column, or group effects and interactions, for which a low-rank model has often been suggested. Although the literature on low-rank approximations is very substantial, with few exceptions, existing methods do not allow to incorporate main effects and interactions while providing statistical guarantees. The present work fills this gap. We propose an estimation method which allows to recover simultaneously the main effects and the interactions. We show that our method is near optimal under conditions which are met in our targeted applications. We also propose an optimization algorithm which provably converges to an optimal solution. Numerical experiments reveal that our method, mimi, performs well when the main effects are sparse and the interaction matrix has low-rank. We also show that mimi compares favorably to existing methods, in particular when the main effects are significantly large compared to the interactions, and when the proportion of missing entries is large. The method is available as an R package on the Comprehensive R Archive Network. Supplementary materials for this article are available online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code to impute continuous outcome. (R 1 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
Welcome to the Zenodo repository for Publication Benchmarking imputation methods for categorical biological data, a comprehensive collection of datasets and scripts utilized in our research endeavors. This repository serves as a vital resource for researchers interested in exploring the empirical and simulated analyses conducted in our study.
Contents:
empirical_analysis:
simulation_analysis:
TDIP_package:
Purpose:
This repository aims to provide transparency and reproducibility to our research findings by making the datasets and scripts publicly accessible. Researchers interested in understanding our methodologies, replicating our analyses, or building upon our work can utilize this repository as a valuable reference.
Citation:
When using the datasets or scripts from this repository, we kindly request citing Publication Benchmarking imputation methods for categorical biological data and acknowledging the use of this Zenodo repository.
Thank you for your interest in our research, and we hope this repository serves as a valuable resource in your scholarly pursuits.
Facebook
TwitterFossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
Facebook
TwitterThis dataset was collected from K-12 teachers via online surveys (Qualtrics). The statistical analyses were conducted in R-programing.
In the present research, we tested whether a combination of getting perspective and exposure to relevant incremental theories can mitigate the consequences of bias on discipline decisions. We call this combination of approaches a “Bias-Consequence Alleviation” (BCA) intervention. The present research sought to determine how the following components can be integrated to reduce the process by which bias contributes to racial inequality in discipline decisions: (1) getting a misbehaving student’s perspective, “student-perspective”; (2) belief that others’ personalities can change, “student-growth”; and (3) belief that one’s own ability to sustain positive relationships can change, “relationship-growth.” Can a combination of these three components curb troublemaker-labeling and pattern-prediction responses to a Black student’s misbehavior (Exp...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.
A description of this dataset, including the methodology and validation results, is available at:
Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.
ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.
You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.
#!/bin/bash
# Set download directory
DOWNLOAD_DIR=~/Downloads
base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"
# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done
The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:
ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc
Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:
Additional information for each variable is given in the netCDF attributes.
Changes in v9.1r1 (previous version was v09.1):
These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:
The following records are all part of the ESA CCI Soil Moisture science data records community
| 1 |
ESA CCI SM MODELFREE Surface Soil Moisture Record | <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank" |
Facebook
TwitterEstimates of captives carried in the Atlantic slave trade by decade, 1650s to 1860s. Data: routes of voyages and recorded numbers of captives (10 variables and 33,345 cases of slave voyages). Data are organized into 40 routes linking African regions to overseas regions. Purpose: estimation of missing data and totals of captive flows. Method: techniques of Bayesian statistics to estimate missing data on routes and flows of captives. Also included is R-language code for simulating routes and populations
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R scripts used for Monte Carlo simulations and data analyses.
Facebook
TwitterSTAD-R is a set of R programs that performs descriptive statistics, in order to make boxplots and histograms. STAD-R was designed because is necessary before than the thing, check if the dataset have the same number of repetitions, blocks, genotypes, environments, if we have missing values, where and how many, review the distributions and outliers, because is important to be sure that the dataset is complete and have the correct structure for do and other kind of analysis.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.
Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description of scenario analyses to test assumptions related to missing data and loss-to-follow-up.
Facebook
Twittera MOCK dataset used to show how to import Qualtrics metadata into the codebook R package
This table contains variable names, labels, and number of missing values. See the complete codebook for more.
| name | label | n_missing |
|---|---|---|
| ResponseSet | NA | 0 |
| Q7 | NA | 0 |
| Q10 | NA | 0 |
This dataset was automatically described using the codebook R package (version 0.9.5).
Facebook
TwitterPaleontological investigations into morphological diversity, or disparity, are often confronted with large amounts of missing data. We illustrate how missing discrete data effects disparity using a novel simulation for removing data based on parameters from published datasets that contain both extinct and extant taxa. We develop an algorithm that assesses the distribution of missing characters in extinct taxa, and simulates data loss by applying that distribution to extant taxa. We term this technique ‘linkage’. We compare differences in disparity metrics and ordination spaces produced by linkage and random character removal. When we incorporated linkage among characters, disparity metrics declined and ordination spaces shrank at a slower rate with increasing missing data, indicating that correlations among characters govern the sensitivity of disparity analysis. We also present and test a new disparity method that uses the linkage algorithm to correct for the bias caused by missing dat...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 2674 intermittent monthly time series that represent car parts sales from January 1998 to March 2002. It was extracted from R expsmooth package.
The original dataset contains missing values and they have been replaced by zeros.
Facebook
TwitterThese data describe the percent of cropland harvested as wheat, corn, and soybean within each basin (basins 1-8, see accompanying shapefiles). Data are available for other crops; however, these three were chosen because wheat is a traditional crop that has been grown for a long time in the Basin and corn and soybeans have increased in recent times because of wetter conditions, the demand for biofuels, and advances in breeding short-season, drought-tolerant crops. The data come from the National Agricultural Statistics Service (NASS) Census of Agriculture (COA) and have estimates for 1974, 1978, 1982, 1986, 1992, 1997, 2002, 2007, and 2012. Years with missing data were estimated estimated using multivariate imputation of missing values with principal components analysis (PCA) via the function imputePCA in the R (R Core Team, 2015) package missMDA (Husson and Josse, 2015). In the interest of dimension reduction, the scores of the first principal component of principal component analysis, by basin, of the wheat, corn, and soy variables is included. Husson, F., and Josse, J., 2015, missMDA—Handling missing values with multivariate data analysis: R package version 1.9, https://CRAN.R-project.org/package=missMDA. R Core Team, 2015, R: A language and environment for statistical computing: R Foundation for Statistical Computing, Vienna, http://www.R-project.org.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The genotypic file requires a header row, which shows the names of individuals. The first row shows the names of molecular markers or SNPs. For each SNP, the genotypes, e.g., AA, AT, TT, are required to transformed to integer format as 0, 1 or 2 based on allelic frequency. For example, A is the major allele that AA and TT will be transformed to 0 and 2, the AT is transformed to 1. It means the copy number of minor alleles for a line at a SNP locus, relative to major allele. The phenotype file must contain a header row, which represents the names of traits. The first column shows the names of the individuals (Table 2). The trait values are numeric and missing values are indicated as NA, following the R coding requirement.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.
This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.
To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).
IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.
In this dataset, we include the original and imputed values for the following variables:
Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)
Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].
More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.
If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318