100+ datasets found

o
Water-quality data imputation with a high percentage of missing values: a...
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Apr 30, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Apr 30, 2021
Authors
Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries. This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges. To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)). IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases. In this dataset, we include the original and imputed values for the following variables: Water temperature (Tw) Dissolved oxygen (DO) Electrical conductivity (EC) pH Turbidity (Turb) Nitrite (NO2-) Nitrate (NO3-) Total Nitrogen (TN) Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC]. More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318. If you use this dataset in your work, please cite our paper: Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318 {"references": ["Rodr\u00edguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318"]}
Z
Missing data in the analysis of multilevel and dependent data (Examples)
data.niaid.nih.gov
zenodo.org
Updated Jul 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Examples) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7773613
Explore at:
Dataset updated
Jul 20, 2023
Dataset provided by
Oliver Lüdtke
Alexander Robitzsch
Simon Grund
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").

The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)

In all data sets, missing values are coded as "NA".
Data from: Missing data estimation in morphometrics: how much is too much?
zenodo.org
data.niaid.nih.gov
+2more
Updated Jun 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel (2022). Data from: Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.f0b50
Dataset updated
Jun 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
f
R scripts used for Monte Carlo simulations and data analyses.
plos.figshare.com
zip
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lateef Babatunde Amusa; Twinomurinzi Hossana (2024). R scripts used for Monte Carlo simulations and data analyses. [Dataset]. http://doi.org/10.1371/journal.pone.0297037.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0297037.s001
Dataset updated
Jan 19, 2024
Dataset provided by
PLOS ONE
Authors
Lateef Babatunde Amusa; Twinomurinzi Hossana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R scripts used for Monte Carlo simulations and data analyses.
u
Example data simulated using the R package survtd
figshare.unimelb.edu.au
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Margarita Moreno-Betancur (2023). Example data simulated using the R package survtd [Dataset]. http://doi.org/10.4225/49/58e58a8dc39a6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.4225/49/58e58a8dc39a6
Dataset updated
May 31, 2023
Dataset provided by
The University of Melbourne
Authors
Margarita Moreno-Betancur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This example dataset is used to illustrate the usage of the R package survtd in the Supplementary Materials of the paper:Moreno-Betancur M, Carlin JB, Brilleman SL, Tanamas S, Peeters A, Wolfe R (2017). Survival analysis with time-dependent covariates subject to measurement error and missing data: Two-stage joint model using multiple imputation (submitted).The data was generated using the simjm function of the package, using the following code:dat
f
Additional file 4 of Heckman imputation models for binary or continuous MNAR...
springernature.figshare.com
txt
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon (2023). Additional file 4 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors [Dataset]. http://doi.org/10.6084/m9.figshare.7038104.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7038104.v1
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Authors
Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R code to impute binary outcome. (R 1 kb)
z
Missing data in the analysis of multilevel and dependent data (Example data...
zenodo.org
bin
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Example data sets) [Dataset]. http://doi.org/10.5281/zenodo.7773614
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7773614
Dataset updated
Jul 20, 2023
Dataset provided by
Springer
Authors
Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data sets for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the data sets used in both example analyses (Examples 1 and 2) in two file formats (binary ".rda" for use in R; plain-text ".dat").

The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

ID = group identifier (1-2000)
x = numeric (Level 1)
y = numeric (Level 1)
w = binary (Level 2)

In all data sets, missing values are coded as "NA".
Sensitivity analysis for missing data in cost-effectiveness analysis: Stata...
figshare.com
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Baptiste Leurent; Manuel Gomes; Rita Faria; Stephen Morris; Richard Grieve; James R Carpenter (2023). Sensitivity analysis for missing data in cost-effectiveness analysis: Stata code [Dataset]. http://doi.org/10.6084/m9.figshare.6714206.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6714206.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Baptiste Leurent; Manuel Gomes; Rita Faria; Stephen Morris; Richard Grieve; James R Carpenter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Stata do-files and data to support tutorial "Sensitivity Analysis for Not-at-Random Missing Data in Trial-Based Cost-Effectiveness Analysis" (Leurent, B. et al. PharmacoEconomics (2018) 36: 889).Do-files should be similar to the code provided in the article's supplementary material.Dataset based on 10 Top Tips trial, but modified to preserve confidentiality. Results will differ from those published.
f
Example of a missing data pattern, where 1 = available, and 0 = missing.
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joost A. Agelink van Rentergem; Jaap M. J. Murre; Hilde M. Huizenga (2023). Example of a missing data pattern, where 1 = available, and 0 = missing. [Dataset]. http://doi.org/10.1371/journal.pone.0173218.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0173218.t001
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Joost A. Agelink van Rentergem; Jaap M. J. Murre; Hilde M. Huizenga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For each test, and each study, there are scores missing, although all test co-occur at least once.
Data from: Benchmarking imputation methods for categorical biological data
zenodo.org
data.niaid.nih.gov
zip
Updated Mar 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthieu Gendre; Torsten Hauffe; Torsten Hauffe; Catalina Pimiento; Catalina Pimiento; Daniele Silvestro; Daniele Silvestro; Matthieu Gendre (2024). Benchmarking imputation methods for categorical biological data [Dataset]. http://doi.org/10.5281/zenodo.10800016
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10800016
Dataset updated
Mar 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matthieu Gendre; Torsten Hauffe; Torsten Hauffe; Catalina Pimiento; Catalina Pimiento; Daniele Silvestro; Daniele Silvestro; Matthieu Gendre
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 9, 2024
Description
Description:

Welcome to the Zenodo repository for Publication Benchmarking imputation methods for categorical biological data, a comprehensive collection of datasets and scripts utilized in our research endeavors. This repository serves as a vital resource for researchers interested in exploring the empirical and simulated analyses conducted in our study.

Contents:

empirical_analysis:

Trait Dataset of Elasmobranchs: A collection of trait data for elasmobranch species obtained from FishBase , stored as RDS file.

Phylogenetic Tree: A phylogenetic tree stored as a TRE file.

Imputations Replicates (Imputation): Replicated imputations of missing data in the trait dataset, stored as RData files.

Error Calculation (Results): Error calculation results derived from imputed datasets, stored as RData files.

Scripts: Collection of R scripts used for the implementation of empirical analysis.

simulation_analysis:

Input Files: Input files utilized for simulation analyses as CSV files

Data Distribution PDFs: PDF files displaying the distribution of simulated data and the missingness.

Output Files: Simulated trait datasets, trait datasets with missing data, and trait imputed datasets with imputation errors calculated as RData files.

Scripts: Collection of R scripts used for the simulation analysis.

TDIP_package:

Scripts of the TDIP Package: All scripts related to the Trait Data Imputation with Phylogeny (TDIP) R package used in the analyses.

Purpose:

This repository aims to provide transparency and reproducibility to our research findings by making the datasets and scripts publicly accessible. Researchers interested in understanding our methodologies, replicating our analyses, or building upon our work can utilize this repository as a valuable reference.

Citation:

When using the datasets or scripts from this repository, we kindly request citing Publication Benchmarking imputation methods for categorical biological data and acknowledging the use of this Zenodo repository.

Thank you for your interest in our research, and we hope this repository serves as a valuable resource in your scholarly pursuits.
r
Air Quality Monitoring - 2019 (grouped by pollutant)
researchdata.edu.au
data.qld.gov.au
+1more
Updated Apr 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environment, Tourism, Science and Innovation (2020). Air Quality Monitoring - 2019 (grouped by pollutant) [Dataset]. https://researchdata.edu.au/air-quality-monitoring-grouped-pollutant/1459199
Explore at:
Dataset updated
Apr 20, 2020
Dataset provided by
data.qld.gov.au
Authors
Environment, Tourism, Science and Innovation
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annual hourly air quality and meteorological data by pollutant for the 2019 calendar year. For more information on air quality, including live air data, please visit www.qld.gov.au/environment/pollution/monitoring/air. \r \r Data resolution: One-hour average values \r Data row timestamp: Start of averaging period \r Missing data/not monitored: Blank cell \r Sampling height: Four metres above ground (unless otherwise indicated)
Z
Car Parts Dataset (with Missing Values)
data.niaid.nih.gov
zenodo.org
Updated Apr 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Webb, Geoff (2021). Car Parts Dataset (with Missing Values) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3994894
Explore at:
Dataset updated
Apr 1, 2021
Dataset provided by
Bergmeir, Christoph
Webb, Geoff
Godahewa, Rakshitha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 2674 intermittent monthly time series that represent car parts sales from January 1998 to March 2002. It was extracted from R expsmooth package.
d
Slave Routes Datasets, 1650s - 1860s
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manning, Patrick; Liu, Yu (2023). Slave Routes Datasets, 1650s - 1860s [Dataset]. http://doi.org/10.7910/DVN/6HLXO3
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/6HLXO3
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Manning, Patrick; Liu, Yu
Time period covered
Jan 1, 1650 - Jan 1, 1870
Description
Estimates of captives carried in the Atlantic slave trade by decade, 1650s to 1860s. Data: routes of voyages and recorded numbers of captives (10 variables and 33,345 cases of slave voyages). Data are organized into 40 routes linking African regions to overseas regions. Purpose: estimation of missing data and totals of captive flows. Method: techniques of Bayesian statistics to estimate missing data on routes and flows of captives. Also included is R-language code for simulating routes and populations
f
MAPE and PB statistics for IBFI compared with other imputation methods...
plos.figshare.com
xls
Updated Jun 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adil Aslam Mir; Kimberlee Jane Kearfott; Fatih Vehbi Çelebi; Muhammad Rafique (2023). MAPE and PB statistics for IBFI compared with other imputation methods (mean, median, mode, PMM, and Hotdeck) for 20% missingness of type MAR and all parameters tested (RN, TH, TC, RH, and PR). [Dataset]. http://doi.org/10.1371/journal.pone.0262131.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0262131.t003
Dataset updated
Jun 6, 2023
Dataset provided by
PLOS ONE
Authors
Adil Aslam Mir; Kimberlee Jane Kearfott; Fatih Vehbi Çelebi; Muhammad Rafique
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MAPE and PB statistics for IBFI compared with other imputation methods (mean, median, mode, PMM, and Hotdeck) for 20% missingness of type MAR and all parameters tested (RN, TH, TC, RH, and PR).
Supplemental Material for Evaluating the Effects of Randomness on Missing...
osf.io
Updated Nov 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert J. Bischoff; Cecilia Padilla-Iglesias; Claudine Gravel-Miguel (2021). Supplemental Material for Evaluating the Effects of Randomness on Missing Data in Archaeological Networks [Dataset]. http://doi.org/10.17605/OSF.IO/J36EB
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/J36EB
Dataset updated
Nov 15, 2021
Dataset provided by
Center for Open Sciencehttps://cos.io/
Authors
Robert J. Bischoff; Cecilia Padilla-Iglesias; Claudine Gravel-Miguel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Raw data and R code for the paper "Evaluating the Effects of Randomness on Missing Data in Archaeological Networks"
d
Data from: Using decision trees to understand structure in missing data
datamed.org
data.niaid.nih.gov
+1more
Updated Jun 2, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2015). Data from: Using decision trees to understand structure in missing data [Dataset]. https://datamed.org/display-item.php?repository=0010&id=5937ae305152c60a13865bb4&query=CARTPT
Explore at:
Dataset updated
Jun 2, 2015
Description
Objectives: Demonstrate the application of decision trees—classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)—to understand structure in missing data. Setting: Data taken from employees at 3 different industrial sites in Australia. Participants: 7915 observations were included. Materials and methods: The approach was evaluated using an occupational health data set comprising results of questionnaires, medical tests and environmental monitoring. Statistical methods included standard statistical tests and the ‘rpart’ and ‘gbm’ packages for CART and BRT analyses, respectively, from the statistical software ‘R’. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. Results: CART and BRT models were effective in highlighting a missingness structure in the data, related to the type of data (medical or environmental), the site in which it was collected, the number of visits, and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured as compared to structured missingness. Discussion: Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. Conclusions: Researchers are encouraged to use CART and BRT models to explore and understand missing data.
n
Datasets with synthetically generated missingess structures.
data.ncl.ac.uk
csv
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Johansson Fernstad (2025). Datasets with synthetically generated missingess structures. [Dataset]. http://doi.org/10.25405/data.ncl.28680893.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.25405/data.ncl.28680893.v1
Dataset updated
Apr 1, 2025
Dataset provided by
Newcastle University
Authors
Sara Johansson Fernstad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets with synthetically generated missingess structures, based on the publicly available "BreastCancerCoimbra" dataset (M. Patrício, J. Pereira, J. Crisóstomo, P. Matafome, M. Gomes, R. Seiça, and F. Caramelo. "Using resistin, glucose, age and bmi to predict the presence of breast cancer". BMC cancer, 18(1):29, 2018.The datasets are part of the supplemental material for: Johansson Fernstad, S., Alsufyani, S., Del-Din, S., Yarnall, A., & Rochester, L. (2025). "To Measure What Isn’t There — Visual Exploration of Missingness Structures Using Quality Metrics", which is under review. The generation of the synthetic missingness structures are described in this paper.
Simulation Linear Regression
kaggle.com
Updated Dec 17, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricky (2016). Simulation Linear Regression [Dataset]. https://www.kaggle.com/zurfer/simulation-linear-regression/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 17, 2016
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ricky
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

There's a story behind every dataset and here's your opportunity to share yours.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
No data
catalog.data.gov
datadiscoverystudio.org
+1more
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). No data [Dataset]. https://catalog.data.gov/dataset/no-data
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Manuscript provides a look-up table to predict exposures from minimal information using ECETOC TRA software. This dataset is not publicly accessible because: There is no EPA-generated data. It can be accessed through the following means: Rosemarie Zaleski of ExxonMobile Biosciences created "look-up table" using freely available ECETOC TRA software *http://www.ecetoc.org/tools/targeted-risk-assessment-tra/download-integrated-tool/). Format: There is no EPA-generated data. This dataset is associated with the following publication: Dellarco, M., R. Zaleski , B. Gaborek , H. Qian, C. Bellin , P. Egeghy, N. Heard , O. Jolliet, D. Lander, N. Sunger , K. Stylianou , and J. Tanir. Using exposure bands for rapid decision making in the RISK21 tiered exposureassessment. CRITICAL REVIEWS IN TOXICOLOGY. CRC Press LLC, Boca Raton, FL, USA, online, (2017).
o
Sensitivity of global terrestrial ecosystems to climate variability: data...
ora.ox.ac.uk
zip
Updated Jan 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Oxford (2016). Sensitivity of global terrestrial ecosystems to climate variability: data and R code [Dataset]. http://doi.org/10.5287/bodleian:VY2PeyGX4
Explore at:
zip(31188451), zip(2143203213), zip(2482430666), zip(2756988208), zip(1932114998), zip(510097482), zip(963288447)Available download formats
Unique identifier
https://doi.org/10.5287/bodleian:VY2PeyGX4
Dataset updated
Jan 1, 2016
Dataset provided by
University of Oxford
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Time period covered
2000 - 2013
Area covered
Global (-180 - 180), (-60 - 90)
Description
Data and coding scripts for Seddon et al. (2016) Nature (DOI 10.1038/nature16986). We derived monthly time-series of four key terrestrial ecosystem variables at 0.05 degree (~5km) resolution from observations by the MODIS sensor on Terra (AM) for the period February 2010-December 2013 inclusive, and developed a method to identify vegetation sensitivity to climate variability over this period (see Methods in main paper).

This ORA item contains all data and files required to run the analysis described in the paper. Data required to run the script are provided in six zip files evi.zip, temp.zip, aetpet.zip, cld.zip, stdev.zip, numpxl.zip, each containing 167 text files, one per month of available data, in addition to a supporting files folder. Details are as follows.

supporting_files.zip : This directory includes computer code and additional supporting files. Please see the 'read me.txt' file within this directory for more information.

evi.zip: ENHANCED VEGETATION INDEX (EVI). We used the MOD13C2 product (Huete et al 2002) which comprises monthly, global EVI at 0.05 degree resolution. In some cases where no clear-sky observations are available, the MOD13C2 version 5 product replaces no-data values with climatological monthly means, so we removed these values where appropriate.

EVI format = ascii text file projection = geographic projection spatial resolution = 0.05 degrees min x = -180 max x = 180 min y = -60 max x = 90 rows = 3000 cols = 7200 bit depth = 16 bit signed integer nodata (sea) = -9999 missing data (on land) = -999 units = dimensionless scale factor = 10000 (divide the value by 10000 to get EVI) filenames = yyyymmevi.txt

numpxl.zip - COUNTS OF THE NUMBER OF PIXELS USED IN EVI CALCULATION. The MOD13C2 product is the result of a spatially and temporally averaged mosaic of higher resolution (1km pixels). Data in this directory represent the number of 1km observations used to calculate the MODIS EVI product. See the online documentation for more details (Solano et al. 2010).

numpxl format = ascii text file projection = geographic projection spatial resolution = 0.05 degrees min x = -180 max x = 180 min y = -60 max x = 90 rows = 3000 cols = 7200 bit depth = 16 bit signed integer nodata (sea) = -9999 missing data (on land) = -999 units = counts filenames = yyyy_mm_numpxl_pt05deg.txt

stdev.zip - STANDARD DEVIATION OF EVI VALUES. Standard deviation of the monthly EVI observations. See discussion in numpxl.zip item (above) and the online documentation for more details (Solano et al. 2010).

stdev format = ascii text file projection = geographic projection spatial resolution = 0.05 degrees min x = -180 max x = 180 min y = -60 max x = 90 rows = 3000 cols = 7200 bit depth = 16 bit signed integer nodata (sea) = -9999 missing data (on land) = -999 units = dimensionless scale factor = 10000 (divide the value by 10000 to get EVI) filenames = yyyy_mm_stdev_pt05deg.txt

temp.zip: AIR TEMPERATURE. We used the MOD07_L2 Atmospheric Profile product (Seeman et al 2006) as a measure of air temperature. Five-minute swaths of Retrieved Temperature Profile were projected to geographic co-ordinates. Pixels from the highest available pressure level, corresponding to the temperature closest to the Earth's surface, were selected from each swath. Swaths were then mean-mosaicked into global daily grids, and the daily global grids were mean-composited to monthly grids of air temperature.

Air temperature format = ascii text file projection = geographic projection spatial resolution = 0.05 degrees min x = -180 max x = 180 min y = -60 max x = 90 rows = 3000 cols = 7200 bit depth = 16 bit signed integer nodata (sea) = -9999 missing data (on land) = -999 units = degrees C scale factor = 1 (divide the value by 1 to get Air temperature) filenames = yyyymmtemp.txt

aetpet.zip: WATER AVAILABILITY. We used the MOD16 Global Evapotranspiration product (Mu et al 2011) to calculate the monthly 0.05 degree ratio of Actual to Potential Evapotranspiration (AET/PET).

AET/PET format = ascii text file projection = geographic projection spatial resolution = 0.05 degrees min x = -180 max x = 180 min y = -60 max x = 90 rows = 3000 cols = 7200 bit depth = 16 bit signed integer nodata (sea) = -9999 missing data (on land) = -999 units = dimensionless scale factor = 10000 (divide the value by 10000 to get AET/PET) filenames = yyyymmaetpet.txt

cld.zip - CLOUDINESS. We used the MOD35_L2 Cloud Mask product (Ackerman et al 2010). This product provides daily records on the presence of cloudy vs cloudless skies, and we used this to make an index of the proportion of of cloudy to clear days in a given pixel. After conversion to geographic co-ordinates, five-minute swaths at 1-km resolution were reclassed as clear sky or cloudy, and these daily swaths were mean-mosaicked to global coverages, mean composited from daily to monthly, and mean-aggregated from 1km to 0.05 degree.

cld format = ascii text file projection = geographic projection spatial resolution = 0.05 degrees min x = -180 max x = 180 min y = -60 max x = 90 rows = 3000 cols = 7200 bit depth = 16 bit signed integer nodata (sea) = -9999 missing data (on land) = -999 units = percentage of days in the month which were cloudy scale factor = 100 (divide the value by 100 to get percentage cloudy days) filenames = yyyymmcld.txt

References

Ackerman, S. et al. (2010) Discriminating clear-sky from cloud with MODIS: Algorithm Theoretical Basis Document (MOD35), Version 6.1. (URL: ttp://modis- atmos.gsfc.nasa.gov/_docs/MOD35_A TBD_Collection6.pdf)

Huete, A. et al. (2002) Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sensing of Environment 83, 195–213.

Mu, Q., Zhao, M., Running, S.R. (2011) Improvements to a MODIS global terrestrial evapotranspiration algorithm. Remote Sensing of Environment 115, 1781-1800

Seeman, S. W., Borbas, E. E., Li, J., Menzel, W. P. & Gumley, L. E. (2006) MODIS Atmospheric Profile Retrieval Algorithm Theoretical Basis Document, Version 6 (URL: http://modis-atmos.gsfc.nasa.gov/_docs/MOD07_atbd_v7_April2011.pdf)

Solano, R. et al. (2010) MODIS Vegetation Index User’s Guide (MOD13 Series) Version 2.00, May 2010 (Collection 5) (URL: http://vip.arizona.edu/documents/MODIS/MODIS_VI_UsersGuide_01_2012.pdf) Seddon et al. (2016) Nature (DOI 10.1038/nature16986) ABSTRACT: Identification of properties that contribute to the persistence and resilience of ecosystems despite climate change constitutes a research priority of global significance. Here, we present a novel, empirical approach to assess the relative sensitivity of ecosystems to climate variability, one property of resilience that builds on theoretical modelling work recognising that systems closer to critical thresholds respond more sensitively to external perturbations. We develop a new metric, the Vegetation Sensitivity Index (VSI) which identifies areas sensitive to climate variability over the past 14 years. The metric uses time-series data of MODIS derived Enhanced Vegetation Index (EVI) and three climatic variables that drive vegetation productivity (air temperature, water availability and cloudiness). Underlying the analysis is an autoregressive modelling approach used to identify regions with memory effects and reduced response rates to external forcing. We find ecologically sensitive regions with amplified responses to climate variability in the arctic tundra, parts of the boreal forest belt, the tropical rainforest, alpine regions worldwide, steppe and prairie regions of central Asia and North and South America, the Caatinga deciduous forest in eastern South America, and eastern areas of Australia. Our study provides a quantitative methodology for assessing the relative response rate of ecosystems – be they natural or with a strong anthropogenic signature – to environmental variability, which is the first step to address why some regions appear to be more sensitive than others and what impact this has upon the resilience of ecosystem service provision and human wellbeing.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169

Water-quality data imputation with a high percentage of missing values: a machine learning approach

Explore at:

27 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5281/zenodo.4731169

Dataset updated

Apr 30, 2021

Authors

Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione

Description

The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries. This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges. To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)). IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases. In this dataset, we include the original and imputed values for the following variables: Water temperature (Tw) Dissolved oxygen (DO) Electrical conductivity (EC) pH Turbidity (Turb) Nitrite (NO2-) Nitrate (NO3-) Total Nitrogen (TN) Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC]. More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318. If you use this dataset in your work, please cite our paper: Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318 {"references": ["Rodr\u00edguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318"]}

Clear search

Close search

Google apps

Main menu

Water-quality data imputation with a high percentage of missing values: a...

Missing data in the analysis of multilevel and dependent data (Examples)

Data from: Missing data estimation in morphometrics: how much is too much?

R scripts used for Monte Carlo simulations and data analyses.

Example data simulated using the R package survtd

Additional file 4 of Heckman imputation models for binary or continuous MNAR...

Missing data in the analysis of multilevel and dependent data (Example data...

Sensitivity analysis for missing data in cost-effectiveness analysis: Stata...

Example of a missing data pattern, where 1 = available, and 0 = missing.

Data from: Benchmarking imputation methods for categorical biological data

Air Quality Monitoring - 2019 (grouped by pollutant)

Car Parts Dataset (with Missing Values)

Slave Routes Datasets, 1650s - 1860s

MAPE and PB statistics for IBFI compared with other imputation methods...

Supplemental Material for Evaluating the Effects of Randomness on Missing...

Data from: Using decision trees to understand structure in missing data

Datasets with synthetically generated missingess structures.

Simulation Linear Regression

Context

Content

Acknowledgements

Inspiration

No data

Sensitivity of global terrestrial ecosystems to climate variability: data...

Water-quality data imputation with a high percentage of missing values: a machine learning approachSee More Versions

Water-quality data imputation with a high percentage of missing values: a machine learning approach