This SOils DAta Harmonization (SoDaH) database is designed to bring together soil carbon data from diverse research networks into a harmonized dataset that can be used for synthesis activities and model development. The research network sources for SoDaH span different biomes and climates, encompass multiple ecosystem types, and have collected data across a range of spatial, temporal, and depth gradients. The rich data sets assembled in SoDaH consist of observations from monitoring efforts and long-term ecological experiments. The SoDaH database also incorporates related environmental covariate data pertaining to climate, vegetation, soil chemistry, and soil physical properties. The data are harmonized and aggregated using open-source code that enables a scripted, repeatable approach for soil data synthesis.
Public data used for data harmonization.
This dataset is associated with the following publication: Uhran, B., L. Windham-Myers, N. Bliss, A. Nahlik, E. Sundquist, and C. Stagg. Improved Wetland Soil Organic Carbon Stocks of the Conterminous U.S. Through Data Harmonization. Frontiers in Soil Science. Frontiers, Lausanne, SWITZERLAND, 1: 706701, (2021).
ST_LUCAS is a harmonized dataset derived from the LUCAS (Land Use and Coverage Area frame Survey) dataset. LUCAS is an Eurostat activity that has performed repeated in situ surveys over Europe every three years since 2006. Original LUCAS data (https://ec.europa.eu/eurostat/web/lucas/data) starting with the 2006 survey were harmonized into common nomenclature based on the 2018 survey. ST_LUCAS dataset is provided in two versions:
lucas_points: each LUCAS survey is represented by single record
lucas_st_points: each LUCAS point is represented by a single location calculated from multiple surveys and by a set of harmonized attributes for each survey year
Harmonization and space-aggregation of LUCAS data were performed by ST_LUCAS system available from https://geoforall.fsv.cvut.cz/st_lucas. The methodology is described in Landa, M.; Brodský, L.; Halounová, L.; Bouček, T.; Pešek, O. Open Geospatial System for LUCAS In Situ Data Harmonization and Distribution. ISPRS Int. J. Geo-Inf. 2022, 11, 361. https://doi.org/10.3390/ijgi11070361.
List of harmonized LUCAS attributes: https://geoforall.fsv.cvut.cz/st_lucas/tables/list_of_attributes.html
ST_LUCAS dataset is provided under the same conditions (“free of charge”) as the original LUCAS data (https://ec.europa.eu/eurostat/web/lucas/data).
The program PanTool was developed as a tool box like a Swiss Army Knife for data conversion and recalculation, written to harmonize individual data collections to standard import format used by PANGAEA. The format of input files the program PanTool needs is a tabular saved in plain ASCII. The user can create this files with a spread sheet program like MS-Excel or with the system text editor. PanTool is distributed as freeware for the operating systems Microsoft Windows, Apple OS X and Linux.
The integration of proteomic datasets, generated by non-cooperating laboratories using different LC-MS/MS setups can overcome limitations in statistically underpowered sample cohorts but has not been demonstrated to this day. In proteomics, differences in sample preservation and preparation strategies, chromatography and mass spectrometry approaches and the used quantification strategy distort protein abundance distributions in integrated datasets. The Removal of these technical batch effects requires setup-specific normalization and strategies that can deal with missing at random (MAR) and missing not at random (MNAR) type values at a time. Algorithms for batch effect removal, such as the ComBat-algorithm, commonly used for other omics types, disregard proteins with MNAR missing values and reduce the informational yield and the effect size for combined datasets significantly. Here, we present a strategy for data harmonization across different tissue preservation techniques, LC-MS/MS instrumentation setups and quantification approaches. To enable batch effect removal without the need for data reduction or error-prone imputation we developed an extension to the ComBat algorithm, ´ComBat HarmonizR, that performs data harmonization with appropriate handling of MAR and MNAR missing values by matrix dissection The ComBat HarmonizR based strategy enables the combined analysis of independently generated proteomic datasets for the first time. Furthermore, we found ComBat HarmonizR to be superior for removing batch effects between different Tandem Mass Tag (TMT)-plexes, compared to commonly used internal reference scaling (iRS). Due to the matrix dissection approach without the need of data imputation, the HarmonizR algorithm can be applied to any type of -omics data while assuring minimal data loss
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an updated version of the original study protocol under the title “Negative Affectivity Data Harmonization” that was pre-registered in OSF on September 4th, 2022 (osf.io/kqsn9).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description and harmonization strategy for the predictor variables.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Harmonized Income Dataset provides harmonized individual-level survey variables on personal and household income from 19 major cross-national survey projects, as well as technical variables necessary to match them to the Survey Data Recycling Master File version 1 (SDR v.1, DOI:10.7910/DVN/VWGF5Q), which contains harmonized survey items on political participation, political attitudes, as well as their selected correlates.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This document outlines the creation of a global inventory of reference samples and Earth Observation (EO) / gridded datasets for the Global Pasture Watch (GPW) initiative. This inventory supports the training and validation of machine-learning models for GPW grassland mapping. This documentation outlines methodology, data sources, workflow, and results.
Keywords: Grassland, Land Use, Land Cover, Gridded Datasets, Harmonization
Create a global inventory of existing reference samples for land use and land cover (LULC);
Compile global EO / gridded datasets that capture LULC classes and harmonize them to match the GPW classes;
Develop automated scripts for data harmonization and integration.
Datasets incorporated:
Datasets |
Spatial distribution | Time period | Number of individual samples |
WorldCereal | Global | 2016-2021 | 38,267,911 |
Global Land Cover Mapping and Estimation (GLanCE) | Global | 1985-2021 | 31,061,694 |
EuroCrops | Europe | 2015-2022 | 14,742,648 |
GeoWiki G-GLOPS training dataset | Global | 2021 | 11,394,623 |
MapBiomas Brazil | Brazil | 1985-2018 | 3,234,370 |
Land Use/Land Cover Area Frame Survey (LUCAS) | Europe | 2006-2018 | 1,351,293 |
Dynamic World | Global | 2019-2020 | 1,249,983 |
Land Change Monitoring, Assessment, and Projection (LCMap) | U.S. (CONUS) | 1984-2018 | 874,836 |
GeoWiki 2012 | Global | 2011-2012 | 151,942 |
PREDICTS | Global | 1984-2013 | 16,627 |
CropHarvest | Global | 2018-2021 | 9,714 |
Total: 102,355,642 samples
We harmonized global reference samples and EO/gridded datasets to align with GPW classes, optimizing their integration into the GPW machine-learning workflow.
We considered reference samples derived by visual interpretation with spatial support of at least 30 m (Landsat and Sentinel), that could represent LULC classes for a point or region.
Each dataset was processed using automated Python scripts to download vector files and convert the original LULC classes into the following GPW classes:
0. Other land cover
1. Natural and Semi-natural grassland
2. Cultivated grassland
3. Crops and other related agricultural practices
We empirically assigned a weight to each sample based on the original dataset's class description, reflecting the level of mixture within the class. The weights range from 1 (Low) to 3 (High), with higher weights indicating greater mixture. Samples with low mixture levels are more accurate and effective for differentiating typologies and for validation purposes.
The harmonized dataset includes these columns:
Attribute Name | Definition |
dataset_name | Original dataset name |
reference_year | Reference year of samples from the original dataset |
original_lulc_class | LULC class from the original dataset |
gpw_lulc_class | Global Pasture Watch LULC class |
sample_weight | Sample's weight based on the mixture level within the original LULC class |
The development of this global inventory of reference samples and EO/gridded datasets relied on valuable contributions from various sources. We would like to express our sincere gratitude to the creators and maintainers of all datasets used in this project.
Brown, C.F., Brumby, S.P., Guzder-Williams, B. et al. Dynamic World, Near real-time global 10 m land use land cover mapping. Sci Data 9, 251 (2022). https://doi.org/10.1038/s41597-022-01307-4Van Tricht, K. et al. Worldcereal: a dynamic open-source system for global-scale, seasonal, and reproducible crop and irrigation mapping. Earth Syst. Sci. Data 15, 5491–5515, 10.5194/essd-15-5491-2023 (2023)
Buchhorn, M.; Smets, B.; Bertels, L.; De Roo, B.; Lesiv, M.; Tsendbazar, N.E., Linlin, L., Tarko, A. (2020): Copernicus Global Land Service: Land Cover 100m: Version 3 Globe 2015-2019: Product User Manual; Zenodo, Geneve, Switzerland, September 2020; doi: 10.5281/zenodo.3938963
d’Andrimont, R. et al. Harmonised lucas in-situ land cover and use database for field surveys from 2006 to 2018 in the european union. Sci. data 7, 352, 10.1038/s41597-019-0340-y (2020)
Fritz, S. et al. Geo-Wiki: An online platform for improving global land cover, Environmental Modelling & Software, 31, https://doi.org/10.1016/j.envsoft.2011.11.015 (2012)
Fritz, S., See, L., Perger, C. et al. A global dataset of crowdsourced land cover and land use reference data. Sci Data 4, 170075 https://doi.org/10.1038/sdata.2017.75 (2017)
Schneider, M., Schelte, T., Schmitz, F. & Körner, M. Eurocrops: The largest harmonized open crop dataset across the european union. Sci. Data 10, 612, 10.1038/s41597-023-02517-0 (2023)
Souza, C. M. et al. Reconstructing Three Decades of Land Use and Land Cover Changes in Brazilian Biomes with Landsat Archive and Earth Engine. Remote. Sens. 12, 2735, 10.3390/rs12172735 (2020)
Stanimirova, R. et al. A global land cover training dataset from 1984 to 2020. Sci. Data 10, 879 (2023)
Tsendbazar, N. et al. Product validation report (d12-pvr) v 1.1 (2021).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This deposit contains the taxonomy maps and data we used to translate data on COVID-19 government responses from 7 different datasets into taxonomy developed by the CoronaNet Research Project (CoronaNet; Cheng et al 2020). These taxonomy maps form the basis of our efforts to harmonize this data into the CoronaNet database. The following taxonomy maps are deposited in the 'Taxonomy' folder:ACAPS COVID-19 Government Measures - CoronaNet Taxonomy Map Canadian Data Set of COVID-19 Interventions from the Canadian Institute for Health Information (CIHI) - CoronaNet Taxonomy Map COVID Analysis and Maping of Policies (COVID AMP) - CoronaNet Taxonomy Map Johns Hopkins Health Intervention Tracking for COVID-19 (HIT-COVID) - CoronaNet Taxonomy Map Oxford Covid-19 Government Response Tracker (OxCGRT) - CoronaNet Taxonomy Map World Health Organisation Public Health and Safety Measures (WHO PHSM) - CoronaNet Taxonomy MapMeanwhile the 'Data' folder contains the raw and mapped data for each external dataset (i.e. ACAPS, CIHI, COVID AMP, HIT-COVID, OxCGRT and WHO PHSM) as well as the combined external data for Steps 1 and 3 of the data harmonization process described in Cheng et al (2023) 'Harmonizing Government Responses to the COVID-19 Pandemic.'
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Meta-analysis sample size of harmonized variables for each study.
A harmonized collection of the core data pertaining to COVID-19 reported cases by geography, in a format prepared for analysis
These data consist of five simulated datasets and a syntax file written in R. All files were created for use in the recorded COORDINATE Workshop 2 (https://www.youtube.com/watch?v=DeyBKxa894E). In this workshop, Scott Milligan, from the GESIS Leibniz Institute for the Social Sciences, leads participants through a complete data harmonisation exercise. The exercise examines the correlation between experiences with bullying and children’s happiness. Participants may run through the process parallel to the recorded workshop. More information on the project and the Harmonisation Toolbox developed in the project are available on the project’s webpage https://www.coordinate-network.eu/harmonisation or in COORDINATE Harmonisation Workshop 1 (https://www.youtube.com/watch?v=DeyBKxa894E).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predictor variables used in analysis and the methods used to harmonize to the categorical variables.
Sediment diatoms are widely used to track environmental histories of lakes and their watersheds, but merging datasets generated by different researchers for further large-scale studies is challenging because of the taxonomic discrepancies caused by rapidly evolving diatom nomenclature and taxonomic concepts. Here we collated five datasets of lake sediment diatoms from the northeastern USA using a harmonization process which included updating synonyms, tracking the identity of inconsistently identified taxa and grouping those that could not be resolved taxonomically. The Dataset consists of a Portable Document Format (.pdf) file of the Voucher Flora, six Microsoft Excel (.xlsx) data files, an R script, and five output Comma Separated Values (.csv) files.
The Voucher Flora documents the morphological species concepts in the dataset using diatom images compiled into plates (NE_Lakes_Voucher_Flora_102421.pdf) and the translation scheme of the OTU codes to diatom scientific or provisional names with identification sources, references, and notes (VoucherFloraTranslation_102421.xlsx).
The file Slide_accession_numbers_102421.xlsx has slide accession numbers in the ANS Diatom Herbarium.
The “DiatomHarmonization_032222_files for R.zip” archive contains four Excel input data files, the R code, and a subfolder “OUTPUT” with five .csv files. The file Counts_original_long_102421.xlsx contains original diatom count data in long format. The file Harmonization_102421.xlsx is the taxonomic harmonization scheme with notes and references. The file SiteInfo_031922.xlsx contains sampling site- and sample-level information. WaterQualityData_021822.xlsx is a supplementary file with water quality data. R code (DiatomHarmonization_032222.R) was used to apply the harmonization scheme to the original diatom counts to produce the output files. The resulting output files are five wide format files containing diatom count data at different harmonization steps (Counts_1327_wide.csv, Step1_1327_wide.csv, Step2_1327_wide.csv, Step3_1327_wide.csv) and the summary of the Indicator Species Analysis (INDVAL_RESULT.csv). The harmonization scheme (Harmonization_102421.xlsx) can be further modified based on additional taxonomic investigations, while the associated R code (DiatomHarmonization_032222.R) provides a straightforward mechanism to diatom data versioning.
This dataset is associated with the following publication: Potapova, M., S. Lee, S. Spaulding, and N. Schulte. A harmonized dataset of sediment diatoms from hundreds of lakes in the northeastern United States. Scientific Data. Springer Nature, New York, NY, 9(540): 1-8, (2022).
This repository of QuestionLink harmonization scripts for many measures of political interest is best accessed via the QuestionLink homepage:
https://www.gesis.org/en/services/processing-and-analyzing-data/data-harmonization/question-link
There you find general information on how to use QuestionLink to harmonize research data, on the method and technology behind QuestionLink, and an overview of other harmonized constructs.
Information on the specific construct, political interest, can be accessed here:
https://www.gesis.org/en/services/processing-and-analyzing-data/data-harmonization/question-link/political-interest
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering of barriers and facilitators to harmonized health data collection, sharing and linkage.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Eligible studies from the CureSCi Metadata Catalog and their available predictor variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A univariate analysis where hydroxyurea use was modeled as a function of each individual predictor.
Trying to get the DOI thing with Zenodo to work...
This SOils DAta Harmonization (SoDaH) database is designed to bring together soil carbon data from diverse research networks into a harmonized dataset that can be used for synthesis activities and model development. The research network sources for SoDaH span different biomes and climates, encompass multiple ecosystem types, and have collected data across a range of spatial, temporal, and depth gradients. The rich data sets assembled in SoDaH consist of observations from monitoring efforts and long-term ecological experiments. The SoDaH database also incorporates related environmental covariate data pertaining to climate, vegetation, soil chemistry, and soil physical properties. The data are harmonized and aggregated using open-source code that enables a scripted, repeatable approach for soil data synthesis.