A detailed overview of the results of the literature search, including the data extraction matrix can be found in the Additional file 1.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This deposit contains the taxonomy maps and data we used to translate data on COVID-19 government responses from 7 different datasets into taxonomy developed by the CoronaNet Research Project (CoronaNet; Cheng et al 2020). These taxonomy maps form the basis of our efforts to harmonize this data into the CoronaNet database. The following taxonomy maps are deposited in the 'Taxonomy' folder:ACAPS COVID-19 Government Measures - CoronaNet Taxonomy Map Canadian Data Set of COVID-19 Interventions from the Canadian Institute for Health Information (CIHI) - CoronaNet Taxonomy Map COVID Analysis and Maping of Policies (COVID AMP) - CoronaNet Taxonomy Map Johns Hopkins Health Intervention Tracking for COVID-19 (HIT-COVID) - CoronaNet Taxonomy Map Oxford Covid-19 Government Response Tracker (OxCGRT) - CoronaNet Taxonomy Map World Health Organisation Public Health and Safety Measures (WHO PHSM) - CoronaNet Taxonomy MapMeanwhile the 'Data' folder contains the raw and mapped data for each external dataset (i.e. ACAPS, CIHI, COVID AMP, HIT-COVID, OxCGRT and WHO PHSM) as well as the combined external data for Steps 1 and 3 of the data harmonization process described in Cheng et al (2023) 'Harmonizing Government Responses to the COVID-19 Pandemic.'
Sediment diatoms are widely used to track environmental histories of lakes and their watersheds, but merging datasets generated by different researchers for further large-scale studies is challenging because of the taxonomic discrepancies caused by rapidly evolving diatom nomenclature and taxonomic concepts. Here we collated five datasets of lake sediment diatoms from the northeastern USA using a harmonization process which included updating synonyms, tracking the identity of inconsistently identified taxa and grouping those that could not be resolved taxonomically. The Dataset consists of a Portable Document Format (.pdf) file of the Voucher Flora, six Microsoft Excel (.xlsx) data files, an R script, and five output Comma Separated Values (.csv) files. The Voucher Flora documents the morphological species concepts in the dataset using diatom images compiled into plates (NE_Lakes_Voucher_Flora_102421.pdf) and the translation scheme of the OTU codes to diatom scientific or provisional names with identification sources, references, and notes (VoucherFloraTranslation_102421.xlsx). The file Slide_accession_numbers_102421.xlsx has slide accession numbers in the ANS Diatom Herbarium. The “DiatomHarmonization_032222_files for R.zip” archive contains four Excel input data files, the R code, and a subfolder “OUTPUT” with five .csv files. The file Counts_original_long_102421.xlsx contains original diatom count data in long format. The file Harmonization_102421.xlsx is the taxonomic harmonization scheme with notes and references. The file SiteInfo_031922.xlsx contains sampling site- and sample-level information. WaterQualityData_021822.xlsx is a supplementary file with water quality data. R code (DiatomHarmonization_032222.R) was used to apply the harmonization scheme to the original diatom counts to produce the output files. The resulting output files are five wide format files containing diatom count data at different harmonization steps (Counts_1327_wide.csv, Step1_1327_wide.csv, Step2_1327_wide.csv, Step3_1327_wide.csv) and the summary of the Indicator Species Analysis (INDVAL_RESULT.csv). The harmonization scheme (Harmonization_102421.xlsx) can be further modified based on additional taxonomic investigations, while the associated R code (DiatomHarmonization_032222.R) provides a straightforward mechanism to diatom data versioning. This dataset is associated with the following publication: Potapova, M., S. Lee, S. Spaulding, and N. Schulte. A harmonized dataset of sediment diatoms from hundreds of lakes in the northeastern United States. Scientific Data. Springer Nature, New York, NY, 9(540): 1-8, (2022).
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This document outlines the creation of a global inventory of reference samples and Earth Observation (EO) / gridded datasets for the Global Pasture Watch (GPW) initiative. This inventory supports the training and validation of machine-learning models for GPW grassland mapping. This documentation outlines methodology, data sources, workflow, and results.
Keywords: Grassland, Land Use, Land Cover, Gridded Datasets, Harmonization
Create a global inventory of existing reference samples for land use and land cover (LULC);
Compile global EO / gridded datasets that capture LULC classes and harmonize them to match the GPW classes;
Develop automated scripts for data harmonization and integration.
Datasets incorporated:
Datasets |
Spatial distribution | Time period | Number of individual samples |
WorldCereal | Global | 2016-2021 | 38,267,911 |
Global Land Cover Mapping and Estimation (GLanCE) | Global | 1985-2021 | 31,061,694 |
EuroCrops | Europe | 2015-2022 | 14,742,648 |
GeoWiki G-GLOPS training dataset | Global | 2021 | 11,394,623 |
MapBiomas Brazil | Brazil | 1985-2018 | 3,234,370 |
Land Use/Land Cover Area Frame Survey (LUCAS) | Europe | 2006-2018 | 1,351,293 |
Dynamic World | Global | 2019-2020 | 1,249,983 |
Land Change Monitoring, Assessment, and Projection (LCMap) | U.S. (CONUS) | 1984-2018 | 874,836 |
GeoWiki 2012 | Global | 2011-2012 | 151,942 |
PREDICTS | Global | 1984-2013 | 16,627 |
CropHarvest | Global | 2018-2021 | 9,714 |
Total: 102,355,642 samples
We harmonized global reference samples and EO/gridded datasets to align with GPW classes, optimizing their integration into the GPW machine-learning workflow.
We considered reference samples derived by visual interpretation with spatial support of at least 30 m (Landsat and Sentinel), that could represent LULC classes for a point or region.
Each dataset was processed using automated Python scripts to download vector files and convert the original LULC classes into the following GPW classes:
0. Other land cover
1. Natural and Semi-natural grassland
2. Cultivated grassland
3. Crops and other related agricultural practices
We empirically assigned a weight to each sample based on the original dataset's class description, reflecting the level of mixture within the class. The weights range from 1 (Low) to 3 (High), with higher weights indicating greater mixture. Samples with low mixture levels are more accurate and effective for differentiating typologies and for validation purposes.
The harmonized dataset includes these columns:
Attribute Name | Definition |
dataset_name | Original dataset name |
reference_year | Reference year of samples from the original dataset |
original_lulc_class | LULC class from the original dataset |
gpw_lulc_class | Global Pasture Watch LULC class |
sample_weight | Sample's weight based on the mixture level within the original LULC class |
The development of this global inventory of reference samples and EO/gridded datasets relied on valuable contributions from various sources. We would like to express our sincere gratitude to the creators and maintainers of all datasets used in this project.
Brown, C.F., Brumby, S.P., Guzder-Williams, B. et al. Dynamic World, Near real-time global 10 m land use land cover mapping. Sci Data 9, 251 (2022). https://doi.org/10.1038/s41597-022-01307-4Van Tricht, K. et al. Worldcereal: a dynamic open-source system for global-scale, seasonal, and reproducible crop and irrigation mapping. Earth Syst. Sci. Data 15, 5491–5515, 10.5194/essd-15-5491-2023 (2023)
Buchhorn, M.; Smets, B.; Bertels, L.; De Roo, B.; Lesiv, M.; Tsendbazar, N.E., Linlin, L., Tarko, A. (2020): Copernicus Global Land Service: Land Cover 100m: Version 3 Globe 2015-2019: Product User Manual; Zenodo, Geneve, Switzerland, September 2020; doi: 10.5281/zenodo.3938963
d’Andrimont, R. et al. Harmonised lucas in-situ land cover and use database for field surveys from 2006 to 2018 in the european union. Sci. data 7, 352, 10.1038/s41597-019-0340-y (2020)
Fritz, S. et al. Geo-Wiki: An online platform for improving global land cover, Environmental Modelling & Software, 31, https://doi.org/10.1016/j.envsoft.2011.11.015 (2012)
Fritz, S., See, L., Perger, C. et al. A global dataset of crowdsourced land cover and land use reference data. Sci Data 4, 170075 https://doi.org/10.1038/sdata.2017.75 (2017)
Schneider, M., Schelte, T., Schmitz, F. & Körner, M. Eurocrops: The largest harmonized open crop dataset across the european union. Sci. Data 10, 612, 10.1038/s41597-023-02517-0 (2023)
Souza, C. M. et al. Reconstructing Three Decades of Land Use and Land Cover Changes in Brazilian Biomes with Landsat Archive and Earth Engine. Remote. Sens. 12, 2735, 10.3390/rs12172735 (2020)
Stanimirova, R. et al. A global land cover training dataset from 1984 to 2020. Sci. Data 10, 879 (2023)
Tsendbazar, N. et al. Product validation report (d12-pvr) v 1.1 (2021).
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
These data consist of five simulated datasets and a syntax file written in R. All files were created for use in the recorded COORDINATE Workshop 2 (https://www.youtube.com/watch?v=DeyBKxa894E). In this workshop, Scott Milligan, from the GESIS Leibniz Institute for the Social Sciences, leads participants through a complete data harmonisation exercise. The exercise examines the correlation between experiences with bullying and children’s happiness. Participants may run through the process parallel to the recorded workshop. More information on the project and the Harmonisation Toolbox developed in the project are available on the project’s webpage https://www.coordinate-network.eu/harmonisation or in COORDINATE Harmonisation Workshop 1 (https://www.youtube.com/watch?v=DeyBKxa894E).
A dataset within the Harmonized Database of Western U.S. Water Rights (HarDWR). For a detailed description of the database, please see the meta-record v2.0. Changelog v2.0 - Recalculated based on data sourced from WestDAAT - Changed using a Site ID column to identify unique records to using aa combination of Site ID and Allocation ID - Removed the Water Management Area (WMA) column from the harmonized records. The replacement is a separate file which stores the relationship between allocations and WMAs. This allows for allocations to contribute to water right amounts to multiple WMAs during the subsequent cumulative process. - Added a column describing a water rights legal status - Added "Unspecified" was a water source category - Added an acre-foot (AF) column - Added a column for the classification of the right's owner v1.02 - Added a .RData file to the dataset as a convenience for anyone exploring our code. This is an internal file, and the one referenced in analysis scripts as the data objects are already in R data objects. v1.01 - Updated the names of each file with an ID number less than 3 digits to include leading 0s v1.0 - Initial public release Description Heremore » we present an updated database of Western U.S. water right records. This database provides consistent unique identifiers for each water right record, and a consistent categorization scheme that puts each water right record into one of seven broad use categories. These data were instrumental in conducting a study of the multi-sector dynamics of inter-sectoral water allocation changes though water markets (Grogan et al., in review). Specifically, the data were formatted for use as input to a process-based hydrologic model, Water Balance Model (WBM), with a water rights module (Grogan et al., in review). While this specific study motivated the development of the database presented here, water management in the U.S. West is a rich area of study (e.g., Anderson and Woosly, 2005; Tidwell, 2014; Null and Prudencio, 2016; Carney et al., 2021) so releasing this database publicly with documentation and usage notes will enable other researchers to do further work on water management in the U.S. West. We produced the water rights database presented here in four main steps: (1) data collection, (2) data quality control, (3) data harmonization, and (4) generation of cumulative water rights curves. Each of steps (1)-(3) had to be completed in order to produce (4), the final product that was used in the modeling exercise in Grogan et al. (in review). All data in each step is associated with a spatial unit called a Water Management Area (WMA), which is the unit of water right administration utilized by the state in which the right came from. Steps (2) and (3) required use to make assumptions and interpretation, and to remove records from the raw data collection. We describe each of these assumptions and interpretations below so that other researchers can choose to implement alternative assumptions an interpretation as fits their research aims. Motivation for Changing Data Sources The most significant change has been a switch from collecting the raw water rights directly from each state to using the water rights records presented in WestDAAT, a product of the Water Data Exchange (WaDE) Program under the Western States Water Council (WSWC). One of the main reasons for this is that each state of interest is a member of the WSWC, meaning that WaDE is partially funded by these states, as well as many universities. As WestDAAT is also a database with consistent categorization, it has allowed us to spend less time on data collection and quality control and more time on answering research questions. This has included records from water right sources we had previously not known about when creating v1.0 of this database. The only major downside to utilizing the WestDAAT records as our raw data is that further updates are tied to when WestDAAT is updated, as some states update their public water right records daily. However, as our focus is on cumulative water amounts at the regional scale, it is unlikely most records updates would have a significant effect on our results. The structure of WestDAAT led to several important changes to how HarWR is formatted. The most significant change is that WaDE has calculated a field known as SiteUUID
, which is a unique identifier for the Point of Diversion (POD), or where the water is drawn from. This separate from AllocationNativeID
, which is the identifier for the allocation of water, or the amount of water associated with the water right. It should be noted that it is possible for a single site to have multiple allocations associated with it and for an allocation to be able to be extracted from multiple sites. The site-allocation structure has allowed us to adapt a more consistent, and hopefully more realistic, approach in organizing the water right records than we had with HarDWR v1.0. This was incredibly helpful as the raw data from many states had multiple water uses within a single field within a single row of their raw data, and it was not always clear if the first water use was the most important, or simply first alphabetically. WestDAAT has already addressed this data quality issue. Furthermore, with v1.0, when there were multiple records with the same water right ID, we selected the largest volume or flow amount and disregarded the rest. As WestDAAT was already a common structure for disparate data formats, we were better able to identify sites with multiple allocations and, perhaps more importantly, allocations with multiple sites. This is particularly helpful when an allocation has sites which cross WMA boundaries, instead of just assigning the full water amount to a single WMA we are now able to divide the amount of water between the number of relevant WMAs. As it is now possible to identify allocations with water used in multiple WMAs, it is no longer practical to store this information within a single column. Instead the stAllocationToWMATab.csv file was created, which is an allocation by WMA matrix containing the percent Place of Use area overlap with each WMA. We then use this percentage to divide the allocation's flow amount between the given WMAs during the cumulation process to hopefully provide more realistic totals of water use in each area. However, not every state provides areas of water use, so like HarDWR v1.0, a hierarchical decision tree was used to assign each allocation to a WMA. First, if a WMA could be identified based on the allocation ID, then that WMA was used; typically, when available, this applied to the entire state and no further steps were needed. Second was the spatial analysis of Place of Use to WMAs. Third was a spatial analysis of the POD locations to WMAs, with the assumption that allocation's POD is within the WMA it should belong to; if an allocation still had multiple WMAs based on its POD locations, then the allocation's flow amount would be divided equally between all WMAs. The fourth, and final, process was to include water allocations which spatially fell outside of the state WMA boundaries. This could be due to several reasons, such as coordinate errors / imprecision in the POD location, imprecision in the WMA boundaries, or rights attached with features, such as a reservoir, which crosses state boundaries. To include these records, we decided for any POD which was within one kilometer of the state's edge would be assigned to the nearest WMA. Other Changes WestDAAT has Allowed In addition to a more nuanced and consistent method of assigning water right's data to WMAs, there are other benefits gained from using the WestDAAT dataset. Among those is a consistent categorization of a water right's legal status. In HarDWR v1.0, legal status was effectively ignored, which led to many valid concerns about the quality of the database related to the amounts of water the rights allowed to be claimed. The main issue was that rights with legal status' such as "application withdrawn", "non-active", or "cancelled" were included within HarDWR v1.0. These, and other water rights status' which were deemed to not be in use have been removed from this version of the database. Another major change has been the addition of the "unspecified water source category. This is water that can come from either surface water or groundwater, or the source of which is unknown. The addition of this source category brings the total number of categories to three. Due to reviewer feedback, we decided to add the acre-foot (AF) column so that the data may be more applicable to a wider audience. We added the ownerClassification column so that the data may be more applicable to a wider audience. File Descriptions The dataset is a series of various files organized by state sub-directories. In addition, each file begins with the state's name, in case the file is separate from its sub-directory for some reason. After the state name is the text which describes the contents of the file. Here is each file described in detail. Note that st is a placeholder for the state's name. stFullRecords_HarmonizedRights.csv: A file of the complete water records for each state. The column headers for each of this type of file are: state - The name of the state to which the allocations belong to. FIPS - The two digit numeric state ID code. siteID - The site location ID for POD locations. A site may have multiple allocations, which are the actual amount of water which can be drawn. In a simplified hypothetical, a farm stead may have an allocation for "irrigation" and an allocation for "domestic" water use, but the water is drawn from the same pumping equipment. It should be noted that many of the site ID appear to have been added by WaDE, and therefore may not be recognized by a given state's water rights database. allocationID - The allocation ID for the water right. For most states this is the water right ID, and what is
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description and harmonization strategy for the predictor variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predictor variables used in analysis and the methods used to harmonize to the categorical variables.
The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.
Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demograohic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor chracteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty
National
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
The 2008 Household Expenditure and Income Survey sample was designed using two-stage cluster stratified sampling method. In the first stage, the primary sampling units (PSUs), the blocks, were drawn using probability proportionate to the size, through considering the number of households in each block to be the block size. The second stage included drawing the household sample (8 households from each PSU) using the systematic sampling method. Fourth substitute households from each PSU were drawn, using the systematic sampling method, to be used on the first visit to the block in case that any of the main sample households was not visited for any reason.
To estimate the sample size, the coefficient of variation and design effect in each subdistrict were calculated for the expenditure variable from data of the 2006 Household Expenditure and Income Survey. This results was used to estimate the sample size at sub-district level, provided that the coefficient of variation of the expenditure variable at the sub-district level did not exceed 10%, with a minimum number of clusters that should not be less than 6 at the district level, that is to ensure good clusters representation in the administrative areas to enable drawing poverty pockets.
It is worth mentioning that the expected non-response in addition to areas where poor families are concentrated in the major cities were taken into consideration in designing the sample. Therefore, a larger sample size was taken from these areas compared to other ones, in order to help in reaching the poverty pockets and covering them.
Face-to-face [f2f]
List of survey questionnaires: (1) General Form (2) Expenditure on food commodities Form (3) Expenditure on non-food commodities Form
Raw Data The design and implementation of this survey procedures were: 1. Sample design and selection 2. Design of forms/questionnaires, guidelines to assist in filling out the questionnaires, and preparing instruction manuals 3. Design the tables template to be used for the dissemination of the survey results 4. Preparation of the fieldwork phase including printing forms/questionnaires, instruction manuals, data collection instructions, data checking instructions and codebooks 5. Selection and training of survey staff to collect data and run required data checkings 6. Preparation and implementation of the pretest phase for the survey designed to test and develop forms/questionnaires, instructions and software programs required for data processing and production of survey results 7. Data collection 8. Data checking and coding 9. Data entry 10. Data cleaning using data validation programs 11. Data accuracy and consistency checks 12. Data tabulation and preliminary results 13. Preparation of the final report and dissemination of final results
Harmonized Data - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets - The harmonization process started with cleaning all raw data files received from the Statistical Office - Cleaned data files were then all merged to produce one data file on the individual level containing all variables subject to harmonization - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables - A post-harmonization cleaning process was run on the data - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format
Metabolomics encounters challenges in cross-study comparisons due to diverse metabolite nomenclature and reporting practices. To bridge this gap, we introduce the Metabolites Merging Strategy (MMS), offering a systematic framework to harmonize multiple metabolite datasets for enhanced interstudy comparability. MMS has three steps. Step 1: Translation and merging of the different datasets by employing InChIKeys for data integration, encompassing the translation of metabolite names (if needed). Followed by Step 2: Attributes' retrieval from the InChIkey, including descriptors of name (title name from PubChem and RefMet name from Metabolomics Workbench), and chemical properties (molecular weight and molecular formula), both systematic (InChI, InChIKey, SMILES) and non-systematic identifiers (PubChem, CheBI, HMDB, KEGG, LipidMaps, DrugBank, Bin ID and CAS number), and their ontology. Finally, a meticulous three-step curation process is used to rectify disparities for conjugated base/acid compounds (optional step), missing attributes, and synonym checking (duplicated information). The MMS procedure is exemplified through a case study of urinary asthma metabolites, where MMS facilitated the identification of significant pathways hidden when no dataset merging strategy was followed. This study highlights the need for standardized and unified metabolite datasets to enhance the reproducibility and comparability of metabolomics studies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Meta-analysis sample size of harmonized variables for each study.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The metadata for Open Educational Resources (OER) are often made available in repositories without recourse to uniform value lists and corresponding standards for their attributes. This circumstance complicates data harmonization when OERs from different sources are to be merged in one search environment. With the help of the RDF standard SKOS and the tool SkoHub-Vocabs, the project "WirLernenOnline" has found an innovative, reusable and standards-based solution to this challenge. This involves the creation of SKOS vocabularies that are used during the ETL process to standardize different terms (for example, "math" and "mathematics"). This then forms the basis for providing users with consistent filtering options and a good search experience. The created and open licensed vocabularies can then easily be reused and linked to overcome this challenge in the future.
The Harmonisation extension for CKAN is designed to standardize metadata labels and values, especially those adhering to the ODM (Open Data Monitor) metadata scheme. It facilitates the harmonization of specific metadata fields through a web interface, allowing users to manage and refine the consistency of their datasets. This extension is used in conjunction with MongoDB to store raw and harmonized metadata, and is part of the broader ODM project aiming to improve data quality within the CKAN ecosystem. Key Features: Metadata Harmonization via Web Form: Provides a user interface for harmonizing specific metadata fields like Dates, Resources, Licenses, and Categories, thus streamlining the data cleaning process for end users. Mapping Management: Allows administrators and users to add new mappings or update existing ones, enabling customization and continuous improvement of the harmonization rules. MongoDB Integration: Leverages MongoDB to store both raw and harmonized metadata by connecting to specific collections ('odm' and 'odm_harmonised'), ensuring data persistence and ready access. Scheduled Harmonization Jobs: Supports automated harmonization tasks through the harmonisation_slave.py script. Users can set up cron jobs to run the script periodically, minimizing manual intervention and ensuring data consistency over time. ODM Metadata Scheme Compliance: Specifically designed to work with metadata that complies with the ODM metadata scheme, thereby improving interoperability and adherence to standards. Technical Integration: The Harmonisation extension requires updates to the CKAN configuration file (development.ini) to activate the plugin and set up the necessary ODM extension settings. The extension also requires MongoDB to be installed and configured as the metadata repository. After correctly configured, users can schedule automatic harmonisation jobs by executing the harmonisation_slave.py script as a cron job. Benefits & Impact: By implementing the Harmonisation extension, organizations can significantly improve the quality and consistency of their metadata. By allowing the harmonisation of key data fields, it enables the data to become more reliable and readily integratable with other systems. This automation streamlines metadata management, reducing manual effort and ensuring that data consistently adheres to the configured standards.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Eligible studies from the CureSCi Metadata Catalog and their available predictor variables.
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE DEPARTMENT OF STATISTICS OF THE HASHEMITE KINGDOM OF JORDAN
The Department of Statistics (DOS) carried out four rounds of the 2005 Employment and Unemployment Survey (EUS) during February, May, August and November 2005. The survey rounds covered a total sample of about thirty nine households Nation-wide. The sampled households were selected using a stratified multi-stage cluster sampling design. It is noteworthy that the sample represents the national level (Kingdom), governorates, the three Regions (Central, North and South), and the urban/rural areas.
The importance of this survey lies in that it provides a comprehensive data base on employment and unemployment that serves decision makers, researchers as well as other parties concerned with policies related to the organization of the Jordanian labor market.
The raw survey data provided by the Statistical Agency were cleaned and harmonized by the Economic Research Forum, in the context of a major project that started in 2009. During which extensive efforts have been exerted to acquire, clean, harmonize, preserve and disseminate micro data of existing labor force surveys in several Arab countries.
Covering a sample representative on the national level (Kingdom), governorates, the three Regions (Central, North and South), and the urban/rural areas.
1- Household/family. 2- Individual/person.
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE DEPARTMENT OF STATISTICS OF THE HASHEMITE KINGDOM OF JORDAN
Face-to-face [f2f]
The questionnaire is divided into main topics, each containing a clear and consistent group of questions, and designed in a way that facilitates the electronic data entry and verification. The questionnaire includes the characteristics of household members in addition to the identification information, which reflects the administrative as well as the statistical divisions of the Kingdom.
The plan of the tabulation of survey results was guided by former Employment and Unemployment Surveys which were previously prepared and tested. The final survey report was then prepared to include all detailed tabulations as well as the methodology of the survey.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Reproducible Brain Charts initiative aims to aggregate and harmonize phenotypic and neuroimage data to delineate novel mechanisms regarding the developmental basis of psychopathology in youth and yield reproducible growth charts of brain development. To reach this objective, the second step of our project is to test item-wise matching strategies of phenotypic harmonization between studies using bifactor models of psychopathology. We focused on this model because general and specific aspects of mental health problems can dissociated, so more specific relationships with the brain could be established. In the current study, we benchmarked six item matching strategies for harmonizing the Child Behavioral Checklist (CBCL) and the Sstrenghts and Difficulties Qquestionnaire (SDQ) within a bifactor model framework in two samples that were assessed with both instruments. It proceded in the following steps: 1) harmonization of items according to the six strategies, 2) estimated bifactor models with harmonized items for each sample separately, 3) estimated factor score correlation between assessment tools in each sample, 4) estimated factor reliability, 5) tested the assessment’s invariance according to each strategy and 6) calculated the root expected mean square difference (REMSD) to estimate the factor score difference of using a proxy measure instead of a target measure while integrating the two samples. We expect that the results of this study can encourage the use of the best streategy to date to increase reproducibility in the field while aggregating data from different contexts and instruments in the context of the bifactor model of psychopathology.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A univariate analysis where hydroxyurea use was modeled as a function of each individual predictor.
The media analysis data was collected for commercial purposes. They are used in media planning as well as in the advertising planning of the different media genres (radio, press media, TV, poster and since 2010 also online). They are cross-sections that are merged together for one year. ag.ma kindly provides the data for scientific use on an annual basis – with a two-year notice period – to GESIS. In addition, agof has provided documentation regarding data collection (questionnaires, code plans, etc.) for the preparation of the MA IntermediaPlus online bundle. In order to make the data accessible for scientific use, the datasets of the individual years were harmonized and pooled into a longitudinal data set starting in 2014 as part of the dissertation project ´Audience and Market Fragmentation online´ of the Digital Society research program NRW at the Heinrich-Heine-University (HHU) and the University of Applied Sciences Düsseldorf (HSD), funded by the Ministry of Culture and Science of the German State of North Rhine-Westphalia. The prepared Longitudinal IntermediaPlus dataset 2014 to 2016 is a ´big data´, which is why the entire dataset will only be available in the form of a database (MySQL). In this database, the information of different variables of a respondent is organized in one column, one row per variable. The present data documentation shows the total database for online media use of the years 2014 to 2016. The data contains all variables of socio demography, free-time activities, additional information on a respondent and his household as well as the interview-specific variables and weights. Only the variables concerning the respondent´s media use are a selection: The online media use of all full online as well as their single entities for all genres whose business model is the provision of content is included - e-commerce, games, etc. were excluded. The media use of radio, print and TV is not included. Preparation for further years is possible, as is the preparation of cross-media media use for radio, press media and TV. Harmonization is available for radio and press media up to 2015 waiting to be applied. The digital process chain developed for data preparation and harmonization is published at GESIS and available for further projects updating the time series for further years. Recourse to these documents - Excel files, scripts, harmonization plans, etc. - is strongly recommended. The process and harmonization for the Longitudinal IntermediaPlus for 2014 to 2016 database was made available in accordance with the FAIR principles (Wilkinson et al. 2016). By harmonizing and pooling the cross-sectional datasets to one longitudinal dataset – which is being carried out by Inga Brentel and Céline Fabienne Kampes as part of the dissertation project ´Audience and Market Fragmentation online´ –, the aim is to make the data source of the media analysis, accessible for research on social and media change in Germany. Die Media-Analyse Daten wurden zu kommerziellen Zwecken erhoben. Sie werden in der Mediaplanung sowie der Werbeplanung der unterschiedlichen Mediengattungen (Radio, Pressemedien, TV, Plakat und seit 2010 auch Online) eingesetzt. Es handelt sich um Querschnitte, die für ein Jahr aneinandergereiht werden. Die ag.ma stellt freundlicherweise jährlich – mit einer Frist von zwei Jahren – die entsprechenden Daten der GESIS zur wissenschaftlichen Nutzung bereit. Zusätzlich hat die agof für die Aufbereitung der Online-Tranche der MA IntermediaPlus Unterlagen bezüglich der Datenerhebung (Fragebögen, Codepläne, usw.) bereitgestellt. Um die Daten für die wissenschaftliche Nutzung zugänglich zu machen, wurden ab 2018 im Rahmen des Dissertationsprojektes „Angebots- und Publikumsfragmentierung online“ des Graduiertenkollegs Digitale Gesellschaft NRW an der Heinrich-Heine-Universität (HHU) sowie der Hochschule Düsseldorf (HSD) gefördert durch das Ministerium für Kultur und Wissenschaft des Landes Nordrhein-Westfalen die Datensätze der einzelnen Jahre zu einem Längsschnitt-Datensatz ab 2014 harmonisiert. Bei dem aufbereiteten Längsschnitt-Datensatz 2014 bis 2016 handelt es sich um „Big-Data“, weshalb der Gesamtdatensatz nur in Form einer Datenbank (MySQL) verfügbar ist. In dieser Datenbank liegt die Information verschiedener Variablen eines Befragten untereinander. Die vorliegende Datendokumentation zeigt den Gesamtdatensatz für die Jahre 2014 bis 2016 für die Online-Mediennutzung. Folgende Variablengruppen wurden neben der Soziodemografie im Rahmen der vorliegenden Studie erhoben bzw. für den Längsschnittdatensatz harmonisiert: Freizeitverhalten, Zusatzinformation zum Befragten und dessen Haushalt wie Geräte im Haushalt, Online-Mediennutzung Content sowie interviewspezifische Variablen und Gewichte. Lediglich bei den Variablen bezüglich der Mediennutzung des Befragten, handelt es sich um eine Auswahl: es ist ausschließlich die Onlinemediennutzung aller Gesamtangebote sowie der Einzelangebote aller Genre, deren Geschäftsmodell auf der Bereitstellung von Inhalten (Content) basiert, aufgenommen – E-Commerce, Spiele, etc. wurden ausgeschlossen. Die Mediennutzung von Radio, Print und TV wurde nicht berücksichtigt. Eine Aufbereitung für weitere Jahre nach 2017 ist grundsätzlich möglich, ebenso die Aufbereitung crossmedialer Mediennutzung für Radio, Pressemedien und TV. Unterlagen zur Harmonisierung liegen für Radio und Pressemedien bis 2015 vor. Die erarbeitete digitale Prozesskette zur Datenaufbereitung und -harmonisierung ist bei GESIS publiziert und für weitere Aufbereitungsschritte verfügbar. Der Rückgriff auf diese Unterlagen – Excel-Dateien, Skripte, Harmonisierungspläne, usw. – wird ausdrücklich empfohlen. Die Aufbereitung und Harmonisierung des Längsschnitts des Gesamtdatensatzes der MA IntermediaPlus für 2014 bis 2016 erfolgte unter Berücksichtigung der FAIR-Prinzipien (Wilkinson et al. 2016). Ziel ist es durch die Harmonisierung der einzelnen Querschnitte die Datenquelle der Media-Analyse, die im Rahmen des Dissertationsprojektes „Angebots- und Publikumsfragmentierung online“ durch Inga Brentel und Céline Fabienne Kampes erfolgte, für Forschung zum sozialen und medialen Wandel in der Bundesrepublik Deutschland zugänglich zu machen.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Within the ESA funded WorldCereal project we have built an open harmonized reference data repository at global extent for model training or product validation in support of land cover and crop type mapping. Data from 2017 onwards were collected from many different sources and then harmonized, annotated and evaluated. These steps are explained in the harmonization protocol (10.5281/zenodo.7584463). This protocol also clarifies the naming convention of the shape files and the WorldCereal attributes (LC, CT, IRR, valtime and sampleID) that were added to the original data sets.
This publication includes those harmonized data sets of which the original data set was published under the CC-BY-SA license or a license similar to CC-BY-SA. See document "_In-situ-data-World-Cereal - license - CC-BY-SA.pdf" for an overview of the original data sets.
Diatom data have been collected in large-scale biological assessments in the United States, such as the U.S. Environmental Protection Agency’s National Rivers and Streams Assessment (NRSA). However, the effectiveness of diatoms as indicators may suffer if inconsistent taxon identifications across different analysts obscure the relationships between assemblage composition and environmental variables. To reduce these inconsistencies, we harmonized the 2008–2009 NRSA data from nine analysts by updating names to current synonyms and by statistically identifying taxa with high analyst signal (taxa with more variation in relative abundance explained by the analyst factor, relative to environmental variables). We then screened a subset of samples with QA/QC data and combined taxa with mismatching identifications by the primary and secondary analysts. When these combined “slash groups” did not reduce analyst signal, we elevated taxa to the genus level or omitted taxa in difficult species complexes. We examined the variation explained by analyst in the original and revised datasets. Further, we examined how revising the datasets to reduce analyst signal can reduce inconsistency, thereby uncovering the variation in assemblage composition explained by total phosphorus (TP), an environmental variable of high priority for water managers. To produce a revised dataset with the greatest taxonomic consistency, we ultimately made 124 slash groups, omitted 7 taxa in the small naviculoid (e.g., Sellaphora atomoides) species complex, and elevated Nitzschia, Diploneis, and Tryblionella taxa to the genus level. Relative to the original dataset, the revised dataset had more overlap among samples grouped by analyst in ordination space, less variation explained by the analyst factor, and more than double the variation in assemblage composition explained by TP. Elevating all taxa to the genus level did not eliminate analyst signal completely, and analyst remained the most important predictor for the genera Sellaphora, Mayamaea, and Psammodictyon, indicating that these taxa present the greatest obstacle to consistent identification in this dataset. Although our process did not completely remove analyst signal, this work provides a method to minimize analyst signal and improve detection of diatom association with TP in large datasets involving multiple analysts. Examination of variation in assemblage data explained by analyst and taxonomic harmonization may be necessary steps for improving data quality and the utility of diatoms as indicators of environmental variables. This dataset is associated with the following publication: Lee, S., I. Bishop, S. Spaulding, R. Mitchell, and L. Yuan. Taxonomic harmonization may reveal a stronger association between diatom assemblages and total phosphorus in large datasets.. ECOLOGICAL INDICATORS. Elsevier Science Ltd, New York, NY, USA, 102: 166-174, (2019).
A detailed overview of the results of the literature search, including the data extraction matrix can be found in the Additional file 1.