Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multicenter and multi-scanner imaging studies may be necessary to ensure sufficiently large sample sizes for developing accurate predictive models. However, multicenter studies, incorporating varying research participant characteristics, MRI scanners, and imaging acquisition protocols, may introduce confounding factors, potentially hindering the creation of generalizable machine learning models. Models developed using one dataset may not readily apply to another, emphasizing the importance of classification model generalizability in multi-scanner and multicenter studies for producing reproducible results. This study focuses on enhancing generalizability in classifying individual migraine patients and healthy controls using brain MRI data through a data harmonization strategy. We propose identifying a ’healthy core’—a group of homogeneous healthy controls with similar characteristics—from multicenter studies. The Maximum Mean Discrepancy (MMD) in Geodesic Flow Kernel (GFK) space is employed to compare two datasets, capturing data variabilities and facilitating the identification of this ‘healthy core’. Homogeneous healthy controls play a vital role in mitigating unwanted heterogeneity, enabling the development of highly accurate classification models with improved performance on new datasets. Extensive experimental results underscore the benefits of leveraging a ’healthy core’. We utilized two datasets: one comprising 120 individuals (66 with migraine and 54 healthy controls), and another comprising 76 individuals (34 with migraine and 42 healthy controls). Notably, a homogeneous dataset derived from a cohort of healthy controls yielded a significant 25% accuracy improvement for both episodic and chronic migraineurs.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This document outlines the creation of a global inventory of reference samples and Earth Observation (EO) / gridded datasets for the Global Pasture Watch (GPW) initiative. This inventory supports the training and validation of machine-learning models for GPW grassland mapping. This documentation outlines methodology, data sources, workflow, and results.
Keywords: Grassland, Land Use, Land Cover, Gridded Datasets, Harmonization
Create a global inventory of existing reference samples for land use and land cover (LULC);
Compile global EO / gridded datasets that capture LULC classes and harmonize them to match the GPW classes;
Develop automated scripts for data harmonization and integration.
Datasets incorporated:
Datasets |
Spatial distribution | Time period | Number of individual samples |
WorldCereal | Global | 2016-2021 | 38,267,911 |
Global Land Cover Mapping and Estimation (GLanCE) | Global | 1985-2021 | 31,061,694 |
EuroCrops | Europe | 2015-2022 | 14,742,648 |
GeoWiki G-GLOPS training dataset | Global | 2021 | 11,394,623 |
MapBiomas Brazil | Brazil | 1985-2018 | 3,234,370 |
Land Use/Land Cover Area Frame Survey (LUCAS) | Europe | 2006-2018 | 1,351,293 |
Dynamic World | Global | 2019-2020 | 1,249,983 |
Land Change Monitoring, Assessment, and Projection (LCMap) | U.S. (CONUS) | 1984-2018 | 874,836 |
GeoWiki 2012 | Global | 2011-2012 | 151,942 |
PREDICTS | Global | 1984-2013 | 16,627 |
CropHarvest | Global | 2018-2021 | 9,714 |
Total: 102,355,642 samples
We harmonized global reference samples and EO/gridded datasets to align with GPW classes, optimizing their integration into the GPW machine-learning workflow.
We considered reference samples derived by visual interpretation with spatial support of at least 30 m (Landsat and Sentinel), that could represent LULC classes for a point or region.
Each dataset was processed using automated Python scripts to download vector files and convert the original LULC classes into the following GPW classes:
0. Other land cover
1. Natural and Semi-natural grassland
2. Cultivated grassland
3. Crops and other related agricultural practices
We empirically assigned a weight to each sample based on the original dataset's class description, reflecting the level of mixture within the class. The weights range from 1 (Low) to 3 (High), with higher weights indicating greater mixture. Samples with low mixture levels are more accurate and effective for differentiating typologies and for validation purposes.
The harmonized dataset includes these columns:
Attribute Name | Definition |
dataset_name | Original dataset name |
reference_year | Reference year of samples from the original dataset |
original_lulc_class | LULC class from the original dataset |
gpw_lulc_class | Global Pasture Watch LULC class |
sample_weight | Sample's weight based on the mixture level within the original LULC class |
The development of this global inventory of reference samples and EO/gridded datasets relied on valuable contributions from various sources. We would like to express our sincere gratitude to the creators and maintainers of all datasets used in this project.
Brown, C.F., Brumby, S.P., Guzder-Williams, B. et al. Dynamic World, Near real-time global 10 m land use land cover mapping. Sci Data 9, 251 (2022). https://doi.org/10.1038/s41597-022-01307-4Van Tricht, K. et al. Worldcereal: a dynamic open-source system for global-scale, seasonal, and reproducible crop and irrigation mapping. Earth Syst. Sci. Data 15, 5491–5515, 10.5194/essd-15-5491-2023 (2023)
Buchhorn, M.; Smets, B.; Bertels, L.; De Roo, B.; Lesiv, M.; Tsendbazar, N.E., Linlin, L., Tarko, A. (2020): Copernicus Global Land Service: Land Cover 100m: Version 3 Globe 2015-2019: Product User Manual; Zenodo, Geneve, Switzerland, September 2020; doi: 10.5281/zenodo.3938963
d’Andrimont, R. et al. Harmonised lucas in-situ land cover and use database for field surveys from 2006 to 2018 in the european union. Sci. data 7, 352, 10.1038/s41597-019-0340-y (2020)
Fritz, S. et al. Geo-Wiki: An online platform for improving global land cover, Environmental Modelling & Software, 31, https://doi.org/10.1016/j.envsoft.2011.11.015 (2012)
Fritz, S., See, L., Perger, C. et al. A global dataset of crowdsourced land cover and land use reference data. Sci Data 4, 170075 https://doi.org/10.1038/sdata.2017.75 (2017)
Schneider, M., Schelte, T., Schmitz, F. & Körner, M. Eurocrops: The largest harmonized open crop dataset across the european union. Sci. Data 10, 612, 10.1038/s41597-023-02517-0 (2023)
Souza, C. M. et al. Reconstructing Three Decades of Land Use and Land Cover Changes in Brazilian Biomes with Landsat Archive and Earth Engine. Remote. Sens. 12, 2735, 10.3390/rs12172735 (2020)
Stanimirova, R. et al. A global land cover training dataset from 1984 to 2020. Sci. Data 10, 879 (2023)
Tsendbazar, N. et al. Product validation report (d12-pvr) v 1.1 (2021).
A dataset within the Harmonized Database of Western U.S. Water Rights (HarDWR). For a detailed description of the database, please see the meta-record v2.0. Changelog v2.0 - Recalculated based on data sourced from WestDAAT - Changed using a Site ID column to identify unique records to using aa combination of Site ID and Allocation ID - Removed the Water Management Area (WMA) column from the harmonized records. The replacement is a separate file which stores the relationship between allocations and WMAs. This allows for allocations to contribute to water right amounts to multiple WMAs during the subsequent cumulative process. - Added a column describing a water rights legal status - Added "Unspecified" was a water source category - Added an acre-foot (AF) column - Added a column for the classification of the right's owner v1.02 - Added a .RData file to the dataset as a convenience for anyone exploring our code. This is an internal file, and the one referenced in analysis scripts as the data objects are already in R data objects. v1.01 - Updated the names of each file with an ID number less than 3 digits to include leading 0s v1.0 - Initial public release Description Heremore » we present an updated database of Western U.S. water right records. This database provides consistent unique identifiers for each water right record, and a consistent categorization scheme that puts each water right record into one of seven broad use categories. These data were instrumental in conducting a study of the multi-sector dynamics of inter-sectoral water allocation changes though water markets (Grogan et al., in review). Specifically, the data were formatted for use as input to a process-based hydrologic model, Water Balance Model (WBM), with a water rights module (Grogan et al., in review). While this specific study motivated the development of the database presented here, water management in the U.S. West is a rich area of study (e.g., Anderson and Woosly, 2005; Tidwell, 2014; Null and Prudencio, 2016; Carney et al., 2021) so releasing this database publicly with documentation and usage notes will enable other researchers to do further work on water management in the U.S. West. We produced the water rights database presented here in four main steps: (1) data collection, (2) data quality control, (3) data harmonization, and (4) generation of cumulative water rights curves. Each of steps (1)-(3) had to be completed in order to produce (4), the final product that was used in the modeling exercise in Grogan et al. (in review). All data in each step is associated with a spatial unit called a Water Management Area (WMA), which is the unit of water right administration utilized by the state in which the right came from. Steps (2) and (3) required use to make assumptions and interpretation, and to remove records from the raw data collection. We describe each of these assumptions and interpretations below so that other researchers can choose to implement alternative assumptions an interpretation as fits their research aims. Motivation for Changing Data Sources The most significant change has been a switch from collecting the raw water rights directly from each state to using the water rights records presented in WestDAAT, a product of the Water Data Exchange (WaDE) Program under the Western States Water Council (WSWC). One of the main reasons for this is that each state of interest is a member of the WSWC, meaning that WaDE is partially funded by these states, as well as many universities. As WestDAAT is also a database with consistent categorization, it has allowed us to spend less time on data collection and quality control and more time on answering research questions. This has included records from water right sources we had previously not known about when creating v1.0 of this database. The only major downside to utilizing the WestDAAT records as our raw data is that further updates are tied to when WestDAAT is updated, as some states update their public water right records daily. However, as our focus is on cumulative water amounts at the regional scale, it is unlikely most records updates would have a significant effect on our results. The structure of WestDAAT led to several important changes to how HarWR is formatted. The most significant change is that WaDE has calculated a field known as SiteUUID
, which is a unique identifier for the Point of Diversion (POD), or where the water is drawn from. This separate from AllocationNativeID
, which is the identifier for the allocation of water, or the amount of water associated with the water right. It should be noted that it is possible for a single site to have multiple allocations associated with it and for an allocation to be able to be extracted from multiple sites. The site-allocation structure has allowed us to adapt a more consistent, and hopefully more realistic, approach in organizing the water right records than we had with HarDWR v1.0. This was incredibly helpful as the raw data from many states had multiple water uses within a single field within a single row of their raw data, and it was not always clear if the first water use was the most important, or simply first alphabetically. WestDAAT has already addressed this data quality issue. Furthermore, with v1.0, when there were multiple records with the same water right ID, we selected the largest volume or flow amount and disregarded the rest. As WestDAAT was already a common structure for disparate data formats, we were better able to identify sites with multiple allocations and, perhaps more importantly, allocations with multiple sites. This is particularly helpful when an allocation has sites which cross WMA boundaries, instead of just assigning the full water amount to a single WMA we are now able to divide the amount of water between the number of relevant WMAs. As it is now possible to identify allocations with water used in multiple WMAs, it is no longer practical to store this information within a single column. Instead the stAllocationToWMATab.csv file was created, which is an allocation by WMA matrix containing the percent Place of Use area overlap with each WMA. We then use this percentage to divide the allocation's flow amount between the given WMAs during the cumulation process to hopefully provide more realistic totals of water use in each area. However, not every state provides areas of water use, so like HarDWR v1.0, a hierarchical decision tree was used to assign each allocation to a WMA. First, if a WMA could be identified based on the allocation ID, then that WMA was used; typically, when available, this applied to the entire state and no further steps were needed. Second was the spatial analysis of Place of Use to WMAs. Third was a spatial analysis of the POD locations to WMAs, with the assumption that allocation's POD is within the WMA it should belong to; if an allocation still had multiple WMAs based on its POD locations, then the allocation's flow amount would be divided equally between all WMAs. The fourth, and final, process was to include water allocations which spatially fell outside of the state WMA boundaries. This could be due to several reasons, such as coordinate errors / imprecision in the POD location, imprecision in the WMA boundaries, or rights attached with features, such as a reservoir, which crosses state boundaries. To include these records, we decided for any POD which was within one kilometer of the state's edge would be assigned to the nearest WMA. Other Changes WestDAAT has Allowed In addition to a more nuanced and consistent method of assigning water right's data to WMAs, there are other benefits gained from using the WestDAAT dataset. Among those is a consistent categorization of a water right's legal status. In HarDWR v1.0, legal status was effectively ignored, which led to many valid concerns about the quality of the database related to the amounts of water the rights allowed to be claimed. The main issue was that rights with legal status' such as "application withdrawn", "non-active", or "cancelled" were included within HarDWR v1.0. These, and other water rights status' which were deemed to not be in use have been removed from this version of the database. Another major change has been the addition of the "unspecified water source category. This is water that can come from either surface water or groundwater, or the source of which is unknown. The addition of this source category brings the total number of categories to three. Due to reviewer feedback, we decided to add the acre-foot (AF) column so that the data may be more applicable to a wider audience. We added the ownerClassification column so that the data may be more applicable to a wider audience. File Descriptions The dataset is a series of various files organized by state sub-directories. In addition, each file begins with the state's name, in case the file is separate from its sub-directory for some reason. After the state name is the text which describes the contents of the file. Here is each file described in detail. Note that st is a placeholder for the state's name. stFullRecords_HarmonizedRights.csv: A file of the complete water records for each state. The column headers for each of this type of file are: state - The name of the state to which the allocations belong to. FIPS - The two digit numeric state ID code. siteID - The site location ID for POD locations. A site may have multiple allocations, which are the actual amount of water which can be drawn. In a simplified hypothetical, a farm stead may have an allocation for "irrigation" and an allocation for "domestic" water use, but the water is drawn from the same pumping equipment. It should be noted that many of the site ID appear to have been added by WaDE, and therefore may not be recognized by a given state's water rights database. allocationID - The allocation ID for the water right. For most states this is the water right ID, and what is
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Mass spectrometry (MS) measurements are not inherently calibrated. Researchers use various calibration methods to assign meaning to arbitrary signal intensities and improve precision. Internal calibration (IC) methods use internal standards (IS) such as synthesized or recombinant proteins or peptides to calibrate MS measurements by comparing endogenous analyte signal to the signal from known IS concentrations spiked into the same sample. However, recent work suggests that using IS as IC introduces quantitative biases that affect comparison across studies because of the inability of IS to capture all sources of variation present throughout an MS workflow. Here, we describe a single-point external calibration strategy to calibrate signal intensity measurements to a common reference material, placing MS measurements on the same scale and harmonizing signal intensities between instruments, acquisition methods, and sites. We demonstrate data harmonization between laboratories and methodologies using this generalizable approach.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Goal: Setting up a pipeline for extending, improving and visualizing time series of municipality characteristics by means of data harmonization and linkage of historical and contemporary dataseries using Linked Data technologies (RDF).This project focused on increasing the data availability, data quality and visualization of characteristics of Dutch municipalities for the period 1795-2010. We did so by (1) combining data from historical and contemporary time series, (2) evaluating and improving on the quality of these time series, and (3) extending the availability of NLGIS maps for the last two decades in order to visualize municipality characteristics for two centuries.
Jointly managed by multiple states and the federal government, there are many ongoing efforts to characterize and understand water quality in the Delaware River Basin (DRB). Many State, Federal and non-profit organizations have collected surface-water-quality samples across the DRB for decades and many of these data are available through the National Water Quality Monitoring Council's Water Quality Portal (WQP). In this data release, WQP data in the DRB were harmonized, meaning that they were processed to create a clean and readily usable dataset. This harmonization processing included the synthesis of parameter names and fractions, the condensation of remarks and other data qualifiers, the resolution of duplicate records, an initial quality control check of the data, and other processing steps described in the metadata. This data set provides harmonized discrete multisource surface-water-quality data pulled from the WQP for nutrients, sediment, salinity, major ions, bacteria, temperature, dissolved oxygen, pH, and turbidity in the DRB, for all available years.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Average mean difference after mean and variance adjustments under random subsets within dataset.
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE DEPARTMENT OF STATISTICS OF THE HASHEMITE KINGDOM OF JORDAN
The Department of Statistics (DOS) carried out four rounds of the 2010 Employment and Unemployment Survey (EUS). The survey rounds covered a total sample of about fifty three thousand households Nation-wide. The sampled households were selected using a stratified multi-stage cluster sampling design. It is noteworthy that the sample represents the national level (Kingdom), governorates, and the urban/rural areas.
The importance of this survey lies in that it provides a comprehensive data base on employment and unemployment that serves decision makers, researchers as well as other parties concerned with policies related to the organization of the Jordanian labor market.
It is worthy to mention that the DOS employed new technology in data collection and data processing. Data was collected using electronic questionnaire instead of a hard copy, namely a hand held device (PDA).
The survey main objectives are:
The raw survey data provided by the Statistical Agency were cleaned and harmonized by the Economic Research Forum, in the context of a major project that started in 2009. During which extensive efforts have been exerted to acquire, clean, harmonize, preserve and disseminate micro data of existing labor force surveys in several Arab countries.
Covering a sample representative on the national level (Kingdom), governorates, the three Regions (Central, North and South), and the urban/rural areas.
1- Household/family. 2- Individual/person.
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE DEPARTMENT OF STATISTICS OF THE HASHEMITE KINGDOM OF JORDAN
The sample of this survey is based on the frame provided by the data of the Population and Housing Census, 2004. The Kingdom was divided into strata, where each city with a population of 100,000 persons or more was considered as a large city. The total number of these cities is 6. Each governorate (except for the 6 large cities) was divided into rural and urban areas. The rest of the urban areas in each governorate was considered as an independent stratum. The same was applied to rural areas where it was considered as an independent stratum. The total number of strata was 30.
In view of the existing significant variation in the socio-economic characteristics in large cities in particular and in urban in general, each stratum of the large cities and urban strata was divided into four sub-stratum according to the socio- economic characteristics provided by the population and housing census with the purpose of providing homogeneous strata.
The frame excludes the population living in remote areas (most of whom are nomads), In addition to that, the frame does not include collective dwellings, such as hotels, hospitals, work camps, prisons and alike.
The sample of this survey was designed using the cluster stratified sampling method. It is representative at the Kingdom, rural and urban areas, regions and governorates levels. The Primary Sampling Units (clusters) were distributed to governorates, urban and rural areas and large cities in each governorate according to the weight of persons/households and according to the variance within each stratum. Slight modifications regarding the number of these units were made. The Primary Sampling Units (PSUs) were ordered within each stratum according to geographic characteristics and then according to socio-economic characteristics in order to ensure good spread of the sample. Then, the sample were selected on two stages. In the first stage, the PSUs were selected using the Probability Proportionate to Size with systematic selection procedure. The number of households, in each PSU served as its weight or size. In the second stage, the blocks of the PSUs which were selected in the first stage have been updated. Then a constant number of households was selected, using the random systematic sampling method as final PSUs from each PSU (cluster).
It is noteworthy that the sample of the present survey does not represent the non-Jordanian population, due to the fact that it is based on households living in conventional dwellings. In other words, it does not cover the collective households living in collective dwellings. Therefore, the non-Jordanian households covered in the present survey are either private households or collective households living in conventional dwellings. In Jordan, it is well known that a large number of non-Jordanian workers live as groups and spend most of their time at workplaces. Hence, it is more unlikely to find them at their residences during daytime (i.e. the time when the data of the survey is collected). Furthermore, most of them live in their workplaces, such as: workshops, sales stores, guard places, or under construction building's sites. Such places are not classified as occupied dwellings for household sampling purposes. Due to all of the above, the coverage of such population would not be complete in household surveys.
Computer Assisted Personal Interview [capi]
The questionnaire was designed electronically on the PDA and revised by the DOS technical staff. It was finalized upon completion of the training program. The questionnaire is divided into main topics, each containing a clear and consistent group of questions, and designed in a way that facilitates the electronic data entry and verification. The questionnaire includes the characteristics of household members in addition to the identification information, which reflects the administrative as well as the statistical divisions of the Kingdom.
PDAs were used to input and transfer data from the interviewees to the database. The plan of the tabulation of survey results was guided by former Employment and Unemployment Surveys which were previously prepared and tested. When all data processing procedures were completed, the actual survey results were tabulated using an ORACLE package. The tabulations were then thoroughly checked for consistency of data such as titles, inputs, concepts, as well as the figures. The final survey report was then prepared to include all detailed tabulations as well as the methodology of the survey.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Trait data represent the basis for ecological and evolutionary research and have relevance for biodiversity conservation, ecosystem management and earth system modelling. The collection and mobilization of trait data has strongly increased over the last decade, but many trait databases still provide only species-level, aggregated trait values (e.g. ranges, means) and lack the direct observations on which those data are based. Thus, the vast majority of trait data measured directly from individuals remains hidden and highly heterogeneous, impeding their discoverability, semantic interoperability, digital accessibility and (re-)use. Here, we integrate quantitative measurements of verbatim trait information from plant individuals (e.g. lengths, widths, counts and angles of stems, leaves, fruits and inflorescence parts) from multiple sources such as field observations and herbarium collections. We develop a workflow to harmonize heterogeneous trait measurements (e.g. trait names and their values and units) as well as additional information related to taxonomy, measurement or fact and occurrence. This data integration and harmonization builds on vocabularies and terminology from existing metadata standards and ontologies such as the Ecological Trait-data Standard (ETS), the Darwin Core (DwC), the Thesaurus Of Plant characteristics (TOP) and the Plant Trait Ontology (TO). A metadata form filled out by data providers enables the automated integration of trait information from heterogeneous datasets. We illustrate our tools with data from palms (family Arecaceae), a globally distributed (pantropical), diverse plant family that is considered a good model system for understanding the ecology and evolution of tropical rainforests. We mobilize nearly 140,000 individual palm trait measurements in an interoperable format, identify semantic gaps in existing plant trait terminology and provide suggestions for the future development of a thesaurus of plant characteristics. Our work thereby promotes the semantic integration of plant trait data in a machine-readable way and shows how large amounts of small trait data sets and their metadata can be integrated into standardized data products.
Additional data sets on river water quality in Great Britain collected by the Environment Agency under the Harmonised Monitoring Scheme. This provides information about nutrient and heavy metal loads entering the marine environment and contributes to our commitment to report figures to the OSPAR Convention for the Protection of the North Atlantic. Full details available at: OSPAR - Riverine Inputs and Direct Discharges
The full dataset will be available from the Environment Agency http://www.geostore.com/environment-agency/WebStore?xml=environment-agency/xml/dataLayers.xml" class="govuk-link">datashare site. From November 2013
Release statement - Following a review by Defra and the Environment Agency on reducing the monitoring programme where it is not required under the present regulatory regime, it has been decided that the monitoring under the Harmonised Monitoring Scheme will be discontinued. For further information please contact enviro.statistics Inbox.
<p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute">MS Excel Spreadsheet</span>, <span class="gem-c-attachment_attribute">12.8 KB</span></p>
<p class="gem-c-attachment_metadata"><a class="govuk-link" aria-label="View Annual average, highest and lowest mean concentrations of Dissolved oxygen by region: 1980 to 2013, Great Britain online" href="/media/5a75ab19e5274a545822d4bb/DissolvedOxygen1980_2013.csv/preview">View online</a></p>
<p class="gem-c-attachment_metadata">This file may not be suitable for users of assistive technology.</p>
<details data-module="ga4-event-tracker" data-ga4-event='{"event_name":"select_content","type":"detail","text":"Request an accessible format.","section":"Request an accessible format.","index_section":1}' class="gem-c-details govuk-details govuk-!-margin-bottom-0" title="Request an accessible format.">
Request an accessible format.
If you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format, please email <a href="mailto:defra.helpline@defra.gov.uk" target="_blank" class="govuk-link">defra.helpline@defra.gov.uk</a>. Please tell us what format you need. It will help us if you say what assistive technology you use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Codes that could not be found in the OMOP concept dictionary.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
This collection introduces an open-source, anthropomorphic phantom-based dataset of CT scans for developing harmonization methods for deep learning based models. The phantom mimics human anatomy, allowing repeated scans without radiation delivery to real patients and isolating scanner effects by removing inter- and intra-patient variations. The dataset includes 268 image series from 13 scanners, 4 manufacturers, and 8 institutions, repeated 18-30 times at a 10 mGy dose using a harmonized protocol. An additional 1,378 image series were acquired with the same 13 scanners and harmonized protocol but including additional acquisition doses. The presented phantom scans consist of three compartments from thorax, liver and test patterns. The 3D-printed liver includes three types of abnormal regions of interest, including two cysts, a metastasis, and a hemangioma, with ground truth segmentation masks that could be used for classification and segmentation.
Recent breakthroughs in data-driven algorithms and artificial intelligence (AI) applications in medical information processing have introduced tremendous potential for AI-assisted image-based personalized medicine that addresses tasks such as segmentation, diagnosis, and prognosis. However, these opportunities come with two challenges: large data requirements and consistency in data distribution. Machine and deep learning algorithms have extreme data demand, which is coupled with the high costs of data acquisition and annotation for a single observation (e.g., one event corresponds to one patient in a survival study). These challenges encourage pooling of data collected from multiple centers and scanners to achieve a critical mass of data for training models. However, pooling data from multiple centers introduces significant variability in the acquisition parameters and specifics of image reconstruction algorithms, leading to data domain shifts and inconsistencies in the collected data. The domain shift introduced by this variability in scanners reduces the value of merging data from multiple centers, reducing performance of predictive tasks such as segmentation, diagnosis, and prognosis, as well as in federated scenarios. Furthermore, domain shifts between training and test or inference data entails high risks of incorrect and uncontrolled predictions for treatment planning and personalized medicine when the inference is based on a scanner (and/or acquisition setting) that was not represented in the training data. Although this challenge applies to all medical imaging modalities, it is particularly important for computed tomography (CT) images due to the wide range of variability in manufacturers, acquisition parameters and dose, reconstruction algorithms, and customized parameter tunings in different centers.
This dataset provides the material to reproduce several different research works conducted in conjunction with it. Researchers can use this dataset for developing their own harmonization methods at both the image and feature levels to tackle the data drift problem from one scanner to another and across different manufacturers. We also release baseline performance metrics for the similarity of scans in the image domain and feature space without harmonization. This will set a baseline to evaluate the effectiveness of various harmonization techniques in the image and feature domains.
The following subsections provide information about how the data were selected, acquired and prepared for publication, approximate date range of imaging studies.
Before the CT scans of the phantom were acquired, a survey was carried out to collect realistic acquisition and reconstruction parameter settings that are used in clinical thoracoabdominal CT scans for oncological staging, tumor search, and infectious focus detection in the portal venous contrast phase. The survey included 21 CT scanners from 9 centers across Switzerland. This translates to a tube voltage of 120 kV, a tube current-time product of 148 mAs, a pitch of 1.000, and a rotation time of 0.5 seconds for the Siemens SOMATOM Definition Edge scanner. The collimation was set to 38.4 mm, with a slice thickness/increment of 2.0 mm, and a pixel spacing of 1.367 mm. Due to vendor-specific limitations, the parameters mentioned above were slightly adapted to the closest possible parameters for each given scanner. The scans were repeated for 13 scanners from 4 manufacturers—Siemens, Philips, General Electric (GE), and Toshiba—at five dose levels (1 mGy, 3 mGy, 6 mGy, 10 mGy, 14 mGy). Only the tube current-time product (in mAs) was adjusted to set the various dose levels; all other parameters were kept the same. For each CT scanner and each dose level, 10 repeated scans (identified in the image series as #1 to #10) with identical settings were performed, except inadvertently for the Toshiba Aquilion Prime SP scanner at 10 mGy (9 repeated scans). Thus, a total of 649 CT scans were performed.
Images were reconstructed using two or three different reconstruction algorithms per CT scan, resulting in two or three CT image series per CT scan. For all CT scans, a vendor-specific iterative reconstruction (IR) algorithm with a standard soft tissue kernel was used, resulting in 649 IR CT series. In addition, filtered backprojection (FBP) reconstruction with a standard soft tissue kernel was used for all CT scans, resulting in another 649 FBP CT series. For 2 of the 13 CT scanners, a DL based reconstruction algorithm was available. For one of these scanners, it was used for three dose levels (1 mGy, 3 mGy, 6 mGy), resulting in 30 additional CT series. For the second scanner, DL reconstruction was used for all five dose levels, resulting in 50 additional CT series. In summary, the dataset presented in this work consists of 1378 series reconstructed from 649 CT scans.
The DICOM data files presented in conjunction with this repository did not undergo any preprocessing steps, in order to preserve all sources of variation—such as spatial shifts and voxel spacing differences introduced by various scanners. However, this repository is linked to a data descriptor paper where we thoroughly analyzed the data, as well as a Git repository that provides the code for resampling the scans to a uniform voxel spacing and performing registration.
The dataset includes original DICOM files with all acquisition parameters stored in the DICOM tags, without any special pre-processing. For each DICOM study, the Study Description tags contain scanner IDs (e.g. "A1" or "H2") which represent 8 institutions. Each DICOM study contains multiple image series reconstructed with different reconstruction methods, plus a series containing the mask related to the various regions of interest in the liver tissue. When downloading these data, the directory and file names will follow the format described in this FAQ entry.
This is an integration of 10 independent multi-country, multi-region, multi-cultural social surveys fielded by Gallup International between 2000 and 2013. The integrated data file contains responses from 535,159 adults living in 103 countries. In total, the harmonization project combined 571 social surveys.
These data have value in a number of longitudinal multi-country, multi-regional, and multi-cultural (L3M) research designs. Understood as independent, though non-random, L3M samples containing a number of multiple indicator ASQ (ask same questions) and ADQ (ask different questions) measures of human development, the environment, international relations, gender equality, security, international organizations, and democracy, to name a few [see full list below].
The data can be used for exploratory and descriptive analysis, with greatest utility at low levels of resolution (e.g. nation-states, supranational groupings). Level of resolution in analysis of these data should be sufficiently low to approximate confidence intervals.
These data can be used for teaching 3M methods, including data harmonization in L3M, 3M research design, survey design, 3M measurement invariance, analysis, and visualization, and reporting. Opportunities to teach about para data, meta data, and data management in L3M designs.
The country units are an unbalanced panel derived from non-probability samples of countries and respondents> Panels (countries) have left and right censorship and are thusly unbalanced. This design limitation can be overcome to the extent that VOTP panels are harmonized with public measurements from other 3M surveys to establish balance in terms of panels and occasions of measurement. Should L3M harmonization occur, these data can be assigned confidence weights to reflect the amount of error in these surveys.
Pooled public opinion surveys (country means), when combine with higher quality country measurements of the same concepts (ASQ, ADQ), can be leveraged to increase the statistical power of pooled publics opinion research designs (multiple L3M datasets)…that is, in studies of public, rather than personal, beliefs.
The Gallup Voice of the People survey data are based on uncertain sampling methods based on underspecified methods. Country sampling is non-random. The sampling method appears be primarily probability and quota sampling, with occasional oversample of urban populations in difficult to survey populations. The sampling units (countries and individuals) are poorly defined, suggesting these data have more value in research designs calling for independent samples replication and repeated-measures frameworks.
**The Voice of the People Survey Series is WIN/Gallup International Association's End of Year survey and is a global study that collects the public's view on the challenges that the world faces today. Ongoing since 1977, the purpose of WIN/Gallup International's End of Year survey is to provide a platform for respondents to speak out concerning government and corporate policies. The Voice of the People, End of Year Surveys for 2012, fielded June 2012 to February 2013, were conducted in 56 countries to solicit public opinion on social and political issues. Respondents were asked whether their country was governed by the will of the people, as well as their attitudes about their society. Additional questions addressed respondents' living conditions and feelings of safety around their living area, as well as personal happiness. Respondents' opinions were also gathered in relation to business development and their views on the effectiveness of the World Health Organization. Respondents were also surveyed on ownership and use of mobile devices. Demographic information includes sex, age, income, education level, employment status, and type of living area.
Background: A consensual definition of occupational burnout is currently lacking. We aimed to harmonize the definition of occupational burnout as a health outcome in medical research and to reach a consensus on this definition within the Network on the Coordination and Harmonisation of European Occupational Cohorts (OMEGA-NET). Methods: First, we performed a systematic review in MEDLINE, PsycINFO and EMBASE (January 1990 to August 2018) and a semantic analysis of the available definitions. We used the definitions of burnout and burnout-related concepts from the Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) to formulate a consistent harmonized definition of the concept. Second, we sought to obtain consensus on the proposed definition using the Delphi technique. Results: We identified 88 unique definitions of burnout and assigned each of them to one of the 11 original definitions. The semantic analysis yielded a semantic proposal, formulated in accordance with SNOMED-CT as follows: “In a worker, occupational burnout or occupational physical AND emotional exhaustion state is an exhaustion due to prolonged exposure to work-related problems”. A panel of 50 experts (researchers and healthcare professionals with an interest for occupational burnout) reached consensus on this proposal at the second round of the Delphi, with 82% of experts agreeing on it. Conclusion: This study resulted in a harmonized definition of occupational burnout approved by experts from 29 countries within the OMEGA-NET. Future research should address the reproducibility of the Delphi consensus in a larger panel of experts, representing more countries, and examine the practicability of the definition.
International
Number of citations per original and secondary definition of occupational burnout among studies included in the systematic review
Three csv files. The first one (ResearchStrings.csv) presents the literature research strings applied to MEDLINE, EMBASE, and PsychINFO, respectively. The second file (DefinitionsIndexation&Citation_OriginaVsUniqueDef.csv) presents the statements of different definitions of occupational burnout identified within the systematic review, their references and the references of studies citing them. Finally the third file (DefinitionsIndexation&Citation_UniqueDefinitionSummary.csv) presents the correspondence between these “unique” definitions and their “original” definitions.
After 2022-01-25, Sentinel-2 scenes with PROCESSING_BASELINE '04.00' or above have their DN (value) range shifted by 1000. The HARMONIZED collection shifts data in newer scenes to be in the same range as in older scenes. Sentinel-2 is a wide-swath, high-resolution, multi-spectral imaging mission supporting Copernicus Land Monitoring studies, including the …
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The analysis of large, multisite neuroimaging datasets provides a promising means for robust characterization of brain networks that can reduce false positives and improve reproducibility. However, the use of different MRI scanners introduces variability to the data. Managing those sources of variability is increasingly important for the generation of accurate group-level inferences. ComBat is one of the most promising tools for multisite (multiscanner) harmonization of structural neuroimaging data, but no study has examined its application to graph theory metrics derived from the structural brain connectome. The present work evaluates the use of ComBat for multisite harmonization in the context of structural network analysis of diffusion-weighted scans from the Advancing Concussion Assessment in Pediatrics (A-CAP) study. Scans were acquired on six different scanners from 484 children aged 8.00–16.99 years [Mean = 12.37 ± 2.34 years; 289 (59.7%) Male] ~10 days following mild traumatic brain injury (n = 313) or orthopedic injury (n = 171). Whole brain deterministic diffusion tensor tractography was conducted and used to construct a 90 x 90 weighted (average fractional anisotropy) adjacency matrix for each scan. ComBat harmonization was applied separately at one of two different stages during data processing, either on the (i) weighted adjacency matrices (matrix harmonization) or (ii) global network metrics derived using unharmonized weighted adjacency matrices (parameter harmonization). Global network metrics based on unharmonized adjacency matrices and each harmonization approach were derived. Robust scanner effects were found for unharmonized metrics. Some scanner effects remained significant for matrix harmonized metrics, but effect sizes were less robust. Parameter harmonized metrics did not differ by scanner. Intraclass correlations (ICC) indicated good to excellent within-scanner consistency between metrics calculated before and after both harmonization approaches. Age correlated with unharmonized network metrics, but was more strongly correlated with network metrics based on both harmonization approaches. Parameter harmonization successfully controlled for scanner variability while preserving network topology and connectivity weights, indicating that harmonization of global network parameters based on unharmonized adjacency matrices may provide optimal results. The current work supports the use of ComBat for removing multiscanner effects on global network topology.
This dataset is the current 2025 Harmonized Tariff Schedule plus all revisions for the current year. It provides the applicable tariff rates and statistical categories for all merchandise imported into the United States; it is based on the international Harmonized System, the global system of nomenclature that is used to describe most world trade in goods.
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE NATIONAL INSTITUTE OF STATISTICS (INS) - TUNISIA
The survey aims at estimating the demographic and educational characteristics of the population. It also calculates the economic indicators of the population such as the number of active individuals, the additional demand for jobs, the number of employed and their characteristics, the number of jobs created, the characteristics of the unemployed and the unemployment rate. Furthermore, this survey estimates these indicators on the household level and their living conditions.
The results of this survey were compared with the results of the second quarter of the national survey on population and employment 2011. It should also be noted that the National Institute of Statistics -Tunisia uses the unemployment definition and concepts adopted by the International Labour Organization. This definition implies that, the individual did not work during the week preceding the day of the interview, was looking for a job in the month preceding the date of the interview, is available to work within two weeks after the day of the interview.
In 2010, the National Institute of Statistics has adopted a strict ILO definition for unemployment, by conditioning that the person must perform effective approaches to search for a job in the month preceding the day of the interview.
Covering a representative sample at the national and regional level (governorates).
1- Household/family. 2- Individual/person.
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE NATIONAL INSTITUTE OF STATISTICS - TUNISIA (INS)
The sample is drawn from the frame of the 2004 General Census of Population and Housing.
Face-to-face [f2f]
Three modules were designed for data collection:
Household Questionnaire (Module 1): Includes questions regarding household characteristics, living conditions, individuals and their demographic, educational and economic characteristics. This module also provides information on internal and external migration.
Active Employed Questionnaire (Module 2): Includes questions regarding the characteristics of the employed individuals as occupation, industry and wages for employees.
Active Unemployed Questionnaire (Module 3): Includes questions regarding the characteristics of the unemployed as unemployment duration, the last occupation, activity, and the number of days worked during the last year...etc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Harmonized IACS inventory of Europe-LAND is a harmonized collection of data from the Geospatial Aid (GSA) system of the Integrated Control and Administration System (IACS), which manages and controls agricultural subsidies in the European Union (EU). The GSA data are a unique data source with field-levels of land use information that are annually generated. The data carry information on crops grown per field, a unique identifier of the subsidy applicants that allows to aggregate fields to farms, information on organic cultivation and animal numbers per farm.
Due to General Data Protection Regulations (GDPR), we are not allowed to share all data that we collected and harmonized. Therefore, there are two versions of the inventory, a public version and an internal version. The internal version contains more information and covers more countries and years.
The public version contains all data that can be shared following the GDPR of the data providers. It covers 18 countries with time series up to 17 years. For most countries, only the crop information can be shared. However, for 6 countries also the applicant identifier and for two of them also the organic management information can be shared. If you use the data, please also cite the original sources of the data. You can find the references in the provided documentation that is in the "_Documentation.zip".
The crop information were harmonized using the Hierarchical Crop and Agriculture Taxonomy (HCAT) of the EuroCrops project (Schneider et al., 2023). To allow for interoperability with EuroCrops, the harmonized Europe-LAND data come with the same column names that relate to the crop information. All crop mapping tables can be found in our GitHub repository.
More detailed information for all countries in our harmonized inventory (including those that are not publicly available) can also be found in the documentation.
The inventory will be updated at least annually. In future versions, we will add a new crop classification, harmonized animal data, and harmonized agri-environmental measures/eco-schemes.
All files come as .geoparquets to stay within the space limitations of Zenodo. Geoparquets can simply be opened in QGIS via drag and drop. Additionally, various libraries from different porgramming languages are able to handle geoparquets, e.g. geoarrow and sgarrwo in R, GDAL/OGR in C++, GeoParquet.jl in Julia or Fiona in Python.
We bundled multiple years of each country to stay below the file number limitation of Zenodo. Each zip file name indicates the country, region, or federal state and the years covered. The meaning of the abbreviations of the countries, regions, and federal states can be found in the "country_region_codes.xlsx" in the "_Documentation.zip".
The Spanish data are also bundled across regions, as they are separated into more than 50 regions. See the country_regions_codes.xlsx tables for the meaning of the abbreviations:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The table includes sample size, mean age with standard deviation, gender distribution (female/male), and the presence of aura.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multicenter and multi-scanner imaging studies may be necessary to ensure sufficiently large sample sizes for developing accurate predictive models. However, multicenter studies, incorporating varying research participant characteristics, MRI scanners, and imaging acquisition protocols, may introduce confounding factors, potentially hindering the creation of generalizable machine learning models. Models developed using one dataset may not readily apply to another, emphasizing the importance of classification model generalizability in multi-scanner and multicenter studies for producing reproducible results. This study focuses on enhancing generalizability in classifying individual migraine patients and healthy controls using brain MRI data through a data harmonization strategy. We propose identifying a ’healthy core’—a group of homogeneous healthy controls with similar characteristics—from multicenter studies. The Maximum Mean Discrepancy (MMD) in Geodesic Flow Kernel (GFK) space is employed to compare two datasets, capturing data variabilities and facilitating the identification of this ‘healthy core’. Homogeneous healthy controls play a vital role in mitigating unwanted heterogeneity, enabling the development of highly accurate classification models with improved performance on new datasets. Extensive experimental results underscore the benefits of leveraging a ’healthy core’. We utilized two datasets: one comprising 120 individuals (66 with migraine and 54 healthy controls), and another comprising 76 individuals (34 with migraine and 42 healthy controls). Notably, a homogeneous dataset derived from a cohort of healthy controls yielded a significant 25% accuracy improvement for both episodic and chronic migraineurs.