Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multicenter and multi-scanner imaging studies may be necessary to ensure sufficiently large sample sizes for developing accurate predictive models. However, multicenter studies, incorporating varying research participant characteristics, MRI scanners, and imaging acquisition protocols, may introduce confounding factors, potentially hindering the creation of generalizable machine learning models. Models developed using one dataset may not readily apply to another, emphasizing the importance of classification model generalizability in multi-scanner and multicenter studies for producing reproducible results. This study focuses on enhancing generalizability in classifying individual migraine patients and healthy controls using brain MRI data through a data harmonization strategy. We propose identifying a ’healthy core’—a group of homogeneous healthy controls with similar characteristics—from multicenter studies. The Maximum Mean Discrepancy (MMD) in Geodesic Flow Kernel (GFK) space is employed to compare two datasets, capturing data variabilities and facilitating the identification of this ‘healthy core’. Homogeneous healthy controls play a vital role in mitigating unwanted heterogeneity, enabling the development of highly accurate classification models with improved performance on new datasets. Extensive experimental results underscore the benefits of leveraging a ’healthy core’. We utilized two datasets: one comprising 120 individuals (66 with migraine and 54 healthy controls), and another comprising 76 individuals (34 with migraine and 42 healthy controls). Notably, a homogeneous dataset derived from a cohort of healthy controls yielded a significant 25% accuracy improvement for both episodic and chronic migraineurs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundBiomedical research requires large, diverse samples to produce unbiased results. Retrospective data harmonization is often used to integrate existing datasets to create these samples, but the process is labor-intensive. Automated methods for matching variables across datasets can accelerate this process, particularly when harmonizing datasets with numerous variables and varied naming conventions. Research in this area has been limited, primarily focusing on lexical matching and ontology-based semantic matching. We aimed to develop new methods, leveraging large language models (LLMs) and ensemble learning, to automate variable matching.MethodsThis study utilized data from two GERAS cohort studies (European [EU] and Japan [JP]) obtained through the Alzheimer’s Disease (AD) Data Initiative’s AD workbench. We first manually created a dataset by matching 347 EU variables with 1322 candidate JP variables and treated matched variable pairs as positive instances and unmatched pairs as negative instances. We then developed four natural language processing (NLP) methods using state-of-the-art LLMs (E5, MPNet, MiniLM, and BioLORD-2023) to estimate variable similarity based on variable labels and derivation rules. A lexical matching method using fuzzy matching was included as a baseline model. In addition, we developed an ensemble-learning method, using the Random Forest (RF) model, to integrate individual NLP methods. RF was trained and evaluated on 50 trials. Each trial had a random split (4:1) of training and test sets, with the model’s hyperparameters optimized through cross-validation on the training set. For each EU variable, 1322 candidate JP variables were ranked based on NLP-derived similarity scores or RF’s probability scores, denoting their likelihood to match the EU variable. Ranking performance was measured by top-n hit ratio (HR-n) and mean reciprocal rank (MRR).ResultsE5 performed best among individual methods, achieving 0.898 HR-30 and 0.700 MRR. RF performed better than E5 on all metrics over 50 trials (P
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Goal: Setting up a pipeline for extending, improving and visualizing time series of municipality characteristics by means of data harmonization and linkage of historical and contemporary dataseries using Linked Data technologies (RDF).This project focused on increasing the data availability, data quality and visualization of characteristics of Dutch municipalities for the period 1795-2010. We did so by (1) combining data from historical and contemporary time series, (2) evaluating and improving on the quality of these time series, and (3) extending the availability of NLGIS maps for the last two decades in order to visualize municipality characteristics for two centuries.
Facebook
TwitterJointly managed by multiple states and the federal government, there are many ongoing efforts to characterize and understand water quality in the Delaware River Basin (DRB). Many State, Federal and non-profit organizations have collected surface-water-quality samples across the DRB for decades and many of these data are available through the National Water Quality Monitoring Council's Water Quality Portal (WQP). In this data release, WQP data in the DRB were harmonized, meaning that they were processed to create a clean and readily usable dataset. This harmonization processing included the synthesis of parameter names and fractions, the condensation of remarks and other data qualifiers, the resolution of duplicate records, an initial quality control check of the data, and other processing steps described in the metadata. This data set provides harmonized discrete multisource surface-water-quality data pulled from the WQP for nutrients, sediment, salinity, major ions, bacteria, temperature, dissolved oxygen, pH, and turbidity in the DRB, for all available years.
Facebook
TwitterSee S2 Table for more background on these data sets and S3 Table on the recommended citation and data licence.
Facebook
TwitterThe analysis of large, multisite neuroimaging datasets provides a promising means for robust characterization of brain networks that can reduce false positives and improve reproducibility. However, the use of different MRI scanners introduces variability to the data. Managing those sources of variability is increasingly important for the generation of accurate group-level inferences. ComBat is one of the most promising tools for multisite (multiscanner) harmonization of structural neuroimaging data, but no study has examined its application to graph theory metrics derived from the structural brain connectome. The present work evaluates the use of ComBat for multisite harmonization in the context of structural network analysis of diffusion-weighted scans from the Advancing Concussion Assessment in Pediatrics (A-CAP) study. Scans were acquired on six different scanners from 484 children aged 8.00–16.99 years [Mean = 12.37 ± 2.34 years; 289 (59.7%) Male] ~10 days following mild traumatic brain injury (n = 313) or orthopedic injury (n = 171). Whole brain deterministic diffusion tensor tractography was conducted and used to construct a 90 x 90 weighted (average fractional anisotropy) adjacency matrix for each scan. ComBat harmonization was applied separately at one of two different stages during data processing, either on the (i) weighted adjacency matrices (matrix harmonization) or (ii) global network metrics derived using unharmonized weighted adjacency matrices (parameter harmonization). Global network metrics based on unharmonized adjacency matrices and each harmonization approach were derived. Robust scanner effects were found for unharmonized metrics. Some scanner effects remained significant for matrix harmonized metrics, but effect sizes were less robust. Parameter harmonized metrics did not differ by scanner. Intraclass correlations (ICC) indicated good to excellent within-scanner consistency between metrics calculated before and after both harmonization approaches. Age correlated with unharmonized network metrics, but was more strongly correlated with network metrics based on both harmonization approaches. Parameter harmonization successfully controlled for scanner variability while preserving network topology and connectivity weights, indicating that harmonization of global network parameters based on unharmonized adjacency matrices may provide optimal results. The current work supports the use of ComBat for removing multiscanner effects on global network topology.
Facebook
TwitterBackground: A consensual definition of occupational burnout is currently lacking. We aimed to harmonize the definition of occupational burnout as a health outcome in medical research and to reach a consensus on this definition within the Network on the Coordination and Harmonisation of European Occupational Cohorts (OMEGA-NET). Methods: First, we performed a systematic review in MEDLINE, PsycINFO and EMBASE (January 1990 to August 2018) and a semantic analysis of the available definitions. We used the definitions of burnout and burnout-related concepts from the Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) to formulate a consistent harmonized definition of the concept. Second, we sought to obtain consensus on the proposed definition using the Delphi technique. Results: We identified 88 unique definitions of burnout and assigned each of them to one of the 11 original definitions. The semantic analysis yielded a semantic proposal, formulated in accordance with SNOMED-CT as follows: “In a worker, occupational burnout or occupational physical AND emotional exhaustion state is an exhaustion due to prolonged exposure to work-related problems”. A panel of 50 experts (researchers and healthcare professionals with an interest for occupational burnout) reached consensus on this proposal at the second round of the Delphi, with 82% of experts agreeing on it. Conclusion: This study resulted in a harmonized definition of occupational burnout approved by experts from 29 countries within the OMEGA-NET. Future research should address the reproducibility of the Delphi consensus in a larger panel of experts, representing more countries, and examine the practicability of the definition.
International
Number of citations per original and secondary definition of occupational burnout among studies included in the systematic review
Three csv files. The first one (ResearchStrings.csv) presents the literature research strings applied to MEDLINE, EMBASE, and PsychINFO, respectively. The second file (DefinitionsIndexation&Citation_OriginaVsUniqueDef.csv) presents the statements of different definitions of occupational burnout identified within the systematic review, their references and the references of studies citing them. Finally the third file (DefinitionsIndexation&Citation_UniqueDefinitionSummary.csv) presents the correspondence between these “unique” definitions and their “original” definitions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Random Forest and E5 model performance comparison.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Average mean difference after mean and variance adjustments under random subsets within dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Average mean difference under random subsets within dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This visualization product displays smoking related items abundance per beach per year obtained during research & cleaning operations. EMODnet Chemistry included the gathering of marine litter in its 3rd phase. Since the beginning of 2018, data of beach litter have been gathered and processed in the EMODnet Chemistry Marine Litter Database (MLDB). The harmonization of all the data has been the most challenging task considering the heterogeneity of the data sources, sampling protocols and reference lists used on a European scale. Preliminary processing were necessary to harmonize all the data : - Exclusion of OSPAR 1000 protocol, - Separation of monitoring surveys from research & cleaning operations - Exclusion of beaches with no coordinates - Normalization of survey lengths and survey numbers per year - Some categories & some litter types have been removed Abundances have been calculated on each beach and year using the following computation: Cigarette related items abundance=(total number of Cigarette related items (normalized at 100m))/(Number of surveys on the year) Percentiles 50, 75 & 95 have been calculated taking into account data from all years. Cigarette related items reference codes taken into account for this product and information on data processing and calculation are detailed in the document attached p15.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Transformation source to target table.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each row indicates the specific experiment, dataset, and the comparison group, healthy controls (HC), episodic migraine (EM), and chronic migraine (CM) patients.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accuracy, specificity, sensitivity, and F1-score of four models with and without healthy core for CM and EM classification.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number and area of urban parks before and after harmonization obtained by GMaps and OSM tools for the 16 cities with official spatial data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multicenter and multi-scanner imaging studies may be necessary to ensure sufficiently large sample sizes for developing accurate predictive models. However, multicenter studies, incorporating varying research participant characteristics, MRI scanners, and imaging acquisition protocols, may introduce confounding factors, potentially hindering the creation of generalizable machine learning models. Models developed using one dataset may not readily apply to another, emphasizing the importance of classification model generalizability in multi-scanner and multicenter studies for producing reproducible results. This study focuses on enhancing generalizability in classifying individual migraine patients and healthy controls using brain MRI data through a data harmonization strategy. We propose identifying a ’healthy core’—a group of homogeneous healthy controls with similar characteristics—from multicenter studies. The Maximum Mean Discrepancy (MMD) in Geodesic Flow Kernel (GFK) space is employed to compare two datasets, capturing data variabilities and facilitating the identification of this ‘healthy core’. Homogeneous healthy controls play a vital role in mitigating unwanted heterogeneity, enabling the development of highly accurate classification models with improved performance on new datasets. Extensive experimental results underscore the benefits of leveraging a ’healthy core’. We utilized two datasets: one comprising 120 individuals (66 with migraine and 54 healthy controls), and another comprising 76 individuals (34 with migraine and 42 healthy controls). Notably, a homogeneous dataset derived from a cohort of healthy controls yielded a significant 25% accuracy improvement for both episodic and chronic migraineurs.