Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionTransparency and traceability are essential for establishing trustworthy artificial intelligence (AI). The lack of transparency in the data preparation process is a significant obstacle in developing reliable AI systems which can lead to issues related to reproducibility, debugging AI models, bias and fairness, and compliance and regulation. We introduce a formal data preparation pipeline specification to improve upon the manual and error-prone data extraction processes used in AI and data analytics applications, with a focus on traceability.MethodsWe propose a declarative language to define the extraction of AI-ready datasets from health data adhering to a common data model, particularly those conforming to HL7 Fast Healthcare Interoperability Resources (FHIR). We utilize the FHIR profiling to develop a common data model tailored to an AI use case to enable the explicit declaration of the needed information such as phenotype and AI feature definitions. In our pipeline model, we convert complex, high-dimensional electronic health records data represented with irregular time series sampling to a flat structure by defining a target population, feature groups and final datasets. Our design considers the requirements of various AI use cases from different projects which lead to implementation of many feature types exhibiting intricate temporal relations.ResultsWe implement a scalable and high-performant feature repository to execute the data preparation pipeline definitions. This software not only ensures reliable, fault-tolerant distributed processing to produce AI-ready datasets and their metadata including many statistics alongside, but also serve as a pluggable component of a decision support application based on a trained AI model during online prediction to automatically prepare feature values of individual entities. We deployed and tested the proposed methodology and the implementation in three different research projects. We present the developed FHIR profiles as a common data model, feature group definitions and feature definitions within a data preparation pipeline while training an AI model for “predicting complications after cardiac surgeries”.DiscussionThrough the implementation across various pilot use cases, it has been demonstrated that our framework possesses the necessary breadth and flexibility to define a diverse array of features, each tailored to specific temporal and contextual criteria.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Harmony is a data harmonisation project that uses Natural Language Processing to help researchers make better use of existing data from different studies by supporting them with the harmonisation of various measures and items used in different studies. Harmony is a collaboration project between the University of Ulster, University College London, the Universidade Federal de Santa Maria in Brazil, and Fast Data Science Ltd.
You can read more at https://harmonydata.org.
There is a live demo at: https://app.harmonydata.org/
These are the datasets used to validate Harmony. The Excel file is McElroy et al's data, and the zip file contains the English and Portuguese GAD-7s.
Facebook
TwitterTo facilitate the use of data collected through the high-frequency phone surveys on COVID-19, the Living Standards Measurement Study (LSMS) team has created the harmonized datafiles using two household surveys: 1) the country’ latest face-to-face survey which has become the sample frame for the phone survey, and 2) the country’s high-frequency phone survey on COVID-19.
The LSMS team has extracted and harmonized variables from these surveys, based on the harmonized definitions and ensuring the same variable names. These variables include demography as well as housing, household consumption expenditure, food security, and agriculture. Inevitably, many of the original variables are collected using questions that are asked differently. The harmonized datafiles include the best available variables with harmonized definitions.
Two harmonized datafiles are prepared for each survey. The two datafiles are:
1. HH: This datafile contains household-level variables. The information include basic household characterizes, housing, water and sanitation, asset ownership, consumption expenditure, consumption quintile, food security, livestock ownership. It also contains information on agricultural activities such as crop cultivation, use of organic and inorganic fertilizer, hired labor, use of tractor and crop sales.
2. IND: This datafile contains individual-level variables. It includes basic characteristics of individuals such as age, sex, marital status, disability status, literacy, education and work.
National
The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.
Sample survey data [ssd]
See “Uganda - National Panel Survey 2019-2020” and “Uganda - High-Frequency Phone Survey on COVID-19 2020-2021” documentations available in the Microdata Library for details.
Computer Assisted Personal Interview [capi]
Uganda National Panel Survey 2019-2020 and Uganda High-Frequency Phone Survey on COVID-19 2020-2021 data were harmonized following the harmonization guidelines (see “Harmonized Datafiles and Variables for High-Frequency Phone Surveys on COVID-19” for more details).
The high-frequency phone survey on COVID-19 has multiple rounds of data collection. When variables are extracted from multiple rounds of the survey, the originating round of the survey is noted with “_rX” in the variable name, where X represents the number of the round. For example, a variable with “_r3” presents that the variable was extracted from Round 3 of the high-frequency phone survey. Round 0 refers to the country’s latest face-to-face survey which has become the sample frame for the high-frequency phone surveys on COVID-19. When the variables are without “_rX”, they were extracted from Round 0.
See “Uganda - National Panel Survey 2019-2020” and “Uganda - High-Frequency Phone Survey on COVID-19 2020-2021” documentations available in the Microdata Library for details.
Facebook
TwitterTo facilitate the use of data collected through the high-frequency phone surveys on COVID-19, the Living Standards Measurement Study (LSMS) team has created the harmonized datafiles using two household surveys: 1) the country’ latest face-to-face survey which has become the sample frame for the phone survey, and 2) the country’s high-frequency phone survey on COVID-19.
The LSMS team has extracted and harmonized variables from these surveys, based on the harmonized definitions and ensuring the same variable names. These variables include demography as well as housing, household consumption expenditure, food security, and agriculture. Inevitably, many of the original variables are collected using questions that are asked differently. The harmonized datafiles include the best available variables with harmonized definitions.
Two harmonized datafiles are prepared for each survey. The two datafiles are: 1. HH: This datafile contains household-level variables. The information include basic household characterizes, housing, water and sanitation, asset ownership, consumption expenditure, consumption quintile, food security, livestock ownership. It also contains information on agricultural activities such as crop cultivation, use of organic and inorganic fertilizer, hired labor, use of tractor and crop sales. 2. IND: This datafile contains individual-level variables. It includes basic characteristics of individuals such as age, sex, marital status, disability status, literacy, education and work.
National coverage
The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.
Sample survey data [ssd]
See “Ethiopia - Socioeconomic Survey 2018-2019” and “Ethiopia - COVID-19 High Frequency Phone Survey of Households 2020” available in the Microdata Library for details.
Computer Assisted Personal Interview [capi]
Ethiopia Socioeconomic Survey (ESS) 2018-2019 and Ethiopia COVID-19 High Frequency Phone Survey of Households (HFPS) 2020 data were harmonized following the harmonization guidelines (see “Harmonized Datafiles and Variables for High-Frequency Phone Surveys on COVID-19” for more details).
The high-frequency phone survey on COVID-19 has multiple rounds of data collection. When variables are extracted from multiple rounds of the survey, the originating round of the survey is noted with “_rX” in the variable name, where X represents the number of the round. For example, a variable with “_r3” presents that the variable was extracted from Round 3 of the high-frequency phone survey. Round 0 refers to the country’s latest face-to-face survey which has become the sample frame for the high-frequency phone surveys on COVID-19. When the variables are without “_rX”, they were extracted from Round 0.
See “Ethiopia - Socioeconomic Survey 2018-2019” and “Ethiopia - COVID-19 High Frequency Phone Survey of Households 2020” available in the Microdata Library for details.
Facebook
TwitterThe harmonized data set on health, created and published by the ERF, is a subset of Iraq Household Socio Economic Survey (IHSES) 2012. It was derived from the household, individual and health modules, collected in the context of the above mentioned survey. The sample was then used to create a harmonized health survey, comparable with the Iraq Household Socio Economic Survey (IHSES) 2007 micro data set.
----> Overview of the Iraq Household Socio Economic Survey (IHSES) 2012:
Iraq is considered a leader in household expenditure and income surveys where the first was conducted in 1946 followed by surveys in 1954 and 1961. After the establishment of Central Statistical Organization, household expenditure and income surveys were carried out every 3-5 years in (1971/ 1972, 1976, 1979, 1984/ 1985, 1988, 1993, 2002 / 2007). Implementing the cooperation between CSO and WB, Central Statistical Organization (CSO) and Kurdistan Region Statistics Office (KRSO) launched fieldwork on IHSES on 1/1/2012. The survey was carried out over a full year covering all governorates including those in Kurdistan Region.
The survey has six main objectives. These objectives are:
The raw survey data provided by the Statistical Office were then harmonized by the Economic Research Forum, to create a comparable version with the 2006/2007 Household Socio Economic Survey in Iraq. Harmonization at this stage only included unifying variables' names, labels and some definitions. See: Iraq 2007 & 2012- Variables Mapping & Availability Matrix.pdf provided in the external resources for further information on the mapping of the original variables on the harmonized ones, in addition to more indications on the variables' availability in both survey years and relevant comments.
National coverage: Covering a sample of urban, rural and metropolitan areas in all the governorates including those in Kurdistan Region.
1- Household/family. 2- Individual/person.
The survey was carried out over a full year covering all governorates including those in Kurdistan Region.
Sample survey data [ssd]
----> Design:
Sample size was (25488) household for the whole Iraq, 216 households for each district of 118 districts, 2832 clusters each of which includes 9 households distributed on districts and governorates for rural and urban.
----> Sample frame:
Listing and numbering results of 2009-2010 Population and Housing Survey were adopted in all the governorates including Kurdistan Region as a frame to select households, the sample was selected in two stages: Stage 1: Primary sampling unit (blocks) within each stratum (district) for urban and rural were systematically selected with probability proportional to size to reach 2832 units (cluster). Stage two: 9 households from each primary sampling unit were selected to create a cluster, thus the sample size of total survey clusters was 25488 households distributed on the governorates, 216 households in each district.
----> Sampling Stages:
In each district, the sample was selected in two stages: Stage 1: based on 2010 listing and numbering frame 24 sample points were selected within each stratum through systematic sampling with probability proportional to size, in addition to the implicit breakdown urban and rural and geographic breakdown (sub-district, quarter, street, county, village and block). Stage 2: Using households as secondary sampling units, 9 households were selected from each sample point using systematic equal probability sampling. Sampling frames of each stages can be developed based on 2010 building listing and numbering without updating household lists. In some small districts, random selection processes of primary sampling may lead to select less than 24 units therefore a sampling unit is selected more than once , the selection may reach two cluster or more from the same enumeration unit when it is necessary.
Face-to-face [f2f]
----> Preparation:
The questionnaire of 2006 survey was adopted in designing the questionnaire of 2012 survey on which many revisions were made. Two rounds of pre-test were carried out. Revision were made based on the feedback of field work team, World Bank consultants and others, other revisions were made before final version was implemented in a pilot survey in September 2011. After the pilot survey implemented, other revisions were made in based on the challenges and feedbacks emerged during the implementation to implement the final version in the actual survey.
----> Questionnaire Parts:
The questionnaire consists of four parts each with several sections: Part 1: Socio – Economic Data: - Section 1: Household Roster - Section 2: Emigration - Section 3: Food Rations - Section 4: housing - Section 5: education - Section 6: health - Section 7: Physical measurements - Section 8: job seeking and previous job
Part 2: Monthly, Quarterly and Annual Expenditures: - Section 9: Expenditures on Non – Food Commodities and Services (past 30 days). - Section 10 : Expenditures on Non – Food Commodities and Services (past 90 days). - Section 11: Expenditures on Non – Food Commodities and Services (past 12 months). - Section 12: Expenditures on Non-food Frequent Food Stuff and Commodities (7 days). - Section 12, Table 1: Meals Had Within the Residential Unit. - Section 12, table 2: Number of Persons Participate in the Meals within Household Expenditure Other Than its Members.
Part 3: Income and Other Data: - Section 13: Job - Section 14: paid jobs - Section 15: Agriculture, forestry and fishing - Section 16: Household non – agricultural projects - Section 17: Income from ownership and transfers - Section 18: Durable goods - Section 19: Loans, advances and subsidies - Section 20: Shocks and strategy of dealing in the households - Section 21: Time use - Section 22: Justice - Section 23: Satisfaction in life - Section 24: Food consumption during past 7 days
Part 4: Diary of Daily Expenditures: Diary of expenditure is an essential component of this survey. It is left at the household to record all the daily purchases such as expenditures on food and frequent non-food items such as gasoline, newspapers…etc. during 7 days. Two pages were allocated for recording the expenditures of each day, thus the roster will be consists of 14 pages.
----> Raw Data:
Data Editing and Processing: To ensure accuracy and consistency, the data were edited at the following stages: 1. Interviewer: Checks all answers on the household questionnaire, confirming that they are clear and correct. 2. Local Supervisor: Checks to make sure that questions has been correctly completed. 3. Statistical analysis: After exporting data files from excel to SPSS, the Statistical Analysis Unit uses program commands to identify irregular or non-logical values in addition to auditing some variables. 4. World Bank consultants in coordination with the CSO data management team: the World Bank technical consultants use additional programs in SPSS and STAT to examine and correct remaining inconsistencies within the data files. The software detects errors by analyzing questionnaire items according to the expected parameter for each variable.
----> Harmonized Data:
Iraq Household Socio Economic Survey (IHSES) reached a total of 25488 households. Number of households refused to response was 305, response rate was 98.6%. The highest interview rates were in Ninevah and Muthanna (100%) while the lowest rates were in Sulaimaniya (92%).
Facebook
TwitterThis is an integration of 10 independent multi-country, multi-region, multi-cultural social surveys fielded by Gallup International between 2000 and 2013. The integrated data file contains responses from 535,159 adults living in 103 countries. In total, the harmonization project combined 571 social surveys.
These data have value in a number of longitudinal multi-country, multi-regional, and multi-cultural (L3M) research designs. Understood as independent, though non-random, L3M samples containing a number of multiple indicator ASQ (ask same questions) and ADQ (ask different questions) measures of human development, the environment, international relations, gender equality, security, international organizations, and democracy, to name a few [see full list below].
The data can be used for exploratory and descriptive analysis, with greatest utility at low levels of resolution (e.g. nation-states, supranational groupings). Level of resolution in analysis of these data should be sufficiently low to approximate confidence intervals.
These data can be used for teaching 3M methods, including data harmonization in L3M, 3M research design, survey design, 3M measurement invariance, analysis, and visualization, and reporting. Opportunities to teach about para data, meta data, and data management in L3M designs.
The country units are an unbalanced panel derived from non-probability samples of countries and respondents> Panels (countries) have left and right censorship and are thusly unbalanced. This design limitation can be overcome to the extent that VOTP panels are harmonized with public measurements from other 3M surveys to establish balance in terms of panels and occasions of measurement. Should L3M harmonization occur, these data can be assigned confidence weights to reflect the amount of error in these surveys.
Pooled public opinion surveys (country means), when combine with higher quality country measurements of the same concepts (ASQ, ADQ), can be leveraged to increase the statistical power of pooled publics opinion research designs (multiple L3M datasets)…that is, in studies of public, rather than personal, beliefs.
The Gallup Voice of the People survey data are based on uncertain sampling methods based on underspecified methods. Country sampling is non-random. The sampling method appears be primarily probability and quota sampling, with occasional oversample of urban populations in difficult to survey populations. The sampling units (countries and individuals) are poorly defined, suggesting these data have more value in research designs calling for independent samples replication and repeated-measures frameworks.
**The Voice of the People Survey Series is WIN/Gallup International Association's End of Year survey and is a global study that collects the public's view on the challenges that the world faces today. Ongoing since 1977, the purpose of WIN/Gallup International's End of Year survey is to provide a platform for respondents to speak out concerning government and corporate policies. The Voice of the People, End of Year Surveys for 2012, fielded June 2012 to February 2013, were conducted in 56 countries to solicit public opinion on social and political issues. Respondents were asked whether their country was governed by the will of the people, as well as their attitudes about their society. Additional questions addressed respondents' living conditions and feelings of safety around their living area, as well as personal happiness. Respondents' opinions were also gathered in relation to business development and their views on the effectiveness of the World Health Organization. Respondents were also surveyed on ownership and use of mobile devices. Demographic information includes sex, age, income, education level, employment status, and type of living area.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This repository contains the ON-Harmony data resource associated with the manuscript:
Warrington et al. (2025), "A multi-site, multi-modal travelling-heads resource for brain MRI harmonisation", under-review, Scientific Data
Phase A
N = 10 healthy participants (mean age at recruitment 34 ± 9.4 years; 8 male, 2 female)
Subject: Sessions
Phase B
N = 10 healthy participants (mean age at recruitment 29.8 ± 11.4 years; 5 male, 5 female)
Subject: Sessions
Scanners:
Phase A data were acquired using Scanners 1-6. Phase B data were acquired using used Scanner 1,2,4,5,7,8. Scanners 1,2,4,5 are overlapping between the two phases.
Modalities:
Scanners are located across five sites in the United Kingdom.
Anatomical data have been defaced following the UKBiobank defacing procedure. Defacing masks are available in each session directory as sub-
At time of release, SWI data were not yet incorporated in to the BIDS standard. The SWI extension proposal (https://bids-specification.readthedocs.io/en/v1.2.1/06-extensions.html: accessed Autumn 2022) was used to define SWI data structure.
Minor deviations of protocols for individual scans The majority of subjects were acquired using the same protocols. However, for some subjects there were minor deviations in some protocol parameters that we describe here for completeness.
Phase A - GE MR 750 rfMRI: For one subject (03286), we acquired two versions of the rfMRI data with 1) an isotropic spatial resolution of 2.4mm and 2) a spatial resolution of 2.2x2.2x3.4mm. Neither matched the 3.3mm isotropic data acquired for other subjects. For 6 subjects (03286, 03997, 10975, 12813, 14482, 14221) there was a mismatch in PE direction between dMRI and fMRI that we accounted for in the processing pipeline. - Philips Achieva dMRI: In most cases, the dMRI protocol included 6 b=0 s/mm2 volumes. Four subjects (13305, 13192, 14229, 14230) were acquired with 2 b=0 s/mm2 volumes. - Philips Ingenia dMRI: In most cases, dMRI data were acquired using an in-plane acceleration factor of 1.5 (TE=98ms, TR=4.4s). For 4 subjects (13305, 13192, 14229, 14230), the dMRI data were acquired using an in-plane acceleration factor of 2 (TE=92ms, TR=3.9s).
Phase B - Philips Achieva fMRI: 16975 (16975_NOT1ACH001) has a single volume missing, due to data corruption during reconstruction. - Philips Ingenia fMRI: 15320 (15320_NOT2ING006) has a single volume missing, due to data corruption during reconstruction. - GE Premier 21 (Oxford) dMRI: For session 16793_OXF4GEP001, the dMRI reversed-phase-encode b=0 was incorrectly acquired with a left-right (LR) phase-encode direction instead of posterior-anterior (PA). We accounted for this by running a bespoke fieldmap estimation with FSL's topup (Andersson et al. NeuroImage 2003), which was fed into the UKBB pipeline for further processing. For session 16745_OXF4GEP001, dMRI were acquired with a slice thickness of 2.4 mm, instead of 2.0 mm. - GE Premier 21 (Oxford) anatomical: For subjects 15320, 16793, 16794, 16974, 16766, 16745, the T2w FLAIR was acquired with a 1.3x1x1 mm resolution instead of 1x1x1 mm. For sessions 16793 and 16794, the T1w MPRAGE was acquired with a 212x256x256 matrix size, instead with 256x256x212. - GE Premier 42 (Nottingham) dMRI: For session 16794_NOT4GEP001, no dMRI reversed-phase-encode b=0 image was acquired. In this case, we generated a synthetic distortion-free b=0 image using Synb0-DisCo (v3.0) (Schilling et al. Magn Reson Imaging. 2019, Schilling et al. PLoS ONE 2020) with the same phase-encoding (anterior-posterior, AP) direction as the main data. A fieldmap was then estimated from the synthetic distortion-free b=0 and acquired b=0 data using FSL's topup, which was then used for further processing.
Incidental Findings - A cortical hypointensity visible on the right hemisphere (for fMRI, dMRI and SWI) was observed for Subject 16766, phase B. The subject is healthy and Incidental finding inspection deemed this as a non-pathological feature. Overall, the z-scored IDPs and QC measures for that subject scans were not considerably different from the mean of all subjects, so we kept these scans.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A new methodology and comprehensive database (Multi-scale harmonisation Across Physical and Socio-Economic Characteristics of a City region, MAPSECC) is developed that connects physical characteristics of a city (building morphology and materials, land-surface cover) with socio-economic aspects (building function, microenvironments of activity, urban transport infrastructure, residential and workplace populations, human activities), and is demonstrated for London, UK (MAPSECC: London). The database fulfils input requirements for dynamic and multi-scale urban modelling approaches. Dataset components combine and harmonise information from primary sources (often government agencies) through novel downscaling and aggregation methods to give a traceable, repeatable methodology.
Files in this archive
Documentation
MAPSECC_London_documentation.pdf
Processing grid
Main dataset: London_500m_grid.zip
Code: London_500m_grid_code.zip
Land-cover fractions
Main dataset: London_landcover.zip
Auxiliary data: London_landcover_auxiliary.zip
Code: London_landcover_code.zip
Building typologies (with population statistics)
Main dataset: Building_typologies.zip
Auxiliary data: Building_typologies_auxiliary.zip
Code: Building_typologies_code.zip
Building material parameters
Main dataset: Materials_layer_info.zip, Materials_layer_processed.zip, Materials_parameters.zip
Code: Materials_code.zip
Human activity profiles
Main dataset: UK_TUS2014-15_activity_profiles.zip
Auxiliary data: Activity_profiles_auxiliary.zip
Code: Activity_profiles_code.zip
Transport database
Main dataset: London_transport_database.zip
Code: London_transport_code.zip
Road lengths by type
Main dataset: London_roads_by_type.zip
Auxiliary data: London_roads_auxiliary.zip
Code: London_roads_code.zip
Spatial attractors
Main dataset: London_attractors.zip
Auxiliary data: London_attractors_auxiliary.zip
Code: London_attractors_code.zip
Disclaimer notice
disclaimer_note.txt
Facebook
TwitterTo facilitate the use of data collected through the high-frequency phone surveys on COVID-19, the Living Standards Measurement Study (LSMS) team has created the harmonized datafiles using two household surveys: 1) the country’ latest face-to-face survey which has become the sample frame for the phone survey, and 2) the country’s high-frequency phone survey on COVID-19.
The LSMS team has extracted and harmonized variables from these surveys, based on the harmonized definitions and ensuring the same variable names. These variables include demography as well as housing, household consumption expenditure, food security, and agriculture. Inevitably, many of the original variables are collected using questions that are asked differently. The harmonized datafiles include the best available variables with harmonized definitions.
Two harmonized datafiles are prepared for each survey. The two datafiles are:
1. HH: This datafile contains household-level variables. The information include basic household characterizes, housing, water and sanitation, asset ownership, consumption expenditure, consumption quintile, food security, livestock ownership. It also contains information on agricultural activities such as crop cultivation, use of organic and inorganic fertilizer, hired labor, use of tractor and crop sales.
2. IND: This datafile contains individual-level variables. It includes basic characteristics of individuals such as age, sex, marital status, disability status, literacy, education and work.
National coverage
The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.
Sample survey data [ssd]
See “Nigeria - General Household Survey, Panel 2018-2019, Wave 4” and “Nigeria - COVID-19 National Longitudinal Phone Survey 2020” available in the Microdata Library for details.
Computer Assisted Personal Interview [capi]
Nigeria General Household Survey, Panel (GHS-Panel) 2018-2019 and Nigeria COVID-19 National Longitudinal Phone Survey (COVID-19 NLPS) 2020 data were harmonized following the harmonization guidelines (see “Harmonized Datafiles and Variables for High-Frequency Phone Surveys on COVID-19” for more details).
The high-frequency phone survey on COVID-19 has multiple rounds of data collection. When variables are extracted from multiple rounds of the survey, the originating round of the survey is noted with “_rX” in the variable name, where X represents the number of the round. For example, a variable with “_r3” presents that the variable was extracted from Round 3 of the high-frequency phone survey. Round 0 refers to the country’s latest face-to-face survey which has become the sample frame for the high-frequency phone surveys on COVID-19. When the variables are without “_rX”, they were extracted from Round 0.
See “Nigeria - General Household Survey, Panel 2018-2019, Wave 4” and “Nigeria - COVID-19 National Longitudinal Phone Survey 2020” available in the Microdata Library for details.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table includes figures on the effects of rent harmonisation and renovation on the average rent increase. A distinction is made here between rental of dwellings by social and other landlords, mid-tier rental and liberalised rental.
Data available from: 2015.
Status of the figures: The figures in this table are definitive.
Changes as of 10 October 2025: The figures for 'Share of harmonisation of dwellings' have been corrected for the reporting year 2025. In the earlier calculation, not all homes were correctly classified. This has no impact on the other figures in this table.
Changes as of 5 September 2025: The 'Mid-tier rental' category has been added to the dimension 'Type of rental'. The figures of 2025 have been published.
When will new figures be published? New figures will become available in September 2026.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AusENDVI (Australian Emprical NDVI) is a monthly, 5-km gridded estimate of NDVI across Australia from 1982-2022. It is built by calibrating and harmonising NOAA's Climate Data Record AVHRR NDVI data to MODIS MCD43A4 NDVI using a gradient boosting ensemble decision tree method. Additionally, the datasets are gapfilled using a synthetic NDVI dataset. The methods are extensively described in an Earth System Science Data pre-publication.
AusENDVI consists of several datasets, each dataset has a description in the attributes of the NetCDF file that describes its provenance. The naming convention is "AusENDVI_
All datasets are in "EPSG:4326" projection, and have a spatial resolution of 0.05 degrees. Geographic coordinate information is contained in the "spatial_ref" variable.
A Jupyter Notebook is also provided that shows how to load, plot, QC mask, reproject, and gap-fill AusENDVI datasets. The notebook is effectively a 'readme' file.
An open-source github repository details the methods used to create these datasets
Facebook
Twitterhttps://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Data produced by BRGM and funded by DREAL CFB. Harmonisation, at the scale of 1/50 000, of the eight previously produced harmonised and departmental maps. This harmonised digital map makes it possible, at the regional level, to do rapid research or processing of information such as selecting specific levels or training (by age, by lithology, etc.). It will be intended, among other things, to support the regional quarry scheme, which, in accordance with the new regulations, will take the place of pre-existing departmental career schemes. For the Burgundy-Franche-Comté region, the eight harmonised geological maps of its departments have been harmonised by: - 2003 for Côte-d’Or and Nièvre, - 2004 for Haute-Saône, - 2005 for Yonne, - 2007 for Jura, - 2008 for Doubs, Saône-et-Loire and Territoire-de-Belfort. Attention: The harmonisation work, both at the departmental and regional level, was carried out solely on the basis of existing geological maps, without new intervention on the ground. The accuracy of the harmonised maps therefore depends on the accuracy of each sheet at the original 1/50 000 and its survey date specified above. Stages of harmonisation work: 1- Harmonise the four departments of the former Franche Comté region to achieve a harmonised geological map of the former Franche Comté region, 2- Harmonise between them the two harmonised regional geological maps: old Burgundy region and former Franche-Comté region. In the end, restitution in the form of 4 layers: GEO050K_HARM_BOU-FRC_L_DIVERS_2154 GEO050K_HARM_BOU-FRC_L_FGEOL_2154 GEO050K_HARM_BOU-FRC_L_STRUCT_2154 GEO050K_HARM_BOU-FRC_S_FGEOL_2154
Facebook
TwitterBackground: A consensual definition of occupational burnout is currently lacking. We aimed to harmonize the definition of occupational burnout as a health outcome in medical research and to reach a consensus on this definition within the Network on the Coordination and Harmonisation of European Occupational Cohorts (OMEGA-NET). Methods: First, we performed a systematic review in MEDLINE, PsycINFO and EMBASE (January 1990 to August 2018) and a semantic analysis of the available definitions. We used the definitions of burnout and burnout-related concepts from the Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) to formulate a consistent harmonized definition of the concept. Second, we sought to obtain consensus on the proposed definition using the Delphi technique. Results: We identified 88 unique definitions of burnout and assigned each of them to one of the 11 original definitions. The semantic analysis yielded a semantic proposal, formulated in accordance with SNOMED-CT as follows: “In a worker, occupational burnout or occupational physical AND emotional exhaustion state is an exhaustion due to prolonged exposure to work-related problems”. A panel of 50 experts (researchers and healthcare professionals with an interest for occupational burnout) reached consensus on this proposal at the second round of the Delphi, with 82% of experts agreeing on it. Conclusion: This study resulted in a harmonized definition of occupational burnout approved by experts from 29 countries within the OMEGA-NET. Future research should address the reproducibility of the Delphi consensus in a larger panel of experts, representing more countries, and examine the practicability of the definition.
International
Number of citations per original and secondary definition of occupational burnout among studies included in the systematic review
Three csv files. The first one (ResearchStrings.csv) presents the literature research strings applied to MEDLINE, EMBASE, and PsychINFO, respectively. The second file (DefinitionsIndexation&Citation_OriginaVsUniqueDef.csv) presents the statements of different definitions of occupational burnout identified within the systematic review, their references and the references of studies citing them. Finally the third file (DefinitionsIndexation&Citation_UniqueDefinitionSummary.csv) presents the correspondence between these “unique” definitions and their “original” definitions.
Facebook
TwitterTHE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE NATIONAL INSTITUTE OF STATISTICS (INS) - TUNISIA
The survey aims at estimating the demographic and educational characteristics of the population. It also calculates the economic indicators of the population such as the number of active individuals, the additional demand for jobs, the number of employed and their characteristics, the number of jobs created, the characteristics of the unemployed and the unemployment rate. Furthermore, this survey estimates these indicators on the household level and their living conditions.
The results of this survey were compared with the results of the second quarter of the national survey on population and employment 2011. It should also be noted that the National Institute of Statistics -Tunisia uses the unemployment definition and concepts adopted by the International Labour Organization. This definition implies that, the individual did not work during the week preceding the day of the interview, was looking for a job in the month preceding the date of the interview, is available to work within two weeks after the day of the interview.
In 2010, the National Institute of Statistics has adopted a strict ILO definition for unemployment, by conditioning that the person must perform effective approaches to search for a job in the month preceding the day of the interview.
Covering a representative sample at the national and regional level (governorates).
1- Household/family. 2- Individual/person.
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE NATIONAL INSTITUTE OF STATISTICS - TUNISIA (INS)
The sample is drawn from the frame of the 2004 General Census of Population and Housing.
Face-to-face [f2f]
Three modules were designed for data collection:
Household Questionnaire (Module 1): Includes questions regarding household characteristics, living conditions, individuals and their demographic, educational and economic characteristics. This module also provides information on internal and external migration.
Active Employed Questionnaire (Module 2): Includes questions regarding the characteristics of the employed individuals as occupation, industry and wages for employees.
Active Unemployed Questionnaire (Module 3): Includes questions regarding the characteristics of the unemployed as unemployment duration, the last occupation, activity, and the number of days worked during the last year...etc.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MedNorm2SnomedCT2UMLS Paper on Mednorm and harmonisation: https://aclanthology.org/W19-3204.pdf The medical concept normalisation task aims to map textual descriptions to standard terminologies such as SNOMED-CT or MedDRA. Existing publicly available datasets annotated using different terminologies cannot be simply merged and utilised, and therefore become less valuable when developing machine learningbased concept normalisation systems. To address that, we designed a data harmonisation… See the full description on the dataset page: https://huggingface.co/datasets/awacke1/MedNorm2SnomedCT2UMLS.
Facebook
TwitterDRAKO is a Mobile Location Data provider with a programmatic trading desk specializing in geolocation analytics and programmatic advertising. Our Consumer Travel History Data has helped cities, counties, and states better understand who their visitors are so that they can effectively develop and deliver advertising campaigns. We’re in a unique position to deliver enriched insight beyond traditional surveying or other data sources because of our rich dataset, proprietary modelling capabilities, and analytical capabilities.
MAIDs (Mobile Advertising IDs) are unique device identifiers associated with consenting mobile devices that can be utilized for geolocation based analyses and audiences. Drako uses MAIDs to fuel our Consumer Travel History Data utilizing our Home Location Model. The Home Location of a MAID is determined based on where that MAID is seen most frequently between the hours of 11pm and 6am (local time). Using this we are able to determine the Home Location of a user which in turn allows us to identify when and where they are travelling.
Beyond identifying that users are tourists, we can also classify them into different bins by their frequency / dwell time over their estimated number of visits. Using our data and frequency, we can identify: overnight visitors, weekend visits, short-term stays, long-term stays, or frequent holiday visitors !
Beyond Consumer Travel History Data in your defined geography alone, we are also able to provide: - Home location - Find out where your audience is coming from using our home location technology - Movement - Quantify how far users have travelled between locations. - Demographics - Discover neighborhood level characteristics such as income, ethnicity, and more - Brand index - Learn which major brands and retailers your audience is visiting the most. - Visitation index - See which destinations your visitors are visiting the most - Addressable audience - Customize your audiences for your campaigns using our analytic insights
Moreover, if you’re looking to activate your Consumer Travel History Data for advertising, we’re always able to further refine or filter your desired audience with our other Audience Data, such as: Brand visits, Geodemographics, Ticketed Event visits, Purchase Intent (in Canada), Purchase History (in USA), and more !
Data Compliance: All of our Consumer Travel History Data is fully CCPA compliant and 100% sourced from SDKs (Software Development Kits), the most reliable and consistent mobile data stream with end user consent available with only a 4-5 day delay. This means that our location and device ID data comes from partnerships with over 1,500+ mobile apps. This data comes with an associated location which is how we are able to segment using geofences.
Data Quality: In addition to partnering with trusted SDKs, DRAKO has additional screening methods to ensure that our mobile location data is consistent and reliable. This includes data harmonization and quality scoring from all of our partners in order to disregard MAIDs with a low quality score.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Harmonisation data set and infilling database used in creating input for CMIP7's ScenarioMIP
These were used for harmonisation before gridding
as well as for creating 'complete' scenarios for running simple climate models
(the so-called 'global-workflow' files focus on this application).
Facebook
TwitterDRAKO is a Mobile Location Audience Data provider with a programmatic trading desk specialising in geolocation analytics and programmatic advertising. Through our customised approach, we offer business and consumer insights as well as addressable audiences for advertising.
Mobile Location Data can be meaningfully transformed into Audience Data when used in conjunction with other dataset. Our expansive POI Data allows us to segment users by visitation to major brands and retailers as well as categorizes them into syndicated segments. Beyond POI visits, our proprietary Home Location Model determines residents of geographic areas such as Designated Market Areas, Counties, or States. Relatedly, our Home Location Model also fuels our Geodemographic Census Data segments as we are able to determine residents of the smallest census units. Additionally, we also have audiences of: ticketed event and venue visitors; survey data; and retail data.
All of our Audience Data is 100% deterministic in that it only includes high-quality, real visits to locations as defined by a POIs satellite imagery buildings contour. We never use a radius when building an audience unless requested.
Overview of our Syndicated Audience Data Segments: - Brand/POI segments (specific named stores and locations) - Categories (behavioural segments - revealed habits) - Census demographic segments (HH income, race, religion, age, family structure, language, etc.,) - Events segments (ticketed live events, conferences, and seminars) - Resident segments (State/province, CMAs, DMAs, city, county, sub-county) - Political segments (Canadian Federal and Provincial, US Congressional Upper and Lower House, US States, City elections, etc.,) - Survey Data (Psychosocial/Demographic survey data) - Retail Data (Receipt/transaction data)
All of our syndicated segments are customizable. That means you can limit them to people within a certain geography, remove employees, include only the most frequent visitors, define your own custom lookback, or extend our audiences using our Home, Work, and Social Extensions.
In addition to our syndicated segments, we’re also able to run custom queries return to you all the Mobile Ad IDs (MAIDs) seen at in a specific location (address; latitude and longitude; or WKT84 Polygon) or in your defined geographic area of interest (political districts, DMAs, Zip Codes, etc.,)
Beyond just returning all the MAIDs seen within a geofence, we are also able to offer additional customizable advantages: - Average precision between 5 and 15 meters - CRM list activation + extension - Extend beyond Mobile Location Data (MAIDs) with our device graph - Filter by frequency of visitations - Home and Work targeting (retrieve only employees or residents of an address) - Home extensions (devices that reside in the same dwelling from your seed geofence) - Rooftop level address geofencing precision (no radius used EVER unless user specified) - Social extensions (devices in the same social circle as users in your seed geofence) - Turn analytics into addressable audiences - Work extensions (coworkers of users in your seed geofence)
Data Compliance: All of our Audience Data is fully CCPA compliant and 100% sourced from SDKs (Software Development Kits), the most reliable and consistent mobile data stream with end user consent available with only a 4-5 day delay. This means that our location and device ID data comes from partnerships with over 1,500+ mobile apps. This data comes with an associated location which is how we are able to segment using geofences.
Data Quality: In addition to partnering with trusted SDKs, DRAKO has additional screening methods to ensure that our mobile location data is consistent and reliable. This includes data harmonization and quality scoring from all of our partners in order to disregard MAIDs with a low quality score.
Facebook
TwitterThe purpose of the NINDS Common Data Elements (CDEs) Project is to standardize the collection of investigational data in order to facilitate comparison of results across studies and more effectively aggregate information into significant metadata results. The goal of the National Institute of Neurological Disorders and Stroke (NINDS) CDE Project specifically is to develop data standards for clinical research within the neurological community. Central to this Project is the creation of common definitions and data sets so that information (data) is consistently captured and recorded across studies. To harmonize data collected from clinical studies, the NINDS Office of Clinical Research is spearheading the effort to develop CDEs in neuroscience. This Web site outlines these data standards and provides accompanying tools to help investigators and research teams collect and record standardized clinical data. The Institute still encourages creativity and uniqueness by allowing investigators to independently identify and add their own critical variables. The CDEs have been identified through review of the documentation of numerous studies funded by NINDS, review of the literature and regulatory requirements, and review of other Institute''s common data efforts. Other data standards such as those of the Clinical Data Interchange Standards Consortium (CDISC), the Clinical Data Acquisition Standards Harmonization (CDASH) Initiative, ClinicalTrials.gov, the NINDS Genetics Repository, and the NIH Roadmap efforts have also been followed to ensure that the NINDS CDEs are comprehensive and as compatible as possible with those standards. CDEs now available: * General (CDEs that cross diseases) Updated Feb. 2011! * Congenital Muscular Dystrophy * Epilepsy (Updated Sept 2011) * Friedreich''s Ataxia * Parkinson''s Disease * Spinal Cord Injury * Stroke * Traumatic Brain Injury CDEs in development: * Amyotrophic Lateral Sclerosis (Public review Sept 15 through Nov 15) * Frontotemporal Dementia * Headache * Huntington''s Disease * Multiple Sclerosis * Neuromuscular Diseases ** Adult and pediatric working groups are being finalized and these groups will focus on: Duchenne Muscular Dystrophy, Facioscapulohumeral Muscular Dystrophy, Myasthenia Gravis, Myotonic Dystrophy, and Spinal Muscular Atrophy The following tools are available through this portal: * CDE Catalog - includes the universe of all CDEs. Users are able to search the full universe to isolate a subset of the CDEs (e.g., all stroke-specific CDEs, all pediatric epilepsy CDEs, etc.) and download details about those CDEs. * CRF Library - (a.k.a., Library of Case Report Form Modules and Guidelines) contains all the CRF Modules that have been created through the NINDS CDE Project as well as various guideline documents. Users are able to search the library to find CRF Modules and Guidelines of interest. * Form Builder - enables users to start the process of assembling a CRF or form by allowing them to choose the CDEs they would like to include on the form. This tool is intended to assist data managers and database developers to create data dictionaries for their study forms.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Please see our GitHub repository here: https://github.com/BIH-CEI/rd-cdm/ Please see our RD CDM documentation here: https://rd-cdm.readthedocs.io/en/latest/index.html/ Attention: The RD CDM paper is currently under review (version 2.0.0.dev0). As soon as the paper is accepted, we will publish v2.0.0. For more information please see our ChangeLog: https://rd-cdm.readthedocs.io/en/latest/changelog.htmlWe introduce our RD CDM v2.0.0— a common data model specifically designed for rare diseases. This RD CDM simplifies the capture, storage, and exchange of complex clinical data, enabling researchers and healthcare providers to work with harmonized datasets across different institutions and countries. The RD CDM is based on the ERDRI-CDS, a common data set developed by the European Rare Disease Research Infrastructure (ERDRI) to support the collection of harmonized data for rare disease research. By extending the ERDRI-CDS with additional concepts and relationships, based on HL7 FHIR v4.0.1 and the GA4GH Phenopacket Schema v2.0, the RD CDM provides a comprehensive model for capturing detailed clinical information alongisde precise genetic data on rare diseases.Background:Rare diseases (RDs), though individually rare, collectively impact over 260 million people worldwide, with over 17 million affected in Europe. These conditions, defined by their low prevalence of fewer than 5 in 10,000 individuals, are often genetically driven, with over 70% of cases suspected to have a genetic cause. Despite significant advances in medical research, RD patients still face lengthy diagnostic delays, often due to a lack of awareness in general healthcare settings and the rarity of RD-specific knowledge among clinicians. Misdiagnosis and underrepresentation in routine care further compound the challenges, leaving many patients without timely and accurate diagnoses.Interoperability plays a critical role in addressing these challenges, ensuring the seamless exchange and interpretation of medical data through the use of internationally agreed standards. In the field of rare diseases, where data is often scarce and scattered, the importance of structured, standardized, and reusable medical records cannot be overstated. Interoperable data formats allow for more efficient research, better care coordination, and a clearer understanding of complex clinical cases. However, existing medical systems often fail to support the depth of phenotypic and genotypic data required for rare disease research and treatment, making interoperability a crucial enabler for improving outcomes in RD care.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionTransparency and traceability are essential for establishing trustworthy artificial intelligence (AI). The lack of transparency in the data preparation process is a significant obstacle in developing reliable AI systems which can lead to issues related to reproducibility, debugging AI models, bias and fairness, and compliance and regulation. We introduce a formal data preparation pipeline specification to improve upon the manual and error-prone data extraction processes used in AI and data analytics applications, with a focus on traceability.MethodsWe propose a declarative language to define the extraction of AI-ready datasets from health data adhering to a common data model, particularly those conforming to HL7 Fast Healthcare Interoperability Resources (FHIR). We utilize the FHIR profiling to develop a common data model tailored to an AI use case to enable the explicit declaration of the needed information such as phenotype and AI feature definitions. In our pipeline model, we convert complex, high-dimensional electronic health records data represented with irregular time series sampling to a flat structure by defining a target population, feature groups and final datasets. Our design considers the requirements of various AI use cases from different projects which lead to implementation of many feature types exhibiting intricate temporal relations.ResultsWe implement a scalable and high-performant feature repository to execute the data preparation pipeline definitions. This software not only ensures reliable, fault-tolerant distributed processing to produce AI-ready datasets and their metadata including many statistics alongside, but also serve as a pluggable component of a decision support application based on a trained AI model during online prediction to automatically prepare feature values of individual entities. We deployed and tested the proposed methodology and the implementation in three different research projects. We present the developed FHIR profiles as a common data model, feature group definitions and feature definitions within a data preparation pipeline while training an AI model for “predicting complications after cardiac surgeries”.DiscussionThrough the implementation across various pilot use cases, it has been demonstrated that our framework possesses the necessary breadth and flexibility to define a diverse array of features, each tailored to specific temporal and contextual criteria.