Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundBiomedical research requires large, diverse samples to produce unbiased results. Retrospective data harmonization is often used to integrate existing datasets to create these samples, but the process is labor-intensive. Automated methods for matching variables across datasets can accelerate this process, particularly when harmonizing datasets with numerous variables and varied naming conventions. Research in this area has been limited, primarily focusing on lexical matching and ontology-based semantic matching. We aimed to develop new methods, leveraging large language models (LLMs) and ensemble learning, to automate variable matching.MethodsThis study utilized data from two GERAS cohort studies (European [EU] and Japan [JP]) obtained through the Alzheimer’s Disease (AD) Data Initiative’s AD workbench. We first manually created a dataset by matching 347 EU variables with 1322 candidate JP variables and treated matched variable pairs as positive instances and unmatched pairs as negative instances. We then developed four natural language processing (NLP) methods using state-of-the-art LLMs (E5, MPNet, MiniLM, and BioLORD-2023) to estimate variable similarity based on variable labels and derivation rules. A lexical matching method using fuzzy matching was included as a baseline model. In addition, we developed an ensemble-learning method, using the Random Forest (RF) model, to integrate individual NLP methods. RF was trained and evaluated on 50 trials. Each trial had a random split (4:1) of training and test sets, with the model’s hyperparameters optimized through cross-validation on the training set. For each EU variable, 1322 candidate JP variables were ranked based on NLP-derived similarity scores or RF’s probability scores, denoting their likelihood to match the EU variable. Ranking performance was measured by top-n hit ratio (HR-n) and mean reciprocal rank (MRR).ResultsE5 performed best among individual methods, achieving 0.898 HR-30 and 0.700 MRR. RF performed better than E5 on all metrics over 50 trials (P
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset composed of 470 soil laboratory spectral data has been aligned using the white Lucky Bay sands as internal soil standard (ISS). Soil samples were collected in Sweden by SLU, in Italy by CNR, in France by INRAE and in Poland by IUNG, and the spectra were acquired in the lab on dry samples. Each partners scanned the ISS using the same instrument as the soil samples, this allowed to compute the correction factor for each instrument.
Along with the main dataset, an explanation document and a R script are provided-
The R script can be used to align spectral data acquired by using different nstruments. Within the provided file "CF_lb" you can find 5 correction factors for 5 instruments: The correction factors were computed using the Lucky bay spectra scanned by each of the 5 instrument (ISS) and the master lucky bay spectrum acquired in the CSIRO lab.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"id": survey identifer
"rowid": unique observation identifier:
"fct_visit_concert": visiting frequency to concert, catogorical representation,
"fct_visit_library": visiting frequency to concert, catogorical representation,
"fct_visit_museum" : visiting frequency to museums or galleries, catogorical representation,
"visit_concert": visiting frequency to concert, numerical representation,
"visit_library": visiting frequency to concert, numerical representation,
"visit_museum": visiting frequency to concert, numerical representation,
age_education": school leaving age
"age_exact": age
"is_student" : respondent still studying (1=yes, 0=no)
"geo": geographic concept
"w1": post-stratification weight for geo
"w_uk": post-stratification weight for Northern Ireland and Great Britain
"w_de":
"wex": projected post stratification weight
"country_code": country code, unifying Germany and the United Kingdom (originally separate samples)
"w": post_stratification weight for country_code
"year_survey": year of the survey
"is_visit_concert": binary variable, 0 if the person did not visit concerts, public libraries, musea...
"is_visit_library": : binary variable, 0 if the person did not visit public libraries
"is_visit_museum":: binary variable, 0 if the person did not visit museums or galleries
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionMeta-analysis is a powerful means for leveraging the hundreds of experiments being run worldwide into more statistically powerful analyses. This is also true for the analysis of omic data, including genome-wide DNA methylation. In particular, thousands of DNA methylation profiles generated using the Illumina 450k are stored in the publicly accessible Gene Expression Omnibus (GEO) repository. Often, however, the intensity values produced by the BeadChip (raw data) are not deposited, therefore only pre-processed values -obtained after computational manipulation- are available. Pre-processing is possibly different among studies and may then affect meta-analysis by introducing non-biological sources of variability.Material and methodsTo systematically investigate the effect of pre-processing on meta-analysis, we analysed four different collections of DNA methylation samples (datasets), each composed of two subsets, for which raw data from controls (i.e. healthy subjects) and cases (i.e. patients) are available. We pre-processed the data from each dataset with nine among the most common pipelines found in literature. Moreover, we evaluated the performance of regRCPqn, a modification of the RCP algorithm that aims to improve data consistency. For each combination of pre-processing (9 × 9), we first evaluated the between-sample variability among control subjects and, then, we identified genomic positions that are differentially methylated between cases and controls (differential analysis).Results and conclusionThe pre-processing of DNA methylation data affects both the between-sample variability and the loci identified as differentially methylated, and the effects of pre-processing are strongly dataset-dependent. By contrast, application of our renormalization algorithm regRCPqn: (i) reduces variability and (ii) increases agreement between meta-analysed datasets, both critical components of data harmonization.
Facebook
TwitterThe harmonized data set on health, created and published by the ERF, is a subset of Iraq Household Socio Economic Survey (IHSES) 2012. It was derived from the household, individual and health modules, collected in the context of the above mentioned survey. The sample was then used to create a harmonized health survey, comparable with the Iraq Household Socio Economic Survey (IHSES) 2007 micro data set.
----> Overview of the Iraq Household Socio Economic Survey (IHSES) 2012:
Iraq is considered a leader in household expenditure and income surveys where the first was conducted in 1946 followed by surveys in 1954 and 1961. After the establishment of Central Statistical Organization, household expenditure and income surveys were carried out every 3-5 years in (1971/ 1972, 1976, 1979, 1984/ 1985, 1988, 1993, 2002 / 2007). Implementing the cooperation between CSO and WB, Central Statistical Organization (CSO) and Kurdistan Region Statistics Office (KRSO) launched fieldwork on IHSES on 1/1/2012. The survey was carried out over a full year covering all governorates including those in Kurdistan Region.
The survey has six main objectives. These objectives are:
The raw survey data provided by the Statistical Office were then harmonized by the Economic Research Forum, to create a comparable version with the 2006/2007 Household Socio Economic Survey in Iraq. Harmonization at this stage only included unifying variables' names, labels and some definitions. See: Iraq 2007 & 2012- Variables Mapping & Availability Matrix.pdf provided in the external resources for further information on the mapping of the original variables on the harmonized ones, in addition to more indications on the variables' availability in both survey years and relevant comments.
National coverage: Covering a sample of urban, rural and metropolitan areas in all the governorates including those in Kurdistan Region.
1- Household/family. 2- Individual/person.
The survey was carried out over a full year covering all governorates including those in Kurdistan Region.
Sample survey data [ssd]
----> Design:
Sample size was (25488) household for the whole Iraq, 216 households for each district of 118 districts, 2832 clusters each of which includes 9 households distributed on districts and governorates for rural and urban.
----> Sample frame:
Listing and numbering results of 2009-2010 Population and Housing Survey were adopted in all the governorates including Kurdistan Region as a frame to select households, the sample was selected in two stages: Stage 1: Primary sampling unit (blocks) within each stratum (district) for urban and rural were systematically selected with probability proportional to size to reach 2832 units (cluster). Stage two: 9 households from each primary sampling unit were selected to create a cluster, thus the sample size of total survey clusters was 25488 households distributed on the governorates, 216 households in each district.
----> Sampling Stages:
In each district, the sample was selected in two stages: Stage 1: based on 2010 listing and numbering frame 24 sample points were selected within each stratum through systematic sampling with probability proportional to size, in addition to the implicit breakdown urban and rural and geographic breakdown (sub-district, quarter, street, county, village and block). Stage 2: Using households as secondary sampling units, 9 households were selected from each sample point using systematic equal probability sampling. Sampling frames of each stages can be developed based on 2010 building listing and numbering without updating household lists. In some small districts, random selection processes of primary sampling may lead to select less than 24 units therefore a sampling unit is selected more than once , the selection may reach two cluster or more from the same enumeration unit when it is necessary.
Face-to-face [f2f]
----> Preparation:
The questionnaire of 2006 survey was adopted in designing the questionnaire of 2012 survey on which many revisions were made. Two rounds of pre-test were carried out. Revision were made based on the feedback of field work team, World Bank consultants and others, other revisions were made before final version was implemented in a pilot survey in September 2011. After the pilot survey implemented, other revisions were made in based on the challenges and feedbacks emerged during the implementation to implement the final version in the actual survey.
----> Questionnaire Parts:
The questionnaire consists of four parts each with several sections: Part 1: Socio – Economic Data: - Section 1: Household Roster - Section 2: Emigration - Section 3: Food Rations - Section 4: housing - Section 5: education - Section 6: health - Section 7: Physical measurements - Section 8: job seeking and previous job
Part 2: Monthly, Quarterly and Annual Expenditures: - Section 9: Expenditures on Non – Food Commodities and Services (past 30 days). - Section 10 : Expenditures on Non – Food Commodities and Services (past 90 days). - Section 11: Expenditures on Non – Food Commodities and Services (past 12 months). - Section 12: Expenditures on Non-food Frequent Food Stuff and Commodities (7 days). - Section 12, Table 1: Meals Had Within the Residential Unit. - Section 12, table 2: Number of Persons Participate in the Meals within Household Expenditure Other Than its Members.
Part 3: Income and Other Data: - Section 13: Job - Section 14: paid jobs - Section 15: Agriculture, forestry and fishing - Section 16: Household non – agricultural projects - Section 17: Income from ownership and transfers - Section 18: Durable goods - Section 19: Loans, advances and subsidies - Section 20: Shocks and strategy of dealing in the households - Section 21: Time use - Section 22: Justice - Section 23: Satisfaction in life - Section 24: Food consumption during past 7 days
Part 4: Diary of Daily Expenditures: Diary of expenditure is an essential component of this survey. It is left at the household to record all the daily purchases such as expenditures on food and frequent non-food items such as gasoline, newspapers…etc. during 7 days. Two pages were allocated for recording the expenditures of each day, thus the roster will be consists of 14 pages.
----> Raw Data:
Data Editing and Processing: To ensure accuracy and consistency, the data were edited at the following stages: 1. Interviewer: Checks all answers on the household questionnaire, confirming that they are clear and correct. 2. Local Supervisor: Checks to make sure that questions has been correctly completed. 3. Statistical analysis: After exporting data files from excel to SPSS, the Statistical Analysis Unit uses program commands to identify irregular or non-logical values in addition to auditing some variables. 4. World Bank consultants in coordination with the CSO data management team: the World Bank technical consultants use additional programs in SPSS and STAT to examine and correct remaining inconsistencies within the data files. The software detects errors by analyzing questionnaire items according to the expected parameter for each variable.
----> Harmonized Data:
Iraq Household Socio Economic Survey (IHSES) reached a total of 25488 households. Number of households refused to response was 305, response rate was 98.6%. The highest interview rates were in Ninevah and Muthanna (100%) while the lowest rates were in Sulaimaniya (92%).
Facebook
TwitterTo facilitate the use of data collected through the high-frequency phone surveys on COVID-19, the Living Standards Measurement Study (LSMS) team has created the harmonized datafiles using two household surveys: 1) the country’ latest face-to-face survey which has become the sample frame for the phone survey, and 2) the country’s high-frequency phone survey on COVID-19.
The LSMS team has extracted and harmonized variables from these surveys, based on the harmonized definitions and ensuring the same variable names. These variables include demography as well as housing, household consumption expenditure, food security, and agriculture. Inevitably, many of the original variables are collected using questions that are asked differently. The harmonized datafiles include the best available variables with harmonized definitions.
Two harmonized datafiles are prepared for each survey. The two datafiles are: 1. HH: This datafile contains household-level variables. The information include basic household characterizes, housing, water and sanitation, asset ownership, consumption expenditure, consumption quintile, food security, livestock ownership. It also contains information on agricultural activities such as crop cultivation, use of organic and inorganic fertilizer, hired labor, use of tractor and crop sales. 2. IND: This datafile contains individual-level variables. It includes basic characteristics of individuals such as age, sex, marital status, disability status, literacy, education and work.
National coverage
The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.
Sample survey data [ssd]
See “Ethiopia - Socioeconomic Survey 2018-2019” and “Ethiopia - COVID-19 High Frequency Phone Survey of Households 2020” available in the Microdata Library for details.
Computer Assisted Personal Interview [capi]
Ethiopia Socioeconomic Survey (ESS) 2018-2019 and Ethiopia COVID-19 High Frequency Phone Survey of Households (HFPS) 2020 data were harmonized following the harmonization guidelines (see “Harmonized Datafiles and Variables for High-Frequency Phone Surveys on COVID-19” for more details).
The high-frequency phone survey on COVID-19 has multiple rounds of data collection. When variables are extracted from multiple rounds of the survey, the originating round of the survey is noted with “_rX” in the variable name, where X represents the number of the round. For example, a variable with “_r3” presents that the variable was extracted from Round 3 of the high-frequency phone survey. Round 0 refers to the country’s latest face-to-face survey which has become the sample frame for the high-frequency phone surveys on COVID-19. When the variables are without “_rX”, they were extracted from Round 0.
See “Ethiopia - Socioeconomic Survey 2018-2019” and “Ethiopia - COVID-19 High Frequency Phone Survey of Households 2020” available in the Microdata Library for details.
Facebook
TwitterThis is an integration of 10 independent multi-country, multi-region, multi-cultural social surveys fielded by Gallup International between 2000 and 2013. The integrated data file contains responses from 535,159 adults living in 103 countries. In total, the harmonization project combined 571 social surveys.
These data have value in a number of longitudinal multi-country, multi-regional, and multi-cultural (L3M) research designs. Understood as independent, though non-random, L3M samples containing a number of multiple indicator ASQ (ask same questions) and ADQ (ask different questions) measures of human development, the environment, international relations, gender equality, security, international organizations, and democracy, to name a few [see full list below].
The data can be used for exploratory and descriptive analysis, with greatest utility at low levels of resolution (e.g. nation-states, supranational groupings). Level of resolution in analysis of these data should be sufficiently low to approximate confidence intervals.
These data can be used for teaching 3M methods, including data harmonization in L3M, 3M research design, survey design, 3M measurement invariance, analysis, and visualization, and reporting. Opportunities to teach about para data, meta data, and data management in L3M designs.
The country units are an unbalanced panel derived from non-probability samples of countries and respondents> Panels (countries) have left and right censorship and are thusly unbalanced. This design limitation can be overcome to the extent that VOTP panels are harmonized with public measurements from other 3M surveys to establish balance in terms of panels and occasions of measurement. Should L3M harmonization occur, these data can be assigned confidence weights to reflect the amount of error in these surveys.
Pooled public opinion surveys (country means), when combine with higher quality country measurements of the same concepts (ASQ, ADQ), can be leveraged to increase the statistical power of pooled publics opinion research designs (multiple L3M datasets)…that is, in studies of public, rather than personal, beliefs.
The Gallup Voice of the People survey data are based on uncertain sampling methods based on underspecified methods. Country sampling is non-random. The sampling method appears be primarily probability and quota sampling, with occasional oversample of urban populations in difficult to survey populations. The sampling units (countries and individuals) are poorly defined, suggesting these data have more value in research designs calling for independent samples replication and repeated-measures frameworks.
**The Voice of the People Survey Series is WIN/Gallup International Association's End of Year survey and is a global study that collects the public's view on the challenges that the world faces today. Ongoing since 1977, the purpose of WIN/Gallup International's End of Year survey is to provide a platform for respondents to speak out concerning government and corporate policies. The Voice of the People, End of Year Surveys for 2012, fielded June 2012 to February 2013, were conducted in 56 countries to solicit public opinion on social and political issues. Respondents were asked whether their country was governed by the will of the people, as well as their attitudes about their society. Additional questions addressed respondents' living conditions and feelings of safety around their living area, as well as personal happiness. Respondents' opinions were also gathered in relation to business development and their views on the effectiveness of the World Health Organization. Respondents were also surveyed on ownership and use of mobile devices. Demographic information includes sex, age, income, education level, employment status, and type of living area.
Facebook
Twitterhttps://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdfhttps://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdf
The Fundamental Data Record (FDR) for Atmospheric Composition UVN v.1.0 dataset is a cross-instrument Level-1 product [ATMOS_L1B] generated in 2023 and resulting from the ESA FDR4ATMOS project. The FDR contains selected Earth Observation Level 1b parameters (irradiance/reflectance) from the nadir-looking measurements of the ERS-2 GOME and Envisat SCIAMACHY missions for the period ranging from 1995 to 2012. The data record offers harmonised cross-calibrated spectra with focus on spectral windows in the Ultraviolet-Visible-Near Infrared regions for the retrieval of critical atmospheric constituents like ozone (O3), sulphur dioxide (SO2), nitrogen dioxide (NO2) column densities, alongside cloud parameters. The FDR4ATMOS products should be regarded as experimental due to the innovative approach and the current use of a limited-sized test dataset to investigate the impact of harmonization on the Level 2 target species, specifically SO2, O3 and NO2. Presently, this analysis is being carried out within follow-on activities. The FDR4ATMOS V1 is currently being extended to include the MetOp GOME-2 series. Product format For many aspects, the FDR product has improved compared to the existing individual mission datasets: GOME solar irradiances are harmonised using a validated SCIAMACHY solar reference spectrum, solving the problem of the fast-changing etalon present in the original GOME Level 1b data; Reflectances for both GOME and SCIAMACHY are provided in the FDR product. GOME reflectances are harmonised to degradation-corrected SCIAMACHY values, using collocated data from the CEOS PIC sites; SCIAMACHY data are scaled to the lowest integration time within the spectral band using high-frequency PMD measurements from the same wavelength range. This simplifies the use of the SCIAMACHY spectra which were split in a complex cluster structure (with own integration time) in the original Level 1b data; The harmonization process applied mitigates the viewing angle dependency observed in the UV spectral region for GOME data; Uncertainties are provided. Each FDR product provides, within the same file, irradiance/reflectance data for UV-VIS-NIR special regions across all orbits on a single day, including therein information from the individual ERS-2 GOME and Envisat SCIAMACHY measurements. FDR has been generated in two formats: Level 1A and Level 1B targeting expert users and nominal applications respectively. The Level 1A [ATMOS_L1A] data include additional parameters such as harmonisation factors, PMD, and polarisation data extracted from the original mission Level 1 products. The ATMOS_L1A dataset is not part of the nominal dissemination to users. In case of specific requirements, please contact EOHelp. Please refer to the README file for essential guidance before using the data. All the new products are conveniently formatted in NetCDF. Free standard tools, such as Panoply, can be used to read NetCDF data. Panoply is sourced and updated by external entities. For further details, please consult our Terms and Conditions page. Uncertainty characterisation One of the main aspects of the project was the characterization of Level 1 uncertainties for both instruments, based on metrological best practices. The following documents are provided: General guidance on a metrological approach to Fundamental Data Records (FDR) Uncertainty Characterisation document Effect tables NetCDF files containing example uncertainty propagation analysis and spectral error correlation matrices for SCIAMACHY (Atlantic and Mauretania scene for 2003 and 2010) and GOME (Atlantic scene for 2003) reflectance_uncertainty_example_FDR4ATMOS_GOME.nc reflectance_uncertainty_example_FDR4ATMOS_SCIA.nc Known Issues Non-monotonous wavelength axis for SCIAMACHY in FDR data version 1.0 In the SCIAMACHY OBSERVATION group of the atmospheric FDR v1.0 dataset (DOI: 10.5270/ESA-852456e), the wavelength axis (lambda variable) is not monotonically increasing. This issue affects all spectral channels (UV, VIS, NIR) in the SCIAMACHY group, while GOME OBSERVATION data remain unaffected. The root cause of the issue lies in the incorrect indexing of the lambda variable during the NetCDF writing process. Notably, the wavelength values themselves are calculated correctly within the processing chain. Temporary Workaround The wavelength axis is correct in the first record of each product. As a workaround, users can extract the wavelength axis from the first record and apply it to all subsequent measurements within the same product. The first record can be retrieved by setting the first two indices (time and scanline) to 0 (assuming counting of array indices starts at 0). Note that this process must be repeated separately for each spectral range (UV, VIS, NIR) and every daily product. Since the wavelength axis of SCIAMACHY is highly stable over time, using the first record introduces no expected impact on retrieval results. Python pseudo-code example: lambda_...
Facebook
TwitterDiatom data have been collected in large-scale biological assessments in the United States, such as the U.S. Environmental Protection Agency’s National Rivers and Streams Assessment (NRSA). However, the effectiveness of diatoms as indicators may suffer if inconsistent taxon identifications across different analysts obscure the relationships between assemblage composition and environmental variables. To reduce these inconsistencies, we harmonized the 2008–2009 NRSA data from nine analysts by updating names to current synonyms and by statistically identifying taxa with high analyst signal (taxa with more variation in relative abundance explained by the analyst factor, relative to environmental variables). We then screened a subset of samples with QA/QC data and combined taxa with mismatching identifications by the primary and secondary analysts. When these combined “slash groups” did not reduce analyst signal, we elevated taxa to the genus level or omitted taxa in difficult species complexes. We examined the variation explained by analyst in the original and revised datasets. Further, we examined how revising the datasets to reduce analyst signal can reduce inconsistency, thereby uncovering the variation in assemblage composition explained by total phosphorus (TP), an environmental variable of high priority for water managers. To produce a revised dataset with the greatest taxonomic consistency, we ultimately made 124 slash groups, omitted 7 taxa in the small naviculoid (e.g., Sellaphora atomoides) species complex, and elevated Nitzschia, Diploneis, and Tryblionella taxa to the genus level. Relative to the original dataset, the revised dataset had more overlap among samples grouped by analyst in ordination space, less variation explained by the analyst factor, and more than double the variation in assemblage composition explained by TP. Elevating all taxa to the genus level did not eliminate analyst signal completely, and analyst remained the most important predictor for the genera Sellaphora, Mayamaea, and Psammodictyon, indicating that these taxa present the greatest obstacle to consistent identification in this dataset. Although our process did not completely remove analyst signal, this work provides a method to minimize analyst signal and improve detection of diatom association with TP in large datasets involving multiple analysts. Examination of variation in assemblage data explained by analyst and taxonomic harmonization may be necessary steps for improving data quality and the utility of diatoms as indicators of environmental variables. This dataset is associated with the following publication: Lee, S., I. Bishop, S. Spaulding, R. Mitchell, and L. Yuan. Taxonomic harmonization may reveal a stronger association between diatom assemblages and total phosphorus in large datasets.. ECOLOGICAL INDICATORS. Elsevier Science Ltd, New York, NY, USA, 102: 166-174, (2019).
Facebook
TwitterTo facilitate the use of data collected through the high-frequency phone surveys on COVID-19, the Living Standards Measurement Study (LSMS) team has created the harmonized datafiles using two household surveys: 1) the country’ latest face-to-face survey which has become the sample frame for the phone survey, and 2) the country’s high-frequency phone survey on COVID-19.
The LSMS team has extracted and harmonized variables from these surveys, based on the harmonized definitions and ensuring the same variable names. These variables include demography as well as housing, household consumption expenditure, food security, and agriculture. Inevitably, many of the original variables are collected using questions that are asked differently. The harmonized datafiles include the best available variables with harmonized definitions.
Two harmonized datafiles are prepared for each survey. The two datafiles are:
1. HH: This datafile contains household-level variables. The information include basic household characterizes, housing, water and sanitation, asset ownership, consumption expenditure, consumption quintile, food security, livestock ownership. It also contains information on agricultural activities such as crop cultivation, use of organic and inorganic fertilizer, hired labor, use of tractor and crop sales.
2. IND: This datafile contains individual-level variables. It includes basic characteristics of individuals such as age, sex, marital status, disability status, literacy, education and work.
National coverage
The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.
Sample survey data [ssd]
See “Malawi - Integrated Household Panel Survey 2010-2013-2016-2019 (Long-Term Panel, 102 EAs)” and “Malawi - High-Frequency Phone Survey on COVID-19” available in the Microdata Library for details.
Computer Assisted Personal Interview [capi]
Malawi Integrated Household Panel Survey (IHPS) 2019 and Malawi High-Frequency Phone Survey on COVID-19 data were harmonized following the harmonization guidelines (see “Harmonized Datafiles and Variables for High-Frequency Phone Surveys on COVID-19” for more details).
The high-frequency phone survey on COVID-19 has multiple rounds of data collection. When variables are extracted from multiple rounds of the survey, the originating round of the survey is noted with “_rX” in the variable name, where X represents the number of the round. For example, a variable with “_r3” presents that the variable was extracted from Round 3 of the high-frequency phone survey. Round 0 refers to the country’s latest face-to-face survey which has become the sample frame for the high-frequency phone surveys on COVID-19. When the variables are without “_rX”, they were extracted from Round 0.
See “Malawi - Integrated Household Panel Survey 2010-2013-2016-2019 (Long-Term Panel, 102 EAs)” and “Malawi - High-Frequency Phone Survey on COVID-19” available in the Microdata Library for details.
Facebook
TwitterThe main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.
Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demograohic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor chracteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty
National
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
The 2008 Household Expenditure and Income Survey sample was designed using two-stage cluster stratified sampling method. In the first stage, the primary sampling units (PSUs), the blocks, were drawn using probability proportionate to the size, through considering the number of households in each block to be the block size. The second stage included drawing the household sample (8 households from each PSU) using the systematic sampling method. Fourth substitute households from each PSU were drawn, using the systematic sampling method, to be used on the first visit to the block in case that any of the main sample households was not visited for any reason.
To estimate the sample size, the coefficient of variation and design effect in each subdistrict were calculated for the expenditure variable from data of the 2006 Household Expenditure and Income Survey. This results was used to estimate the sample size at sub-district level, provided that the coefficient of variation of the expenditure variable at the sub-district level did not exceed 10%, with a minimum number of clusters that should not be less than 6 at the district level, that is to ensure good clusters representation in the administrative areas to enable drawing poverty pockets.
It is worth mentioning that the expected non-response in addition to areas where poor families are concentrated in the major cities were taken into consideration in designing the sample. Therefore, a larger sample size was taken from these areas compared to other ones, in order to help in reaching the poverty pockets and covering them.
Face-to-face [f2f]
List of survey questionnaires: (1) General Form (2) Expenditure on food commodities Form (3) Expenditure on non-food commodities Form
Raw Data The design and implementation of this survey procedures were: 1. Sample design and selection 2. Design of forms/questionnaires, guidelines to assist in filling out the questionnaires, and preparing instruction manuals 3. Design the tables template to be used for the dissemination of the survey results 4. Preparation of the fieldwork phase including printing forms/questionnaires, instruction manuals, data collection instructions, data checking instructions and codebooks 5. Selection and training of survey staff to collect data and run required data checkings 6. Preparation and implementation of the pretest phase for the survey designed to test and develop forms/questionnaires, instructions and software programs required for data processing and production of survey results 7. Data collection 8. Data checking and coding 9. Data entry 10. Data cleaning using data validation programs 11. Data accuracy and consistency checks 12. Data tabulation and preliminary results 13. Preparation of the final report and dissemination of final results
Harmonized Data - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets - The harmonization process started with cleaning all raw data files received from the Statistical Office - Cleaned data files were then all merged to produce one data file on the individual level containing all variables subject to harmonization - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables - A post-harmonization cleaning process was run on the data - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format
Facebook
Twitterhttps://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
+++++++++++++++ Version 3.0.0 +++++++++++++++
We carried out an harmonization of the Eurobarometer 2004-2021(spring). This dataset includes 35 single standard Eurobarometers, and morethan 140 variables about EU policies, attitudes towards Europe and the EU, identity, cognitive mobilization, political institutions, socio-political characteristics and partisanship, etc.
The harmonization was carried out using existing Eurobarometer datasets published by GESIS. To allow the user to replicate the harmonization and be able to modify some codes if needed, we publish one example of do-file used to pursue the harmonization, as well as the corresponding (harmonized) dataset. The user can find the do-file containing the codes used to modify and clean EB 953 (ZA7783, conducted in spring 2021) according to the harmonization procedure that we followed. Moreover, the user can find the cleaned dataset for EB 953 that was obtained after running the do-file. The files are named “EB 953.do” and “953_new.dta”.
We include: - a harmonized dataset ("harmonised_EB_2004-2021.dta"), - a technical report ("User Guide Harmonized Eurobarometer 2004-2021"), - a summary of the original survey questions corresponding to the variables included in the dataset ("Trends_EBs_1970-2021.xlsx"), - one of the do-files used to carry out the harmonization (“EB 953.do” ), - one of the datasets used before merging all datasets (“953_new.dta”).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data data set is the taxonomically harmonized pollen data from records 2831 sites. 1032 sites are located in North America, 1075 sites in Europe, 488 sites in Asia, 150 sites in South America, 54 in Africa and 32 in the Indopacific. Most of the data where retrieved from the Neotoma Paleoecology Database (https://www.neotomadb.org/), with additional data from Cao et al. (2020; https://doi.org/10.5194/essd-12-119-2020), Cao et al. (2013, https://doi.org/10.1016/j.revpalbo.2013.02.003) and our own collection for the Asian sector. The ages of the samples refer to the newly established LegacyAge 1.0 framework (https://doi.pangaea.de/10.1594/PANGAEA.933132). The 10,110 original pollen taxa names and notations were harmonized to 1002 taxa names. We present the table with the harmonization approach crossreferencing the original taxa with the harmonized taxa name. The harmonised pollen data are presented as counts (when available) and as percentage values. We complement the data publication by providing the source information on the references (most data are related to Neotoma) as a table linked to each Dataset ID. The data set and site IDs are from Neotoma if the data sets are derived from the Neotoma repository. In case of our own data collection efforts (Cao et al. (2020), Cao et al. (2013) and our own data) we used the already published PANGAEA event names in case they are related to the data or created own site names with referencing to geographical regions similar to the Neotoma data naming principle.
Facebook
TwitterJointly managed by multiple states and the federal government, there are many ongoing efforts to characterize and understand water quality in the Delaware River Basin (DRB). Many State, Federal and non-profit organizations have collected surface-water-quality samples across the DRB for decades and many of these data are available through the National Water Quality Monitoring Council's Water Quality Portal (WQP). In this data release, WQP data in the DRB were harmonized, meaning that they were processed to create a clean and readily usable dataset. This harmonization processing included the synthesis of parameter names and fractions, the condensation of remarks and other data qualifiers, the resolution of duplicate records, an initial quality control check of the data, and other processing steps described in the metadata. This data set provides harmonized discrete multisource surface-water-quality data pulled from the WQP for nutrients, sediment, salinity, major ions, bacteria, temperature, dissolved oxygen, pH, and turbidity in the DRB, for all available years.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contact: Md Safat Sikder (mssikder@illinois.edu), Jida Wang (jidaw@illinois.edu)
Citation
Sikder, M. S., Wang, J., Allen, G. H., Sheng, Y., Yamazaki, D., Crétaux, J.-F., and Pavelsky, T. M., 2024. HarP: Harmonized Prior river-lake database. Zenodo, https://doi.org/10.5281/zenodo.14205131.
If you only use the PLD-TopoCat dataset, please cite the following paper:
Sikder, M. S., Wang, J., Allen, G. H., Sheng, Y., Yamazaki, D., Song, C., Ding, M., Crétaux, J.-F., and Pavelsky, T. M., 2023. Lake-TopoCat: A global lake drainage topology and catchment dataset. Earth System Science Data, 15, 3483-3511, https://doi.org/10.5194/essd-15-3483-2023.
Data description and components
The Harmonized Prior river-lake database (HarP) for SWOT integrated the SWOT River Database (SWORD) (Altenau et al., 2021) and the SWOT Prior Lake Database (PLD) (Wang et al., 2023) into a geometrically (lake/river) explicit but topologically harmonized vector database to allow for coupled fluvial-lacustrine applications, including a synergistic use of both river and lake products from SWOT.
In addition to the input river network (SWORD v16) and lake database (PLD v106), we used the MERIT Hydro v1.0.1 (Yamazaki et al., 2019), a high-resolution (~90 m) global hydrography dataset, to develop this database.
The SWORD-PLD harmonization process involves three major steps, with Step 3 being divided into three sub-steps. The processing chain is illustrated in the attached Figure "SWORD-PLD_harmonization_steps.jpg", as well as in Section 2 of the product description document. The HarP database consists of the outputs from each of the steps. For convenience, the global landmass (excluding Antarctica) was partitioned to 68 Pfafstetter Level-2 basins/regions, with their IDs shown in Figure "Pfaf2_basins.jpg" attached.
The HarP database consists of five datasets or components (outputs from each step), each with multiple features. The five datasets are described below, and more details are elaborated in the product description document.
Harmonized SWORD-PLD (file name "Harmonized_SWORD_PLD"): This is the fully harmonized SWORD-PLD dataset, the primary product of HarP (i.e., output of Step 3.3 in Figure "SWORD-PLD_harmonization_steps.jpg"). This dataset couples SWORD and PLD into a geometrically segmented but topologically integrated dataset at the node, reach, and catchment scales (stored by three feature layers, respectively):
(a) Harmonized feature nodes: Harmonized_feature_nodes_pfaf_xx (b) Harmonized river network: Harmonized_river_network_pfaf_xx (c) Harmonized feature catchments: Harmonized_feature_catchments_pfaf_xx Note: ''pfaf_xx'' indicates the Pfafstetter Level-2 basin ID (shown in Fig. 'Pfaf2_basins.jpg').
Figure "HarP_example.jpg", attached to this database, is an example of the fully harmonized SWORD-PLD dataset for the Ohio River Basin. The example shows three main features of the dataset: feature nodes (i.e., reach downstream ends, lake inlets, and lake outlets; see Fig. 3 in the product description document for definitions), river reaches (i.e., reaches characterized by SWORD alone, characterized by TopoCat alone, and shared by both SWORD and TopoCat), and catchments segmented by each of the feature nodes.
Intersected SWORD-PLD drainage configuration (file name "Intersected_SWORD_PLD"): This dataset is the intersected SWORD-PLD (prior river-lake) features (i.e., output of Step 2 in Figure "SWORD-PLD_harmonization_steps.jpg"). This dataset was constructed independently from Step 1 and Step 3. In this dataset, the original geometries of SWORD and PLD are not altered, but instead, their geometric and drainage topological relationships are configured in the attribute tables. This dataset consists of three features:
(a) Intersected reaches: Intersected_SWORD_reaches_pfaf_xx (b) Intersected nodes: Intersected_SWORD_nodes_pfaf_xx (c) Intersected lakes: Intersected_PLD_lakes_pfaf_xx
PLD-TopoCat (file name "PLD_TopoCat"): This dataset is the lake drainage topology and catchments (TopoCat) for PLD lakes (i.e., output of Step 1 in Figure "SWORD-PLD_harmonization_steps.jpg"). PLD-TopoCat was developed to generate detailed lake drainage topology and connecting paths, which were later used to configure the off-SWORD-network PLD lakes into the tributaries that drain to SWORD. PLD-TopoCat was generated from PLD v106 and MERIT Hydro. Details of the developiong process and algorithm for TopoCat can be found at Sikder at al., (2023). PLD-TopoCat dataset contains six features:
(a) Lake original polygon: PLD_lakes_pfaf_xx (b) Lake raster polygon: Lake_raster_polygons_pfaf_xx (c) Lake outlets: Lake_outlets_pfaf_xx (d) Lake catchments: Lake_catchments_pfaf_xx (e) Inter-lake reaches: Inter_lake_reaches_pfaf_xx (f) Lake-network basins: Lake_network_basins_pfaf_xx Note: full version of the PLD-TopoCat is available here.
SWORD-mirror network (file name "SWORD_mirror"): The SWORD-mirror network was constructed to facilitate the SWORD-TopoCat network merging process (i.e., output of Step 3.1 in Figure "SWORD-PLD_harmonization_steps.jpg"). It is essentially a replica of SWORD except that the original SWORD reaches are geometrically modified to be aligned with the topological/hydrographic information depicted in MERIT Hydro. The SWORD-mirror network consists of four features:
(a) SWORD-original reaches: SWORD_original_reaches_pfaf_xx (b) SWORD-mirror prelim. reaches: SWORD_mirror_prelim_reaches_pfaf_xx (c) SWORD-mirror reaches: SWORD_mirror_reaches_pfaf_xx (d) SWORD-mirror reach catchments: SWORD_mirror_reach_catchments_pfaf_xx
Merged SWORD-mirror – TopoCat network (file name "SWORD_TopoCat_merged"): This dataset is the output of Step 3.2 in Figure "SWORD-PLD_harmonization_steps.jpg". It is essentially the merged product of the inter-lake reaches (from Step 2) and SWORD-mirror reaches (from Step 3.1). The merged SWORD-mirror – TopoCat network consists of three features:
(a) Merged SWORD-TopoCat reaches: SWORD_TopoCat_merged_reaches_pfaf_xx (b) SWORD nodes at SWORD-TopoCat confluence: SWORD_TopoCat_confluence_nodes_pfaf_xx (c) Reach catchments for merged network: SWORD_TopoCat_reach_catchments_pfaf_xx
The attribute tables for each of the feature components are explained in Section 4 of the product description document. All files of HarP are available in both shapefile and geodatabase formats.
DisclaimerAuthors of this dataset claim no responsibility or liability for any consequences related to the use, citation, or dissemination of HarP. For any quesitons, please contact Safat Sikder and Jida Wang.
Facebook
TwitterTo better understand the impact of the shock induced by the COVID-19 pandemic on micro and small enterprises in Tunisia and assess the policy responses in a rapidly changing context, reliable data is imperative, and the need to resort to a dynamic data collection tool at a time when countries in the region are in a state of flux cannot be overstated. The COVID-19 MENA Monitor Survey was led by the Economic Research Forum (ERF) to provide data for researchers and policy makers on the economic and labor market impact of the global COVID-19 pandemic on enterprises.
The ERF COVID-19 MENA Monitor Survey is constructed using a series of short panel phone surveys, that are conducted approximately every two months, and it will cover business closure (temporary/permanent) due to lockdowns, ability to telework/deliver the service, disruptions to supply chains (for inputs and outputs), loss of product markets, increased cost of supplies, worker layoffs, salary adjustments, access to lines of credit and delays in transportation. Understanding the strategies of enterprises (particularly micro and small enterprises) to cope with the crisis is one of the main objectives of this survey. Specific constraints such as weak access to the internet in some areas or laws constraining goods' delivery will be analyzed. Enterprise owners will also be asked about prospects for the future, including ability to stay open, and whether they benefited from any measures to support their businesses. The ERF COVID-19 MENA Monitor Survey is a wide-ranging, nationally representative panel survey. The wave 3 of this dataset was collected from August to September 2021 and harmonized by the Economic Research Forum (ERF) and is featured as data for enterprise data.
The harmonization was designed to create comparable data that can facilitate cross-country and comparative research between other Arab countries (Morocco, Egypt, and Jordan). All the COVID-19 MENA Monitor surveys incorporate similar survey designs, with data on enterprises within Arab countries (Egypt, Jordan, Tunisia, and Morocco).
National
Enterprises
The sample universe for the enterprise survey was enterprises that had 6-199 workers pre-COVID-19
Sample survey data [ssd]
The sample universe for the firm survey was firms that had 6-199 workers pre-COVID-19. Stratified random samples were used to ensure adequate sample size in key strata. A target of 500 firms was set as a sample. Up to Five attempts were made to ensure response if a phone number was not picked up/answered, was disconnected or busy, or picked up but could not complete the interview at that time. After the fifth failed attempt, a firm was treated as a non-response and a random firm from the same stratum was used as an alternate.
Use the National Institute of Statistics (INS) and Agency for the Promotion of Industry and Innovation (APII) databases as follow: o Tunisia did not have a Yellow Pages or similar database, so administrative/statistics data sources had to be used o The sample started with the INS frame with 1,238 enterprises with 6-200 wage employees § Enterprises were stratified into: (1) Agriculture (2) Industry (3) Construction (4) Trade (5) Accommodation (6) Service § Enterprises were also stratified by size in terms of 6-49 versus 50-200 employees § A random stratified sample (order) was selected § Further restricted to enterprises with 6-199 workers in February 2020 based on an eligibility question during the phone interview § This sample frame was eventually exhausted o After the INS sample was exhausted, the APII sample was used § APII only covered enterprises with 10+ workers § APII only covered (1) services & transport, and (2) industry o Weights are based on the underlying data on all enterprises from INS, specifically: Entreprises privées selon l'activité principale et la tranche de salariés (RNE 2019). § We ultimately stratify the Tunisia weights by industry and enterprises sized: 6-9 employees (since APII only covered 10+), 10-49, and 50-199.
Computer Assisted Telephone Interview [cati]
The enterprise questionnaire is carried out to understand the strategies of enterprises -particularly micro and small enterprises- to cope with the crisis as well as related constraints and prospects for the future. It includes questions on business closure (temporary/permanent) due to lockdowns, ability to telework/deliver the service, disruptions to supply chains (for inputs and outputs), loss of product markets, increased cost of supplies, worker layoffs, salary adjustments, access to lines of credit and delays in transportation.
Note: The questionnaire can be seen in the documentation materials tab.
Facebook
TwitterThis SOils DAta Harmonization (SoDaH) database is designed to bring together soil carbon data from diverse research networks into a harmonized dataset that can be used for synthesis activities and model development. The research network sources for SoDaH span different biomes and climates, encompass multiple ecosystem types, and have collected data across a range of spatial, temporal, and depth gradients. The rich data sets assembled in SoDaH consist of observations from monitoring efforts and long-term ecological experiments. The SoDaH database also incorporates related environmental covariate data pertaining to climate, vegetation, soil chemistry, and soil physical properties. The data are harmonized and aggregated using open-source code that enables a scripted, repeatable approach for soil data synthesis.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset unifies three major NASA Exoplanet Archive catalogs — KOI, K2, and TOI — into a single machine-learning–ready dataset for exoplanet classification.
It harmonizes feature names, fills missing data, and projects all missions into a common PCA feature space.
The goal is to provide a consistent and comprehensive dataset for training models that can distinguish :
Labels CONFIRMED: 0 CANDIDATE: 1 FALSE POSITIVE: 2 REFUTED: 3
| Mission | Description | Source |
|---|---|---|
| KOI | Kepler Objects of Interest | https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+*+from+q1_q17_dr25_koi&format=csv |
| K2 | Kepler/K2 extended mission | https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+*+from+k2pandc&format=csv |
| TOI | TESS Objects of Interest | https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+*+from+toi&format=csv |
Each catalog contains different column names and structures, so a column mapping was created to standardize the schema across all missions.
Below is an example of the unified mapping used to align equivalent features between missions:
column_map = {
# Coordinates
"ra": {"KOI": "ra", "K2": "ra", "TOI": "ra"},
"dec": {"KOI": "dec", "K2": "dec", "TOI": "dec"},
# Orbital parameters
"orbital_period": {"KOI": "koi_period", "K2": "pl_orbper", "TOI": "pl_orbper"},
"planet_radius": {"KOI": "koi_prad", "K2": "pl_rade", "TOI": "pl_rade"},
"semi_major_axis": {"KOI": "koi_sma", "K2": "pl_orbsmax", "TOI": None},
"transit_depth": {"KOI": "koi_depth", "K2": "pl_trandep", "TOI": "pl_trandep"},
# Stellar parameters
"stellar_teff": {"KOI": "koi_steff", "K2": "st_teff", "TOI": "st_teff"},
"stellar_radius": {"KOI": "koi_srad", "K2": "st_rad", "TOI": "st_rad"},
"stellar_logg": {"KOI": "koi_slogg", "K2": "st_logg", "TOI": "st_logg"},
"stellar_met": {"KOI": "koi_smet", "K2": "st_met", "TOI": "st_met"},
# Photometry
"gmag": {"KOI": "koi_gmag", "K2": "sy_gmag", "TOI": "st_gmag"},
"rmag": {"KOI": "koi_rmag", "K2": "sy_rmag", "TOI": "st_rmag"},
"imag": {"KOI": "koi_imag", "K2": "sy_imag", "TOI": "st_imag"},
"zmag": {"KOI": "koi_zmag", "K2": "sy_zmag", "TOI": "st_zmag"},
"jmag": {"KOI": "koi_jmag", "K2": "sy_jmag", "TOI": "st_jmag"},
"hmag": {"KOI": "koi_hmag", "K2": "sy_hmag", "TOI": "st_hmag"},
"kmag": {"KOI": "koi_kmag", "K2": "sy_kmag", "TOI": "st_kmag"},
"tmag": {"KOI": None, "K2": None, "TOI": "st_tmag"},
}
The KOI, K2, and TOI catalogs were downloaded from the NASA Exoplanet Archive (q1_q17_dr25_koi, k2pandc, toi tables).
Columns were standardized using a unified mapping to ensure consistent names across catalogs (e.g. planet_radius, orbital_period, star_teff, disposition, etc.).
Only relevant physical and photometric parameters were kept. Missing fields were filled with NaN to be handled later.
Although raw light curves were not directly modeled, several summary features derived from light curve analysis were included:
lc_model_snr: signal-to-noise ratio of the transit model
lc_max_single, lc_max_multi: maximum signal from single and multiple events
lc_time0: time of first detected transit
These features capture key aspects of the transit detection without requiring the raw flux data.
To reduce redundancy and extract latent relationships, a PCA was performed on five transit-related variables:
['transit_depth', 'transit_duration', 'lc_model_snr', 'ror_ratio', 'transit_depth2']
The PCA components were concatenated with the non-PCA columns from all three catalogs to form a single dataset.
Facebook
TwitterTo facilitate the use of data collected through the high-frequency phone surveys on COVID-19, the Living Standards Measurement Study (LSMS) team has created the harmonized datafiles using two household surveys: 1) the country’ latest face-to-face survey which has become the sample frame for the phone survey, and 2) the country’s high-frequency phone survey on COVID-19.
The LSMS team has extracted and harmonized variables from these surveys, based on the harmonized definitions and ensuring the same variable names. These variables include demography as well as housing, household consumption expenditure, food security, and agriculture. Inevitably, many of the original variables are collected using questions that are asked differently. The harmonized datafiles include the best available variables with harmonized definitions.
Two harmonized datafiles are prepared for each survey. The two datafiles are:
1. HH: This datafile contains household-level variables. The information include basic household characterizes, housing, water and sanitation, asset ownership, consumption expenditure, consumption quintile, food security, livestock ownership. It also contains information on agricultural activities such as crop cultivation, use of organic and inorganic fertilizer, hired labor, use of tractor and crop sales.
2. IND: This datafile contains individual-level variables. It includes basic characteristics of individuals such as age, sex, marital status, disability status, literacy, education and work.
National coverage
The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.
Sample survey data [ssd]
See “Nigeria - General Household Survey, Panel 2018-2019, Wave 4” and “Nigeria - COVID-19 National Longitudinal Phone Survey 2020” available in the Microdata Library for details.
Computer Assisted Personal Interview [capi]
Nigeria General Household Survey, Panel (GHS-Panel) 2018-2019 and Nigeria COVID-19 National Longitudinal Phone Survey (COVID-19 NLPS) 2020 data were harmonized following the harmonization guidelines (see “Harmonized Datafiles and Variables for High-Frequency Phone Surveys on COVID-19” for more details).
The high-frequency phone survey on COVID-19 has multiple rounds of data collection. When variables are extracted from multiple rounds of the survey, the originating round of the survey is noted with “_rX” in the variable name, where X represents the number of the round. For example, a variable with “_r3” presents that the variable was extracted from Round 3 of the high-frequency phone survey. Round 0 refers to the country’s latest face-to-face survey which has become the sample frame for the high-frequency phone surveys on COVID-19. When the variables are without “_rX”, they were extracted from Round 0.
See “Nigeria - General Household Survey, Panel 2018-2019, Wave 4” and “Nigeria - COVID-19 National Longitudinal Phone Survey 2020” available in the Microdata Library for details.
Facebook
TwitterAfter 2022-01-25, Sentinel-2 scenes with PROCESSING_BASELINE '04.00' or above have their DN (value) range shifted by 1000. The HARMONIZED collection shifts data in newer scenes to be in the same range as in older scenes. Sentinel-2 is a wide-swath, high-resolution, multi-spectral imaging mission supporting Copernicus Land Monitoring studies, including the …
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundBiomedical research requires large, diverse samples to produce unbiased results. Retrospective data harmonization is often used to integrate existing datasets to create these samples, but the process is labor-intensive. Automated methods for matching variables across datasets can accelerate this process, particularly when harmonizing datasets with numerous variables and varied naming conventions. Research in this area has been limited, primarily focusing on lexical matching and ontology-based semantic matching. We aimed to develop new methods, leveraging large language models (LLMs) and ensemble learning, to automate variable matching.MethodsThis study utilized data from two GERAS cohort studies (European [EU] and Japan [JP]) obtained through the Alzheimer’s Disease (AD) Data Initiative’s AD workbench. We first manually created a dataset by matching 347 EU variables with 1322 candidate JP variables and treated matched variable pairs as positive instances and unmatched pairs as negative instances. We then developed four natural language processing (NLP) methods using state-of-the-art LLMs (E5, MPNet, MiniLM, and BioLORD-2023) to estimate variable similarity based on variable labels and derivation rules. A lexical matching method using fuzzy matching was included as a baseline model. In addition, we developed an ensemble-learning method, using the Random Forest (RF) model, to integrate individual NLP methods. RF was trained and evaluated on 50 trials. Each trial had a random split (4:1) of training and test sets, with the model’s hyperparameters optimized through cross-validation on the training set. For each EU variable, 1322 candidate JP variables were ranked based on NLP-derived similarity scores or RF’s probability scores, denoting their likelihood to match the EU variable. Ranking performance was measured by top-n hit ratio (HR-n) and mean reciprocal rank (MRR).ResultsE5 performed best among individual methods, achieving 0.898 HR-30 and 0.700 MRR. RF performed better than E5 on all metrics over 50 trials (P