47 datasets found
  1. Feature ablation analysisa.

    • plos.figshare.com
    xls
    Updated Jul 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zexu Li; Suraj P. Prabhu; Zachary T. Popp; Shubhi S. Jain; Vijetha Balakundi; Ting Fang Alvin Ang; Rhoda Au; Jinying Chen (2025). Feature ablation analysisa. [Dataset]. http://doi.org/10.1371/journal.pone.0328262.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 24, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Zexu Li; Suraj P. Prabhu; Zachary T. Popp; Shubhi S. Jain; Vijetha Balakundi; Ting Fang Alvin Ang; Rhoda Au; Jinying Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundBiomedical research requires large, diverse samples to produce unbiased results. Retrospective data harmonization is often used to integrate existing datasets to create these samples, but the process is labor-intensive. Automated methods for matching variables across datasets can accelerate this process, particularly when harmonizing datasets with numerous variables and varied naming conventions. Research in this area has been limited, primarily focusing on lexical matching and ontology-based semantic matching. We aimed to develop new methods, leveraging large language models (LLMs) and ensemble learning, to automate variable matching.MethodsThis study utilized data from two GERAS cohort studies (European [EU] and Japan [JP]) obtained through the Alzheimer’s Disease (AD) Data Initiative’s AD workbench. We first manually created a dataset by matching 347 EU variables with 1322 candidate JP variables and treated matched variable pairs as positive instances and unmatched pairs as negative instances. We then developed four natural language processing (NLP) methods using state-of-the-art LLMs (E5, MPNet, MiniLM, and BioLORD-2023) to estimate variable similarity based on variable labels and derivation rules. A lexical matching method using fuzzy matching was included as a baseline model. In addition, we developed an ensemble-learning method, using the Random Forest (RF) model, to integrate individual NLP methods. RF was trained and evaluated on 50 trials. Each trial had a random split (4:1) of training and test sets, with the model’s hyperparameters optimized through cross-validation on the training set. For each EU variable, 1322 candidate JP variables were ranked based on NLP-derived similarity scores or RF’s probability scores, denoting their likelihood to match the EU variable. Ranking performance was measured by top-n hit ratio (HR-n) and mean reciprocal rank (MRR).ResultsE5 performed best among individual methods, achieving 0.898 HR-30 and 0.700 MRR. RF performed better than E5 on all metrics over 50 trials (P 

  2. Z

    D3.3_20230919_ProbeField_Aligned_Spectra_V1

    • data.niaid.nih.gov
    Updated Sep 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Castaldi, Fabio (2024). D3.3_20230919_ProbeField_Aligned_Spectra_V1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13757028
    Explore at:
    Dataset updated
    Sep 16, 2024
    Dataset provided by
    National Research Council
    Authors
    Castaldi, Fabio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset composed of 470 soil laboratory spectral data has been aligned using the white Lucky Bay sands as internal soil standard (ISS). Soil samples were collected in Sweden by SLU, in Italy by CNR, in France by INRAE and in Poland by IUNG, and the spectra were acquired in the lab on dry samples. Each partners scanned the ISS using the same instrument as the soil samples, this allowed to compute the correction factor for each instrument.

    Along with the main dataset, an explanation document and a R script are provided-

    The R script can be used to align spectral data acquired by using different nstruments. Within the provided file "CF_lb" you can find 5 correction factors for 5 instruments: The correction factors were computed using the Lucky bay spectra scanned by each of the 5 instrument (ISS) and the master lucky bay spectrum acquired in the CSIRO lab.

  3. Harmonized Cultural Access & Participation Dataset

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    csv, pdf
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Antal; Daniel Antal (2024). Harmonized Cultural Access & Participation Dataset [Dataset]. http://doi.org/10.5281/zenodo.5781672
    Explore at:
    csv, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Antal; Daniel Antal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    "id": survey identifer
    "rowid": unique observation identifier:
    "fct_visit_concert": visiting frequency to concert, catogorical representation,
    "fct_visit_library": visiting frequency to concert, catogorical representation,
    "fct_visit_museum" : visiting frequency to museums or galleries, catogorical representation,
    "visit_concert": visiting frequency to concert, numerical representation,
    "visit_library": visiting frequency to concert, numerical representation,
    "visit_museum": visiting frequency to concert, numerical representation,
    age_education": school leaving age
    "age_exact": age
    "is_student" : respondent still studying (1=yes, 0=no)
    "geo": geographic concept
    "w1": post-stratification weight for geo
    "w_uk": post-stratification weight for Northern Ireland and Great Britain
    "w_de":
    "wex": projected post stratification weight
    "country_code": country code, unifying Germany and the United Kingdom (originally separate samples)
    "w": post_stratification weight for country_code
    "year_survey": year of the survey
    "is_visit_concert": binary variable, 0 if the person did not visit concerts, public libraries, musea...
    "is_visit_library": : binary variable, 0 if the person did not visit public libraries
    "is_visit_museum":: binary variable, 0 if the person did not visit museums or galleries

  4. Evaluation of pre-processing on the meta-analysis of DNA methylation data...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Claudia Sala; Pietro Di Lena; Danielle Fernandes Durso; Andrea Prodi; Gastone Castellani; Christine Nardini (2023). Evaluation of pre-processing on the meta-analysis of DNA methylation data from the Illumina HumanMethylation450 BeadChip platform [Dataset]. http://doi.org/10.1371/journal.pone.0229763
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Claudia Sala; Pietro Di Lena; Danielle Fernandes Durso; Andrea Prodi; Gastone Castellani; Christine Nardini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionMeta-analysis is a powerful means for leveraging the hundreds of experiments being run worldwide into more statistically powerful analyses. This is also true for the analysis of omic data, including genome-wide DNA methylation. In particular, thousands of DNA methylation profiles generated using the Illumina 450k are stored in the publicly accessible Gene Expression Omnibus (GEO) repository. Often, however, the intensity values produced by the BeadChip (raw data) are not deposited, therefore only pre-processed values -obtained after computational manipulation- are available. Pre-processing is possibly different among studies and may then affect meta-analysis by introducing non-biological sources of variability.Material and methodsTo systematically investigate the effect of pre-processing on meta-analysis, we analysed four different collections of DNA methylation samples (datasets), each composed of two subsets, for which raw data from controls (i.e. healthy subjects) and cases (i.e. patients) are available. We pre-processed the data from each dataset with nine among the most common pipelines found in literature. Moreover, we evaluated the performance of regRCPqn, a modification of the RCP algorithm that aims to improve data consistency. For each combination of pre-processing (9 × 9), we first evaluated the between-sample variability among control subjects and, then, we identified genomic positions that are differentially methylated between cases and controls (differential analysis).Results and conclusionThe pre-processing of DNA methylation data affects both the between-sample variability and the loci identified as differentially methylated, and the effects of pre-processing are strongly dataset-dependent. By contrast, application of our renormalization algorithm regRCPqn: (i) reduces variability and (ii) increases agreement between meta-analysed datasets, both critical components of data harmonization.

  5. i

    Household Health Survey 2012-2013, Economic Research Forum (ERF)...

    • catalog.ihsn.org
    • datacatalog.ihsn.org
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Central Statistical Organization (CSO) (2017). Household Health Survey 2012-2013, Economic Research Forum (ERF) Harmonization Data - Iraq [Dataset]. https://catalog.ihsn.org/index.php/catalog/6937
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    Central Statistical Organization (CSO)
    Economic Research Forum
    Kurdistan Regional Statistics Office (KRSO)
    Time period covered
    2012 - 2013
    Area covered
    Iraq
    Description

    Abstract

    The harmonized data set on health, created and published by the ERF, is a subset of Iraq Household Socio Economic Survey (IHSES) 2012. It was derived from the household, individual and health modules, collected in the context of the above mentioned survey. The sample was then used to create a harmonized health survey, comparable with the Iraq Household Socio Economic Survey (IHSES) 2007 micro data set.

    ----> Overview of the Iraq Household Socio Economic Survey (IHSES) 2012:

    Iraq is considered a leader in household expenditure and income surveys where the first was conducted in 1946 followed by surveys in 1954 and 1961. After the establishment of Central Statistical Organization, household expenditure and income surveys were carried out every 3-5 years in (1971/ 1972, 1976, 1979, 1984/ 1985, 1988, 1993, 2002 / 2007). Implementing the cooperation between CSO and WB, Central Statistical Organization (CSO) and Kurdistan Region Statistics Office (KRSO) launched fieldwork on IHSES on 1/1/2012. The survey was carried out over a full year covering all governorates including those in Kurdistan Region.

    The survey has six main objectives. These objectives are:

    1. Provide data for poverty analysis and measurement and monitor, evaluate and update the implementation Poverty Reduction National Strategy issued in 2009.
    2. Provide comprehensive data system to assess household social and economic conditions and prepare the indicators related to the human development.
    3. Provide data that meet the needs and requirements of national accounts.
    4. Provide detailed indicators on consumption expenditure that serve making decision related to production, consumption, export and import.
    5. Provide detailed indicators on the sources of households and individuals income.
    6. Provide data necessary for formulation of a new consumer price index number.

    The raw survey data provided by the Statistical Office were then harmonized by the Economic Research Forum, to create a comparable version with the 2006/2007 Household Socio Economic Survey in Iraq. Harmonization at this stage only included unifying variables' names, labels and some definitions. See: Iraq 2007 & 2012- Variables Mapping & Availability Matrix.pdf provided in the external resources for further information on the mapping of the original variables on the harmonized ones, in addition to more indications on the variables' availability in both survey years and relevant comments.

    Geographic coverage

    National coverage: Covering a sample of urban, rural and metropolitan areas in all the governorates including those in Kurdistan Region.

    Analysis unit

    1- Household/family. 2- Individual/person.

    Universe

    The survey was carried out over a full year covering all governorates including those in Kurdistan Region.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    ----> Design:

    Sample size was (25488) household for the whole Iraq, 216 households for each district of 118 districts, 2832 clusters each of which includes 9 households distributed on districts and governorates for rural and urban.

    ----> Sample frame:

    Listing and numbering results of 2009-2010 Population and Housing Survey were adopted in all the governorates including Kurdistan Region as a frame to select households, the sample was selected in two stages: Stage 1: Primary sampling unit (blocks) within each stratum (district) for urban and rural were systematically selected with probability proportional to size to reach 2832 units (cluster). Stage two: 9 households from each primary sampling unit were selected to create a cluster, thus the sample size of total survey clusters was 25488 households distributed on the governorates, 216 households in each district.

    ----> Sampling Stages:

    In each district, the sample was selected in two stages: Stage 1: based on 2010 listing and numbering frame 24 sample points were selected within each stratum through systematic sampling with probability proportional to size, in addition to the implicit breakdown urban and rural and geographic breakdown (sub-district, quarter, street, county, village and block). Stage 2: Using households as secondary sampling units, 9 households were selected from each sample point using systematic equal probability sampling. Sampling frames of each stages can be developed based on 2010 building listing and numbering without updating household lists. In some small districts, random selection processes of primary sampling may lead to select less than 24 units therefore a sampling unit is selected more than once , the selection may reach two cluster or more from the same enumeration unit when it is necessary.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    ----> Preparation:

    The questionnaire of 2006 survey was adopted in designing the questionnaire of 2012 survey on which many revisions were made. Two rounds of pre-test were carried out. Revision were made based on the feedback of field work team, World Bank consultants and others, other revisions were made before final version was implemented in a pilot survey in September 2011. After the pilot survey implemented, other revisions were made in based on the challenges and feedbacks emerged during the implementation to implement the final version in the actual survey.

    ----> Questionnaire Parts:

    The questionnaire consists of four parts each with several sections: Part 1: Socio – Economic Data: - Section 1: Household Roster - Section 2: Emigration - Section 3: Food Rations - Section 4: housing - Section 5: education - Section 6: health - Section 7: Physical measurements - Section 8: job seeking and previous job

    Part 2: Monthly, Quarterly and Annual Expenditures: - Section 9: Expenditures on Non – Food Commodities and Services (past 30 days). - Section 10 : Expenditures on Non – Food Commodities and Services (past 90 days). - Section 11: Expenditures on Non – Food Commodities and Services (past 12 months). - Section 12: Expenditures on Non-food Frequent Food Stuff and Commodities (7 days). - Section 12, Table 1: Meals Had Within the Residential Unit. - Section 12, table 2: Number of Persons Participate in the Meals within Household Expenditure Other Than its Members.

    Part 3: Income and Other Data: - Section 13: Job - Section 14: paid jobs - Section 15: Agriculture, forestry and fishing - Section 16: Household non – agricultural projects - Section 17: Income from ownership and transfers - Section 18: Durable goods - Section 19: Loans, advances and subsidies - Section 20: Shocks and strategy of dealing in the households - Section 21: Time use - Section 22: Justice - Section 23: Satisfaction in life - Section 24: Food consumption during past 7 days

    Part 4: Diary of Daily Expenditures: Diary of expenditure is an essential component of this survey. It is left at the household to record all the daily purchases such as expenditures on food and frequent non-food items such as gasoline, newspapers…etc. during 7 days. Two pages were allocated for recording the expenditures of each day, thus the roster will be consists of 14 pages.

    Cleaning operations

    ----> Raw Data:

    Data Editing and Processing: To ensure accuracy and consistency, the data were edited at the following stages: 1. Interviewer: Checks all answers on the household questionnaire, confirming that they are clear and correct. 2. Local Supervisor: Checks to make sure that questions has been correctly completed. 3. Statistical analysis: After exporting data files from excel to SPSS, the Statistical Analysis Unit uses program commands to identify irregular or non-logical values in addition to auditing some variables. 4. World Bank consultants in coordination with the CSO data management team: the World Bank technical consultants use additional programs in SPSS and STAT to examine and correct remaining inconsistencies within the data files. The software detects errors by analyzing questionnaire items according to the expected parameter for each variable.

    ----> Harmonized Data:

    • The SPSS package is used to harmonize the Iraq Household Socio Economic Survey (IHSES) 2007 with Iraq Household Socio Economic Survey (IHSES) 2012.
    • The harmonization process starts with raw data files received from the Statistical Office.
    • A program is generated for each dataset to create harmonized variables.
    • Data is saved on the household and individual level, in SPSS and then converted to STATA, to be disseminated.

    Response rate

    Iraq Household Socio Economic Survey (IHSES) reached a total of 25488 households. Number of households refused to response was 305, response rate was 98.6%. The highest interview rates were in Ninevah and Muthanna (100%) while the lowest rates were in Sulaimaniya (92%).

  6. w

    COVID-19 High Frequency Phone Survey of Households 2020 - World Bank LSMS...

    • microdata.worldbank.org
    • catalog.ihsn.org
    • +1more
    Updated Oct 25, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Central Statistics Agency of Ethiopia (2021). COVID-19 High Frequency Phone Survey of Households 2020 - World Bank LSMS Harmonized Dataset - Ethiopia [Dataset]. https://microdata.worldbank.org/index.php/catalog/4072
    Explore at:
    Dataset updated
    Oct 25, 2021
    Dataset authored and provided by
    Central Statistics Agency of Ethiopia
    Time period covered
    2018 - 2021
    Area covered
    Ethiopia
    Description

    Abstract

    To facilitate the use of data collected through the high-frequency phone surveys on COVID-19, the Living Standards Measurement Study (LSMS) team has created the harmonized datafiles using two household surveys: 1) the country’ latest face-to-face survey which has become the sample frame for the phone survey, and 2) the country’s high-frequency phone survey on COVID-19.

    The LSMS team has extracted and harmonized variables from these surveys, based on the harmonized definitions and ensuring the same variable names. These variables include demography as well as housing, household consumption expenditure, food security, and agriculture. Inevitably, many of the original variables are collected using questions that are asked differently. The harmonized datafiles include the best available variables with harmonized definitions.

    Two harmonized datafiles are prepared for each survey. The two datafiles are: 1. HH: This datafile contains household-level variables. The information include basic household characterizes, housing, water and sanitation, asset ownership, consumption expenditure, consumption quintile, food security, livestock ownership. It also contains information on agricultural activities such as crop cultivation, use of organic and inorganic fertilizer, hired labor, use of tractor and crop sales. 2. IND: This datafile contains individual-level variables. It includes basic characteristics of individuals such as age, sex, marital status, disability status, literacy, education and work.

    Geographic coverage

    National coverage

    Analysis unit

    • Households
    • Individuals

    Universe

    The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    See “Ethiopia - Socioeconomic Survey 2018-2019” and “Ethiopia - COVID-19 High Frequency Phone Survey of Households 2020” available in the Microdata Library for details.

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Cleaning operations

    Ethiopia Socioeconomic Survey (ESS) 2018-2019 and Ethiopia COVID-19 High Frequency Phone Survey of Households (HFPS) 2020 data were harmonized following the harmonization guidelines (see “Harmonized Datafiles and Variables for High-Frequency Phone Surveys on COVID-19” for more details).

    The high-frequency phone survey on COVID-19 has multiple rounds of data collection. When variables are extracted from multiple rounds of the survey, the originating round of the survey is noted with “_rX” in the variable name, where X represents the number of the round. For example, a variable with “_r3” presents that the variable was extracted from Round 3 of the high-frequency phone survey. Round 0 refers to the country’s latest face-to-face survey which has become the sample frame for the high-frequency phone surveys on COVID-19. When the variables are without “_rX”, they were extracted from Round 0.

    Response rate

    See “Ethiopia - Socioeconomic Survey 2018-2019” and “Ethiopia - COVID-19 High Frequency Phone Survey of Households 2020” available in the Microdata Library for details.

  7. VOTP Dataset

    • kaggle.com
    zip
    Updated Apr 10, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sdorius (2017). VOTP Dataset [Dataset]. https://www.kaggle.com/sdorius/votpharm
    Explore at:
    zip(24823052 bytes)Available download formats
    Dataset updated
    Apr 10, 2017
    Authors
    sdorius
    Description

    This is an integration of 10 independent multi-country, multi-region, multi-cultural social surveys fielded by Gallup International between 2000 and 2013. The integrated data file contains responses from 535,159 adults living in 103 countries. In total, the harmonization project combined 571 social surveys.

    These data have value in a number of longitudinal multi-country, multi-regional, and multi-cultural (L3M) research designs. Understood as independent, though non-random, L3M samples containing a number of multiple indicator ASQ (ask same questions) and ADQ (ask different questions) measures of human development, the environment, international relations, gender equality, security, international organizations, and democracy, to name a few [see full list below].

    The data can be used for exploratory and descriptive analysis, with greatest utility at low levels of resolution (e.g. nation-states, supranational groupings). Level of resolution in analysis of these data should be sufficiently low to approximate confidence intervals.

    These data can be used for teaching 3M methods, including data harmonization in L3M, 3M research design, survey design, 3M measurement invariance, analysis, and visualization, and reporting. Opportunities to teach about para data, meta data, and data management in L3M designs.

    The country units are an unbalanced panel derived from non-probability samples of countries and respondents> Panels (countries) have left and right censorship and are thusly unbalanced. This design limitation can be overcome to the extent that VOTP panels are harmonized with public measurements from other 3M surveys to establish balance in terms of panels and occasions of measurement. Should L3M harmonization occur, these data can be assigned confidence weights to reflect the amount of error in these surveys.

    Pooled public opinion surveys (country means), when combine with higher quality country measurements of the same concepts (ASQ, ADQ), can be leveraged to increase the statistical power of pooled publics opinion research designs (multiple L3M datasets)…that is, in studies of public, rather than personal, beliefs.

    The Gallup Voice of the People survey data are based on uncertain sampling methods based on underspecified methods. Country sampling is non-random. The sampling method appears be primarily probability and quota sampling, with occasional oversample of urban populations in difficult to survey populations. The sampling units (countries and individuals) are poorly defined, suggesting these data have more value in research designs calling for independent samples replication and repeated-measures frameworks.

    **The Voice of the People Survey Series is WIN/Gallup International Association's End of Year survey and is a global study that collects the public's view on the challenges that the world faces today. Ongoing since 1977, the purpose of WIN/Gallup International's End of Year survey is to provide a platform for respondents to speak out concerning government and corporate policies. The Voice of the People, End of Year Surveys for 2012, fielded June 2012 to February 2013, were conducted in 56 countries to solicit public opinion on social and political issues. Respondents were asked whether their country was governed by the will of the people, as well as their attitudes about their society. Additional questions addressed respondents' living conditions and feelings of safety around their living area, as well as personal happiness. Respondents' opinions were also gathered in relation to business development and their views on the effectiveness of the World Health Organization. Respondents were also surveyed on ownership and use of mobile devices. Demographic information includes sex, age, income, education level, employment status, and type of living area.

  8. Fundamental Data Record for Atmospheric Composition [ATMOS_L1B]

    • earth.esa.int
    Updated Jul 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Space Agency (2024). Fundamental Data Record for Atmospheric Composition [ATMOS_L1B] [Dataset]. https://earth.esa.int/eogateway/catalog/fdr-for-atmospheric-composition
    Explore at:
    Dataset updated
    Jul 1, 2024
    Dataset authored and provided by
    European Space Agencyhttp://www.esa.int/
    License

    https://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdfhttps://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdf

    Time period covered
    Jun 28, 1995 - Apr 7, 2012
    Description

    The Fundamental Data Record (FDR) for Atmospheric Composition UVN v.1.0 dataset is a cross-instrument Level-1 product [ATMOS_L1B] generated in 2023 and resulting from the ESA FDR4ATMOS project. The FDR contains selected Earth Observation Level 1b parameters (irradiance/reflectance) from the nadir-looking measurements of the ERS-2 GOME and Envisat SCIAMACHY missions for the period ranging from 1995 to 2012. The data record offers harmonised cross-calibrated spectra with focus on spectral windows in the Ultraviolet-Visible-Near Infrared regions for the retrieval of critical atmospheric constituents like ozone (O3), sulphur dioxide (SO2), nitrogen dioxide (NO2) column densities, alongside cloud parameters. The FDR4ATMOS products should be regarded as experimental due to the innovative approach and the current use of a limited-sized test dataset to investigate the impact of harmonization on the Level 2 target species, specifically SO2, O3 and NO2. Presently, this analysis is being carried out within follow-on activities. The FDR4ATMOS V1 is currently being extended to include the MetOp GOME-2 series. Product format For many aspects, the FDR product has improved compared to the existing individual mission datasets: GOME solar irradiances are harmonised using a validated SCIAMACHY solar reference spectrum, solving the problem of the fast-changing etalon present in the original GOME Level 1b data; Reflectances for both GOME and SCIAMACHY are provided in the FDR product. GOME reflectances are harmonised to degradation-corrected SCIAMACHY values, using collocated data from the CEOS PIC sites; SCIAMACHY data are scaled to the lowest integration time within the spectral band using high-frequency PMD measurements from the same wavelength range. This simplifies the use of the SCIAMACHY spectra which were split in a complex cluster structure (with own integration time) in the original Level 1b data; The harmonization process applied mitigates the viewing angle dependency observed in the UV spectral region for GOME data; Uncertainties are provided. Each FDR product provides, within the same file, irradiance/reflectance data for UV-VIS-NIR special regions across all orbits on a single day, including therein information from the individual ERS-2 GOME and Envisat SCIAMACHY measurements. FDR has been generated in two formats: Level 1A and Level 1B targeting expert users and nominal applications respectively. The Level 1A [ATMOS_L1A] data include additional parameters such as harmonisation factors, PMD, and polarisation data extracted from the original mission Level 1 products. The ATMOS_L1A dataset is not part of the nominal dissemination to users. In case of specific requirements, please contact EOHelp. Please refer to the README file for essential guidance before using the data. All the new products are conveniently formatted in NetCDF. Free standard tools, such as Panoply, can be used to read NetCDF data. Panoply is sourced and updated by external entities. For further details, please consult our Terms and Conditions page. Uncertainty characterisation One of the main aspects of the project was the characterization of Level 1 uncertainties for both instruments, based on metrological best practices. The following documents are provided: General guidance on a metrological approach to Fundamental Data Records (FDR) Uncertainty Characterisation document Effect tables NetCDF files containing example uncertainty propagation analysis and spectral error correlation matrices for SCIAMACHY (Atlantic and Mauretania scene for 2003 and 2010) and GOME (Atlantic scene for 2003) reflectance_uncertainty_example_FDR4ATMOS_GOME.nc reflectance_uncertainty_example_FDR4ATMOS_SCIA.nc Known Issues Non-monotonous wavelength axis for SCIAMACHY in FDR data version 1.0 In the SCIAMACHY OBSERVATION group of the atmospheric FDR v1.0 dataset (DOI: 10.5270/ESA-852456e), the wavelength axis (lambda variable) is not monotonically increasing. This issue affects all spectral channels (UV, VIS, NIR) in the SCIAMACHY group, while GOME OBSERVATION data remain unaffected. The root cause of the issue lies in the incorrect indexing of the lambda variable during the NetCDF writing process. Notably, the wavelength values themselves are calculated correctly within the processing chain. Temporary Workaround The wavelength axis is correct in the first record of each product. As a workaround, users can extract the wavelength axis from the first record and apply it to all subsequent measurements within the same product. The first record can be retrieved by setting the first two indices (time and scanline) to 0 (assuming counting of array indices starts at 0). Note that this process must be repeated separately for each spectral range (UV, VIS, NIR) and every daily product. Since the wavelength axis of SCIAMACHY is highly stable over time, using the first record introduces no expected impact on retrieval results. Python pseudo-code example: lambda_...

  9. Supplementary material for Lee et al. 2019 Taxonomic harmonization may...

    • catalog.data.gov
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Supplementary material for Lee et al. 2019 Taxonomic harmonization may reveal a stronger association between diatom assemblages and total phosphorus in large datasets [Dataset]. https://catalog.data.gov/dataset/supplementary-material-for-lee-et-al-2019-taxonomic-harmonization-may-reveal-a-stronger-as
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Diatom data have been collected in large-scale biological assessments in the United States, such as the U.S. Environmental Protection Agency’s National Rivers and Streams Assessment (NRSA). However, the effectiveness of diatoms as indicators may suffer if inconsistent taxon identifications across different analysts obscure the relationships between assemblage composition and environmental variables. To reduce these inconsistencies, we harmonized the 2008–2009 NRSA data from nine analysts by updating names to current synonyms and by statistically identifying taxa with high analyst signal (taxa with more variation in relative abundance explained by the analyst factor, relative to environmental variables). We then screened a subset of samples with QA/QC data and combined taxa with mismatching identifications by the primary and secondary analysts. When these combined “slash groups” did not reduce analyst signal, we elevated taxa to the genus level or omitted taxa in difficult species complexes. We examined the variation explained by analyst in the original and revised datasets. Further, we examined how revising the datasets to reduce analyst signal can reduce inconsistency, thereby uncovering the variation in assemblage composition explained by total phosphorus (TP), an environmental variable of high priority for water managers. To produce a revised dataset with the greatest taxonomic consistency, we ultimately made 124 slash groups, omitted 7 taxa in the small naviculoid (e.g., Sellaphora atomoides) species complex, and elevated Nitzschia, Diploneis, and Tryblionella taxa to the genus level. Relative to the original dataset, the revised dataset had more overlap among samples grouped by analyst in ordination space, less variation explained by the analyst factor, and more than double the variation in assemblage composition explained by TP. Elevating all taxa to the genus level did not eliminate analyst signal completely, and analyst remained the most important predictor for the genera Sellaphora, Mayamaea, and Psammodictyon, indicating that these taxa present the greatest obstacle to consistent identification in this dataset. Although our process did not completely remove analyst signal, this work provides a method to minimize analyst signal and improve detection of diatom association with TP in large datasets involving multiple analysts. Examination of variation in assemblage data explained by analyst and taxonomic harmonization may be necessary steps for improving data quality and the utility of diatoms as indicators of environmental variables. This dataset is associated with the following publication: Lee, S., I. Bishop, S. Spaulding, R. Mitchell, and L. Yuan. Taxonomic harmonization may reveal a stronger association between diatom assemblages and total phosphorus in large datasets.. ECOLOGICAL INDICATORS. Elsevier Science Ltd, New York, NY, USA, 102: 166-174, (2019).

  10. High-Frequency Phone Survey on COVID-19 - World Bank LSMS Harmonized Dataset...

    • catalog.ihsn.org
    • microdata.worldbank.org
    Updated Jan 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Malawi National Statistical Office (NSO) (2022). High-Frequency Phone Survey on COVID-19 - World Bank LSMS Harmonized Dataset - Malawi [Dataset]. https://catalog.ihsn.org/catalog/9901
    Explore at:
    Dataset updated
    Jan 3, 2022
    Dataset provided by
    National Statistical Office of Malawihttp://www.nsomalawi.mw/
    Authors
    Malawi National Statistical Office (NSO)
    Time period covered
    2019 - 2021
    Area covered
    Malawi
    Description

    Abstract

    To facilitate the use of data collected through the high-frequency phone surveys on COVID-19, the Living Standards Measurement Study (LSMS) team has created the harmonized datafiles using two household surveys: 1) the country’ latest face-to-face survey which has become the sample frame for the phone survey, and 2) the country’s high-frequency phone survey on COVID-19.

    The LSMS team has extracted and harmonized variables from these surveys, based on the harmonized definitions and ensuring the same variable names. These variables include demography as well as housing, household consumption expenditure, food security, and agriculture. Inevitably, many of the original variables are collected using questions that are asked differently. The harmonized datafiles include the best available variables with harmonized definitions.

    Two harmonized datafiles are prepared for each survey. The two datafiles are: 1. HH: This datafile contains household-level variables. The information include basic household characterizes, housing, water and sanitation, asset ownership, consumption expenditure, consumption quintile, food security, livestock ownership. It also contains information on agricultural activities such as crop cultivation, use of organic and inorganic fertilizer, hired labor, use of tractor and crop sales.
    2. IND: This datafile contains individual-level variables. It includes basic characteristics of individuals such as age, sex, marital status, disability status, literacy, education and work.

    Geographic coverage

    National coverage

    Analysis unit

    • Households
    • Individuals

    Universe

    The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    See “Malawi - Integrated Household Panel Survey 2010-2013-2016-2019 (Long-Term Panel, 102 EAs)” and “Malawi - High-Frequency Phone Survey on COVID-19” available in the Microdata Library for details.

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Cleaning operations

    Malawi Integrated Household Panel Survey (IHPS) 2019 and Malawi High-Frequency Phone Survey on COVID-19 data were harmonized following the harmonization guidelines (see “Harmonized Datafiles and Variables for High-Frequency Phone Surveys on COVID-19” for more details).

    The high-frequency phone survey on COVID-19 has multiple rounds of data collection. When variables are extracted from multiple rounds of the survey, the originating round of the survey is noted with “_rX” in the variable name, where X represents the number of the round. For example, a variable with “_r3” presents that the variable was extracted from Round 3 of the high-frequency phone survey. Round 0 refers to the country’s latest face-to-face survey which has become the sample frame for the high-frequency phone surveys on COVID-19. When the variables are without “_rX”, they were extracted from Round 0.

    Response rate

    See “Malawi - Integrated Household Panel Survey 2010-2013-2016-2019 (Long-Term Panel, 102 EAs)” and “Malawi - High-Frequency Phone Survey on COVID-19” available in the Microdata Library for details.

  11. i

    Household Expenditure and Income Survey 2008, Economic Research Forum (ERF)...

    • catalog.ihsn.org
    Updated Jan 12, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Statistics (2022). Household Expenditure and Income Survey 2008, Economic Research Forum (ERF) Harmonization Data - Jordan [Dataset]. https://catalog.ihsn.org/index.php/catalog/7661
    Explore at:
    Dataset updated
    Jan 12, 2022
    Dataset authored and provided by
    Department of Statistics
    Time period covered
    2008 - 2009
    Area covered
    Jordan
    Description

    Abstract

    The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.

    Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demograohic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor chracteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty

    Geographic coverage

    National

    Analysis unit

    • Household/families
    • Individuals

    Universe

    The survey covered a national sample of households and all individuals permanently residing in surveyed households.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2008 Household Expenditure and Income Survey sample was designed using two-stage cluster stratified sampling method. In the first stage, the primary sampling units (PSUs), the blocks, were drawn using probability proportionate to the size, through considering the number of households in each block to be the block size. The second stage included drawing the household sample (8 households from each PSU) using the systematic sampling method. Fourth substitute households from each PSU were drawn, using the systematic sampling method, to be used on the first visit to the block in case that any of the main sample households was not visited for any reason.

    To estimate the sample size, the coefficient of variation and design effect in each subdistrict were calculated for the expenditure variable from data of the 2006 Household Expenditure and Income Survey. This results was used to estimate the sample size at sub-district level, provided that the coefficient of variation of the expenditure variable at the sub-district level did not exceed 10%, with a minimum number of clusters that should not be less than 6 at the district level, that is to ensure good clusters representation in the administrative areas to enable drawing poverty pockets.

    It is worth mentioning that the expected non-response in addition to areas where poor families are concentrated in the major cities were taken into consideration in designing the sample. Therefore, a larger sample size was taken from these areas compared to other ones, in order to help in reaching the poverty pockets and covering them.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    List of survey questionnaires: (1) General Form (2) Expenditure on food commodities Form (3) Expenditure on non-food commodities Form

    Cleaning operations

    Raw Data The design and implementation of this survey procedures were: 1. Sample design and selection 2. Design of forms/questionnaires, guidelines to assist in filling out the questionnaires, and preparing instruction manuals 3. Design the tables template to be used for the dissemination of the survey results 4. Preparation of the fieldwork phase including printing forms/questionnaires, instruction manuals, data collection instructions, data checking instructions and codebooks 5. Selection and training of survey staff to collect data and run required data checkings 6. Preparation and implementation of the pretest phase for the survey designed to test and develop forms/questionnaires, instructions and software programs required for data processing and production of survey results 7. Data collection 8. Data checking and coding 9. Data entry 10. Data cleaning using data validation programs 11. Data accuracy and consistency checks 12. Data tabulation and preliminary results 13. Preparation of the final report and dissemination of final results

    Harmonized Data - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets - The harmonization process started with cleaning all raw data files received from the Statistical Office - Cleaned data files were then all merged to produce one data file on the individual level containing all variables subject to harmonization - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables - A post-harmonization cleaning process was run on the data - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format

  12. g

    Harmonized Eurobarometer 2004-2021

    • search.gesis.org
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Russo, Luana; Bräutigam, Milena, Harmonized Eurobarometer 2004-2021 [Dataset]. http://doi.org/10.7802/2539
    Explore at:
    Dataset provided by
    GESIS, Köln
    GESIS search
    Authors
    Russo, Luana; Bräutigam, Milena
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Description

    +++++++++++++++ Version 3.0.0 +++++++++++++++

    We carried out an harmonization of the Eurobarometer 2004-2021(spring). This dataset includes 35 single standard Eurobarometers, and morethan 140 variables about EU policies, attitudes towards Europe and the EU, identity, cognitive mobilization, political institutions, socio-political characteristics and partisanship, etc.

    The harmonization was carried out using existing Eurobarometer datasets published by GESIS. To allow the user to replicate the harmonization and be able to modify some codes if needed, we publish one example of do-file used to pursue the harmonization, as well as the corresponding (harmonized) dataset. The user can find the do-file containing the codes used to modify and clean EB 953 (ZA7783, conducted in spring 2021) according to the harmonization procedure that we followed. Moreover, the user can find the cleaned dataset for EB 953 that was obtained after running the do-file. The files are named “EB 953.do” and “953_new.dta”.

    We include: - a harmonized dataset ("harmonised_EB_2004-2021.dta"), - a technical report ("User Guide Harmonized Eurobarometer 2004-2021"), - a summary of the original survey questions corresponding to the variables included in the dataset ("Trends_EBs_1970-2021.xlsx"), - one of the do-files used to carry out the harmonization (“EB 953.do” ), - one of the datasets used before merging all datasets (“953_new.dta”).

  13. Global taxonomically harmonized pollen data set for Late Quaternary with...

    • doi.pangaea.de
    zip
    Updated Mar 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ulrike Herzschuh; Chenzhi Li; Thomas Böhmer; Birgit Heim; Xianyong Cao; Mareike Wieczorek (2021). Global taxonomically harmonized pollen data set for Late Quaternary with revised chronologies (LegacyPollen 1.0) [Dataset]. http://doi.org/10.1594/PANGAEA.929773
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 29, 2021
    Dataset provided by
    PANGAEA
    Authors
    Ulrike Herzschuh; Chenzhi Li; Thomas Böhmer; Birgit Heim; Xianyong Cao; Mareike Wieczorek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 26, 1938 - Mar 18, 2014
    Area covered
    Description

    This data data set is the taxonomically harmonized pollen data from records 2831 sites. 1032 sites are located in North America, 1075 sites in Europe, 488 sites in Asia, 150 sites in South America, 54 in Africa and 32 in the Indopacific. Most of the data where retrieved from the Neotoma Paleoecology Database (https://www.neotomadb.org/), with additional data from Cao et al. (2020; https://doi.org/10.5194/essd-12-119-2020), Cao et al. (2013, https://doi.org/10.1016/j.revpalbo.2013.02.003) and our own collection for the Asian sector. The ages of the samples refer to the newly established LegacyAge 1.0 framework (https://doi.pangaea.de/10.1594/PANGAEA.933132). The 10,110 original pollen taxa names and notations were harmonized to 1002 taxa names. We present the table with the harmonization approach crossreferencing the original taxa with the harmonized taxa name. The harmonised pollen data are presented as counts (when available) and as percentage values. We complement the data publication by providing the source information on the references (most data are related to Neotoma) as a table linked to each Dataset ID. The data set and site IDs are from Neotoma if the data sets are derived from the Neotoma repository. In case of our own data collection efforts (Cao et al. (2020), Cao et al. (2013) and our own data) we used the already published PANGAEA event names in case they are related to the data or created own site names with referencing to geographical regions similar to the Neotoma data naming principle.

  14. d

    Multisource surface-water-quality data for the Delaware River Basin

    • catalog.data.gov
    Updated Nov 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Multisource surface-water-quality data for the Delaware River Basin [Dataset]. https://catalog.data.gov/dataset/multisource-surface-water-quality-data-for-the-delaware-river-basin
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Delaware River
    Description

    Jointly managed by multiple states and the federal government, there are many ongoing efforts to characterize and understand water quality in the Delaware River Basin (DRB). Many State, Federal and non-profit organizations have collected surface-water-quality samples across the DRB for decades and many of these data are available through the National Water Quality Monitoring Council's Water Quality Portal (WQP). In this data release, WQP data in the DRB were harmonized, meaning that they were processed to create a clean and readily usable dataset. This harmonization processing included the synthesis of parameter names and fractions, the condensation of remarks and other data qualifiers, the resolution of duplicate records, an initial quality control check of the data, and other processing steps described in the metadata. This data set provides harmonized discrete multisource surface-water-quality data pulled from the WQP for nutrients, sediment, salinity, major ions, bacteria, temperature, dissolved oxygen, pH, and turbidity in the DRB, for all available years.

  15. Z

    HarP: Harmonized Prior river-lake database

    • nde-dev.biothings.io
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang, Jida (2024). HarP: Harmonized Prior river-lake database [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_14205130
    Explore at:
    Dataset updated
    Dec 4, 2024
    Dataset provided by
    Pavelsky, Tamlin M.
    Crétaux, Jean-François
    Sheng, Yongwei
    Yamazaki, Dai
    Allen, George H.
    Wang, Jida
    Sikder, Md Safat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contact: Md Safat Sikder (mssikder@illinois.edu), Jida Wang (jidaw@illinois.edu)

    Citation

    Sikder, M. S., Wang, J., Allen, G. H., Sheng, Y., Yamazaki, D., Crétaux, J.-F., and Pavelsky, T. M., 2024. HarP: Harmonized Prior river-lake database. Zenodo, https://doi.org/10.5281/zenodo.14205131.

    If you only use the PLD-TopoCat dataset, please cite the following paper:

    Sikder, M. S., Wang, J., Allen, G. H., Sheng, Y., Yamazaki, D., Song, C., Ding, M., Crétaux, J.-F., and Pavelsky, T. M., 2023. Lake-TopoCat: A global lake drainage topology and catchment dataset. Earth System Science Data, 15, 3483-3511, https://doi.org/10.5194/essd-15-3483-2023.

    Data description and components

    The Harmonized Prior river-lake database (HarP) for SWOT integrated the SWOT River Database (SWORD) (Altenau et al., 2021) and the SWOT Prior Lake Database (PLD) (Wang et al., 2023) into a geometrically (lake/river) explicit but topologically harmonized vector database to allow for coupled fluvial-lacustrine applications, including a synergistic use of both river and lake products from SWOT.

    In addition to the input river network (SWORD v16) and lake database (PLD v106), we used the MERIT Hydro v1.0.1 (Yamazaki et al., 2019), a high-resolution (~90 m) global hydrography dataset, to develop this database.

    The SWORD-PLD harmonization process involves three major steps, with Step 3 being divided into three sub-steps. The processing chain is illustrated in the attached Figure "SWORD-PLD_harmonization_steps.jpg", as well as in Section 2 of the product description document. The HarP database consists of the outputs from each of the steps. For convenience, the global landmass (excluding Antarctica) was partitioned to 68 Pfafstetter Level-2 basins/regions, with their IDs shown in Figure "Pfaf2_basins.jpg" attached.

    The HarP database consists of five datasets or components (outputs from each step), each with multiple features. The five datasets are described below, and more details are elaborated in the product description document.

    1. Harmonized SWORD-PLD (file name "Harmonized_SWORD_PLD"): This is the fully harmonized SWORD-PLD dataset, the primary product of HarP (i.e., output of Step 3.3 in Figure "SWORD-PLD_harmonization_steps.jpg"). This dataset couples SWORD and PLD into a geometrically segmented but topologically integrated dataset at the node, reach, and catchment scales (stored by three feature layers, respectively):

      (a) Harmonized feature nodes: Harmonized_feature_nodes_pfaf_xx (b) Harmonized river network: Harmonized_river_network_pfaf_xx (c) Harmonized feature catchments: Harmonized_feature_catchments_pfaf_xx Note: ''pfaf_xx'' indicates the Pfafstetter Level-2 basin ID (shown in Fig. 'Pfaf2_basins.jpg').

    Figure "HarP_example.jpg", attached to this database, is an example of the fully harmonized SWORD-PLD dataset for the Ohio River Basin. The example shows three main features of the dataset: feature nodes (i.e., reach downstream ends, lake inlets, and lake outlets; see Fig. 3 in the product description document for definitions), river reaches (i.e., reaches characterized by SWORD alone, characterized by TopoCat alone, and shared by both SWORD and TopoCat), and catchments segmented by each of the feature nodes.

    1. Intersected SWORD-PLD drainage configuration (file name "Intersected_SWORD_PLD"): This dataset is the intersected SWORD-PLD (prior river-lake) features (i.e., output of Step 2 in Figure "SWORD-PLD_harmonization_steps.jpg"). This dataset was constructed independently from Step 1 and Step 3. In this dataset, the original geometries of SWORD and PLD are not altered, but instead, their geometric and drainage topological relationships are configured in the attribute tables. This dataset consists of three features:

      (a) Intersected reaches: Intersected_SWORD_reaches_pfaf_xx (b) Intersected nodes: Intersected_SWORD_nodes_pfaf_xx (c) Intersected lakes: Intersected_PLD_lakes_pfaf_xx

    2. PLD-TopoCat (file name "PLD_TopoCat"): This dataset is the lake drainage topology and catchments (TopoCat) for PLD lakes (i.e., output of Step 1 in Figure "SWORD-PLD_harmonization_steps.jpg"). PLD-TopoCat was developed to generate detailed lake drainage topology and connecting paths, which were later used to configure the off-SWORD-network PLD lakes into the tributaries that drain to SWORD. PLD-TopoCat was generated from PLD v106 and MERIT Hydro. Details of the developiong process and algorithm for TopoCat can be found at Sikder at al., (2023). PLD-TopoCat dataset contains six features:

      (a) Lake original polygon: PLD_lakes_pfaf_xx (b) Lake raster polygon: Lake_raster_polygons_pfaf_xx (c) Lake outlets: Lake_outlets_pfaf_xx (d) Lake catchments: Lake_catchments_pfaf_xx (e) Inter-lake reaches: Inter_lake_reaches_pfaf_xx (f) Lake-network basins: Lake_network_basins_pfaf_xx Note: full version of the PLD-TopoCat is available here.

    3. SWORD-mirror network (file name "SWORD_mirror"): The SWORD-mirror network was constructed to facilitate the SWORD-TopoCat network merging process (i.e., output of Step 3.1 in Figure "SWORD-PLD_harmonization_steps.jpg"). It is essentially a replica of SWORD except that the original SWORD reaches are geometrically modified to be aligned with the topological/hydrographic information depicted in MERIT Hydro. The SWORD-mirror network consists of four features:

      (a) SWORD-original reaches: SWORD_original_reaches_pfaf_xx (b) SWORD-mirror prelim. reaches: SWORD_mirror_prelim_reaches_pfaf_xx (c) SWORD-mirror reaches: SWORD_mirror_reaches_pfaf_xx (d) SWORD-mirror reach catchments: SWORD_mirror_reach_catchments_pfaf_xx

    4. Merged SWORD-mirror – TopoCat network (file name "SWORD_TopoCat_merged"): This dataset is the output of Step 3.2 in Figure "SWORD-PLD_harmonization_steps.jpg". It is essentially the merged product of the inter-lake reaches (from Step 2) and SWORD-mirror reaches (from Step 3.1). The merged SWORD-mirror – TopoCat network consists of three features:

      (a) Merged SWORD-TopoCat reaches: SWORD_TopoCat_merged_reaches_pfaf_xx (b) SWORD nodes at SWORD-TopoCat confluence: SWORD_TopoCat_confluence_nodes_pfaf_xx (c) Reach catchments for merged network: SWORD_TopoCat_reach_catchments_pfaf_xx

    The attribute tables for each of the feature components are explained in Section 4 of the product description document. All files of HarP are available in both shapefile and geodatabase formats.

    DisclaimerAuthors of this dataset claim no responsibility or liability for any consequences related to the use, citation, or dissemination of HarP. For any quesitons, please contact Safat Sikder and Jida Wang.

  16. e

    COVID 19 MENA Monitor Enterprise Surveys, CMMENT – Wave 3 - Tunisia

    • erfdataportal.com
    • mail.erfdataportal.com
    Updated Oct 13, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Economics Research Forum (2021). COVID 19 MENA Monitor Enterprise Surveys, CMMENT – Wave 3 - Tunisia [Dataset]. https://erfdataportal.com/index.php/catalog/229
    Explore at:
    Dataset updated
    Oct 13, 2021
    Dataset authored and provided by
    Economics Research Forum
    Time period covered
    2021
    Area covered
    Tunisia
    Description

    Abstract

    To better understand the impact of the shock induced by the COVID-19 pandemic on micro and small enterprises in Tunisia and assess the policy responses in a rapidly changing context, reliable data is imperative, and the need to resort to a dynamic data collection tool at a time when countries in the region are in a state of flux cannot be overstated. The COVID-19 MENA Monitor Survey was led by the Economic Research Forum (ERF) to provide data for researchers and policy makers on the economic and labor market impact of the global COVID-19 pandemic on enterprises.

    The ERF COVID-19 MENA Monitor Survey is constructed using a series of short panel phone surveys, that are conducted approximately every two months, and it will cover business closure (temporary/permanent) due to lockdowns, ability to telework/deliver the service, disruptions to supply chains (for inputs and outputs), loss of product markets, increased cost of supplies, worker layoffs, salary adjustments, access to lines of credit and delays in transportation. Understanding the strategies of enterprises (particularly micro and small enterprises) to cope with the crisis is one of the main objectives of this survey. Specific constraints such as weak access to the internet in some areas or laws constraining goods' delivery will be analyzed. Enterprise owners will also be asked about prospects for the future, including ability to stay open, and whether they benefited from any measures to support their businesses. The ERF COVID-19 MENA Monitor Survey is a wide-ranging, nationally representative panel survey. The wave 3 of this dataset was collected from August to September 2021 and harmonized by the Economic Research Forum (ERF) and is featured as data for enterprise data.

    The harmonization was designed to create comparable data that can facilitate cross-country and comparative research between other Arab countries (Morocco, Egypt, and Jordan). All the COVID-19 MENA Monitor surveys incorporate similar survey designs, with data on enterprises within Arab countries (Egypt, Jordan, Tunisia, and Morocco).

    Geographic coverage

    National

    Analysis unit

    Enterprises

    Universe

    The sample universe for the enterprise survey was enterprises that had 6-199 workers pre-COVID-19

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The sample universe for the firm survey was firms that had 6-199 workers pre-COVID-19. Stratified random samples were used to ensure adequate sample size in key strata. A target of 500 firms was set as a sample. Up to Five attempts were made to ensure response if a phone number was not picked up/answered, was disconnected or busy, or picked up but could not complete the interview at that time. After the fifth failed attempt, a firm was treated as a non-response and a random firm from the same stratum was used as an alternate.

    Use the National Institute of Statistics (INS) and Agency for the Promotion of Industry and Innovation (APII) databases as follow: o Tunisia did not have a Yellow Pages or similar database, so administrative/statistics data sources had to be used o The sample started with the INS frame with 1,238 enterprises with 6-200 wage employees § Enterprises were stratified into: (1) Agriculture (2) Industry (3) Construction (4) Trade (5) Accommodation (6) Service § Enterprises were also stratified by size in terms of 6-49 versus 50-200 employees § A random stratified sample (order) was selected § Further restricted to enterprises with 6-199 workers in February 2020 based on an eligibility question during the phone interview § This sample frame was eventually exhausted o After the INS sample was exhausted, the APII sample was used § APII only covered enterprises with 10+ workers § APII only covered (1) services & transport, and (2) industry o Weights are based on the underlying data on all enterprises from INS, specifically: Entreprises privées selon l'activité principale et la tranche de salariés (RNE 2019). § We ultimately stratify the Tunisia weights by industry and enterprises sized: 6-9 employees (since APII only covered 10+), 10-49, and 50-199.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    The enterprise questionnaire is carried out to understand the strategies of enterprises -particularly micro and small enterprises- to cope with the crisis as well as related constraints and prospects for the future. It includes questions on business closure (temporary/permanent) due to lockdowns, ability to telework/deliver the service, disruptions to supply chains (for inputs and outputs), loss of product markets, increased cost of supplies, worker layoffs, salary adjustments, access to lines of credit and delays in transportation.

    Note: The questionnaire can be seen in the documentation materials tab.

  17. d

    Data from: SOils DAta Harmonization database (SoDaH): an open-source...

    • search.dataone.org
    • portal.edirepository.org
    Updated Jul 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William R Wieder; Derek Pierson; Stevan R Earl; Kate Lajtha; Sara Baer; Ford Ballantyne; Asmeret A Berhe; Sharon Billings; Laurel M Brigham; Stephany S Chacon; Jennifer Fraterrigo; Serita D Frey; Katerina Georgiou; Marie-Anne de Graaff; A S Grandy; Melannie D Hartman; Sarah E Hobbie; Chris Johnson; Jason Kaye; Emily Snowman; Marcy E Litvak; Michelle C Mack; Avni Malhotra; Jessica A M Moore; Knute Nadelhoffer; Craig Rasmussen; Whendee L Silver; Benjamin N Sulman; Xanthe Walker; Samantha Weintraub (2020). SOils DAta Harmonization database (SoDaH): an open-source synthesis of soil data from research networks [Dataset]. https://search.dataone.org/view/https:%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%2Fedi%2F521%2F1
    Explore at:
    Dataset updated
    Jul 15, 2020
    Dataset provided by
    Environmental Data Initiative
    Authors
    William R Wieder; Derek Pierson; Stevan R Earl; Kate Lajtha; Sara Baer; Ford Ballantyne; Asmeret A Berhe; Sharon Billings; Laurel M Brigham; Stephany S Chacon; Jennifer Fraterrigo; Serita D Frey; Katerina Georgiou; Marie-Anne de Graaff; A S Grandy; Melannie D Hartman; Sarah E Hobbie; Chris Johnson; Jason Kaye; Emily Snowman; Marcy E Litvak; Michelle C Mack; Avni Malhotra; Jessica A M Moore; Knute Nadelhoffer; Craig Rasmussen; Whendee L Silver; Benjamin N Sulman; Xanthe Walker; Samantha Weintraub
    Area covered
    Variables measured
    K, Ca, L1, L2, L3, L4, L5, Mg, Na, bs, and 147 more
    Description

    This SOils DAta Harmonization (SoDaH) database is designed to bring together soil carbon data from diverse research networks into a harmonized dataset that can be used for synthesis activities and model development. The research network sources for SoDaH span different biomes and climates, encompass multiple ecosystem types, and have collected data across a range of spatial, temporal, and depth gradients. The rich data sets assembled in SoDaH consist of observations from monitoring efforts and long-term ecological experiments. The SoDaH database also incorporates related environmental covariate data pertaining to climate, vegetation, soil chemistry, and soil physical properties. The data are harmonized and aggregated using open-source code that enables a scripted, repeatable approach for soil data synthesis.

  18. Exoplanet Classification Dataset

    • kaggle.com
    zip
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agostina Silva (2025). Exoplanet Classification Dataset [Dataset]. https://www.kaggle.com/datasets/datatalesbyagos/exoplanet-classification-dataset
    Explore at:
    zip(1321859 bytes)Available download formats
    Dataset updated
    Oct 7, 2025
    Authors
    Agostina Silva
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Exoplanet Classification Dataset

    Overview

    This dataset unifies three major NASA Exoplanet Archive catalogs — KOI, K2, and TOI — into a single machine-learning–ready dataset for exoplanet classification.
    It harmonizes feature names, fills missing data, and projects all missions into a common PCA feature space.

    The goal is to provide a consistent and comprehensive dataset for training models that can distinguish :

    Labels CONFIRMED: 0 CANDIDATE: 1 FALSE POSITIVE: 2 REFUTED: 3

    1. Source Data

    MissionDescriptionSource
    KOIKepler Objects of Interesthttps://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+*+from+q1_q17_dr25_koi&format=csv
    K2Kepler/K2 extended missionhttps://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+*+from+k2pandc&format=csv
    TOITESS Objects of Interesthttps://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+*+from+toi&format=csv

    Each catalog contains different column names and structures, so a column mapping was created to standardize the schema across all missions.

    2. Column Harmonization

    Below is an example of the unified mapping used to align equivalent features between missions:

    column_map = {
      # Coordinates
      "ra": {"KOI": "ra", "K2": "ra", "TOI": "ra"},
      "dec": {"KOI": "dec", "K2": "dec", "TOI": "dec"},
    
      # Orbital parameters
      "orbital_period": {"KOI": "koi_period", "K2": "pl_orbper", "TOI": "pl_orbper"},
      "planet_radius": {"KOI": "koi_prad", "K2": "pl_rade", "TOI": "pl_rade"},
      "semi_major_axis": {"KOI": "koi_sma", "K2": "pl_orbsmax", "TOI": None},
      "transit_depth": {"KOI": "koi_depth", "K2": "pl_trandep", "TOI": "pl_trandep"},
    
      # Stellar parameters
      "stellar_teff": {"KOI": "koi_steff", "K2": "st_teff", "TOI": "st_teff"},
      "stellar_radius": {"KOI": "koi_srad", "K2": "st_rad", "TOI": "st_rad"},
      "stellar_logg": {"KOI": "koi_slogg", "K2": "st_logg", "TOI": "st_logg"},
      "stellar_met": {"KOI": "koi_smet", "K2": "st_met", "TOI": "st_met"},
    
      # Photometry
      "gmag": {"KOI": "koi_gmag", "K2": "sy_gmag", "TOI": "st_gmag"},
      "rmag": {"KOI": "koi_rmag", "K2": "sy_rmag", "TOI": "st_rmag"},
      "imag": {"KOI": "koi_imag", "K2": "sy_imag", "TOI": "st_imag"},
      "zmag": {"KOI": "koi_zmag", "K2": "sy_zmag", "TOI": "st_zmag"},
      "jmag": {"KOI": "koi_jmag", "K2": "sy_jmag", "TOI": "st_jmag"},
      "hmag": {"KOI": "koi_hmag", "K2": "sy_hmag", "TOI": "st_hmag"},
      "kmag": {"KOI": "koi_kmag", "K2": "sy_kmag", "TOI": "st_kmag"},
      "tmag": {"KOI": None, "K2": None, "TOI": "st_tmag"},
    }
    

    3. Data Preparation

    The KOI, K2, and TOI catalogs were downloaded from the NASA Exoplanet Archive (q1_q17_dr25_koi, k2pandc, toi tables).

    Columns were standardized using a unified mapping to ensure consistent names across catalogs (e.g. planet_radius, orbital_period, star_teff, disposition, etc.).

    Only relevant physical and photometric parameters were kept. Missing fields were filled with NaN to be handled later.

    Light Curve Derived Features

    Although raw light curves were not directly modeled, several summary features derived from light curve analysis were included:

    lc_model_snr: signal-to-noise ratio of the transit model

    lc_max_single, lc_max_multi: maximum signal from single and multiple events

    lc_time0: time of first detected transit

    These features capture key aspects of the transit detection without requiring the raw flux data.

    Principal Component Analysis

    To reduce redundancy and extract latent relationships, a PCA was performed on five transit-related variables:

    ['transit_depth', 'transit_duration', 'lc_model_snr', 'ror_ratio', 'transit_depth2']

    Integration

    The PCA components were concatenated with the non-PCA columns from all three catalogs to form a single dataset.

  19. w

    COVID-19 National Longitudinal Phone Survey 2020 – World Bank LSMS...

    • microdata.worldbank.org
    • catalog.ihsn.org
    • +1more
    Updated Oct 25, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Bureau of Statistics (NBS) (2021). COVID-19 National Longitudinal Phone Survey 2020 – World Bank LSMS Harmonized Dataset - Nigeria [Dataset]. https://microdata.worldbank.org/index.php/catalog/3856
    Explore at:
    Dataset updated
    Oct 25, 2021
    Dataset authored and provided by
    National Bureau of Statistics (NBS)
    Time period covered
    2018 - 2021
    Area covered
    Nigeria
    Description

    Abstract

    To facilitate the use of data collected through the high-frequency phone surveys on COVID-19, the Living Standards Measurement Study (LSMS) team has created the harmonized datafiles using two household surveys: 1) the country’ latest face-to-face survey which has become the sample frame for the phone survey, and 2) the country’s high-frequency phone survey on COVID-19.

    The LSMS team has extracted and harmonized variables from these surveys, based on the harmonized definitions and ensuring the same variable names. These variables include demography as well as housing, household consumption expenditure, food security, and agriculture. Inevitably, many of the original variables are collected using questions that are asked differently. The harmonized datafiles include the best available variables with harmonized definitions.

    Two harmonized datafiles are prepared for each survey. The two datafiles are: 1. HH: This datafile contains household-level variables. The information include basic household characterizes, housing, water and sanitation, asset ownership, consumption expenditure, consumption quintile, food security, livestock ownership. It also contains information on agricultural activities such as crop cultivation, use of organic and inorganic fertilizer, hired labor, use of tractor and crop sales.
    2. IND: This datafile contains individual-level variables. It includes basic characteristics of individuals such as age, sex, marital status, disability status, literacy, education and work.

    Geographic coverage

    National coverage

    Analysis unit

    • Households
    • Individuals

    Universe

    The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    See “Nigeria - General Household Survey, Panel 2018-2019, Wave 4” and “Nigeria - COVID-19 National Longitudinal Phone Survey 2020” available in the Microdata Library for details.

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Cleaning operations

    Nigeria General Household Survey, Panel (GHS-Panel) 2018-2019 and Nigeria COVID-19 National Longitudinal Phone Survey (COVID-19 NLPS) 2020 data were harmonized following the harmonization guidelines (see “Harmonized Datafiles and Variables for High-Frequency Phone Surveys on COVID-19” for more details).

    The high-frequency phone survey on COVID-19 has multiple rounds of data collection. When variables are extracted from multiple rounds of the survey, the originating round of the survey is noted with “_rX” in the variable name, where X represents the number of the round. For example, a variable with “_r3” presents that the variable was extracted from Round 3 of the high-frequency phone survey. Round 0 refers to the country’s latest face-to-face survey which has become the sample frame for the high-frequency phone surveys on COVID-19. When the variables are without “_rX”, they were extracted from Round 0.

    Response rate

    See “Nigeria - General Household Survey, Panel 2018-2019, Wave 4” and “Nigeria - COVID-19 National Longitudinal Phone Survey 2020” available in the Microdata Library for details.

  20. Harmonized Sentinel-2 MSI: MultiSpectral Instrument, Level-2A (SR)

    • developers.google.com
    Updated Jan 30, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Union/ESA/Copernicus (2020). Harmonized Sentinel-2 MSI: MultiSpectral Instrument, Level-2A (SR) [Dataset]. https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_SR_HARMONIZED
    Explore at:
    Dataset updated
    Jan 30, 2020
    Dataset provided by
    European Space Agencyhttp://www.esa.int/
    Time period covered
    Mar 28, 2017 - Dec 2, 2025
    Area covered
    Description

    After 2022-01-25, Sentinel-2 scenes with PROCESSING_BASELINE '04.00' or above have their DN (value) range shifted by 1000. The HARMONIZED collection shifts data in newer scenes to be in the same range as in older scenes. Sentinel-2 is a wide-swath, high-resolution, multi-spectral imaging mission supporting Copernicus Land Monitoring studies, including the …

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zexu Li; Suraj P. Prabhu; Zachary T. Popp; Shubhi S. Jain; Vijetha Balakundi; Ting Fang Alvin Ang; Rhoda Au; Jinying Chen (2025). Feature ablation analysisa. [Dataset]. http://doi.org/10.1371/journal.pone.0328262.t006
Organization logo

Feature ablation analysisa.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Jul 24, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Zexu Li; Suraj P. Prabhu; Zachary T. Popp; Shubhi S. Jain; Vijetha Balakundi; Ting Fang Alvin Ang; Rhoda Au; Jinying Chen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundBiomedical research requires large, diverse samples to produce unbiased results. Retrospective data harmonization is often used to integrate existing datasets to create these samples, but the process is labor-intensive. Automated methods for matching variables across datasets can accelerate this process, particularly when harmonizing datasets with numerous variables and varied naming conventions. Research in this area has been limited, primarily focusing on lexical matching and ontology-based semantic matching. We aimed to develop new methods, leveraging large language models (LLMs) and ensemble learning, to automate variable matching.MethodsThis study utilized data from two GERAS cohort studies (European [EU] and Japan [JP]) obtained through the Alzheimer’s Disease (AD) Data Initiative’s AD workbench. We first manually created a dataset by matching 347 EU variables with 1322 candidate JP variables and treated matched variable pairs as positive instances and unmatched pairs as negative instances. We then developed four natural language processing (NLP) methods using state-of-the-art LLMs (E5, MPNet, MiniLM, and BioLORD-2023) to estimate variable similarity based on variable labels and derivation rules. A lexical matching method using fuzzy matching was included as a baseline model. In addition, we developed an ensemble-learning method, using the Random Forest (RF) model, to integrate individual NLP methods. RF was trained and evaluated on 50 trials. Each trial had a random split (4:1) of training and test sets, with the model’s hyperparameters optimized through cross-validation on the training set. For each EU variable, 1322 candidate JP variables were ranked based on NLP-derived similarity scores or RF’s probability scores, denoting their likelihood to match the EU variable. Ranking performance was measured by top-n hit ratio (HR-n) and mean reciprocal rank (MRR).ResultsE5 performed best among individual methods, achieving 0.898 HR-30 and 0.700 MRR. RF performed better than E5 on all metrics over 50 trials (P 

Search
Clear search
Close search
Google apps
Main menu