69 datasets found
  1. Cleaned NHANES 1988-2018

    • figshare.com
    txt
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

  2. France Weekly Real Estate Listings 2022-2023

    • kaggle.com
    zip
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Dragunov (2024). France Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/france-weekly-real-estate-listings-2022-2023
    Explore at:
    zip(2750497 bytes)Available download formats
    Dataset updated
    Apr 3, 2024
    Authors
    Artur Dragunov
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    France
    Description

    These Kaggle datasets provide downloaded real estate listings from the French real estate market, capturing data from a leading platform in France (Seloger), reminiscent of the approach taken for the US dataset from Redfin and UK dataset from Zoopla. It encompasses detailed property listings, pricing, and market trends across France, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as France_clean_unique.csv.

    The cleaning process mirrored that of the US dataset, involving removing irrelevant features, normalizing variable names for dataset consistency with USA and UK, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on France's real estate market drivers.

    For exact column descriptions, see columns for France_clean_unique.csv and my thesis.

    Table 2.5 and Section 2.2.1, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

    If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

    Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.

  3. e

    Merger of BNV-D data (2008 to 2019) and enrichment

    • data.europa.eu
    zip
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick VINCOURT (2025). Merger of BNV-D data (2008 to 2019) and enrichment [Dataset]. https://data.europa.eu/data/datasets/5f1c3eca9d149439e50c740f?locale=en
    Explore at:
    zip(18530465)Available download formats
    Dataset updated
    Jan 16, 2025
    Dataset authored and provided by
    Patrick VINCOURT
    Description

    Merging (in Table R) data published on https://www.data.gouv.fr/fr/datasets/ventes-de-pesticides-par-departement/, and joining two other sources of information associated with MAs: — uses: https://www.data.gouv.fr/fr/datasets/usages-des-produits-phytosanitaires/ — information on the “Biocontrol” status of the product, from document DGAL/SDQSPV/2020-784 published on 18/12/2020 at https://agriculture.gouv.fr/quest-ce-que-le-biocontrole

    All the initial files (.csv transformed into.txt), the R code used to merge data and different output files are collected in a zip. enter image description here NB: 1) “YASCUB” for {year,AMM,Substance_active,Classification,Usage,Statut_“BioConttrol”}, substances not on the DGAL/SDQSPV list being coded NA. 2) The file of biocontrol products shall be cleaned from the duplicates generated by the marketing authorisations leading to several trade names.
    3) The BNVD_BioC_DY3 table and the output file BNVD_BioC_DY3.txt contain the fields {Code_Region,Region,Dept,Code_Dept,Anne,Usage,Classification,Type_BioC,Quantite_substance)}

  4. KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/korus-aq-aircraft-merge-data-files-9bba5
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    KORUSAQ_Merge_Data are pre-generated merge data files combining various products collected during the KORUS-AQ field campaign. This collection features pre-generated merge files for the DC-8 aircraft. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.

  5. USA Weekly Real Estate Listings 2022-2023

    • kaggle.com
    zip
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Dragunov (2024). USA Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/usa-weekly-real-estate-listings
    Explore at:
    zip(66961155 bytes)Available download formats
    Dataset updated
    Apr 3, 2024
    Authors
    Artur Dragunov
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    United States
    Description

    These Kaggle datasets offer a comprehensive analysis of the US real estate market, leveraging data sourced from Redfin via an unofficial API. It contains weekly snapshots stored in CSV files, reflecting the dynamic nature of property listings, prices, and market trends across various states and cities, except for Wyoming, Montana, and North Dakota, and with specific data generation for Texas cities. Notably, the dataset includes a prepared version, USA_clean_unique, which has undergone initial cleaning steps as outlined in the thesis. These datasets were part of my thesis; other two countries were France and UK.

    These steps include: - Removal of irrelevant features for statistical analysis. - Renaming variables for consistency across international datasets. - Adjustment of variable value ranges for a more refined analysis.

    Unique aspects such as Redfin’s “hot” label algorithm, property search status, and detailed categorizations of property types (e.g., single-family residences, condominiums/co-ops, multi-family homes, townhouses) provide deep insights into the market. Additionally, external factors like interest rates, stock market volatility, unemployment rates, and crime rates have been integrated to enrich the dataset and offer a multifaceted view of the real estate market's drivers.

    The USA_clean_unique dataset represents a key step before data normalization/trimming, containing variables both in their raw form and categorized based on predefined criteria, such as property size, year of construction, and number of bathrooms/bedrooms. This structured approach aims to capture the non-linear relationships between various features and property prices, enhancing the dataset's utility for predictive modeling and market analysis.

    See columns from USA_clean_unique.csv and my Thesis (Table 2.8) for exact column descriptions.

    Table 2.4 and Section 2.2.3, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

    If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

    Let me know if you want to see how I got from raw data to USA_clean_unique.csv. Multiple steps include cleaning in Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming columns for consistency.

  6. u

    SBI Cruise NBP03-04a merged bottle dataset

    • data.ucar.edu
    • arcticdata.io
    • +1more
    ascii
    Updated Aug 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dennis Hansell; Nick R. Bates; Service Group, Scripps Institution of Oceanography, University of California - San Diego; Steven Roberts (2025). SBI Cruise NBP03-04a merged bottle dataset [Dataset]. https://data.ucar.edu/dataset/sbi-cruise-nbp03-04a-merged-bottle-dataset
    Explore at:
    asciiAvailable download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Dennis Hansell; Nick R. Bates; Service Group, Scripps Institution of Oceanography, University of California - San Diego; Steven Roberts
    Time period covered
    Jul 5, 2003 - Aug 20, 2003
    Area covered
    Description

    This data set contains merged bottle data from the SBI cruise on the United States Coast Guard Cutter (USCGC) Nathaniel B. Palmer (NBP03-04a). During this cruise rosette casts were conducted and a bottle data file was generated by the Scripps Service group from these water samples. Additional groups were funded to measure supplementary parameters from these same water samples. This data set is the first version of the merging of the Scripps Service group bottle data file with these data gathered by these additional groups.

  7. UK Weekly Real Estate Listings 2022-2023

    • kaggle.com
    zip
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Dragunov (2024). UK Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/uk-weekly-real-estate-listings-2022-2023
    Explore at:
    zip(29112488 bytes)Available download formats
    Dataset updated
    Apr 3, 2024
    Authors
    Artur Dragunov
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    United Kingdom
    Description

    These Kaggle datasets provide downloaded real-estate listings from the UK real estate market, capturing data from a leading platform in the UK (Zoopla), reminiscent of the approach taken for the US dataset from Redfin and French dataset from Seloger. It encompasses detailed property listings, pricing, and market trends across UK, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as UK_clean_unique.csv.

    The cleaning process mirrored that of the US and French datasets, involving removing irrelevant features, normalizing variable names for dataset consistency with the USA and France, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on the UK's real estate market drivers.

    For exact column descriptions, see columns for UK_clean_unique.csv and my thesis.

    Table 2.6 and Section 2.2.2, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

    If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

    Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.

  8. n

    Multilevel modeling of time-series cross-sectional data reveals the dynamic...

    • data.niaid.nih.gov
    • dataone.org
    • +1more
    zip
    Updated Mar 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kodai Kusano (2020). Multilevel modeling of time-series cross-sectional data reveals the dynamic interaction between ecological threats and democratic development [Dataset]. http://doi.org/10.5061/dryad.547d7wm3x
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 6, 2020
    Dataset provided by
    University of Nevada, Reno
    Authors
    Kodai Kusano
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    What is the relationship between environment and democracy? The framework of cultural evolution suggests that societal development is an adaptation to ecological threats. Pertinent theories assume that democracy emerges as societies adapt to ecological factors such as higher economic wealth, lower pathogen threats, less demanding climates, and fewer natural disasters. However, previous research confused within-country processes with between-country processes and erroneously interpreted between-country findings as if they generalize to within-country mechanisms. In this article, we analyze a time-series cross-sectional dataset to study the dynamic relationship between environment and democracy (1949-2016), accounting for previous misconceptions in levels of analysis. By separating within-country processes from between-country processes, we find that the relationship between environment and democracy not only differs by countries but also depends on the level of analysis. Economic wealth predicts increasing levels of democracy in between-country comparisons, but within-country comparisons show that democracy declines as countries become wealthier over time. This relationship is only prevalent among historically wealthy countries but not among historically poor countries, whose wealth also increased over time. By contrast, pathogen prevalence predicts lower levels of democracy in both between-country and within-country comparisons. Our longitudinal analyses identifying temporal precedence reveal that not only reductions in pathogen prevalence drive future democracy, but also democracy reduces future pathogen prevalence and increases future wealth. These nuanced results contrast with previous analyses using narrow, cross-sectional data. As a whole, our findings illuminate the dynamic process by which environment and democracy shape each other.

    Methods Our Time-Series Cross-Sectional data combine various online databases. Country names were first identified and matched using R-package “countrycode” (Arel-Bundock, Enevoldsen, & Yetman, 2018) before all datasets were merged. Occasionally, we modified unidentified country names to be consistent across datasets. We then transformed “wide” data into “long” data and merged them using R’s Tidyverse framework (Wickham, 2014). Our analysis begins with the year 1949, which was occasioned by the fact that one of the key time-variant level-1 variables, pathogen prevalence was only available from 1949 on. See our Supplemental Material for all data, Stata syntax, R-markdown for visualization, supplemental analyses and detailed results (available at https://osf.io/drt8j/).

  9. Scripts for Analysis

    • figshare.com
    txt
    Updated Jul 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sneddon Lab UCSF (2018). Scripts for Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.6783569.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 18, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Sneddon Lab UCSF
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Scripts used for analysis of V1 and V2 Datasets.seurat_v1.R - initialize seurat object from 10X Genomics cellranger outputs. Includes filtering, normalization, regression, variable gene identification, PCA analysis, clustering, tSNE visualization. Used for v1 datasets. merge_seurat.R - merge two or more seurat objects into one seurat object. Perform linear regression to remove batch effects from separate objects. Used for v1 datasets. subcluster_seurat_v1.R - subcluster clusters of interest from Seurat object. Determine variable genes, perform regression and PCA. Used for v1 datasets.seurat_v2.R - initialize seurat object from 10X Genomics cellranger outputs. Includes filtering, normalization, regression, variable gene identification, and PCA analysis. Used for v2 datasets. clustering_markers_v2.R - clustering and tSNE visualization for v2 datasets. subcluster_seurat_v2.R - subcluster clusters of interest from Seurat object. Determine variable genes, perform regression and PCA analysis. Used for v2 datasets.seurat_object_analysis_v1_and_v2.R - downstream analysis and plotting functions for seurat object created by seurat_v1.R or seurat_v2.R. merge_clusters.R - merge clusters that do not meet gene threshold. Used for both v1 and v2 datasets. prepare_for_monocle_v1.R - subcluster cells of interest and perform linear regression, but not scaling in order to input normalized, regressed values into monocle with monocle_seurat_input_v1.R monocle_seurat_input_v1.R - monocle script using seurat batch corrected values as input for v1 merged timecourse datasets. monocle_lineage_trace.R - monocle script using nUMI as input for v2 lineage traced dataset. monocle_object_analysis.R - downstream analysis for monocle object - BEAM and plotting. CCA_merging_v2.R - script for merging v2 endocrine datasets with canonical correlation analysis and determining the number of CCs to include in downstream analysis. CCA_alignment_v2.R - script for downstream alignment, clustering, tSNE visualization, and differential gene expression analysis.

  10. h

    Mixmix-LLaMAX

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcus Cedric R. Idia (2025). Mixmix-LLaMAX [Dataset]. https://huggingface.co/datasets/marcuscedricridia/Mixmix-LLaMAX
    Explore at:
    Dataset updated
    Apr 3, 2025
    Authors
    Marcus Cedric R. Idia
    Description

    Merged UI Dataset: Mixmix-LLaMAX

    This dataset was automatically generated by merging and processing the following sources: marcuscedricridia/s1K-claude-3-7-sonnet, marcuscedricridia/Creative_Writing-ShareGPT-deepclean-sharegpt, marcuscedricridia/Medical-R1-Distill-Data-deepclean-sharegpt, marcuscedricridia/Open-Critic-GPT-deepclean-sharegpt, marcuscedricridia/kalo-opus-instruct-22k-no-refusal-deepclean-sharegpt, marcuscedricridia/unAIthical-ShareGPT-deepclean-sharegpt… See the full description on the dataset page: https://huggingface.co/datasets/marcuscedricridia/Mixmix-LLaMAX.

  11. s

    Data from: RAW data from Towards Holistic Environmental Policy Assessment:...

    • research.science.eus
    • data.europa.eu
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber; Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber (2024). RAW data from Towards Holistic Environmental Policy Assessment: Multi-Criteria Frameworks and recommendations for modelers paper [Dataset]. https://research.science.eus/documentos/685699066364e456d3a65172
    Explore at:
    Dataset updated
    2024
    Authors
    Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber; Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber
    Description

    Name: Data used to rate the relevance of each dimension necessary for a Holistic Environmental Policy Assessment.

    Summary: This dataset contains answers from a panel of experts and the public to rate the relevance of each dimension on a scale of 0 (Nor relevant at all) to 100 (Extremely relevant).

    License: CC-BY-SA

    Acknowledge: These data have been collected in the framework of the DECIPHER project. This project has received funding from the European Union’s Horizon Europe programme under grant agreement No. 101056898.

    Disclaimer: Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

    Collection Date: 2024-1 / 2024-04

    Publication Date: 22/04/2025

    DOI: 10.5281/zenodo.13909413

    Other repositories: -

    Author: University of Deusto

    Objective of collection: This data was originally collected to prioritise the dimensions to be further used for Environmental Policy Assessment and IAMs enlarged scope.

    Description:

    Data Files (CSV)

    decipher-public.csv : Public participants' general survey results in the framework of the Decipher project, including socio demographic characteristics and overall perception of each dimension necessary for a Holistic Environmental Policy Assessment.

    decipher-risk.csv : Contains individual survey responses regarding prioritisation of dimensions in risk situations. Includes demographic and opinion data from a targeted sample.

    decipher-experts.csv : Experts’ opinions collected on risk topics through surveys in the framework of Decipher Project, targeting professionals in relevant fields.

    decipher-modelers.csv: Answers given by the developers of models about the characteristics of the models and dimensions covered by them.

    prolific_export_risk.csv : Exported survey data from Prolific, focusing specifically on ratings in risk situations. Includes response times, demographic details, and survey metadata.

    prolific_export_public_{1,2}.csv : Public survey exports from Prolific, gathering prioritisation of dimensions necessary for environmental policy assessment.

    curated.csv : Final cleaned and harmonized dataset combining multiple survey sources. Designed for direct statistical analysis with standardized variable names.

    Scripts files (R)

    decipher-modelers.R: Script to assess the answers given modelers about the characteristics of the models.

    joint.R: Script to clean and joint the RAW answers from the different surveys to retrieve overall perception of each dimension necessary for a Holistic Environmental Policy Assessment.

    Report Files

    decipher-modelers.pdf: Diagram with the result of the

    full-Country.html : Full interactive report showing dimension prioritisation broken down by participant country.

    full-Gender.html : Visualization report displaying differences in dimension prioritisation by gender.

    full-Education.html : Detailed breakdown of dimension prioritisation results based on education level.

    full-Work.html : Report focusing on participant occupational categories and associated dimension prioritisation.

    full-Income.html : Analysis report showing how income level correlates with dimension prioritisation.

    full-PS.html : Report analyzing Political Sensitivity scores across all participants.

    full-type.html : Visualization report comparing participant dimensions prioritisation (public vs experts) in normal and risk situations.

    full-joint-Country.html : Joint analysis report integrating multiple dimensions of country-based dimension prioritisation in normal and risk situations. Combines demographic and response patterns.

    full-joint-Gender.html : Combined gender-based analysis across datasets, exploring intersections of demographic factors and dimensions prioritisation in normal and risk situations.

    full-joint-Education.html : Education-focused report merging various datasets to show consistent or divergent patterns of dimensions prioritisation in normal and risk awareness.

    full-joint-Work.html : Cross-dataset analysis of occupational groups and their dimensions prioritisation in normal and risk situation

    full-joint-Income.html : Income-stratified joint analysis, merging public and expert datasets to find common trends and significant differences during dimensions prioritisation in normal and risks situations.

    full-joint-PS.html : Comprehensive Political Sensitivity score report from merged datasets, highlighting general patterns and subgroup variations in normal and risk situations.

    5 star: ⭐⭐⭐

    Preprocessing steps: The data has been re-coded and cleaned using the scripts provided.

    Reuse: NA

    Update policy: No more updates are planned.

    Ethics and legal aspects: Names of the persons involved have been removed.

    Technical aspects:

    Other:

  12. r

    Web page phishing detection dataset

    • resodate.org
    • service.tib.eu
    Updated Jan 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesher Joshua M; Sree Dananjay S; Adhithya R; M Revathi (2025). Web page phishing detection dataset [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvd2ViLXBhZ2UtcGhpc2hpbmctZGV0ZWN0aW9uLWRhdGFzZXQ=
    Explore at:
    Dataset updated
    Jan 2, 2025
    Dataset provided by
    Leibniz Data Manager
    Authors
    Jesher Joshua M; Sree Dananjay S; Adhithya R; M Revathi
    Description

    The dataset used in this study is the result of merging two publicly available datasets: the 'Web page phishing detection' dataset and the 'Phishing Websites Dataset'. A subset of the most relevant features was selectively included in the merged dataset to avoid redundancy and focus on the shared characteristics between the two original datasets.

  13. h

    Mixmix-LLaMAX-Mini

    • huggingface.co
    Updated Apr 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcus Cedric R. Idia (2025). Mixmix-LLaMAX-Mini [Dataset]. https://huggingface.co/datasets/marcuscedricridia/Mixmix-LLaMAX-Mini
    Explore at:
    Dataset updated
    Apr 4, 2025
    Authors
    Marcus Cedric R. Idia
    Description

    Merged UI Dataset: Mixmix-LLaMAX-Mini

    This dataset was automatically generated by merging and processing the following sources: marcuscedricridia/Creative_Writing-ShareGPT-deepclean-sharegpt-deepclean-sharegpt, marcuscedricridia/ldjnr-combined-deepclean-sharegpt-deepclean-sharegpt, marcuscedricridia/Open-Critic-GPT-deepclean-sharegpt, marcuscedricridia/self-instruct-starcoder-deepclean-sharegpt, marcuscedricridia/s1K-claude-3-7-sonnet… See the full description on the dataset page: https://huggingface.co/datasets/marcuscedricridia/Mixmix-LLaMAX-Mini.

  14. Aerial Vehicle OBB Dataset

    • kaggle.com
    zip
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mridankan Mandal (2025). Aerial Vehicle OBB Dataset [Dataset]. https://www.kaggle.com/datasets/redzapdos123/aerial-vehicle-obb-dataset
    Explore at:
    zip(11517085012 bytes)Available download formats
    Dataset updated
    Aug 29, 2025
    Authors
    Mridankan Mandal
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Aerial Vehicles OBB Dataset (YOLOv11-OBB Format):

    A large scale, merged dataset for oriented vehicle detection in aerial imagery, preformatted for YOLOv11-OBB models.

    Overview:

    This dataset combines three distinct aerial imagery collections—**VSAI**, DroneVehicles, and DIOR-R, into a unified resource for training and benchmarking oriented object detection models. It has been specifically preprocessed and formatted for use with Ultralytics' YOLOv11-OBB models.

    The primary goal is to provide a detailed dataset for tasks like aerial surveillance, traffic monitoring, and vehicle detection from a drone's perspective. All annotations have been converted to the YOLO OBB format, and the classes have been simplified for focused vehicle detection tasks.

    Key Features:

    • Merged & Simplified: Combines three popular aerial vehicle datasets.
    • Two Class System: Simplifies detection by categorizing all objects into small-vehicle and large-vehicle.
    • YOLOv11-OBB Ready: Preformatted with normalized OBB annotations and a data.yaml configuration file for immediate use in YOLO training pipelines.
    • Cleaned & Split: Empty annotations have been removed, and the data is organized into standard train, validation, and test sets.

    Data Description:

    Source Datasets:

    1. VSAI Dataset: Contains aerial imagery for traffic analysis by DroneVision.
    2. DroneVehicles Dataset: A collection of vehicle images from a drone's perspective, originally provided in YOLO OBB format.
    3. DIOR-R Dataset: A large scale benchmark for object detection in optical remote sensing images. Only the 'vehicle' class was extracted for this merged dataset.

    Preprocessing and Modifications:

    • Class Merging: All vehicle types from the source datasets were mapped to two parent classes: small-vehicle and large-vehicle. The vehicle class from the DIOR-R dataset was mapped to large-vehicle.
    • Data Cleaning: Image and label pairs with empty annotation files were removed to ensure dataset integrity.
    • Formatting: All annotations were converted to the YOLOv11-OBB format, with coordinates normalized between 0 and 1.

    Classes:

    Class IDClass NameSource Dataset(s)
    0small-vehicleVSAI, DroneVehicles
    1large-vehicleVSAI, DroneVehicles, DIOR-R

    Dataset Statistics:

    • Total Labeled Images: 29,125
      • Training Set: 18,274 images
      • Validation Set: 5,420 images
      • Test Set: 5,431 images

    Annotation Format:

    Each image has a corresponding .txt label file. Each line in the file represents one object in the YOLOv11-OBB format: class_id x1 y1 x2 y2 x3 y3 x4 y4

    • class_id: The class index (0 for small-vehicle, 1 for large-vehicle).
    • (x1, y1)...(x4, y4): The four corner points of the oriented bounding box, with all coordinates normalized to a range of [0, 1].

    File and Folder Structure:

    The dataset is organized into a standard YOLO directory structure for easy integration with training programs.

    RoadVehiclesYOLOOBBDataset/
    ├── train/
    │  ├── images/ #18,274 images
    │  └── labels/ #18,274 labels
    ├── val/
    │  ├── images/ #5,420 images
    │  └── labels/ #5,420 labels
    ├── test/
    │  ├── images/ #5,431 images
    │  └── labels/ #5,431 labels
    ├── data.yaml  #YOLO dataset configuration file.
    └── ReadMe.md  #Documentation
    

    Usage:

    To use this dataset with YOLOv11 or other compatible frameworks, simply point your training script to the included data.yaml file.

    Example data.yaml:

    #Dataset configuration.
    path: RoadVehiclesYOLOOBBDataset/
    train: train/images
    val: val/images
    test: test/images
    
    #Number of classes.
    nc: 2
    
    #Class names.
    names:
     0: small-vehicle
     1: large-vehicle
    

    License:

    This merged dataset is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0), which is the most restrictive license among its sources.

    • You are free to:
      • Share and adapt the material for any non-commercial purpose.
    • Under the following terms:
      • Attribution: You must give appropriate credit to the original authors and the creator of this merged dataset.
      • NonCommercial: You may not use the material for commercial purposes.
      • ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license.

    Citation and Attribution:

    When using this dataset, please provide attribution to all original sources as follows:

    - VSAI_Dataset: by DroneVision, licensed under CC BY-NC-SA 4.0.
    - DroneVehicles Dataset: by Yiming Sun, Bing Cao, Pengfei Zhu, and Qin G. Hu and modified by Mridankan Mandal, licensed under CC BY-NC-SA 4.0.
    - DIOR-R dataset: by the DIOR...
    
  15. h

    MM-MergeBench

    • huggingface.co
    Updated Sep 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fanhu Zeng (2025). MM-MergeBench [Dataset]. https://huggingface.co/datasets/AuroraZengfh/MM-MergeBench
    Explore at:
    Dataset updated
    Sep 25, 2025
    Authors
    Fanhu Zeng
    Description

    Card of Dataset for Multimodal Large Language Model

      Dataset details and sources
    

    This dataset is constructed using publicly available instruction tuning datasets, including COCO, ScienceQA, VizWiz, ImageNet, VQAv2, ImageNet-R, Flickr30k, OCRVQA, Screen2words and TabMWP.

      Seen datasets for merging
    

    Dataset Image Source Download Path

    ScienceQA ScienceQA images

    VizWiz VizWiz images

    ImageNet ImageNet images

    VQAv2, Flickr30k COCO2014 images

    IconQA… See the full description on the dataset page: https://huggingface.co/datasets/AuroraZengfh/MM-MergeBench.

  16. h

    unAIthical-ShareGPT

    • huggingface.co
    Updated Apr 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcus Cedric R. Idia (2025). unAIthical-ShareGPT [Dataset]. https://huggingface.co/datasets/marcuscedricridia/unAIthical-ShareGPT
    Explore at:
    Dataset updated
    Apr 3, 2025
    Authors
    Marcus Cedric R. Idia
    Description

    Merged UI Dataset: unAIthical-ShareGPT

    This dataset was automatically generated by merging and processing the following sources: marcuscedricridia/nowarning-filtered-polarity, marcuscedricridia/toxic-full-filtered-polarity, marcuscedricridia/toxi-dpo-flitered-polarity, marcuscedricridia/harmful-filtered-polarity, marcuscedricridia/amoralqa-filtered-polarity Generation Timestamp: 2025-04-03 15:01:39 Processing Time: 14.73 seconds Output Format: sharegpt

      Processing Summary… See the full description on the dataset page: https://huggingface.co/datasets/marcuscedricridia/unAIthical-ShareGPT.
    
  17. h

    free-to-use-signs

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bagheera, free-to-use-signs [Dataset]. https://huggingface.co/datasets/bghira/free-to-use-signs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Bagheera
    License

    https://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/

    Description

    Free-to-Use Signs

    This dataset is a unique curation of typography data released under a free-to-use license. Specifically, this dataset contains images of signs.

      Dataset Details
    

    This dataset contains 952 images which have been captioned by BLIP3 (MM-XGEN).

      Dataset Sources
    

    Repository: Reddit (/r/signs)

      Uses
    
    
    
    
    
      Direct Use
    

    Training a LoRA for typography Merging this dataset into a larger set

      Out-of-Scope Use
    

    Hate speech or other… See the full description on the dataset page: https://huggingface.co/datasets/bghira/free-to-use-signs.

  18. BRAINTEASER ALS and MS Datasets

    • data.europa.eu
    • data.niaid.nih.gov
    unknown
    Updated Jul 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). BRAINTEASER ALS and MS Datasets [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-14857741?locale=lv
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    Description

    BRAINTEASER (Bringing Artificial Intelligence home for a better care of amyotrophic lateral sclerosis and multiple sclerosis) is a data science project that seeks to exploit the value of big data, including those related to health, lifestyle habits, and environment, to support patients with Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) and their clinicians. Taking advantage of cost-efficient sensors and apps, BRAINTEASER will integrate large, clinical datasets that host both patient-generated and environmental data. As part of its activities, BRAINTEASER organized three open evaluation challenges on Intelligent Disease Progression Prediction (iDPP), iDPP@CLEF 2022, iDPP@CLEF 2023, and iDPP@CLEF 2024 co-located with the Conference and Labs of the Evaluation Forum (CLEF). The goal of iDPP@CLEF is to design and develop an evaluation infrastructure for AI algorithms able to: better describe disease mechanisms; stratify patients according to their phenotype assessed all over the disease evolution; predict disease progression in a probabilistic, time-dependent fashion. The iDPP@CLEF challenges relied on retrospective and prospective ALS and MS patient data made available by the clinical partners of the BRAINTEASER consortium. Retrospective Dataset We release three retrospective datasets, one for ALS and two for MS. The two retrospective MS datasets, one consisting of clinical data only and one with clinical data and environmental/pollution data. The retrospective datasets contain data about 2,204 ALS patients (static variables, ALSFRS-R questionnaires, spirometry tests, environmental/pollution data) and 1,792 MS patients (static variables, EDSS scores, evoked potentials, relapses, MRIs). A subset of 280 MS patients contains environmental and pollution data. More in detail, the BRAINTEASER project retrospective datasets were derived from the merging of already existing datasets obtained by the clinical centers involved in the BRAINTEASER Project. The ALS dataset was obtained by the merge and homogenisation of the Piemonte and Valle d’Aosta Registry for Amyotrophic Lateral Sclerosis (PARALS, Chiò et al., 2017) and the Lisbon ALS clinic (CENTRO ACADÉMICO DE MEDICINA DE LISBOA, Centro Hospitalar Universitário de Lisboa-Norte, Hospital de Santa Maria, Lisbon, Portugal,) dataset. Both datasets were initiated in 1995 and are currently maintained by researchers of the ALS Regional Expert Centre (CRESLA), University of Turin, and of the CENTRO ACADÉMICO DE MEDICINA DE LISBOA-Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa. They include demographic and clinical data, comprehending both static and dynamic variables. The MS dataset was obtained from the Pavia MS clinical dataset, which was started in 1990 and contains demographic and clinical information that is continuously updated by the researchers of the Institute and the Turin MS clinic dataset (Department of Neurosciences and Mental Health, Neurology Unit 1, Città della Salute e della Scienza di Torino. Retrospective environmental data are accessible at various scales at the individual subject level. Thus, environmental data have been retrieved at different scales: To gather macroscale air pollution data we’ve leveraged data coming from public monitoring stations that cover the whole extension of the involved countries, namely the European Air Quality Portal; data from a network of air quality sensors (PurpleAir - Outdoor Air Quality Monitor / PurpleAir PA-II) installed in different points of the city of Pavia (Italy) were extracted as well. In both cases, environmental data were previously publicly available. In order to merge environmental data with individual subject locations we leverage postcodes (postcodes of the station for the pollutant detection and postcodes of subject address). Data were merged following an anonymization procedure based on hash keys. Environmental exposure trajectories have been pre-processed and aggregated in order to avoid fine temporal and spatial granularities. Thus, individual exposure information could not disclose personal addresses. The retrospective datasets are shared in two formats: RDF (serialized in Turtle) modeled according to the BRAINTEASER Ontology (BTO); CSV, as shared during the iDPP@CLEF 2022 and 2023 challenges, split into training and test. Each format corresponds to a specific folder in the datasets, where a dedicated README file provides further details on the datasets. Note that the ALS dataset is split into multiple ZIP files due to the size of the environmental data. Prospective Dataset For the iDPP@CLEF 2024 challenge, the datasets contain prospective data about 86 ALS patients (static variables, ALSFRS-R questionnaires compiled by clinicians or patients using the BRAINTEASER mobile application, sensors data). The prospective datasets are shared in two formats: RDF (serialized in Turtle) modeled according to the BRAINTEASER Ontology (BTO); CSV, as shared durin

  19. Cyclistic

    • kaggle.com
    zip
    Updated May 12, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salam Ibrahim (2022). Cyclistic [Dataset]. https://www.kaggle.com/datasets/salamibrahim/cyclistic
    Explore at:
    zip(209748131 bytes)Available download formats
    Dataset updated
    May 12, 2022
    Authors
    Salam Ibrahim
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    **Introduction ** This case study will be based on Cyclistic, a bike sharing company in Chicago. I will perform tasks of a junior data analyst to answer business questions. I will do this by following a process that includes the following phases: ask, prepare, process, analyze, share and act.

    Background Cyclistic is a bike sharing company that operates 5828 bikes within 692 docking stations. The company has been around since 2016 and separates itself from the competition due to the fact that they offer a variety of bike services including assistive options. Lily Moreno is the director of the marketing team and will be the person to receive these insights from this analysis.

    Case Study and business task Lily Morenos perspective on how to generate more income by marketing Cyclistics services correctly includes converting casual riders (one day passes and/or pay per ride customers) into annual riders with a membership. Annual riders are more profitable than casual riders according to the finance analysts. She would rather see a campaign targeting casual riders into annual riders, instead of launching campaigns targeting new costumers. So her strategy as the manager of the marketing team is simply to maximize the amount of annual riders by converting casual riders.

    In order to make a data driven decision, Moreno needs the following insights: - A better understanding of how casual riders and annual riders differ - Why would a casual rider become an annual one - How digital media can affect the marketing tactics

    Moreno has directed me to the first question - how do casual riders and annual riders differ?

    Stakeholders Lily Moreno, manager of the marketing team Cyclistic Marketing team Executive team

    Data sources and organization Data used in this report is made available and is licensed by Motivate International Inc. Personal data is hidden to protect personal information. Data used is from the past 12 months (01/04/2021 – 31/03/2022) of bike share dataset.

    By merging all 12 monthly bike share data provided, an extensive amount of data with 5,400,000 rows were returned and included in this analysis.

    Data security and limitations: Personal information is secured and hidden to prevent unlawful use. Original files are backed up in folders and subfolders.

    Tools and documentation of cleaning process The tools used for data verification and data cleaning are Microsoft Excel and R programming. The original files made accessible by Motivate International Inc. are backed up in their original format and in separate files.

    Microsoft Excel is used to generally look through the dataset and get a overview of the content. I performed simple checks of the data by filtering, sorting, formatting and standardizing the data to make it easily mergeable.. In Excel, I also changed data type to have the right format, removed unnecessary data if its incomplete or incorrect, created new columns to subtract and reformat existing columns and deleting empty cells. These tasks are easily done in spreadsheets and provides an initial cleaning process of the data.

    R will be used to perform queries of bigger datasets such as this one. R will also be used to create visualizations to answer the question at hand.

    Limitations Microsoft Excel has a limitation of 1,048,576 rows while the data of the 12 months combined are over 5,500,000 rows. When combining the 12 months of data into one table/sheet, Excel is no longer efficient and I switched over to R programming.

  20. Harmonization of sediment diatoms from hundreds of lakes in the northeastern...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Sep 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Harmonization of sediment diatoms from hundreds of lakes in the northeastern United States [Dataset]. https://catalog.data.gov/dataset/harmonization-of-sediment-diatoms-from-hundreds-of-lakes-in-the-northeastern-united-states
    Explore at:
    Dataset updated
    Sep 13, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    United States, Northeastern United States
    Description

    Sediment diatoms are widely used to track environmental histories of lakes and their watersheds, but merging datasets generated by different researchers for further large-scale studies is challenging because of the taxonomic discrepancies caused by rapidly evolving diatom nomenclature and taxonomic concepts. Here we collated five datasets of lake sediment diatoms from the northeastern USA using a harmonization process which included updating synonyms, tracking the identity of inconsistently identified taxa and grouping those that could not be resolved taxonomically. The Dataset consists of a Portable Document Format (.pdf) file of the Voucher Flora, six Microsoft Excel (.xlsx) data files, an R script, and five output Comma Separated Values (.csv) files. The Voucher Flora documents the morphological species concepts in the dataset using diatom images compiled into plates (NE_Lakes_Voucher_Flora_102421.pdf) and the translation scheme of the OTU codes to diatom scientific or provisional names with identification sources, references, and notes (VoucherFloraTranslation_102421.xlsx). The file Slide_accession_numbers_102421.xlsx has slide accession numbers in the ANS Diatom Herbarium. The “DiatomHarmonization_032222_files for R.zip” archive contains four Excel input data files, the R code, and a subfolder “OUTPUT” with five .csv files. The file Counts_original_long_102421.xlsx contains original diatom count data in long format. The file Harmonization_102421.xlsx is the taxonomic harmonization scheme with notes and references. The file SiteInfo_031922.xlsx contains sampling site- and sample-level information. WaterQualityData_021822.xlsx is a supplementary file with water quality data. R code (DiatomHarmonization_032222.R) was used to apply the harmonization scheme to the original diatom counts to produce the output files. The resulting output files are five wide format files containing diatom count data at different harmonization steps (Counts_1327_wide.csv, Step1_1327_wide.csv, Step2_1327_wide.csv, Step3_1327_wide.csv) and the summary of the Indicator Species Analysis (INDVAL_RESULT.csv). The harmonization scheme (Harmonization_102421.xlsx) can be further modified based on additional taxonomic investigations, while the associated R code (DiatomHarmonization_032222.R) provides a straightforward mechanism to diatom data versioning. This dataset is associated with the following publication: Potapova, M., S. Lee, S. Spaulding, and N. Schulte. A harmonized dataset of sediment diatoms from hundreds of lakes in the northeastern United States. Scientific Data. Springer Nature, New York, NY, 9(540): 1-8, (2022).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Organization logo

Cleaned NHANES 1988-2018

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
txtAvailable download formats
Dataset updated
Feb 18, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

Search
Clear search
Close search
Google apps
Main menu