22 datasets found
  1. Synthetic datasets of the UK Biobank cohort

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, pdf, zip
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
    Explore at:
    bin, csv, zip, pdfAvailable download formats
    Dataset updated
    Sep 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

    The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

    The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

    • Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]
    • Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

    Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

    The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

    Content

    The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

    • synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
    • synthbdbasevar: baseline variables, mostly collected at recruitment.
    • synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
    • synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

    In addition, this repository provides these additional files:

    • codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
    • asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
    • Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

    Generation of the synthetic data

    The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

    The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

    This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

  2. Linkage-Disequilibrium (LD) matrices for six continental ancestry groups...

    • zenodo.org
    application/gzip
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shadi Zabad; Shadi Zabad (2025). Linkage-Disequilibrium (LD) matrices for six continental ancestry groups from the UK Biobank [Dataset]. http://doi.org/10.5281/zenodo.14614207
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 8, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shadi Zabad; Shadi Zabad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains Linkage Disequilibrium (LD) matrices for six ancestry groups from the UK Biobank.

    LD matrices record the SNP-by-SNP correlations in a given sample of individuals from the general population. In this case, we threshold the matrices so that we only record the correlations between variants in the same LD block (defined by LDetect). The continental ancestry groups are defined by the Pan-UKB initiative as:

    • EUR = European ancestry (N=362446)
    • CSA = Central/South Asian ancestry (N=8284)
    • AFR = African ancestry (N=6255)
    • EAS = East Asian ancestry (N=2700)
    • MID = Middle Eastern ancestry (N=1567)
    • AMR = Admixed American ancestry (N=987)

    The sample sizes here are restricted to unrelated individuals in the UK Biobank. The matrices were computed using magenpy and quantized to int8 data type for better compressibility. The standard matrices (EUR.tar.gz, AFR.tar.gz, ...) contain pairwise correlations for 1.4 million HapMap3+ variants. For European samples, we also provide LD matrices that record pairwise correlations for up to 18 million variants (EUR_18m_variants.tar.gz)

    For more details on how these matrices were computed, please consult our manuscript:

    Towards whole-genome inference of polygenic scores with fast and memory-efficient algorithms
    Shadi Zabad, Chirayu Anant Haryan, Simon Gravel, Sanchit Misra, Yue Li

    To access these matrices, consult the codebase of magenpy, our custom python package with special data structures for processing these LD matrices.

  3. GWAS summary statistics for Standing Height from the UK Biobank (5-fold...

    • zenodo.org
    application/gzip
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shadi Zabad; Shadi Zabad (2024). GWAS summary statistics for Standing Height from the UK Biobank (5-fold cross-validation) [Dataset]. http://doi.org/10.5281/zenodo.14270953
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shadi Zabad; Shadi Zabad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains GWAS summary statistics for Standing Height in the UK Biobank.

    The GWAS study used data from "White British" samples (N = 337225), which were randomly divided into 5 folds for the purposes of cross-validation. The upload contains, for each fold, GWAS summary statistics for the training and test set. The test summary statistics can be used to evaluate PRS models via pseudo-validation methods. Association testing was done with plink2.

    The structure of the data is as follows:

    • train
      • fold_1
        • chr_1.PHENO1.glm.linear
        • chr_2.PHENO1.glm.linear
        • ...
      • fold_2
      • fold_3
      • ...
    • test
      • fold_1
      • fold_2
      • fold_3
      • ...

    For more details about the GWAS study, Quality Control (QC) criteria, or other information, please consult our publication:

    Zabad, S., Gravel, S., & Li, Y. (2023). Fast and accurate Bayesian polygenic risk modeling with variational inference. The American Journal of Human Genetics, 110(5), 741–761. https://doi.org/10.1016/j.ajhg.2023.03.009

    If you use this data in your work, please cite the publication above.

  4. h

    NURTuRE Chronic Kidney Disease (NCKD)

    • healthdatagateway.org
    unknown
    Updated Jun 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NURTuRE (2017). NURTuRE Chronic Kidney Disease (NCKD) [Dataset]. https://healthdatagateway.org/en/dataset/1396
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Jun 14, 2017
    Dataset authored and provided by
    NURTuRE
    License

    https://saildatabank.com/data/apply-to-work-with-the-data/https://saildatabank.com/data/apply-to-work-with-the-data/

    Description

    The NURTuRE project was devised to create a national kidney biobank as recommended in the UK Renal Research Strategy 2016. Strategic Aims: To work towards achieving this NURTuRE will:

    • Create a national Kidney Bio Bank for collection and storage of biological samples from 3,000 CKD patients and up to 800 NS patients, to provide a strategic resource for fundamental and translational research.
    • Develop and implement proactive UK protocol driven cohort studies in CKD and NS to investigate determinants of and risk factors for clinically important adverse outcomes.
    • Engage patient cohorts, with consent to approach for any future research study. NURTuRE Objectives:
    • The provision of comprehensive clinical and laboratory data from cohort studies.
    • The provision of high quality bio-samples with centralised storage/retrieval.
    • To carry out core biomarker analysis of biopsy specimens in biofluids of all patients recruited and parallel assessment.
    • Follow-up specimen collection. First patient recruitment - By 31 June 2017 CKD - baseline and 100 % follow up collections, over 2 years NS: baseline and 20% follow up - over 3 years. Healthy Volunteers - baseline

    Biological samples availability - Samples are available via the NURTuRE biobank - https://nurturebiobank.org/

  5. Association of all hypermetropia and hypermetropia (low or moderate/high),...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phillippa M. Cumberland; Yanchun Bao; Pirro G. Hysi; Paul J. Foster; Christopher J. Hammond; Jugnoo S. Rahi (2023). Association of all hypermetropia and hypermetropia (low or moderate/high), by key socio-demographic factors. [Dataset]. http://doi.org/10.1371/journal.pone.0139780.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Phillippa M. Cumberland; Yanchun Bao; Pirro G. Hysi; Paul J. Foster; Christopher J. Hammond; Jugnoo S. Rahi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • No qualifications, State school examinations at 16 years of age (‘O’ levels), at 18 years (‘A’ levels) or University/other professional qualification+: Number of eyes;++ model adjusted for eye laterality, gender, age (continuous), educational qualification, accommodation tenure, ethnicity and test centre.Association of all hypermetropia and hypermetropia (low or moderate/high), by key socio-demographic factors.
  6. DreamData

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RairoRapha (2025). DreamData [Dataset]. https://www.kaggle.com/datasets/thetraveller/dreambiome
    Explore at:
    zip(18982050 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    RairoRapha
    Description

    📊 Data Sources (Scientific & Real-World)

    DreamBiome is built entirely on real human data — no synthetic, no invented corpora.
    The system integrates three research-backed data pillars:

    1. Dream Reports (DreamBank + Dryad RSOS)

    • 140+ real dream narratives
    • Includes series such as jasmine1, norms-m, etc.
    • Used in cognitive science, psychology, and computational linguistics
    • Contains word counts, metadata, and (Dryad) numerical annotations

    Sources:
    - DreamBank (Hall & Van de Castle / UC Santa Cruz): https://dreambank.net/
    - Dryad RSOS “textual dream analysis” dataset: https://doi.org/10.5061/dryad.4t880

    2. Sleep Architecture (Sleep-EDF Hypnogram Database)

    • 70+ professionally scored nights
    • 30-second epochs labeled across W, N1, N2, N3, REM
    • Includes efficiency, total sleep time, awakenings, and REM%
    • Used to generate the “sleep arcs” (dominant stage per quarter-night)

    Source:
    - PhysioNet Sleep-EDF Expanded Dataset: https://physionet.org/content/sleep-edfx/1.0.0/

    3. Epidemiological Context (Insomnia Prevalence)

    • Lightweight region-level insomnia statistics
    • Global, UK Biobank, and East Asia meta-estimates
    • Used to contextualize DreamBiome World generation

    Sources:
    - Ohayon (2002). Epidemiology of insomnia. Sleep Medicine Reviews.
    - Lane et al. (2019). UK Biobank sleep data.
    - Jiang et al. (2015). East Asia insomnia prevalence. Journal of Sleep Research.

  7. Effect sizes for 200+ polygenic scores

    • figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Privé (2023). Effect sizes for 200+ polygenic scores [Dataset]. http://doi.org/10.6084/m9.figshare.14074760.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Florian Privé
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • PGS-effects.csv.gz: vectors of effect sizes for 215 polygenic scores (PGS)- pred-cor: partial correlations of these PGS with the corresponding phenotypes, in eight ancestry groups from the UK Biobank- phenotype-description.xlsx: description of all phenotypes used in the study (30 were discarded due to very low prediction)-> these report the best prediction from penalized regression and LDpred2.We also provide these files separately for penalized regression (PLR) and LDpred2-auto (without using the test set).The effect size file for penalized regression is very small because vectors of effects are very sparse.Those are based on the UK Biobank data only.
  8. 24 genome-wide significant loci discovered in the metaUSAT multivariable...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waheed Ul-Rahman Ahmed; Manal I. A. Patel; Michael Ng; James McVeigh; Krina Zondervan; Akira Wiberg; Dominic Furniss (2023). 24 genome-wide significant loci discovered in the metaUSAT multivariable meta-analysis of inguinal, femoral, umbilical, hiatus hernia in 57,418 cases and 287,090 controls in UK Biobank. [Dataset]. http://doi.org/10.1371/journal.pone.0272261.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Waheed Ul-Rahman Ahmed; Manal I. A. Patel; Michael Ng; James McVeigh; Krina Zondervan; Akira Wiberg; Dominic Furniss
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistically significant signals from the metaUSAT analysis are shown in the left-hand column. The central column shows the association p-values for those SNPs in the six original GWAS analyses, with the direction of effect indicated by a + or–sign. Candidate genes are those selected from the prioritised genes (using the four mapping strategies described previously for all GWAS-discovered loci) or genes in proximity as identified within the UCSC genome browser.

  9. Data from: Brain Ages Derived from Different MRI Modalities are Associated...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrei-Claudiu Roibu; Andrei-Claudiu Roibu; Stanislaw Adaszewski; Torsten Schindler; Stephen M. Smith; Stephen M. Smith; Ana I.L. Namburete; Ana I.L. Namburete; Frederik J. Lange; Frederik J. Lange; Stanislaw Adaszewski; Torsten Schindler (2025). Brain Ages Derived from Different MRI Modalities are Associated with Distinct Biological Phenotypes [Dataset]. http://doi.org/10.5281/zenodo.8110876
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrei-Claudiu Roibu; Andrei-Claudiu Roibu; Stanislaw Adaszewski; Torsten Schindler; Stephen M. Smith; Stephen M. Smith; Ana I.L. Namburete; Ana I.L. Namburete; Frederik J. Lange; Frederik J. Lange; Stanislaw Adaszewski; Torsten Schindler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Brain ageing is a highly variable, spatially and temporally heterogeneous process, marked by numerous structural and functional changes. These can cause discrepancies between individuals’ chronological age and the apparent age of their brain, as inferred from neuroimaging data. Machine learning models, and particularly Convolutional Neural Networks (CNNs), have proven adept in capturing patterns relating to ageing induced changes in the brain. The differences between the predicted and chronological ages, referred to as brain age deltas, have emerged as useful biomarkers for exploring those factors which promote accelerated ageing or resilience, such as pathologies or lifestyle factors. However, previous studies rely only on structural neuroimaging for predictions, overlooking potentially informative functional and microstructural changes. Here we show that multiple contrasts derived from different MRI modalities can predict brain age, each encoding bespoke brain ageing information. By using 3D CNNs and UK Biobank data, we found that 57 contrasts derived from structural, susceptibility-weighted, diffusion, and functional MRI can successfully predict brain age. For each contrast, different patterns of association with non-imaging phenotypes were found, resulting in a total of 191 unique, statistically significant associations. Furthermore, we found that ensembling data from multiple contrasts results in both higher prediction accuracies and stronger correlations to non-imaging measurements. Our results demonstrate that other 3D contrasts and modalities, which have not been considered so far for the task of brain age prediction, encode different information about the ageing brain. We envision our work as being the starting point for future investigations into the causal links underpinning the observed brain age deltas and non-imaging measurement associations. For instance, drug effects can be monitored, given that certain medications correlated with accelerated brain ageing. Furthermore, continued development of brain age models could facilitate their deployment in clinical trials for recruitment and monitoring, and hospitals for diagnostic and screening tasks.

    Data Description

    This dataset contains the full correlation results with all nIDPs in the UK Biobank. These are presented in datasets split by sex in Female and Male subjects. For easier data manipulation, two smaller datasets have also been made available, containing just those correlation which pass the False Discovery Rate (FDR) threshold.

    As experiments were also conducted for ensembles using multiple contrasts, similar datasets are provided for those.

    Finally, global datasets are also provided. These are the concatenation of the associations contained in the Male and Female datasets.

    Paper & Code

    The original paper for this article can be accessed here:

    To access the codes relevant for this project, please access the project GitHub Repos:

    If using this work, please cite it based on the above paper, or using the following BibTex:

    @inproceedings{roibu2023brain,
     title={Brain Ages Derived from Different MRI Modalities are Associated with Distinct Biological Phenotypes},
     author={Roibu, Andrei-Claudiu and Adaszewski, Stanislaw and Schindler, Torsten and Smith, Stephen M and Namburete, Ana IL and Lange, Frederik J},
     booktitle={2023 10th IEEE Swiss Conference on Data Science (SDS)},
     pages={17--25},
     year={2023},
     organization={IEEE},
     doi={10.1109/SDS57534.2023.00010}
    }

    Data Access

    The data for this project is freely available upon application at the UK Biobank. For more information regarding the individual nIDPs, please access the UK Biobank Showcase website at: https://biobank.ctsu.ox.ac.uk/showcase/search.cgi

    Funding

    ACR is supported by EPSRC Grant EP/S024093/1, F. Hoffmann-La Roche AG and a 2021 Industrial Fellowship offered by the Royal Commission for the Exhibition of 1851. SMS is supported by a Wellcome Trust Collaborative Award 215573/Z/19/Z. AILN is grateful for support from the Academy of Medical Sciences under the Springboard Awards scheme (SBF005/1136), and the Bill and Melinda Gates Foundation. FJL is supported by a Wellcome Trust Collaborative Award (215573/Z/19/Z). The WIN is supported by core funding from the Wellcome Trust (203139/Z/16/Z). The computational aspects were supported by the Wellcome Trust (203141/Z/16/Z) and the NIHR Oxford BRC. Corresponding authors: ACR (andreiroibu@icloud.com), SA (stanislaw.adaszewski@roche.com) and AILN (ana.namburete@cs.ox.ac.uk).

  10. Four loci significantly associated with overlap hernia in 5,219 cases and...

    • figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waheed Ul-Rahman Ahmed; Manal I. A. Patel; Michael Ng; James McVeigh; Krina Zondervan; Akira Wiberg; Dominic Furniss (2023). Four loci significantly associated with overlap hernia in 5,219 cases and 26,095 controls in UK Biobank. [Dataset]. http://doi.org/10.1371/journal.pone.0272261.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Waheed Ul-Rahman Ahmed; Manal I. A. Patel; Michael Ng; James McVeigh; Krina Zondervan; Akira Wiberg; Dominic Furniss
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Four loci significantly associated with overlap hernia in 5,219 cases and 26,095 controls in UK Biobank.

  11. p

    SynergiQC

    • catalog.paradim.science
    Updated Oct 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philippe Joubert (2025). SynergiQC [Dataset]. https://catalog.paradim.science/index.php/en/synergiqc
    Explore at:
    Dataset updated
    Oct 31, 2025
    Dataset provided by
    Cartographies RSN
    Authors
    Philippe Joubert
    Description

    The dataset contains lung cancer CT images (DICOM format) with segmentations (DICOM SEG) of tumors. Clinical and research data associated with images are available through IUCPQ-UL Biobank.

  12. European (British) LD files for GhostKnockoffGWAS

    • zenodo.org
    • nde-dev.biothings.io
    zip
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    benjamin chu; benjamin chu (2025). European (British) LD files for GhostKnockoffGWAS [Dataset]. http://doi.org/10.5281/zenodo.15191305
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    benjamin chu; benjamin chu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 10, 2025
    Area covered
    United Kingdom
    Description

    This contains pre-processed LD files (Sigma matrix, S matrix, ...etc) computed on unrelated British samples of the UK-Biobank (n = 306604). It is intended to be used as an input to the GhostKnockoffGWAS pipeline.

    • This is the output of applying solveblock executable directly on 306,604 unrelated British samples of the UK-Biobank.
    • Quasi-independent blocks are computed by applying the snp_ldsplit function with parameters thr_r2=0.01, max_r2=0.3, min_size = 500, and max_size = {1000, 1500, 3000, 6000, 10000}.
    • SNPs with minor allele frequency less than 0.01 or Hardy-Weinburg equilibrium p-value less than 1e-6 are removed.
    • Only HG19 coordinates are available.
    • Knockoff optimization were carried out by the Knockoffs.jl julia package: https://github.com/biona001/Knockoffs.jl
    • The result (i.e. files available in this site) is saved in .csv and .h5 formatted files for easier access, which is directly readable by GhostKnockoffGWAS.

    Note: We previously released another set of EUR LD files. This set of LD files should be preferred over the previous one. The main difference with this entry is that the previous entry used quasi-independent blocks from LDetect computed on the 1000 genomes project. Here we compute the independent blocks using snp_ldsplit directly on the UK-Biobank British samples.

  13. r

    ASPREE Genome-wide SNP Genotyping Dataset

    • researchdata.edu.au
    • bridges.monash.edu
    Updated Nov 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Lacaze; John McNeill (2022). ASPREE Genome-wide SNP Genotyping Dataset [Dataset]. http://doi.org/10.26180/21097654.V1
    Explore at:
    Dataset updated
    Nov 16, 2022
    Dataset provided by
    Monash University
    Authors
    Paul Lacaze; John McNeill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ASPirin in Reducing Events in the Elderly (ASPREE) is a clinical trial and longitudinal study of healthy ageing, involving 16,703 Australians aged over 70 years and 2,411 Americans aged over 65 enrolled. The primary ASPREE trial has received >$60M in NIH funding since 2012. Each ASPREE participants’ health is tracked longitudinally through extensive phenotyping and collection of clinical outcome data. The ASPREE Healthy Ageing Biobank is an associated biorepository of blood, saliva and urine samples from >15,000 ASPREE participants, 10,000 of whom have now provided a matched follow-up 3-year sample. Biospecimens are consented for genetic and biomarker studies, enabling ASPREE to conduct molecular epidemiology and healthy ageing research. ASPREE facilitates over 12 sub-studies funded through NHMRC.

    Genome wide association analysis has been undertaken in multiple areas:

    • All-cause and vascular dementia
    • Stroke
    • Genetic modifiers for Alzheimer's disease
    • Poygenic resilence scores capture genetic effect for Alzheimers disease
    • Genome wide association analysis for Alzheimers Disease

    https://bioplatforms.com/projects/aspree-framework-initiative/

    To make data requests, you may undertake an approval process by contacting Paul Lacaze on Paul.Lacaze@monash.edu



  14. Segmentation Networks and Representative Meshes from UK Biobank

    • zenodo.org
    zip
    Updated Jun 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Devran Ugurlu; Shuang Qian; Elliot Fairweather; Charlene Mauger; Bram Ruijsink; Laura Dal Toso; Yu Deng; Marina Strocchi; Reza Razavi; Alistair Young; Pablo Lamata; Steven Niederer; Martin Bishop; Devran Ugurlu; Shuang Qian; Elliot Fairweather; Charlene Mauger; Bram Ruijsink; Laura Dal Toso; Yu Deng; Marina Strocchi; Reza Razavi; Alistair Young; Pablo Lamata; Steven Niederer; Martin Bishop (2025). Segmentation Networks and Representative Meshes from UK Biobank [Dataset]. http://doi.org/10.5281/zenodo.15649643
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Devran Ugurlu; Shuang Qian; Elliot Fairweather; Charlene Mauger; Bram Ruijsink; Laura Dal Toso; Yu Deng; Marina Strocchi; Reza Razavi; Alistair Young; Pablo Lamata; Steven Niederer; Martin Bishop; Devran Ugurlu; Shuang Qian; Elliot Fairweather; Charlene Mauger; Bram Ruijsink; Laura Dal Toso; Yu Deng; Marina Strocchi; Reza Razavi; Alistair Young; Pablo Lamata; Steven Niederer; Martin Bishop
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We present a database of representative left and right ventricular meshes constructed from patient-specific models based on a large cohort of ~55k participants from UK Biobank. It comprises 1423 representative tetrahedral finite element meshes across sex (male, female), body mass index (range: 16 - 42 kg/m²) and age (range: 49 - 80 years).

    For each mesh, it also includes:

    • a realistic biventricular myocardial fibre structure
    • a morphological coordinate system which describes the positions within ventricles based on (1) the apical-basal (Z), (2) transmural (ρ) (from endocardium to epicardium), (3) rotational (Φ) (anterior, anteroseptal, inferior, inferolateral, anterolateral) and (4) chamber-wise (left ventricle and right ventricle) coordinates.

    We also present trained network weights and nnUNet plan and hyperparameter selection files for cine MR segmentation models trained separately for the following views: 2 chamber, 3 chamber, 4 chamber and short axis. These are supplied as a zip of relevant nnUNet files for each view: Dataset101_UKBB_LAX_2Ch.zip, Dataset102_UKBB_LAX_3Ch.zip, Dataset103_UKBB_LAX_4Ch.zip, Dataset100_UKBB_Petersen_SAX.zip.

  15. Regional association plots for ancestry groups in the discovery cohort.

    • figshare.com
    bin
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy D. Stockwell; Michael C. Chang; Anubha Mahajan; William Forrest; Neha Anegondi; Rion K. Pendergrass; Suresh Selvaraj; Jens Reeder; Eric Wei; Victor A. Iglesias; Natalie M. Creps; Laura Macri; Andrea N. Neeranjan; Marcel P. van der Brug; Suzie J. Scales; Mark I. McCarthy; Brian L. Yaspan (2023). Regional association plots for ancestry groups in the discovery cohort. [Dataset]. http://doi.org/10.1371/journal.pgen.1010609.s004
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Amy D. Stockwell; Michael C. Chang; Anubha Mahajan; William Forrest; Neha Anegondi; Rion K. Pendergrass; Suresh Selvaraj; Jens Reeder; Eric Wei; Victor A. Iglesias; Natalie M. Creps; Laura Macri; Andrea N. Neeranjan; Marcel P. van der Brug; Suzie J. Scales; Mark I. McCarthy; Brian L. Yaspan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • AFR = African descent; AMR = Admixed American descent; EAS = East Asian descent; EUR = European descent. (XLSX)
  16. eRNA GReX

    • zenodo.org
    zip
    Updated Jun 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael J. Betti; Michael J. Betti; Eric Gamazon; Eric Gamazon (2024). eRNA GReX [Dataset]. http://doi.org/10.5281/zenodo.11212496
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael J. Betti; Michael J. Betti; Eric Gamazon; Eric Gamazon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all model weights and corresponding datasets generated by Betti et al. in the manuscript Genetically regulated enhancer RNA expression predicts enhancer-promoter contact frequency and reveals genetic mechanisms at complex trait-associated loci. The following are the contents of the sub-directories in this dataset:

    • coloc: Colocalization results for genome-wide significant (p < 5 x 10-8) GWAS associations in the UK Biobank with eRNA and canonical gene eQTLs (Supplementary Tables 11 and 12).
    • contact_model_training: Input datasets from whole blood and brain, respectively, that were used to train the neural network-based models of contact frequency.
    • eqtl_mapping: eQTLs mapped across 49 cell and tissue types for both eRNAs and canonical genes.
    • scz_mr: Inputs and results for Mendelian randomization analysis of eRNA and canonical gene-based TWAS of schizophrenia.
    • scz_twas: eRNA and canonical gene-based TWAS results of schizophrenia.
    • trained_models: Model weights and SNP covariance matrices for genetically regulated eRNA expression (GReX) across 49 cell and tissue types.
    • uk_biobank_twas: eRNA-based TWAS summary statistics for 4,671 UK Biobank traits across 49 cell and tissue types.

    Please cite:

    Betti, M.J., Aldrich, M.C., Lin, P., & Gamazon, E.R. (2024). Genetically regulated enhancer RNA expression predicts enhancer-promoter contact frequency and reveals genetic mechanisms at complex trait-associated loci. Preprint.

    Betti, M.J., Aldrich, M.C., Lin, P., & Gamazon, E.R. (2024). eRNA GReX (Version 1.0). Zenodo. 10.5281/zenodo.11212496

  17. Caribbean LD files for GhostKnockoffGWAS

    • zenodo.org
    zip
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    benjamin chu; benjamin chu (2025). Caribbean LD files for GhostKnockoffGWAS [Dataset]. http://doi.org/10.5281/zenodo.15192021
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    benjamin chu; benjamin chu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 10, 2025
    Area covered
    Caribbean
    Description

    This contains pre-processed LD files (Sigma matrix, S matrix, ...etc) computed on Caribbean samples of the UK-Biobank (n = 4517). It is intended to be used as an input to the GhostKnockoffGWAS pipeline.

    • This is the output of applying solveblock executable directly on 4517 Caribbean samples of the UK-Biobank.
    • Quasi-independent blocks are computed by applying the snp_ldsplit function with parameters thr_r2=0.01, max_r2=0.3, min_size = 500, and max_size = {1000, 1500, 3000, 6000, 10000}.
    • SNPs with minor allele frequency less than 0.01 or Hardy-Weinburg equilibrium p-value less than 1e-6 are removed.
    • Only HG19 coordinates are available.
    • Knockoff optimization were carried out by the Knockoffs.jl julia package: https://github.com/biona001/Knockoffs.jl
    • The result (i.e. files available in this site) is saved in .csv and .h5 formatted files for easier access, which is directly readable by GhostKnockoffGWAS.
  18. Chinese LD files for GhostKnockoffGWAS

    • zenodo.org
    zip
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    benjamin chu; benjamin chu (2025). Chinese LD files for GhostKnockoffGWAS [Dataset]. http://doi.org/10.5281/zenodo.15198714
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    benjamin chu; benjamin chu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This contains pre-processed LD files (Sigma matrix, S matrix, ...etc) computed on Chinese samples of the UK-Biobank (n = 1574). It is intended to be used as an input to the GhostKnockoffGWAS pipeline.

    • This is the output of applying solveblock executable directly on 1574 Chinese samples of the UK-Biobank.
    • Quasi-independent blocks are computed by applying the snp_ldsplit function with parameters thr_r2=0.01, max_r2=0.3, min_size = 500, and max_size = {1000, 1500, 3000, 6000, 10000}.
    • SNPs with minor allele frequency less than 0.01 or Hardy-Weinburg equilibrium p-value less than 1e-6 are removed.
    • Only HG19 coordinates are available.
    • Knockoff optimization were carried out by the Knockoffs.jl julia package: https://github.com/biona001/Knockoffs.jl
    • The result (i.e. files available in this site) is saved in .csv and .h5 formatted files for easier access, which is directly readable by GhostKnockoffGWAS.
  19. d

    Data from: In search of the genetic variants of human sex ratio at birth:...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siliang Song; Jianzhi Zhang (2025). In search of the genetic variants of human sex ratio at birth: Was Fisher wrong about sex ratio evolution? [Dataset]. http://doi.org/10.5061/dryad.vdncjsz43
    Explore at:
    Dataset updated
    Aug 4, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Siliang Song; Jianzhi Zhang
    Description

    The human sex ratio (fraction of males) at birth is close to 0.5 at the population level, an observation commonly explained by Fisher's principle. However, past human studies yielded conflicting results regarding the existence of sex ratio-influencing mutations-a prerequisite to Fisher’s principle, raising the question of whether the nearly even population sex ratio is instead dictated by the random X/Y chromosome segregation in male meiosis. Here we show that, because a person’s offspring sex ratio (OSR) has an enormous measurement error, a gigantic sample is required to detect OSR-influencing genetic variants. Conducting a UK Biobank-based genome-wide association study that is more powerful than previous studies, we detect an OSR-associated genetic variant, which awaits verification in independent samples. Given the abysmal precision in measuring OSR, it is unsurprising that the estimated heritability of OSR is effectively zero. We further show that OSR’s estimated heritability would ..., GWAS: When conducting the GWAS in the UKB, we did not simply use the sibling sex ratio as the trait, because of the difficulty in accounting for different estimation errors of the sibling sex ratio for different families as a result of the variation in family size. For example, individual A has one brother and zero sister, while individual B has four brothers and one sister. Although A has a higher sibling sex ratio than B, B’s siblings obviously provide stronger evidence for a male-biased sibling sex ratio than A’s siblings. To properly weigh the data by the family size, we considered the birth of each sibling as an independent event. In the above example, we would associate A’s genotype with one male birth and associate B’s genotype with four male births and one female birth. In GWAS, a male birth is coded as 1 and a female birth is coded as 0. The UKB participants have a total of 873,715 full siblings, leading to an unprecedented statistical power. In our GWAS in the UKB, we i..., , # In search of the genetic variants of human sex ratio at birth: Was Fisher wrong about sex ratio evolution?

    https://doi.org/10.5061/dryad.vdncjsz43

    Description of the data and file structure

    GWAS summary statistics and simulation data of the paper "In search of the genetic variants of human sex ratio at birth: Was Fisher wrong about sex ratio evolution?"

    Files and variables

    File: Human_sex_ratio_scrit.zip

    Description: Scripts for the project. For descriptions of each script files, see README.md in the zip file or https://github.com/song88180/Human_sex_ratio

    File: GWAS_OSR_cov_logistic.tsv

    Description:Â GWAS summary statistics of offspring sex ratio. Cells with "NA" means the value is not available.Â

    Variables
    • CHROM: chromosome number
    • POS: SNP position (GRCh37)
    • ID: rsid of the SNP
    • REF: reference allele
    • ALT: alternative allele
    • P: P-value
    • t...
  20. Data files for the manuscript entitled, "Single-cell DNA methylome and 3D...

    • zenodo.org
    txt, zip
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyuan Johnson Chen; Zeyuan Johnson Chen; Sankha Subhra Das; Sankha Subhra Das; Asha Kar; Asha Kar; Seung Hyuk Tony Lee; Seung Hyuk Tony Lee; Kevin Abuhanna; Kevin Abuhanna; Marcus Alvarez; Marcus Alvarez; Mihir Sukhatme; Mihir Sukhatme; Zitian Wang; Zitian Wang; Kyla Gelev; Kyla Gelev; Sandhya Rajkumar; Sandhya Rajkumar; Matthew Heffel; Yi Zhang; Oren Avram; Oren Avram; Elior Rahmani; Sriram Sankararaman; Sriram Sankararaman; Sini Heinonen; Sini Heinonen; Peltoniemi Hilkka; Eran Halperin; Kirsi Pietiläinen; Kirsi Pietiläinen; Chongyuan Luo; Paivi Pajukanta; Paivi Pajukanta; Matthew Heffel; Yi Zhang; Elior Rahmani; Peltoniemi Hilkka; Eran Halperin; Chongyuan Luo (2025). Data files for the manuscript entitled, "Single-cell DNA methylome and 3D genome atlas of human subcutaneous adipose tissue." [Dataset]. http://doi.org/10.5281/zenodo.15318595
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zeyuan Johnson Chen; Zeyuan Johnson Chen; Sankha Subhra Das; Sankha Subhra Das; Asha Kar; Asha Kar; Seung Hyuk Tony Lee; Seung Hyuk Tony Lee; Kevin Abuhanna; Kevin Abuhanna; Marcus Alvarez; Marcus Alvarez; Mihir Sukhatme; Mihir Sukhatme; Zitian Wang; Zitian Wang; Kyla Gelev; Kyla Gelev; Sandhya Rajkumar; Sandhya Rajkumar; Matthew Heffel; Yi Zhang; Oren Avram; Oren Avram; Elior Rahmani; Sriram Sankararaman; Sriram Sankararaman; Sini Heinonen; Sini Heinonen; Peltoniemi Hilkka; Eran Halperin; Kirsi Pietiläinen; Kirsi Pietiläinen; Chongyuan Luo; Paivi Pajukanta; Paivi Pajukanta; Matthew Heffel; Yi Zhang; Elior Rahmani; Peltoniemi Hilkka; Eran Halperin; Chongyuan Luo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These are the genome-wide association study (GWAS) statistics in the UK Biobank and Source Data files for our paper Chen ZJ, Das SS, Kar A, Lee SHT, Abuhanna KD, Alvarez M, Sukhatme MG, Wang Z, Gelev KZ, Heffel MG, Zhang Y, Avram O, Rahmani E, Sankararaman S, Heinonen S, Peltoniemi H, Halperin E, Pietiläinen KH, Luo C, Pajukanta P. Single-cell DNA methylome and 3D genome atlas of human subcutaneous adipose tissue.
    Further details of these analyses can be found in the Methods and Results part of this paper.

    Repository contents

    GWAS summary statistics in the UK Biobank for C-reactive protein (CRP), body mass index (BMI), metabolic-dysfunction associated steatotic liver disease (MASLD), and waist-to-hip ratio adjusted for BMI (WHRadjBMI):

    • GWAS.zip

    Figure source data:

    • Figure2.zip
    • Figure3.zip
    • Figure4.zip
    • Figure5.zip
    • Figure6.zip
    • ExtendedDataFigure1.zip
    • ExtendedDataFigure2.zip
    • ExtendedDataFigure3.zip
    • ExtendedDataFigure4.zip
    • ExtendedDataFigure5.zip
    • ExtendedDataFigure6.zip
    • ExtendedDataFigure7.zip
    • ExtendedDataFigure8.zip
    • ExtendedDataFigure9.zip
    • ExtendedDataFigure10.zip
    • SupplementaryFigure1.zip
    • SupplementaryFigure2.zip
    • SupplementaryFigure3.zip
    • SupplementaryFigure4.zip
    • SupplementaryFigure5.zip
    • SupplementaryFigure6.zip
    • SupplementaryFigure7.zip
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
Organization logo

Synthetic datasets of the UK Biobank cohort

Explore at:
bin, csv, zip, pdfAvailable download formats
Dataset updated
Sep 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

  • Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]
  • Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

  • synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
  • synthbdbasevar: baseline variables, mostly collected at recruitment.
  • synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
  • synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

  • codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
  • asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
  • Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

Search
Clear search
Close search
Google apps
Main menu