7 datasets found
  1. Synthetic datasets of the UK Biobank cohort

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, pdf, zip
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
    Explore at:
    bin, csv, zip, pdfAvailable download formats
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

    The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

    Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

    The original datasets are described in the article by Vanoli et al in Epidemiology (2024) (DOI: 10.1097/EDE.0000000000001796) [freely available here], which also provides information about the data sources.

    The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

    Content

    The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

    • synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
    • synthbdbasevar: baseline variables, mostly collected at recruitment.
    • synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
    • synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

    In addition, this repository provides these additional files:

    • codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
    • asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
    • Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

    Generation of the synthetic data

    The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

    The first part merges all the data including the annual PM2.5 levels in a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

    This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables as well as the mortality risks resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

  2. GWAS summary statistics for 9 quantitative phenotypes from the UK Biobank...

    • zenodo.org
    application/gzip
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shadi Zabad; Shadi Zabad (2025). GWAS summary statistics for 9 quantitative phenotypes from the UK Biobank (5-fold cross-validation) [Dataset]. http://doi.org/10.5281/zenodo.14612130
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shadi Zabad; Shadi Zabad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains GWAS summary statistics for 9 quantitative phenotypes from the UK Biobank.

    The phenotypes are:

    • HEIGHT: Standing height (Data-Field: 50)
    • BMI: Body mass index (Data-Field: 21001)
    • WC: Waist circumference (Data-Field: 48)
    • HC: Hip circumference (Data-Field: 49)
    • BW: Birth weight (Data-Field: 20022)
    • FVC: Forced vital capacity (Data-Field: 3062)
    • FEV1: Forced expiratory volume in 1-second (Data-Field: 3063)
    • HDL: HDL cholesterol (Data-Field: 30760)
    • LDL: LDL cholesterol (Data-Field: 30780)

    The GWAS study used data from "White British" samples (N = 337225), which were randomly divided into 5 folds for the purposes of cross-validation. The upload contains, for each fold, GWAS summary statistics for the training, validation, and test set. The validation summary statistics can be used for model selection/tuning. The test summary statistics can be used to evaluate PRS models via pseudo-validation methods. Association testing was done with plink2.

    The structure of the data is as follows:

    • train
      • fold_1
        • chr_1.PHENO1.glm.linear
        • chr_2.PHENO1.glm.linear
        • ...
      • fold_2
      • fold_3
      • ...
    • validation
      • fold_1
      • fold_2
      • fold_3
      • ...
    • test
      • fold_1
      • fold_2
      • fold_3
      • ...

    For more details about the GWAS study, Quality Control (QC) criteria, or other information, please consult our publication:

    Zabad, S., Gravel, S., & Li, Y. (2023). Fast and accurate Bayesian polygenic risk modeling with variational inference. The American Journal of Human Genetics, 110(5), 741–761. https://doi.org/10.1016/j.ajhg.2023.03.009

    If you use this data in your work, please cite the publication above.

  3. Data from: Brain Ages Derived from Different MRI Modalities are Associated...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrei-Claudiu Roibu; Andrei-Claudiu Roibu; Stanislaw Adaszewski; Torsten Schindler; Stephen M. Smith; Stephen M. Smith; Ana I.L. Namburete; Ana I.L. Namburete; Frederik J. Lange; Frederik J. Lange; Stanislaw Adaszewski; Torsten Schindler (2025). Brain Ages Derived from Different MRI Modalities are Associated with Distinct Biological Phenotypes [Dataset]. http://doi.org/10.5281/zenodo.8110876
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrei-Claudiu Roibu; Andrei-Claudiu Roibu; Stanislaw Adaszewski; Torsten Schindler; Stephen M. Smith; Stephen M. Smith; Ana I.L. Namburete; Ana I.L. Namburete; Frederik J. Lange; Frederik J. Lange; Stanislaw Adaszewski; Torsten Schindler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Brain ageing is a highly variable, spatially and temporally heterogeneous process, marked by numerous structural and functional changes. These can cause discrepancies between individuals’ chronological age and the apparent age of their brain, as inferred from neuroimaging data. Machine learning models, and particularly Convolutional Neural Networks (CNNs), have proven adept in capturing patterns relating to ageing induced changes in the brain. The differences between the predicted and chronological ages, referred to as brain age deltas, have emerged as useful biomarkers for exploring those factors which promote accelerated ageing or resilience, such as pathologies or lifestyle factors. However, previous studies rely only on structural neuroimaging for predictions, overlooking potentially informative functional and microstructural changes. Here we show that multiple contrasts derived from different MRI modalities can predict brain age, each encoding bespoke brain ageing information. By using 3D CNNs and UK Biobank data, we found that 57 contrasts derived from structural, susceptibility-weighted, diffusion, and functional MRI can successfully predict brain age. For each contrast, different patterns of association with non-imaging phenotypes were found, resulting in a total of 191 unique, statistically significant associations. Furthermore, we found that ensembling data from multiple contrasts results in both higher prediction accuracies and stronger correlations to non-imaging measurements. Our results demonstrate that other 3D contrasts and modalities, which have not been considered so far for the task of brain age prediction, encode different information about the ageing brain. We envision our work as being the starting point for future investigations into the causal links underpinning the observed brain age deltas and non-imaging measurement associations. For instance, drug effects can be monitored, given that certain medications correlated with accelerated brain ageing. Furthermore, continued development of brain age models could facilitate their deployment in clinical trials for recruitment and monitoring, and hospitals for diagnostic and screening tasks.

    Data Description

    This dataset contains the full correlation results with all nIDPs in the UK Biobank. These are presented in datasets split by sex in Female and Male subjects. For easier data manipulation, two smaller datasets have also been made available, containing just those correlation which pass the False Discovery Rate (FDR) threshold.

    As experiments were also conducted for ensembles using multiple contrasts, similar datasets are provided for those.

    Finally, global datasets are also provided. These are the concatenation of the associations contained in the Male and Female datasets.

    Paper & Code

    The original paper for this article can be accessed here:

    To access the codes relevant for this project, please access the project GitHub Repos:

    If using this work, please cite it based on the above paper, or using the following BibTex:

    @inproceedings{roibu2023brain,
     title={Brain Ages Derived from Different MRI Modalities are Associated with Distinct Biological Phenotypes},
     author={Roibu, Andrei-Claudiu and Adaszewski, Stanislaw and Schindler, Torsten and Smith, Stephen M and Namburete, Ana IL and Lange, Frederik J},
     booktitle={2023 10th IEEE Swiss Conference on Data Science (SDS)},
     pages={17--25},
     year={2023},
     organization={IEEE},
     doi={10.1109/SDS57534.2023.00010}
    }

    Data Access

    The data for this project is freely available upon application at the UK Biobank. For more information regarding the individual nIDPs, please access the UK Biobank Showcase website at: https://biobank.ctsu.ox.ac.uk/showcase/search.cgi

    Funding

    ACR is supported by EPSRC Grant EP/S024093/1, F. Hoffmann-La Roche AG and a 2021 Industrial Fellowship offered by the Royal Commission for the Exhibition of 1851. SMS is supported by a Wellcome Trust Collaborative Award 215573/Z/19/Z. AILN is grateful for support from the Academy of Medical Sciences under the Springboard Awards scheme (SBF005/1136), and the Bill and Melinda Gates Foundation. FJL is supported by a Wellcome Trust Collaborative Award (215573/Z/19/Z). The WIN is supported by core funding from the Wellcome Trust (203139/Z/16/Z). The computational aspects were supported by the Wellcome Trust (203141/Z/16/Z) and the NIHR Oxford BRC. Corresponding authors: ACR (andreiroibu@icloud.com), SA (stanislaw.adaszewski@roche.com) and AILN (ana.namburete@cs.ox.ac.uk).

  4. f

    Data from: Participant characteristics.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas J MacGillivray; James R. Cameron; Qiuli Zhang; Ahmed El-Medany; Carl Mulholland; Ziyan Sheng; Bal Dhillon; Fergus N. Doubal; Paul J. Foster; Emmanuel Trucco; Cathie Sudlow (2023). Participant characteristics. [Dataset]. http://doi.org/10.1371/journal.pone.0127914.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Thomas J MacGillivray; James R. Cameron; Qiuli Zhang; Ahmed El-Medany; Carl Mulholland; Ziyan Sheng; Bal Dhillon; Fergus N. Doubal; Paul J. Foster; Emmanuel Trucco; Cathie Sudlow
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    *based on the UK Biobank Data Showcase 11.Participant characteristics.

  5. Supplementary datasets for: Polymorphic short tandem repeats make widespread...

    • data.niaid.nih.gov
    • search.dataone.org
    zip
    Updated Nov 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Margoliash; Shai Fuchs; Yang Li; Xuan Zhang; Arya Massarat; Alon Goren; Melissa Gymrek (2023). Supplementary datasets for: Polymorphic short tandem repeats make widespread contributions to blood and serum traits [Dataset]. http://doi.org/10.5061/dryad.z612jm6jk
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 13, 2023
    Dataset provided by
    Sheba Medical Center
    University of California, San Diego
    Authors
    Jonathan Margoliash; Shai Fuchs; Yang Li; Xuan Zhang; Arya Massarat; Alon Goren; Melissa Gymrek
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Short tandem repeats (STRs) are genomic regions consisting of repeated sequences of 1-6bp in succession. Single nucleotide polymorphism (SNP) based genome-wide association studies (GWAS) do not fully capture STR effects. To study these effects, we imputed 445,720 STRs into genotype arrays from 408,153 White British UK Biobank participants and tested for association with 44 blood phenotypes. Using two fine-mapping methods, we identify 119 candidate causal STR-trait associations and estimate that STRs account for 5.2–7.6% of causal variants identifiable from GWAS for these traits. These are among the strongest associations for multiple phenotypes, including a coding CTG repeat associated with apolipoprotein B levels, a promoter CGG repeat with platelet traits and an intronic poly-A repeat with mean platelet volume. Our study suggests that STRs make widespread contributions to complex traits, provides stringently selected candidate causal STRs, and demonstrates the need to consider a more complete view of genetic variation in GWAS. Methods Please see the Cell Genomics article or biorXiv preprint for detailed methods. Alternatively, the relevant portion of the methods from the paper have been pasted below, but lacking references and with some distortions due to the difficulty of copying mathematical formulae. For references to tables, figures, notes and the key resources table, please refer to the paper. Selection of UK Biobank participants We downloaded the fam file and sample file for version 2 of the phased SNP array data (referred to in the UKB documentation as the ‘haplotype’ dataset) using the ukbgene utility (ver Jan 28 2019 14:09:15 - using Glibc2.28(stable)) described in UKB Data Showcase Resource ID 664 (see Key Resources Table). The IDs from the sample file already excluded 968 individuals previously identified as having excessive principal component-adjusted SNP array heterozygosity or excessive SNP array missingness after call-level filtering indicating potential DNA contamination. We further removed withdrawn participants, indicated by non-positive IDs in the sample file as well as by IDs in email communications from the UKB access management team. After the additional filtering, data for 487,279 individuals remained. We downloaded the sample quality control (QC) file (described in the sample QC section of UKB Data Showcase Resource ID 531 (see Key Resources Table)) from the European Genome-Phenome Archive (accession EGAF00001844707) using pyEGA3. We subsetted the non-withdrawn individuals above to the 408,870 (83.91%) participants identified as White-British by column in.white.British.ancestry.subset of the sample QC file. This field was computed by the UKB team to only include individuals whose self-reported ethnic background was White British and whose genetic principal components were not outliers compared to the other individuals in that group. In concordance with previous analyses of this cohort we additionally removed data for: ● 2 individuals with an excessive number of inferred relatives, removed due to plausible SNP array contamination (participants listed in sample QC file column excluded.from.kinship.inference that had not already been removed by the UKB team prior to phasing) ● 308 individuals whose self-reported sex did not match the genetically inferred sex, removed due to concern for sample mislabeling (participants where sample QC file columns Submitted.Gender and Inferred.Gender did not match) ● 407 additional individuals with putative sex chromosome aneuploidies removed as their genetic signals might differ significantly from the rest of the population (listed in sample QC file column putative.sex.chromosome.aneuploidy) Following these additional filters the data for 408,153 individuals remained (99.82% of the White British individuals considered above). SNP and indel dataset preprocessing We obtained both phased hard-called and imputed SNP and short indel genotypes made available by the UKB. These variants were provided in reference genome hg19 coordinates, and all analyses in this study, unless otherwise specified, were performed with hg19 coordinates. Phased hard-called genotypes: We downloaded the bgen files containing the hard-called SNP and indel haplotypes (release version 2) and the corresponding sample and fam files using the ukbgene utility (UKB Data Showcase Resource 664 (see Key Resources Table)). These variants had been genotyped using microarrays and phased using SHAPEIT3 with the 1000 genomes phase 3 reference panel. Variants genotyped on the microarray were excluded from phasing and downstream analyses if they failed QC on more than one microarray genotyping batch, had overall call-missingness rate greater than 5% or had minor allele frequency less than 0.01%. Of the resulting 658,720 variants, 99.5% were single nucleotide variants, 0.2% were short indels (average length 1.9bp, maximal length 26bp), and 0.2% were short deletions (average length 1.9bp, maximal length 29bp). Imputed genotypes: We similarly downloaded imputed SNP data using the ukbgene utility (release version 3). Variants had been imputed with IMPUTE4 using the Haplotype Reference Consortium panel, with additional variants from the UK10K and 1000 Genomes phase 3 reference panels. The resulting imputed variants contain 93,095,623 variants, consisting of 96.0% single nucleotide variants, 1.3% short insertions (average length 2.5bp, maximum length 661bp), 2.6% short deletions (average length 3.1bp, maximum length 129bp). This set does not include the 11 classic human leukocyte antigen alleles imputed separately. We used bgen-reader 4.0.8 to access the downloaded bgen files in python. We used plink2 v2.00a3LM (build AVX2 Intel 28 Oct 2020) to convert bgen files from both hard-called and imputed SNPs to the plink2 format for downstream analyses. For hard-called genotypes, we used plink to set the first allele to match the hg19 reference genome. Imputed genotypes already matched the reference. Unless otherwise noted, our pipeline worked with imputed genotypes as non-reference allele dosages, i.e. for each individual. STR imputation We previously published a reference panel containing phased haplotypes of SNP variants alongside 445,720 autosomal STR variants in 2,504 individuals from the 1000 Genomes Project (see Key Resources Table). This panel focuses on STRs ascertained to be highly polymorphic and well-imputed in European individuals. Notably, this excludes many STRs known to be implicated in repeat expansion diseases, STRs that are primarily polymorphic only in non-European populations, or STRs that are too mutable to be in strong linkage disequilibrium (LD) with nearby SNPs. The IDs listed in the ‘str’ column of Supplemental Table 2 at that URL describe which variants in the reference panel are STRs and which are other types of variants. That produces a list of 445,715 unique variant IDs and 5 IDs which are each assigned to four separate variants in the reference panel VCFs. For the IDs with multiple assignments, we selected the variant that appeared first in the VCF and discarded the others, leaving 445,720 unique STR variants each with unique IDs. While our analyses with these STRs were performed using hg19 coordinates unless otherwise stated, we also provide hg38 reference coordinates for these STRs in the supplemental tables. We obtained those coordinates using LiftOver which resulted in identical coordinates as in HipSTR’s hg38 STR reference panel (see Key Resources Table). All STRs successfully lifted over to hg38 coordinates. To select shared variants for imputation, we note that 641,582 (97.4%) of SNP and indel variants that were hard-called and phased in the UKB participants were present in our SNP-STR reference panel. As a quality control step, we filtered variants that had highly discordant minor allele frequencies between the 1000 Genomes European subpopulations (see Key Resources Table) and White British individuals from the UKB. We first took a maximal unrelated set of the White British individuals (see Phenotype Methods below) and then visually inspected the alternate allele frequency of the overlapping variants (Figure S1) and chose to remove the 110 variants with an alternate allele frequency difference of more than 12%. We used Beagle v5.1 (build 25Nov19.28d) with the tool’s provided human genetic maps (see Key Resources Table) and non-default flag ap=true to impute STRs into the remaining 641,472 SNPs and indels from the SNP-STR panel into the hard-called SNP haplotypes. Though we performed the above comparison between reference panel Europeans and UKB White British individuals, we performed this STR imputation into all UKB participants using all the individuals in the reference panel. We chose Beagle because it can handle multi-allelic loci. Due to computational constraints, we ran Beagle per chromosome on batches of 1000 participants at a time with roughly 18GB of memory. We merged the resulting VCFs across batches and extracted only the STR variants. Lastly, we added back the INFO fields present in the SNP-STR reference panel that Beagle removed during imputation.

    Estimated allele frequencies (Figure 1b) were computed as follows: for each allele length for each STR, we summed the imputed probability of the STR on that chromosome to have length over both chromosomes of all unrelated participants. That sum is divided by the total number of chromosomes considered to obtain the estimated frequency of each allele. Inferring repeat units Each STR in the SNP-STR reference panel was previously annotated with a repeat period - the length of its repeat unit - but not the repeat unit itself. We inferred the repeat unit of each STR in the panel as follows: we considered the STR’s reference allele and given period. We then took each k-mer in the reference allele where k is the repeat period, standardized those k-mers, and took their counts. We

  6. GWAS summary statistics for Standing Height from the UK Biobank (5-fold...

    • zenodo.org
    application/gzip
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shadi Zabad; Shadi Zabad (2024). GWAS summary statistics for Standing Height from the UK Biobank (5-fold cross-validation) [Dataset]. http://doi.org/10.5281/zenodo.14270953
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shadi Zabad; Shadi Zabad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains GWAS summary statistics for Standing Height in the UK Biobank.

    The GWAS study used data from "White British" samples (N = 337225), which were randomly divided into 5 folds for the purposes of cross-validation. The upload contains, for each fold, GWAS summary statistics for the training and test set. The test summary statistics can be used to evaluate PRS models via pseudo-validation methods. Association testing was done with plink2.

    The structure of the data is as follows:

    • train
      • fold_1
        • chr_1.PHENO1.glm.linear
        • chr_2.PHENO1.glm.linear
        • ...
      • fold_2
      • fold_3
      • ...
    • test
      • fold_1
      • fold_2
      • fold_3
      • ...

    For more details about the GWAS study, Quality Control (QC) criteria, or other information, please consult our publication:

    Zabad, S., Gravel, S., & Li, Y. (2023). Fast and accurate Bayesian polygenic risk modeling with variational inference. The American Journal of Human Genetics, 110(5), 741–761. https://doi.org/10.1016/j.ajhg.2023.03.009

    If you use this data in your work, please cite the publication above.

  7. E

    CINECA_synthetic_cohort_EUROPE_UK1 referencing fake samples

    • ega-archive.org
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CINECA_synthetic_cohort_EUROPE_UK1 referencing fake samples [Dataset]. https://ega-archive.org/datasets/EGAD00001006673
    Explore at:
    License

    https://ega-archive.org/dacs/EGAC00001000514https://ega-archive.org/dacs/EGAC00001000514

    Area covered
    Europe
    Description

    Please note: This synthetic data set (with cohort “participants” / ”subjects” marked with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or results. The purpose of this dataset is to aid development of technical implementations for cohort data discovery, harmonization, access, and federated analysis. In support of FAIRness in data sharing, this dataset is made freely available under the Creative Commons Licence (CC-BY). Please ensure this preamble is included with this dataset and that the CINECA project (funding: EC H2020 grant 825775) is acknowledged. For any questions please contact isuru@ebi.ac.uk or cthomas@ebi.ac.uk

    This dataset (CINECA_synthetic_cohort_EUROPE_UK1) consists of 2521 samples which have genetic data based on 1000 Genomes data (https://www.nature.com/articles/nature15393), and synthetic subject attributes and phenotypic data derived from UKBiobank (https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001779). These data were initially derived using the TOFU tool (https://github.com/spiros/tofu), which generates randomly generated values based on the UKBiobank data dictionary. Categorical values were randomly generated based on the data dictionary, continuous variables generated based on the distribution of values reported by the UK Biobank showcase, and date / time values were random. Additionally we split the phenotypes and attributes into 4 main classes - general, cancer, diabetes mellitus, and cardiac. We assigned the general attributes to all the samples, and the cardiac / diabetes mellitus / cancer attributes to a proportion of the total samples. Once the initial set of phenotypes and attributes were generated, the data data was checked for consistency and where possible dependent attributes were calculated from the independent variables generated by TOFU. For example, BMI was calculated from height and weight data, and age at death generated by date of death and date of birth. These data were then loaded to the development instance of Biosamples (https://www.ebi.ac.uk/biosamples/) which accessioned each of the samples. The genetic data are derived from the 1000 Genomes Phase 3 release (https://www.internationalgenome.org/category/phase-3/). The genotype data consists of a single joint call vcf files with call genotypes for all 2504 samples, plus bed, bim, fam, and nosex files generated via plink for these samples and genotypes. The genotype data has had a variety of errors introduced to mimic real data and as a test for quality control pipelines. These include gender mismatches, ethnic background mislabelling and low call rates for a randomly chosen subset of sample data as well as deviations from Hardy Weinberg equilibrium and low call rates for a random selection of variants. Additionally 40 samples have raw genetic data available in the form of both bam and cram files, including unmapped data. The gender of the samples in the 1000 genomes data has been matched to the synthetic phenotypic data generated for these samples. The genetic data was then linked to the synthetic data in BioSamples, and submitted to EGA.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
Organization logo

Synthetic datasets of the UK Biobank cohort

Explore at:
bin, csv, zip, pdfAvailable download formats
Dataset updated
Feb 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The original datasets are described in the article by Vanoli et al in Epidemiology (2024) (DOI: 10.1097/EDE.0000000000001796) [freely available here], which also provides information about the data sources.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

  • synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
  • synthbdbasevar: baseline variables, mostly collected at recruitment.
  • synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
  • synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

  • codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
  • asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
  • Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data including the annual PM2.5 levels in a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables as well as the mortality risks resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

Search
Clear search
Close search
Google apps
Main menu