95 datasets found
  1. Synthetic datasets of the UK Biobank cohort

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, pdf, zip
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
    Explore at:
    bin, csv, zip, pdfAvailable download formats
    Dataset updated
    Sep 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

    The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

    The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

    • Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]
    • Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

    Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

    The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

    Content

    The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

    • synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
    • synthbdbasevar: baseline variables, mostly collected at recruitment.
    • synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
    • synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

    In addition, this repository provides these additional files:

    • codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
    • asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
    • Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

    Generation of the synthetic data

    The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

    The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

    This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

  2. d

    Data from: Accuracy of identifying incident stroke cases from linked...

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Jun 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kristiina Rannikmae (2025). Accuracy of identifying incident stroke cases from linked healthcare data in UK Biobank [Dataset]. http://doi.org/10.5061/dryad.w9ghx3fk0
    Explore at:
    Dataset updated
    Jun 21, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Kristiina Rannikmae
    Time period covered
    Jan 1, 2019
    Description

    Objective: In UK Biobank (UKB), a large population-based prospective study, cases of many diseases are ascertained through linkage to routinely collected, coded national health datasets. We assessed the accuracy of these for identifying incident strokes.

    Methods: In a regional UKB sub-population (n=17,249), we identified all participants with ≥1 code signifying a first stroke after recruitment (incident stroke-coded cases) in linked hospital admission, primary care or death record data. Stroke physicians reviewed their full electronic patient records (EPRs) and generated reference standard diagnoses. We evaluated the number and proportion of cases that were true positives (i.e. positive predictive value, PPV) for all codes combined and by code source and type.

    Results: Of 232 incident stroke-coded cases, 97% had EPR information available. Data sources were: 30% hospital admission only; 39% primary care only; 28% hospital and primary care; 3% death records only. While 42% of cases ...

  3. Data from: UK Biobank release and systematic evaluation of optimised...

    • data-staging.niaid.nih.gov
    • zenodo.org
    Updated Apr 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thompson, Deborah; Wells, Daniel; Selzam, Saskia; Peneva, Iliana; Moore, Rachel; Sharp, Kevin; Tarran, Will; Beard, Ed; Riveros-Mckay, Fernando; Palmer, Duncan; Seth, Priyanka; Harrison, James; Futema, Marta; Genomics England Research Consortium; McVean, Gil; Plagnol, Vincent; Donnelly, Peter; Weale, Michael (2023). UK Biobank release and systematic evaluation of optimised polygenic risk scores for 53 diseases and quantitative traits [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6631951
    Explore at:
    Dataset updated
    Apr 18, 2023
    Dataset provided by
    Genomics Ltd
    Cardiology Research Centre, Molecular and Clinical Sciences Research Institute, St George's University of London, London,, UK
    Authors
    Thompson, Deborah; Wells, Daniel; Selzam, Saskia; Peneva, Iliana; Moore, Rachel; Sharp, Kevin; Tarran, Will; Beard, Ed; Riveros-Mckay, Fernando; Palmer, Duncan; Seth, Priyanka; Harrison, James; Futema, Marta; Genomics England Research Consortium; McVean, Gil; Plagnol, Vincent; Donnelly, Peter; Weale, Michael
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary-level GWAS data for 53 traits generated by Genomics plc as presented in:

    Thompson D. et al. UK Biobank release and systematic evaluation of optimised polygenic risk scores for 53 diseases and quantitative traits (https://doi.org/10.1101/2022.06.16.22276246)

    If you have any questions or comments regarding these files, please contact Genomics plc at research@genomicsplc.com

    NOTES

    These analyses were carried out using the full UK Biobank (UKB) imputation data release (v3b). After removal of exclusions and withdrawals, a subset of 337,151 UKB individuals, the White British Unrelated (WBU) subgroup, was defined as the intersection of two sample groups created by Bycroft et al 2018 (Nature 562, 203-209): the ‘White British ancestry’ group (UKB Data Field 22006) and the ‘used in genetic principal components’ group (UKB Data Field 22020), the latter being high quality samples that were filtered to avoid closely related individuals. All GWAS analyses were performed on the WBU subgroup.

    Phenotypes were defined as described in Supplementary Table 1 ‘Phenotype definitions’ using a combination of Hospital Episode Statistics, Cancer Registry reports (where applicable) and self-report responses.

    All analyses included Age at assessment, sex (for non-sex specific traits), genotyping chip, and 10 principal components as covariates.

    GWAS summary statistics for each trait were generated by applying PLINK 2.0 to the WBU subgroup, using a logistic regression for disease traits, and a linear regression model for quantitative traits. For chromosome X variants males were treated as having 0 or 2 alternative alleles.

    The results are not adjusted for genomic control.

    DATA FILE CONTENT DESCRIPTION (DISEASE TRAITS)

        cpra
        Variant ID in ‘CPRA’ format. Position reflects position in b37
    
    
        chrom
        Chromosome
    
    
        pos
        Position in base pairs (b37, 1-based)
    
    
        alt
        Alternative allele (effect allele)
    
    
        beta
        Effect size (log odds ratio)
    
    
        standard_error
        Standard error of beta
    
    
        minus_log10_p
        Minus log(base 10) of P-value
    
    
        ref
        Reference allele (non-effect allele)
    
    
        ncase
        Number of cases
    
    
        ncontrol
        Number of controls
    

    DATA FILE CONTENT DESCRIPTION (QUANTITATIVE TRAITS)

        cpra
        Variant ID in ‘CPRA’ format. Position reflects position in b37
    
    
        chrom
        Chromosome
    
    
        pos
        Position in base pairs (b37, 1-based)
    
    
        alt
        Alternative allele (effect allele)
    
    
        beta
        Effect size (log odds ratio
    
    
        standard_error
        Standard error of beta
    
    
        minus_log10_p
        Minus log(base 10) of P-value
    
    
        ref
        Reference allele (non-effect allele)
    
    
        ntotal
        Total sample size
    

    PHENOTYPE CODES

    The following is a list of traits and their phenotype codes (as used in file naming).

    DISEASE TRAITS

        Age-related macular degeneration
        AMD
    
    
        Alzheimer's disease
        AD
    
    
        Asthma
        AST
    
    
        Atrial fibrillation
        AF
    
    
        Bipolar disorder
        BD
    
    
        Bowel cancer
        CRC
    
    
        Breast cancer
        BC
    
    
        Cardiovascular disease
        CVD
    
    
        Coeliac disease
        CED
    
    
        Coronary artery disease
        CAD
    
    
        Crohn's disease
        CD
    
    
        Epithelial ovarian cancer
        EOC
    
    
        Hypertension
        HT
    
    
        Ischaemic stroke
        ISS
    
    
        Melanoma
        MEL
    
    
        Multiple sclerosis
        MS
    
    
        Osteoporosis
        OP
    
    
        Prostate cancer
        PC
    
    
        Parkinson's disease
        PD
    
    
        Primary open angle glaucoma
        POAG
    
    
        Psoriasis
        PSO
    
    
        Rheumatoid arthritis
        RA
    
    
        Schizophrenia
        SCZ
    
    
        Systemic lupus erythematosus
        SLE
    
    
        Type 1 diabetes
        T1D
    
    
        Type 2 diabetes
        T2D
    
    
        Ulcerative colitis
        UC
    
    
        Venous thromboembolic disease
        VTE
    

    QUANTITATIVE TRAITS

        Age at menopause
        AAM
    
    
        Apolipoprotein A1
        APOEA
    
    
        Apolipoprotein B
        APOEB
    
    
        Body mass index
        BMI
    
    
        Calcium
        ACALMD
    
    
        Docosahexaenoic acid
        DOA
    
    
        Estimated bone mineral density T-score
        EBMDT
    
    
        Estimated glomerular filtration rate (creatinine based)
        EGCR
    
    
        Estimated glomerular filtration rate (cystatin based)
        EGCY
    
    
        Glycated haemoglobin
        HBA1C_DF
    
    
        High density lipoprotein cholesterol
        HDL
    
    
        Height
        HEIGHT
    
    
        Intraocular pressure
        IOP
    
    
        Low density lipoprotein cholesterol
        LDL_SF
    
    
        Omega-6 fatty acids
        OSFA
    
    
        Omega-3 fatty acids
        OTFA
    
    
        Phosphatidylcholines
        PDCL
    
    
        Phosphoglycerides
        PHG
    
    
        Polyunsaturated fatty acids
        PFA
    
    
        Resting heart rate
        RHR
    
    
        Remnant cholesterol (Non-HDL, Non-LDL cholesterol)
        RMNC
    
    
        Sphingomyelins
        SGM
    
    
        Total cholesterol
        TCH
    
    
        Total fatty acids
        TFA
    
    
        Total triglycerides
        TTG
    
  4. f

    Data_Sheet_1_Association of vitamin and/or nutritional supplements with fall...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lingfang He; Tianqi Ma; Guogang Zhang; Xunjie Cheng; Yongping Bai (2023). Data_Sheet_1_Association of vitamin and/or nutritional supplements with fall among patients with diabetes: A prospective study based on ACCORD and UK Biobank.docx [Dataset]. http://doi.org/10.3389/fnut.2022.1082282.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    Frontiers
    Authors
    Lingfang He; Tianqi Ma; Guogang Zhang; Xunjie Cheng; Yongping Bai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AimsTo assess the associations of vitamin and/or nutritional supplements (VNS) with falls among patients with diabetes.Methods9,141 and 21,489 middle-aged participants with diabetes from Action to Control Cardiovascular Risk in Diabetes (ACCORD) trial and UK Biobank were included. Use of VNS was collected at baseline, and fall events were recorded using annual questionnaires in ACCORD and electric records in UK Biobank during follow-up. The associations of VNS use with fall risk were analyzed using logistic regression models in ACCORD and Fine-Gray sub-distribution hazard models in UK Biobank. The role of specific supplements was also estimated in UK Biobank, adjusting for confounding factors and multiple comparisons.Results45.9% (4,193/9,141, 5.5 median follow-up years) patients in ACCORD and 10.5% (2,251/21,489, 11.9 median follow-up years) in UK Biobank experienced fall and in-patient events during follow-up, respectively. In ACCORD, VNS using was associated with an increased risk of fall (full-adjusted odds ratio [OR]: 1.26, P < 0.05). In UK Biobank, despite no significant association between VNS overall and in-patient fall, vitamin B, calcium, and iron using increased the risk of falls significantly (full-adjusted hazard ratio range: 1.31–1.37, P < 0.05).ConclusionsUse of specific VNS increased the risk of fall among patients with diabetes. The non-indicative use of nutritional supplements for patients with diabetes might be inadvisable.

  5. f

    Population characteristics and distribution of symptoms, blood tests and...

    • plos.figshare.com
    xls
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Barclay; Cristina Renzi; Antonis Antoniou; Spiros Denaxas; Hannah Harrison; Samantha Ip; Nora Pashayan; Ana Torralbo; Juliet Usher-Smith; Angela Wood; Georgios Lyratzopoulos (2023). Population characteristics and distribution of symptoms, blood tests and primary care consultation patterns in CPRD and UK Biobank. [Dataset]. http://doi.org/10.1371/journal.pdig.0000383.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 15, 2023
    Dataset provided by
    PLOS Digital Health
    Authors
    Matthew Barclay; Cristina Renzi; Antonis Antoniou; Spiros Denaxas; Hannah Harrison; Samantha Ip; Nora Pashayan; Ana Torralbo; Juliet Usher-Smith; Angela Wood; Georgios Lyratzopoulos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Population characteristics and distribution of symptoms, blood tests and primary care consultation patterns in CPRD and UK Biobank.

  6. Characteristics of the 240,477 UK Biobank participants included in analyses....

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luke C. Pilling; Janice L. Atkins; George A. Kuchel; Luigi Ferrucci; David Melzer (2023). Characteristics of the 240,477 UK Biobank participants included in analyses. [Dataset]. http://doi.org/10.1371/journal.pone.0203504.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Luke C. Pilling; Janice L. Atkins; George A. Kuchel; Luigi Ferrucci; David Melzer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Characteristics of the 240,477 UK Biobank participants included in analyses.

  7. n

    Data from: Improving genome-wide association discovery and genomic...

    • data-staging.niaid.nih.gov
    • datadryad.org
    zip
    Updated Sep 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Robinson; Etienne J. Orliac; Daniel Trejo Banos; Sven E. Ojavee; Kristi Läll; Reedik Mägi; Peter M. Visscher; Matthew R. Robinson (2022). Improving genome-wide association discovery and genomic prediction accuracy in biobank data [Dataset]. http://doi.org/10.5061/dryad.gtht76hmz
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 2, 2022
    Dataset provided by
    The University of Queensland
    Institute of Science and Technology Austria
    University of Zurich
    University of Lausanne
    University of Tartu
    Authors
    Matthew Robinson; Etienne J. Orliac; Daniel Trejo Banos; Sven E. Ojavee; Kristi Läll; Reedik Mägi; Peter M. Visscher; Matthew R. Robinson
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Genetically informed, deep-phenotyped biobanks are an important research resource and it is imperative that the most powerful, versatile, and efficient analysis approaches are used. Here, we apply our recently developed Bayesian grouped mixture of regressions model (GMRM) in the UK and Estonian Biobanks and obtain the highest genomic prediction accuracy reported to date across 21 heritable traits. When compared to other approaches, GMRM accuracy was greater than annotation prediction models run in the LDAK or LDPred-funct software by 15% (SE 7%) and 14% (SE 2%), respectively, and was 18% (SE 3%) greater than a baseline BayesR model without single-nucleotide polymorphism (SNP) markers grouped into minor allele frequency–linkage disequilibrium (MAF-LD) annotation categories. For height, the prediction accuracy R 2 was 47% in a UK Biobank holdout sample, which was 76% of the estimated h SNP 2 . We then extend our GMRM prediction model to provide mixed-linear model association (MLMA) SNP marker estimates for genome-wide association (GWAS) discovery, which increased the independent loci detected to 16,162 in unrelated UK Biobank individuals, compared to 10,550 from BoltLMM and 10,095 from Regenie, a 62 and 65% increase, respectively. The average χ2 value of the leading markers increased by 15.24 (SE 0.41) for every 1% increase in prediction accuracy gained over a baseline BayesR model across the traits. Thus, we show that modeling genetic associations accounting for MAF and LD differences among SNP markers, and incorporating prior knowledge of genomic function, is important for both genomic prediction and discovery in large-scale individual-level studies. Methods From the measurements, tests, and electronic health record data available in the UK Biobank data, we selected 12 blood based biomarkers, 3 of the most common heritable complex diseases, and 6 quantitative measures. The full list of traits, the UK Biobank coding of the data used, and the covariates adjusted for are given in Table S1. For the quantitative measures and blood-based biomarkers we adjusted the values by the covariates, removed any individuals with a phenotype greater or less than 7 SD from the mean (assuming these are measurement errors), and standardized the values to have zero mean and variance 1. For the common complex diseases, we determined disease status using a combination of information available. For high blood pressure (BP), we used self-report information of whether high blood pressure was diagnosed by a doctor (UK Biobank code 6150-0.0), the age high blood pressure was diagnosed (2966-0.0), and whether the individual reported taking blood pressure medication (6153-0.0, 6177-0.0). For type-2 diabetes (T2D), we used self-report information of whether diabetes was diagnosed by a doctor (2443-0.0), the age diabetes was diagnosed (2976-0.0), and whether the individual reported taking diabetes medication (6153-0.0, 6177-0.0). For cardiovascular disease (CAD), we used self-report information of whether a heart attack was diagnosed by a doctor (3894-0.0), the age angina was diagnosed (3627-0.0), and whether the individual reported heart problem diagnosed by a doctor (6150-0.0) the date of myocardial infarction (42000-0.0). For each disease, we then combined this with primary death ICD10 codes (40001-0.0), causes of operative procedures (41201-0.0), and the main (41202-0.0), secondary (41204-0.0) and inpatient ICD10 codes (41270-0.0). For BP we selected ICD10 codes I10, for T2D we selected ICD10 codes E11 to E14 and excluded from the analysis individuals with E10 (type-1 diabetes), and for CAD we selected ICD10 code I20-I29. Thus, for the purposes of this analysis, we define these diseases broadly simply to maximise the number of cases available for analysis. For each disease, individuals with neither a self-report indication or a relevant ICD10 diagnosis, were then assigned a zero value as a control. We restricted our discovery analysis of the UK Biobank to a sample of European-ancestry individuals. To infer ancestry, we used both self-reported ethnic background (21000-0) selecting coding 1 and genetic ethnicity (22006-0) selecting coding 1. We also took the 488,377 genotyped participants and projected them onto the first two genotypic principal components (PC) calculated from 2,504 individuals of the 1,000 Genomes project with known ancestries. Using the obtained PC loadings, we then assigned each participant to the closest population in the 1000 Genomes data: European, African, East-Asian, South-Asian or Admixed, selecting individuals with PC1 projection < absolute value 4 and PC 2 projection < absolute value 3. Samples were excluded if in the UK Biobank quality control procedures they (i) were identified as extreme heterozygosity or missing genotype outliers; (ii) had a genetically inferred gender that did not match the self-reported gender; (iii) were identified to have putative sex chromosome aneuploidy; (iv) were excluded from kinship inference; (v) had withdrawn their consent for their data to be used. We used the imputed autosomal genotype data of the UK Biobank provided as part of the data release. We used the genotype probabilities to hard-call the genotypes for variants with an imputation quality score above 0.3. The hard-call-threshold was 0.1, setting the genotypes with probability <=0.9 as missing. From the good quality markers (with missingness less than 5% and p-value for Hardy-Weinberg test larger than 10-6, as determined in the set of unrelated Europeans) we selected those with minor allele frequency (MAF) > 0.0002 and rs identifier, in the set of European-ancestry participants, providing a data set 9,144,511 SNPs. From this we took the overlap with the Estonian Genome centre data described below to give a final set of 8,430,446 markers. For computational convenience we then removed markers in very high LD selecting one marker from any set of markers with LD R2 > 0.8 within a 1MB window. These filters resulted in a data set with 458,747 individuals and 2,174,071 markers. We apply our GMRM model to each UK Biobank trait, running two short chains for 5000 iterations and combining the last 2000 posterior samples together. Here, we provide the posterior mean effect size estimates fo each SNP and the mixed-linear model association regression coefficient, SE, t-statistic, and association p-value.

  8. Current and planned future data available in the UK Biobank resource.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cathie Sudlow; John Gallacher; Naomi Allen; Valerie Beral; Paul Burton; John Danesh; Paul Downey; Paul Elliott; Jane Green; Martin Landray; Bette Liu; Paul Matthews; Giok Ong; Jill Pell; Alan Silman; Alan Young; Tim Sprosen; Tim Peakman; Rory Collins (2023). Current and planned future data available in the UK Biobank resource. [Dataset]. http://doi.org/10.1371/journal.pmed.1001779.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Cathie Sudlow; John Gallacher; Naomi Allen; Valerie Beral; Paul Burton; John Danesh; Paul Downey; Paul Elliott; Jane Green; Martin Landray; Bette Liu; Paul Matthews; Giok Ong; Jill Pell; Alan Silman; Alan Young; Tim Sprosen; Tim Peakman; Rory Collins
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hb: haemoglobin; SNPs: single nucleotide polymorphisms; CNV: copy number variations; MRI: magnetic resonance imaging; DXA: dual-energy X-ray absorptiometry; ICD: International Classification of Diseases; OPCS: Office of Population Censuses and Surveys Classification of Interventions and Procedures‡ Future dates are estimated. Data available may be all or part of the relevant dataset.* available from an earlier date from health record systems in ScotlandCurrent and planned future data available in the UK Biobank resource.

  9. m

    Summary statistics of UK Biobank blood pressure genome-wide association...

    • figshare.manchester.ac.uk
    application/gzip
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaoguang Xu (2024). Summary statistics of UK Biobank blood pressure genome-wide association studies (GWAS) using 337,422 unrelated white European individuals [Dataset]. http://doi.org/10.48420/24851436.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    University of Manchester
    Authors
    Xiaoguang Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three blood pressure traits were analysed: systolic blood pressure (SBP), diastolic blood pressure (DBP) and pulse pressure (PP; the difference between SBP and DBP). Mean SBP and DBP values from automated values were calculated. After calculating blood pressure values, SBP and DBP were adjusted for medication use by adding 15 and 10 mm Hg to their values, respectively, for individuals reported to be taking blood pressure–lowering medication.For the UK Biobank genome-wide association studies (GWAS), we performed linear mixed model (LMM) association testing under an additive genetic model of the three continuous, medication-adjusted blood pressure traits (SBP, DBP, PP) for all measured and imputed genetic variants (Data Field-22828) with minor allele frequency (MAF) >=1% and imputation score>=0.3 in dosage format using the BOLT-LMM (v2.4.1) software. Covariates were age, age2, sex, BMI, genotyping array and 10PCs. Genomic inflation was not applied to the GWAS summary statistics.Sample QC was described below:We included up to 337,422 individuals from UK Biobank for the purpose of this project. We followed UK Biobank sample-based quality control criteria (Nature 2018;562:203-209); excluded were samples/individuals based on the following criteria: (i) outliers in heterozygosity and missingness, (ii) self-reported gender not consistent with genetic data inferred gender (ii) sample call rate (computed using probesets internal to Affymetrix)

  10. Z

    A common NFKB1 variant detected through antibody analysis in UK Biobank...

    • data.niaid.nih.gov
    Updated Jan 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chong, Amanda Y; Brenner, Nicole; Jimenez-Kaufmann, Andres; Cortes, Adrian; Hill, Michael; Littlejohns, Thomas J; Gilchrist, James J; Fairfax, Benjamin P; Knight, Julian C; Hodel, Flavia; Fellay, Jacques; McVean, Gil; Moreno-Estrada, Andres; Waterboer, Tim; Hill, Adrian V S; Mentzer, Alexander J (2024). A common NFKB1 variant detected through antibody analysis in UK Biobank predicts risk of infection and allergy: Summary statistics - Health records [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7347791
    Explore at:
    Dataset updated
    Jan 10, 2024
    Dataset provided by
    Global Health Institute, School of Life Sciences, EPFL, Lausanne, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland; Precision Medicine Unit, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
    The Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK; Department of Paediatrics, University of Oxford, Oxford, UK
    Division of Infections and Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, Germany
    The Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK; Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
    The Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
    Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
    The Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK; The Jenner Institute, University of Oxford, Oxford, UK
    Global Health Institute, School of Life Sciences, EPFL, Lausanne, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland
    NIHR Oxford Biomedical Research Centre, Oxford University Hospitals NHS Foundation Trust, Oxford, UK; Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK
    Nuffield Department of Population Health, University of Oxford, Oxford, UK
    Advanced Genomics Unit, National Laboratory of Genomics for Biodiversity (LANGEBIO), CINVESTAV, Irapuato, Mexico
    Department of Oncology, University of Oxford, Oxford, UK
    MRC-Population Health Research Unit, University of Oxford, Oxford, UK
    Authors
    Chong, Amanda Y; Brenner, Nicole; Jimenez-Kaufmann, Andres; Cortes, Adrian; Hill, Michael; Littlejohns, Thomas J; Gilchrist, James J; Fairfax, Benjamin P; Knight, Julian C; Hodel, Flavia; Fellay, Jacques; McVean, Gil; Moreno-Estrada, Andres; Waterboer, Tim; Hill, Adrian V S; Mentzer, Alexander J
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Infectious agents contribute significantly to the global burden of diseases, through both acute infection and their chronic sequelae. We leveraged the UK Biobank to identify genetic loci that influence humoral immune response to multiple infections. From 45 genome-wide association studies in 9,611 participants from UK Biobank, we identified NFKB1 as a locus associated with quantitative antibody responses to multiple pathogens including those from the herpes, retro- and polyoma-virus families. An insertion-deletion variant thought to affect NFKB1 expression (rs28362491), was mapped as the likely causal variant. This variant has persisted throughout hominid evolution and could play a key role in regulation of the immune response. Using 121 infection and inflammation related traits in 487,297 UK Biobank participants, we show that the deletion allele was associated with an increased risk of infection from diverse pathogens but had a protective effect against allergic disease. We propose that altered expression of NFKB1, as a result of the deletion, modulates haematopoietic pathways, and likely impacts cell survival, antibody production, and inflammation. Taken together, we show that disruptions to the tightly regulated immune processes may tip the balance between exacerbated immune responses and allergy, or increased risk of infection and impaired resolution of inflammation.

    This dataset contains GWAS summary statistics for infection, inflammation, and allergy related traits in 487,297 individuals

  11. Large-scale UK Biobank Whole Body Atlases

    • zenodo.org
    application/gzip
    Updated Nov 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophie Starck; Sophie Starck; Vasiliki Sideri-Lampretsa; Vasiliki Sideri-Lampretsa; Jessica J. M. Ritter; Veronika Zimmer; Veronika Zimmer; Rickmer Braren; Rickmer Braren; Tamara T. Mueller; Tamara T. Mueller; Daniel Rueckert; Daniel Rueckert; Jessica J. M. Ritter (2024). Large-scale UK Biobank Whole Body Atlases [Dataset]. http://doi.org/10.5281/zenodo.13136891
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Nov 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sophie Starck; Sophie Starck; Vasiliki Sideri-Lampretsa; Vasiliki Sideri-Lampretsa; Jessica J. M. Ritter; Veronika Zimmer; Veronika Zimmer; Rickmer Braren; Rickmer Braren; Tamara T. Mueller; Tamara T. Mueller; Daniel Rueckert; Daniel Rueckert; Jessica J. M. Ritter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Population atlases are commonly utilised in medical imaging to facilitate the investigation of variability across populations.
    Such atlases enable the mapping of medical images into a common coordinate system, promoting comparability and enabling the study of inter-subject differences.
    Constructing such atlases becomes particularly challenging when working with highly heterogeneous datasets, such as whole-body images, where subjects show significant anatomical variations.
    In this work, we propose a pipeline for generating a standardised whole-body atlas for a highly heterogeneous population by partitioning the population into anatomically meaningful subgroups.
    Using magnetic resonance (MR) images from the UK Biobank dataset, we create six whole-body atlases representing a healthy population average.
    We furthermore unbias them, and this way obtain a realistic representation of the population.
    In addition to the anatomical atlases, we generate probabilistic atlases that capture the distributions of abdominal fat (visceral and subcutaneous) and five abdominal organs across the population (liver, spleen, pancreas, left and right kidneys).
    We demonstrate a clinical application of these atlases, using the differences between subjects with medical conditions such as diabetes and cardiovascular diseases and healthy subjects from the atlas space.
    With this work, we make the constructed anatomical and label atlases publically available and anticipate them to support medical research conducted on whole-body MR images.

  12. European (British) LD files for GhostKnockoffGWAS

    • zenodo.org
    zip
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    benjamin chu; benjamin chu (2025). European (British) LD files for GhostKnockoffGWAS [Dataset]. http://doi.org/10.5281/zenodo.15191305
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    benjamin chu; benjamin chu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 10, 2025
    Area covered
    United Kingdom
    Description

    This contains pre-processed LD files (Sigma matrix, S matrix, ...etc) computed on unrelated British samples of the UK-Biobank (n = 306604). It is intended to be used as an input to the GhostKnockoffGWAS pipeline.

    • This is the output of applying solveblock executable directly on 306,604 unrelated British samples of the UK-Biobank.
    • Quasi-independent blocks are computed by applying the snp_ldsplit function with parameters thr_r2=0.01, max_r2=0.3, min_size = 500, and max_size = {1000, 1500, 3000, 6000, 10000}.
    • SNPs with minor allele frequency less than 0.01 or Hardy-Weinburg equilibrium p-value less than 1e-6 are removed.
    • Only HG19 coordinates are available.
    • Knockoff optimization were carried out by the Knockoffs.jl julia package: https://github.com/biona001/Knockoffs.jl
    • The result (i.e. files available in this site) is saved in .csv and .h5 formatted files for easier access, which is directly readable by GhostKnockoffGWAS.

    Note: We previously released another set of EUR LD files. This set of LD files should be preferred over the previous one. The main difference with this entry is that the previous entry used quasi-independent blocks from LDetect computed on the 1000 genomes project. Here we compute the independent blocks using snp_ldsplit directly on the UK-Biobank British samples.

  13. E

    'Hill_CB_2016' - Data supporting paper 'Molecular genetic contributions to...

    • find.data.gov.scot
    • dtechtive.com
    txt
    Updated Jun 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Edinburgh. Centre for Cognitive Ageing and Cognitive Epidemiology (2019). 'Hill_CB_2016' - Data supporting paper 'Molecular genetic contributions to social deprivation and household income in UK Biobank'. Current Biology (2016). [Dataset]. http://doi.org/10.7488/ds/2562
    Explore at:
    txt(792.3 MB), txt(0.0166 MB), txt(792.7 MB), txt(0.0005 MB), txt(793.1 MB)Available download formats
    Dataset updated
    Jun 4, 2019
    Dataset provided by
    University of Edinburgh. Centre for Cognitive Ageing and Cognitive Epidemiology
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data supporting paper 'Molecular genetic contributions to social deprivation and household income in UK Biobank'. Current Biology (2016). doi: 10.1016/j.cub.2016.09.035 ## Note re working with data ## Each of the three data files contains over seventeen million rows. Users will encounter difficulties if they attempt to view the content using Notepad++ or Microsoft Notepad. Microsoft Excel 2016 will not display all rows. These space-delimited text files contains seven columns, with a header row, which are listed in the readme file. ## Note re other copy ## The data files are identical to the files of the same name previously made available on the website of the Centre for Cognitive Ageing and Cognitive Epidemiology (CCACE) http://www.ccace.ed.ac.uk/node/335 as the zip archive 'Hill_CB_2016.zip'.

  14. Baseline characteristics.

    • plos.figshare.com
    xls
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert C. Schell; William H. Dow; Lia C. H. Fernald; Patrick T. Bradshaw; David H. Rehkopf (2024). Baseline characteristics. [Dataset]. http://doi.org/10.1371/journal.pone.0304653.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 13, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Robert C. Schell; William H. Dow; Lia C. H. Fernald; Patrick T. Bradshaw; David H. Rehkopf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Previous research demonstrates the joint association of self-reported physical activity and genotype with coronary artery disease. However, an existing research gap is whether accelerometer-measured overall physical activity or physical activity intensity can offset genetic predisposition to coronary artery disease. This study explores the independent and joint associations of accelerometer-measured physical activity and genetic predisposition with incident coronary artery disease. Incident coronary artery disease based on hospital inpatient records and death register data serves as the outcome of this study. Polygenic risk score and overall physical activity, measured as Euclidean Norm Minus One, and intensity, measured as minutes per day of moderate-to-vigorous intensity physical activity (MVPA), are examined both linearly and by decile. The UK Biobank population-based cohort recruited over 500,000 individuals aged 40 to 69 between 2006 and 2010, with 103,712 volunteers participating in a weeklong wrist-worn accelerometer study from 2013 to 2015. Individuals of White British ancestry (n = 65,079) meeting the genotyping and accelerometer-based inclusion criteria and with no missing covariates were included in the analytic sample. In the sample of 65,079 individuals, the mean (SD) age was 62.51 (7.76) and 61% were female. During a median follow-up of 6.8 years, 1,382 cases of coronary artery disease developed. At the same genetic risk, physical activity intensity had a hazard ratio (HR) of 0.41 (95% CI: 0.29–0.60) at the 90th compared to 10th percentile, equivalent to 31.68 and 120.96 minutes of moderate-to-vigorous physical activity per day, respectively, versus an HR of 0.61 (95% CI: 0.52–0.72) for overall physical activity. The combination of high genetic risk and low physical activity intensity showed the greatest risk, with an individual at the 10th percentile of genetic risk and 90th percentile of intensity facing an HR of 0.14 (95% CI: 0.09–0.21) compared to an individual at the 90th percentile of genetic risk and 10th percentile of intensity. Physical activity, especially physical activity intensity, is associated with an attenuation of some of the risk of coronary artery disease but this pattern does not vary by genetic risk. This accelerometer-based study provides the clearest evidence to date regarding the joint influence of genetics, overall physical activity, and physical activity intensity on coronary artery disease.

  15. GCTB sparse shrunk LD matrices from 2.8M common variants from the UK Biobank...

    • search.datacite.org
    • zenodo.org
    Updated Aug 23, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luke Lloyd-Jones (2019). GCTB sparse shrunk LD matrices from 2.8M common variants from the UK Biobank - Part AA - START HERE [Dataset]. http://doi.org/10.5281/zenodo.3375372
    Explore at:
    Dataset updated
    Aug 23, 2019
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Zenodohttp://zenodo.org/
    Authors
    Luke Lloyd-Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GCTB sparse shrunk LD matrices from 2.8M common variants from the UK Biobank. Part AA of AA, AB, AC, AD and AE. TO JOIN AND UNZIP THESE MATRICES Download all parts to one folder from: PartAA - 10.5281/zenodo.3375373 PartAB - 10.5281/zenodo.3376357 Part AC - 10.5281/zenodo.3376456 Parts AD and AE - 10.5281/zenodo.3376628 Use cat to join cat ukb_50k_bigset_2.8M.zip.part* > ukb_50k_bigset_2.8M.zip Then unzip. See README for further details. unzip ukb_50k_bigset_2.8M.zip

  16. c

    Summary statistics for Alzheimer's disease GWAS on APOE-e4 homozygous...

    • research-data.cardiff.ac.uk
    • datasetcatalog.nlm.nih.gov
    zip
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Bracher-Smith; Ganna Leonenko; Emily Baker; Karen Crawford; Andrew Graham; Dervis Salih; Brian Howell; John Hardy; Valentina Escott-Price (2024). Summary statistics for Alzheimer's disease GWAS on APOE-e4 homozygous individuals in UK Biobank [Dataset]. http://doi.org/10.17035/d.2022.0216755828
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 30, 2024
    Dataset provided by
    Cardiff University
    Authors
    Matthew Bracher-Smith; Ganna Leonenko; Emily Baker; Karen Crawford; Andrew Graham; Dervis Salih; Brian Howell; John Hardy; Valentina Escott-Price
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We conducted a genome-wide association study in a sample of 5,390 APOE-ε4 homozygous (ε4ε4) individuals (288 cases and 5,102 controls) aged 65 or over in the UK Biobank. Results are the summary statistics from a GWAS conducted in Plink using age, sex and the first 15 principal components as covariates.There are 5,349,830 rows and 9 columns. Each row corresponds to a variant. Columns are: CHR, BP, SNP, A1, A2_counted_allele, OR, SE_beta, Z_STAT and P. As indicated by column names, A2 is the risk allele counted in the regression. Dataset is gzipped and tab-delimited.

  17. PLINK association test statistics of UK Biobank blood traits

    • zenodo.org
    application/gzip
    Updated Jun 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuxin Zou; Peter Carbonetto; Dongyue Xie; Gao Wang; Matthew Stephens; Yuxin Zou; Peter Carbonetto; Dongyue Xie; Gao Wang; Matthew Stephens (2023). PLINK association test statistics of UK Biobank blood traits [Dataset]. http://doi.org/10.5281/zenodo.8088041
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yuxin Zou; Peter Carbonetto; Dongyue Xie; Gao Wang; Matthew Stephens; Yuxin Zou; Peter Carbonetto; Dongyue Xie; Gao Wang; Matthew Stephens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PLINK association test statistics of UK Biobank blood traits generated for mvSuSiE fine-mapping analyses; see https://www.biorxiv.org/content/10.1101/2023.04.14.536893.

  18. f

    Supplementary file 1_Frailty and risk of gastrointestinal bleeding: a...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Su, Hong; Song, Rong; Song, Jian; Zhang, Chenao; Mei, Qiao; Wang, Jiren; Wang, Junyan; Liu, Xingyu; Huang, Qiming (2025). Supplementary file 1_Frailty and risk of gastrointestinal bleeding: a prospective cohort study based on UK biobank.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002093785
    Explore at:
    Dataset updated
    Jul 4, 2025
    Authors
    Su, Hong; Song, Rong; Song, Jian; Zhang, Chenao; Mei, Qiao; Wang, Jiren; Wang, Junyan; Liu, Xingyu; Huang, Qiming
    Description

    IntroductionFrailty has been associated with various diseases. However, its impact on gastrointestinal bleeding (GIB) remains largely unexplored. This study investigates the relationship between frailty and the incidence of gastrointestinal bleeding events.MethodsA total of 352,060 participants from the UK Biobank with no history of gastrointestinal bleeding were included. Baseline frailty status was assessed using the Fried phenotype and categorized as non-frail, pre-frail, or frail. The primary outcome was gastrointestinal bleeding, identified through hospitalization records and death registries. Cox proportional hazard models were used to evaluate the association between frailty and gastrointestinal bleeding incidence.ResultsAmong the 352,060 participants (mean age 56.1 years), 3.6% (N = 12,747) were classified as frail, and 43.6% (N = 153,424) as pre-frail at baseline. Over a median follow-up of 14.7 years, 20,105 gastrointestinal bleeding events were recorded. Compared to non-frail individuals, frail (HR = 1.53, 95% CI: 1.44–1.62) and pre-frail (HR = 1.15, 95% CI: 1.11–1.18) individuals exhibited a significantly higher risk of gastrointestinal bleeding after multivariate adjustment (P for trend < 0.001). Subgroup and sensitivity analyses remained consistent findings.ConclusionFrailty significantly elevates the risk of gastrointestinal bleeding. Early identification and targeted multidimensional interventions addressing frailty may reduce gastrointestinal bleeding events and improve patient prognosis.

  19. Summary statistics of the UK Biobank + Estonian Biobank sex-combined GWAS...

    • zenodo.org
    tar
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dennis van der Meer; Dennis van der Meer; Alexey Shadrin; Alexey Shadrin (2025). Summary statistics of the UK Biobank + Estonian Biobank sex-combined GWAS meta-analysis for the Nightingale panel of 249 circulating plasma metabolic markers from "Pleiotropic and sex-specific genetic architecture of circulating metabolic markers" [https://doi.org/10.1101/2024.07.30.24311254]. [Dataset]. http://doi.org/10.5281/zenodo.15420219
    Explore at:
    tarAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dennis van der Meer; Dennis van der Meer; Alexey Shadrin; Alexey Shadrin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    "meta.tar" archive contains 249 files representing UK Biobank + Estonian Biobank sex-combined GWAS meta-analysis summary statistics for the Nightingale panel of 249 circulating plasma metabolic markers presented in "Pleiotropic and sex-specific genetic architecture of circulating metabolic markers" [https://doi.org/10.1101/2024.07.30.24311254].

    Each file contains nine columns:

    SNP: ID of the genetic marker;

    CHR: chromosome code (GRCh37 genomic build);

    BP: base-pair coordinate (GRCh37 genomic build);

    PVAL: regression p-value;

    A1: effect allele;

    A2: other allele;

    N: sample size;

    BETA: regression coefficient for effect allele (A1);

    SE: standard error of regression coefficient (BETA).

  20. E

    'DAVIES_MP_2016.zip' - Data supporting Davies et al. Genome-wide association...

    • dtechtive.com
    • find.data.gov.scot
    txt
    Updated Jun 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Edinburgh. Centre for Cognitive Ageing and Cognitive Epidemiology (2019). 'DAVIES_MP_2016.zip' - Data supporting Davies et al. Genome-wide association study of cognitive functions and educational attainment in UK Biobank (N=112 151). [Dataset]. http://doi.org/10.7488/ds/2560
    Explore at:
    txt(779.7 MB), txt(0.0166 MB), txt(793.2 MB), txt(783.3 MB), txt(0.0005 MB), txt(783.4 MB)Available download formats
    Dataset updated
    Jun 4, 2019
    Dataset provided by
    University of Edinburgh. Centre for Cognitive Ageing and Cognitive Epidemiology
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    UNITED KINGDOM
    Description

    Data supporting the paper Davies et al. Genome-wide association study of cognitive functions and educational attainment in UK Biobank (N=112 151). Molecular Psychiatry (2016). ## Note re working with data ## Each of the four data files contains over seventeen million rows. Users will encounter difficulties if they attempt to view the contents of these files using Notepad++ or Notepad. Microsoft Excel 2016 will not display all rows, only the first one million or so rows. These are space-delimited files.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
Organization logo

Synthetic datasets of the UK Biobank cohort

Explore at:
bin, csv, zip, pdfAvailable download formats
Dataset updated
Sep 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

  • Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]
  • Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

  • synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
  • synthbdbasevar: baseline variables, mostly collected at recruitment.
  • synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
  • synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

  • codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
  • asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
  • Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

Search
Clear search
Close search
Google apps
Main menu