73 datasets found

Synthetic datasets of the UK Biobank cohort
zenodo.org
bin, csv, pdf, zip
Updated Sep 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
Explore at:
bin, csv, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13983170
Dataset updated
Sep 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]

Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.

synthbdbasevar: baseline variables, mostly collected at recruitment.

synthpmdata: annual average exposure to PM_2.5 for each participant reconstructed using their residential history.

synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.

asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).

Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM_2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data, including the annual PM_2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM_2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
H
UK Biobank
dtechtive.com
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UK Biobank (2023). UK Biobank [Dataset]. https://dtechtive.com/datasets/26022
Explore at:
Dataset updated
May 30, 2023
Dataset provided by
UK Biobank
Area covered
United Kingdom
Description
UK Biobank is a large-scale biomedical database and research resource that provides researchers access to detailed longitudinal phenotype, medical and genetic data from 500,000 volunteer participants.
Source Data 2. The dataset derived from the UK Biobank for the cohort study....
figshare.com
application/gzip
Updated Dec 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Han Zhang (2023). Source Data 2. The dataset derived from the UK Biobank for the cohort study. [Dataset]. http://doi.org/10.6084/m9.figshare.24154980.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24154980.v2
Dataset updated
Dec 12, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Han Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data is used to conduct cohort study to evaluate the association between smoking and the risk of inflammatory bowel disease.
Data from: Brain Ages Derived from Different MRI Modalities are Associated...
data.niaid.nih.gov
Updated Aug 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roibu, Andrei-Claudiu; Adaszewski, Stanislaw; Schindler, Torsten; Smith, Stephen M.; Namburete, Ana I.L.; Lange, Frederik J. (2023). Brain Ages Derived from Different MRI Modalities are Associated with Distinct Biological Phenotypes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8110875
Explore at:
Dataset updated
Aug 9, 2023
Dataset provided by
Roche Holding AGhttp://roche.com/
Wellcome Centre for Integrative Neuroimaging (WIN), University of Oxford, Oxford, U.K.
Oxford Machine Learning in NeuroImaging Lab (OMNI), University of Oxford, Oxford, U.K.
Authors
Roibu, Andrei-Claudiu; Adaszewski, Stanislaw; Schindler, Torsten; Smith, Stephen M.; Namburete, Ana I.L.; Lange, Frederik J.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

Brain ageing is a highly variable, spatially and temporally heterogeneous process, marked by numerous structural and functional changes. These can cause discrepancies between individuals’ chronological age and the apparent age of their brain, as inferred from neuroimaging data. Machine learning models, and particularly Convolutional Neural Networks (CNNs), have proven adept in capturing patterns relating to ageing induced changes in the brain. The differences between the predicted and chronological ages, referred to as brain age deltas, have emerged as useful biomarkers for exploring those factors which promote accelerated ageing or resilience, such as pathologies or lifestyle factors. However, previous studies rely only on structural neuroimaging for predictions, overlooking potentially informative functional and microstructural changes. Here we show that multiple contrasts derived from different MRI modalities can predict brain age, each encoding bespoke brain ageing information. By using 3D CNNs and UK Biobank data, we found that 57 contrasts derived from structural, susceptibility-weighted, diffusion, and functional MRI can successfully predict brain age. For each contrast, different patterns of association with non-imaging phenotypes were found, resulting in a total of 191 unique, statistically significant associations. Furthermore, we found that ensembling data from multiple contrasts results in both higher prediction accuracies and stronger correlations to non-imaging measurements. Our results demonstrate that other 3D contrasts and modalities, which have not been considered so far for the task of brain age prediction, encode different information about the ageing brain. We envision our work as being the starting point for future investigations into the causal links underpinning the observed brain age deltas and non-imaging measurement associations. For instance, drug effects can be monitored, given that certain medications correlated with accelerated brain ageing. Furthermore, continued development of brain age models could facilitate their deployment in clinical trials for recruitment and monitoring, and hospitals for diagnostic and screening tasks.

Data Description

This dataset contains the full correlation results with all nIDPs in the UK Biobank. These are presented in datasets split by sex in Female and Male subjects. For easier data manipulation, two smaller datasets have also been made available, containing just those correlation which pass the False Discovery Rate (FDR) threshold.

As experiments were also conducted for ensembles using multiple contrasts, similar datasets are provided for those.

Finally, global datasets are also provided. These are the concatenation of the associations contained in the Male and Female datasets.

Paper & Code

The original paper for this article can be accessed here:

https://ieeexplore.ieee.org/abstract/document/10196736

To access the codes relevant for this project, please access the project GitHub Repos:

https://github.com/AndreiRoibu/AgeMapper

If using this work, please cite it based on the above paper, or using the following BibTex:

@inproceedings{roibu2023brain, title={Brain Ages Derived from Different MRI Modalities are Associated with Distinct Biological Phenotypes}, author={Roibu, Andrei-Claudiu and Adaszewski, Stanislaw and Schindler, Torsten and Smith, Stephen M and Namburete, Ana IL and Lange, Frederik J}, booktitle={2023 10th IEEE Swiss Conference on Data Science (SDS)}, pages={17--25}, year={2023}, organization={IEEE}, doi={10.1109/SDS57534.2023.00010} }

Data Access

The data for this project is freely available upon application at the UK Biobank. For more information regarding the individual nIDPs, please access the UK Biobank Showcase website at: https://biobank.ctsu.ox.ac.uk/showcase/search.cgi

Funding

ACR is supported by EPSRC Grant EP/S024093/1, F. Hoffmann-La Roche AG and a 2021 Industrial Fellowship offered by the Royal Commission for the Exhibition of 1851. SMS is supported by a Wellcome Trust Collaborative Award 215573/Z/19/Z. AILN is grateful for support from the Academy of Medical Sciences under the Springboard Awards scheme (SBF005/1136), and the Bill and Melinda Gates Foundation. FJL is supported by a Wellcome Trust Collaborative Award (215573/Z/19/Z). The WIN is supported by core funding from the Wellcome Trust (203139/Z/16/Z). The computational aspects were supported by the Wellcome Trust (203141/Z/16/Z) and the NIHR Oxford BRC. Corresponding authors: ACR (andreiroibu@icloud.com), SA (stanislaw.adaszewski@roche.com) and AILN (ana.namburete@cs.ox.ac.uk).
a
UK Biobank
atlaslongitudinaldatasets.ac.uk
url
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Manchester (2025). UK Biobank [Dataset]. https://atlaslongitudinaldatasets.ac.uk/datasets/ukb
Explore at:
urlAvailable download formats
Dataset updated
Feb 10, 2025
Dataset provided by
Atlas of Longitudinal Datasets
Authors
University of Manchester
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United Kingdom
Variables measured
Anxiety disorders, Standard measures, Alcohol Use Disorder, Non-standard measures, Unspecified Anxiety Disorder, Major Depressive Disorder (MDD), Unspecified Depressive Disorder, Generalized Anxiety Disorder (GAD), Depression and depressive disorders, Post Traumatic Stress Disorder (PTSD), and 1 more
Measurement technique
Biobank, Computer, paper or task testing (e.g. cognitive testing, theory of mind doll task, attention computer tasks), Functional magnetic resonance imaging (fMRI), Magnetic Resonance Imaging (MRI), Interview – face-to-face, Physical or biological assessment (e.g. blood, saliva, gait, grip strength, anthropometry), Arterial Spin Labelling (ASL), Cohort, Wearable devices, Health services, and 1 more
Dataset funded by
Amazon Web Services (AWS)
Scottish Government
Medical Research Councilhttp://mrc.ukri.org/
Department of Health and Social Carehttps://gov.uk/dhsc
Wellcome Trusthttps://wellcome.org/
Northwest Regional Development Agency (NWDA)
Schmidt Sciences
Diabetes UK
United Kingdom Government
Welsh Government
Cancer Research UK (CRUK)
Griffin Catalyst
British Heart Foundation
Description
UK Biobank is a large-scale biomedical database and detailed prospective study containing de-identified genetic, lifestyle and health information and biological samples from over 500,000 participants in the United Kingdom. Between 2006 and 2010, participants aged 40 to 69 years were recruited from Nation Health Service (NHS) central registers across the United Kingdom. Participants have been followed up with regularly since 2006.
Source Data 5. The dataset derived from the UK Biobank for the G-E...
figshare.com
application/gzip
Updated Dec 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Han Zhang (2023). Source Data 5. The dataset derived from the UK Biobank for the G-E interaction analysis. [Dataset]. http://doi.org/10.6084/m9.figshare.24154983.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24154983.v2
Dataset updated
Dec 12, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Han Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data contains of the information on the mqtls of smoking-related methylation and is used to perform the G-E interaction analysis (for CD).
u
Informing Educational Interventions using Genome-Wide Data, 2016-2019
datacatalogue.ukdataservice.ac.uk
Updated Jun 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davies, N, University of Bristol (2021). Informing Educational Interventions using Genome-Wide Data, 2016-2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-854982
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-854982
Dataset updated
Jun 7, 2021
Authors
Davies, N, University of Bristol
Area covered
United Kingdom
Description
Data sources include UK Biobank https://www.ukbiobank.ac.uk/enable-your-research, ALSPAC http://www.bristol.ac.uk/alspac/researchers/access, and the NCDS https://cls.ucl.ac.uk/data-access-training/. This collection provides the links to the code for the analysis used in this project available under Related Resources.
During this fellowship, I will use the wealth of genetic data from longitudinal cohort studies in the UK and abroad to conduct innovative research into three core issues in modern economics, psychology and sociology: academic attainment, non-cognitive skills, and assortative matching in relationships. These are some of the most heavily researched topics across various social sciences (1-4). Many researchers have argued that the genome plays an important role in each of these topics, yet we have relatively little direct evidence about this. A major limitation of much of the existing research in this area is that it has struggled to account for intrinsic differences between individuals. I will overcome this limitation by combining the growing wealth of biosocial and genome-wide data, from eight longitudinal cohort studies from the UK and others worldwide, with cutting edge econometric and statistical methods for causal inference. These novel data and methods offer an opportunity for new evidence and discoveries about research questions that were previously difficult or impossible to address (5, 6).

My research objectives are to investigate the following three research questions:

1) How are the effects of three genetic variants associated with educational attainment mediated? What are their long-term effects on labour market outcomes?

To date, we know of three individual genetic variants that are associated with educational attainment. However, we do not know which biosocial mechanisms mediate these effects. During this fellowship, I will investigate this using data from the UK Biobank. This cohort study has genome-wide data on 500,000 individuals. Due to its size, the UK Biobank will offer unparalleled statistical power to investigate the aetiology of these associations and their long-term consequences. In addition, I will seek to replicate my findings and investigate these relationships in more detail using the rich and highly detailed information in the English Longitudinal Study of Aging (N=8,000) and Understanding Society (N=10,000).

2) What is the genetic architecture of cognitive and non-cognitive skills and educational outcomes across the life course?

Non-cognitive skills are a set of psychological character traits that influence success in school and at work, for example, motivation, perseverance, emotional intelligence, resilience, and self-control (7). Research about the importance of non-cognitive skills has led to policy interventions that aim to improve children's non-cognitive skills (2, 8-10). However, whilst we know that these skills are associated with outcomes, we do not know if they cause success in school or work. I will add to the evidence about this question using genome-wide data from the Avon Longitudinal Study of Parents and Children (ALSPAC) offspring (N=8,365). I will seek to replicate these results in the National Child Development Study (N=5,595) and The Twins Early Development Study (N=3,500).

3) How does assortative mating affect the human genome? What are the consequences of assortative mating for interpreting the results of social-science studies using genome-wide datasets?

Despite the saying 'opposites attract, spouses tend to be more alike than two randomly chosen individuals from the population. In this project, I will investigate whether this is because spouses come from similar backgrounds or if spouses are also more likely to have similar genetic variants than would be expected by chance. This has implications for interpreting the results of studies using genome-wide data. I will use data from UK Biobank, ALSPAC mothers and fathers (N=10,107 and 2000 respectively), the Health and Retirement Study (N=15,620) and the Generation Scotland study (N=10,399).
Data from: Brain Ages Derived from Different MRI Modalities are Associated...
zenodo.org
csv
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrei-Claudiu Roibu; Andrei-Claudiu Roibu; Stanislaw Adaszewski; Torsten Schindler; Stephen M. Smith; Stephen M. Smith; Ana I.L. Namburete; Ana I.L. Namburete; Frederik J. Lange; Frederik J. Lange; Stanislaw Adaszewski; Torsten Schindler (2025). Brain Ages Derived from Different MRI Modalities are Associated with Distinct Biological Phenotypes [Dataset]. http://doi.org/10.5281/zenodo.8110876
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8110876
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrei-Claudiu Roibu; Andrei-Claudiu Roibu; Stanislaw Adaszewski; Torsten Schindler; Stephen M. Smith; Stephen M. Smith; Ana I.L. Namburete; Ana I.L. Namburete; Frederik J. Lange; Frederik J. Lange; Stanislaw Adaszewski; Torsten Schindler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

Brain ageing is a highly variable, spatially and temporally heterogeneous process, marked by numerous structural and functional changes. These can cause discrepancies between individuals’ chronological age and the apparent age of their brain, as inferred from neuroimaging data. Machine learning models, and particularly Convolutional Neural Networks (CNNs), have proven adept in capturing patterns relating to ageing induced changes in the brain. The differences between the predicted and chronological ages, referred to as brain age deltas, have emerged as useful biomarkers for exploring those factors which promote accelerated ageing or resilience, such as pathologies or lifestyle factors. However, previous studies rely only on structural neuroimaging for predictions, overlooking potentially informative functional and microstructural changes. Here we show that multiple contrasts derived from different MRI modalities can predict brain age, each encoding bespoke brain ageing information. By using 3D CNNs and UK Biobank data, we found that 57 contrasts derived from structural, susceptibility-weighted, diffusion, and functional MRI can successfully predict brain age. For each contrast, different patterns of association with non-imaging phenotypes were found, resulting in a total of 191 unique, statistically significant associations. Furthermore, we found that ensembling data from multiple contrasts results in both higher prediction accuracies and stronger correlations to non-imaging measurements. Our results demonstrate that other 3D contrasts and modalities, which have not been considered so far for the task of brain age prediction, encode different information about the ageing brain. We envision our work as being the starting point for future investigations into the causal links underpinning the observed brain age deltas and non-imaging measurement associations. For instance, drug effects can be monitored, given that certain medications correlated with accelerated brain ageing. Furthermore, continued development of brain age models could facilitate their deployment in clinical trials for recruitment and monitoring, and hospitals for diagnostic and screening tasks.

Data Description

This dataset contains the full correlation results with all nIDPs in the UK Biobank. These are presented in datasets split by sex in Female and Male subjects. For easier data manipulation, two smaller datasets have also been made available, containing just those correlation which pass the False Discovery Rate (FDR) threshold.

As experiments were also conducted for ensembles using multiple contrasts, similar datasets are provided for those.

Finally, global datasets are also provided. These are the concatenation of the associations contained in the Male and Female datasets.

Paper & Code

The original paper for this article can be accessed here:

https://ieeexplore.ieee.org/abstract/document/10196736

To access the codes relevant for this project, please access the project GitHub Repos:

https://github.com/AndreiRoibu/AgeMapper

If using this work, please cite it based on the above paper, or using the following BibTex:

@inproceedings{roibu2023brain, title={Brain Ages Derived from Different MRI Modalities are Associated with Distinct Biological Phenotypes}, author={Roibu, Andrei-Claudiu and Adaszewski, Stanislaw and Schindler, Torsten and Smith, Stephen M and Namburete, Ana IL and Lange, Frederik J}, booktitle={2023 10th IEEE Swiss Conference on Data Science (SDS)}, pages={17--25}, year={2023}, organization={IEEE}, doi={10.1109/SDS57534.2023.00010} }

Data Access

The data for this project is freely available upon application at the UK Biobank. For more information regarding the individual nIDPs, please access the UK Biobank Showcase website at: https://biobank.ctsu.ox.ac.uk/showcase/search.cgi

Funding

ACR is supported by EPSRC Grant EP/S024093/1, F. Hoffmann-La Roche AG and a 2021 Industrial Fellowship offered by the Royal Commission for the Exhibition of 1851. SMS is supported by a Wellcome Trust Collaborative Award 215573/Z/19/Z. AILN is grateful for support from the Academy of Medical Sciences under the Springboard Awards scheme (SBF005/1136), and the Bill and Melinda Gates Foundation. FJL is supported by a Wellcome Trust Collaborative Award (215573/Z/19/Z). The WIN is supported by core funding from the Wellcome Trust (203139/Z/16/Z). The computational aspects were supported by the Wellcome Trust (203141/Z/16/Z) and the NIHR Oxford BRC. Corresponding authors: ACR (andreiroibu@icloud.com), SA (stanislaw.adaszewski@roche.com) and AILN (ana.namburete@cs.ox.ac.uk).
Z
GWAS on self-reported hearing difficulty in the UK Biobank
datasetcatalog.nlm.nih.gov
Updated Oct 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moore, D. R.; Wells, H. R. R.; Dawson, Sally J; Williams, Frances M. K.; Abidin, Fatin N. Zainul; Payton, A.; Freidin, Maxim B.; Munro, Kevin J.; Morton, C. C.; Dawes, P (2019). GWAS on self-reported hearing difficulty in the UK Biobank [Dataset]. http://doi.org/10.5281/zenodo.3490750
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3490750
Dataset updated
Oct 15, 2019
Authors
Moore, D. R.; Wells, H. R. R.; Dawson, Sally J; Williams, Frances M. K.; Abidin, Fatin N. Zainul; Payton, A.; Freidin, Maxim B.; Munro, Kevin J.; Morton, C. C.; Dawes, P
Description
The dataset contains results of two genome-wide association studies for age-related hearing impairment (ARHI)-related traits as described in the following publication Wells HRR, Freidin MB, Zainul Abidin FN, Payton A, Dawes P, Munro KJ, Morton CC, Moore DR, Dawson SJ, Williams FMK. GWAS Identifies 44 Independent Associated Genomic Loci for Self-Reported Adult Hearing Difficulty in UK Biobank. Am J Hum Genet. 2019 Oct 3;105(4):788-802. doi: 10.1016/j.ajhg.2019.09.008. Epub 2019 Sep 26. Please cite the article if using this dataset. Two files provide summary statistics for discovery analysis of Hearing difficulty (HD) and Hearing aid use (HAID) phenotypes for individuals of European descent from UK Biobank. Acknowledgements The research was carried out using the UK Biobank Resource under application number 11516. H.R.R.W. is funded by a PhD Studentship Grant, S44, from Action on Hearing Loss. The study was also supported by funding from NIHR UCLH BRC Deafness and Hearing Problems Theme, a grant from MED_EL, and the NIHR Manchester Biomedical Research Centre. The English Longitudinal Study of Aging is jointly run by University College London, Institute for Fiscal Studies, University of Manchester, and National Centre for Social Research. Genetic analyses have been carried out by UCL Genomics and funded by the Economic and Social Research Council and the National Institute on Aging. Data governance was provided by the METADAC data access committee, funded by ESRC, Wellcome, and MRC (2015-2018: Grant Number MR/N01104X/1 2018-2020: Grant Number ES/S008349/1). TwinsUK is funded by the Wellcome Trust, Medical Research Council, European Union, the National Institute for Health Research (NIHR)-funded BioResource, Clinical Research Facility, and Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust in partnership with King’s College London. We would like to thank all the participants of UK Biobank, English Longitudinal Study of Aging, and TwinsUK. Column headers: SNP, SNP rsID CHR, chromosome BP, genomic position (GRCh37 build) ALLELE1, effect allele (coded as "1") ALLELE0, reference allele (coded as "0") A1FREQ, effect allele frequency INFO, imputation quality BETA, effect size of effect allele SE: standard error of effect size P, P-value of association (without GC correction)
H
UK Biobank GWAS result of Fatty Liver Index
dataverse.harvard.edu
Updated Nov 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanni Li; Eline van den Berg; Jingyuan Fu; Rinse Weersma (2023). UK Biobank GWAS result of Fatty Liver Index [Dataset]. http://doi.org/10.7910/DVN/4YM1BG
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/4YM1BG
Dataset updated
Nov 11, 2023
Dataset provided by
Harvard Dataverse
Authors
Yanni Li; Eline van den Berg; Jingyuan Fu; Rinse Weersma
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Summary statistics of genetic loci and fatty liver index in UK Biobank cohort. This resulted in 408,870 nonrelated individuals from the UKBB who self-reported as White-British and had similar genetic ancestry based on a principal component analysis of genotypes. This research has been conducted using data obtained via UKBB Access Application number 52728.
Variable mapped to malnutrition, frailty and sarcopenia.
plos.figshare.com
xls
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nada AlMohaisen; Matthew Gittins; Chris Todd; Sorrel Burden (2023). Variable mapped to malnutrition, frailty and sarcopenia. [Dataset]. http://doi.org/10.1371/journal.pone.0278371.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0278371.t002
Dataset updated
Jun 6, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Nada AlMohaisen; Matthew Gittins; Chris Todd; Sorrel Burden
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Variable mapped to malnutrition, frailty and sarcopenia.
f
ROMK variants from the UK Biobank analyzed in this study.
datasetcatalog.nlm.nih.gov
Updated Nov 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarangi, Srikant; Porter, Aidan W.; Pitluk, Zachary W.; Nguyen, Nga H.; McChesney, Erin M.; Sheng, Shaohu; Kleyman, Thomas R.; Durrant, Jacob D.; Brodsky, Jeffrey L. (2023). ROMK variants from the UK Biobank analyzed in this study. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000958749
Explore at:
Dataset updated
Nov 13, 2023
Authors
Sarangi, Srikant; Porter, Aidan W.; Pitluk, Zachary W.; Nguyen, Nga H.; McChesney, Erin M.; Sheng, Shaohu; Kleyman, Thomas R.; Durrant, Jacob D.; Brodsky, Jeffrey L.
Description
Table shows 511 KCNJ1 variants available in the whole-exome sequencing (WES) database, which contains data from ~200k participants from the UK Biobank [61,62]. Columns represent the chromosomal location and the nucleotide change for each substitution, as well as their minor and alternative allele frequencies (denoted as maf and aaf, respectively). (XLSX)
n
Sociability GWAS in a population-based sample : summary statistics of a...
narcis.nl
pdf
Updated Mar 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bralten, J.B. (Radboud University); Roth Mota, N. (Radboud University); Klemann, C.J.H.M. (Radboud University); Witte, W. de (2021). Sociability GWAS in a population-based sample : summary statistics of a genome-wide association study of an aggregated sociability score in the UK Biobank [Dataset]. http://doi.org/10.17026/dans-ztj-zga6
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.17026/dans-ztj-zga6
Dataset updated
Mar 12, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Bralten, J.B. (Radboud University); Roth Mota, N. (Radboud University); Klemann, C.J.H.M. (Radboud University); Witte, W. de
Area covered
northlimit=59.62358300012501; eastlimit=2.374666058072581; southlimit=49.568413008749225; westlimit=-8.205608345022652United Kingdom
Description
Levels of sociability are continuously distributed in the general population, and decreased sociability represents an early manifestation of several brain disorders. Here, we investigated the genetic underpinnings of sociability in the population.

Main question of our research: 1. Are there common genetic variants that are associated with sociability in the general population? 2. Are genetic variants that are associated with sociability also associated with neuropsychiatric disorders?

Type of data uploaded in this repository: The UK Biobank project (see https://www.ukbiobank.ac.uk/) is a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants. The database is globally accessible to approved researchers undertaking vital research into the most common and life-threatening diseases. The raw data that this project is based on comes from the publically available UK Biobank set, which is very large and is therefore not provided here. Here we only provide the results from our analysis, that is also described here: https://www.biorxiv.org/content/10.1101/781195v2 and currently in revision in a scientific journal. In the dataset you will find the association of 9327396 genetic variants with the phenotype sociability. This dataset is not applicable to be opened with Excel, and can best be opened on a cluster computer or using specfic software.

Subjects The UK Biobank (UKBB) is a major population-based cohort from the United Kingdom that includes individuals aged between 37 and 73 years. We constructed a sociability measure based on the the aggregation of scores per participant on four questions from the UKBB database that link to sociability, including (1) a question about the frequency of friend/family visits, (2) a question on the number and type of social venues that are visited, (3) a question about worrying after social embarrassment and (4) a question about feeling lonely, leading to a sociability score ranging from 0-4. Participants were excluded if they had somatic problems that could be related to social withdrawal (BMI < 15 or BMI > 40, narcolepsy (all the time), stroke, severe tinnitus, deafness or brain-related cancers) or if they answered that they had “No friends/family outside household” or “Do not know” or “Prefer not to answer” to any of the questions.

SNP genotyping and quality control Details about the available genome-wide genotyping data for UKBB participants have been reported previously (PMID: 30305743). We used third-release genotyping data (see https://biobank.ctsu.ox.ac.uk/crystal/label.cgi?id=100319). Briefly, 49,950 participants were genotyped using the UK BiLEVE Axiom Array and 438,427 participants were genotyped using UK Biobank Axiom Array. Genotypes were imputed into the dataset using the Haplotype Reference Consortium (HRC), and the UK10K haplotype resource. To account for ethnicity, we included only those individuals that identified themselves as "white" by self-report and plotted the Principal Components (PC) provided by the UKBB, excluding individuals considered to be outliers according to PCs 1 and 2. Genetic relatedness calculated with KING kinship and provided by the UKBB (https://kenhanscombe.github.io/ukbtools/articles/explore-ukb-data.html ; http://www.ukbiobank.ac.uk/wp-content/uploads/2014/04/UKBiobank_genotyping_QC_documentation-web.pdf) was used to identify first and second-degree relatives. Subsequently ´families´ (i.e. clusters of related individuals above an IBD>0.125 threshold) were created and only one individual from each of these created ‘families’ was included in the analysis. If self-reported sex and SNP-based sex differed, individuals were excluded from further analysis. Single nucleotide polymorphisms (SNPs) with minor allele frequency <0.005, Hardy-Weinberg equilibrium test P value<1e−6, missing genotype rate >0.05, and imputation quality of INFO <0.8 were excluded. In the current study, all analyses are based on 342,461 participants of European ancestry for which both genotype data and sociability scores were available.

Genome-wide association analysis Genome-wide association analysis with the imputed marker dosages was performed in PLINK1.9, using a linear regression model with the sociability measure as the dependent variable and including sex, age, 10 first PCs, assessment center, and genotype batch as covariates. SNPs were considered significantly associated if they had p-value < 5e-8. Associated loci were considered independent of each other at r2 0.6 and lead SNPs were classified as the SNP with the smallest association p-value and at r2 0.1, using a 250kb window. The summary statistics come from the plink2 linear regression analysis.
TwinsUK
healthdatagateway.org
unknown
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TwinsUK is funded by the Wellcome Trust, Medical Research Council, Versus Arthritis, European Union Horizon 2020, Chronic Disease Research Foundation (CDRF), Zoe Global Ltd and the National Institute for Health Research (NIHR)-funded BioResource, Clinical Research Facility and Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust in partnership with King’s College London. (2024). TwinsUK [Dataset]. https://healthdatagateway.org/dataset/728
Explore at:
unknownAvailable download formats
Dataset updated
Oct 8, 2024
Dataset provided by
TwinsUKhttp://www.twinsuk.ac.uk/
Medical Research Councilhttp://mrc.ukri.org/
Wellcome Trusthttps://wellcome.org/
Authors
TwinsUK is funded by the Wellcome Trust, Medical Research Council, Versus Arthritis, European Union Horizon 2020, Chronic Disease Research Foundation (CDRF), Zoe Global Ltd and the National Institute for Health Research (NIHR)-funded BioResource, Clinical Research Facility and Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust in partnership with King’s College London.
License
https://twinsuk.ac.uk/researchers/access-data-and-samples/request-access/https://twinsuk.ac.uk/researchers/access-data-and-samples/request-access/
Description
The TwinsUK cohort (https://twinsuk.ac.uk/), set up in 1992, is a major volunteer-based genomic epidemiology resource with longitudinal deep genomic and phenomics data from over 15,000 adult twins (18+) from across the UK who are highly engaged and recallable. The cohort is predominantly female (80%) for historical reasons. It is one of the most deeply characterised adult twin cohort in the world, providing a rich platform for scientists to research health and ageing longitudinally. There are over 700,000 biological samples stored and data collected on twins with repeat measures at multiple timepoints. Extremely large datasets (billions of data points) have been generated for each TwinsUK participant over 30 years, including phenotypes from questionnaires, multiple clinical visits, and record linkage, and genetic and ‘omic data from biological samples. TwinsUK ensures derived datasets from raw data are returned by collaborators to enhance the resource. TwinsUK also holds a wide range of laboratory samples, including plasma, serum, DNA, faecal microbiome and tissue (skin, fat, colonic biopsies) within HTA-regulated facilities at King's College London.

More recently, postal and at-home collection strategies have allowed sample collections from frail twins, our whole cohort for COVID-19 studies, and for new twin recruits. The cohort is recallable either on a four-year longitudinal sweep visit or, based on diagnosis or genotype.

More than 1,000 data access collaborations and 250,000 samples have been shared with external researchers, resulting in over 800 publications since 2012.

TwinsUK is now working to link to twins’ official health, education and environmental records for health research purposes, which will further enhance the resource, education and environmental records for health research purposes, which will further enhance the resource.
Anthropometric data, the prevalence of metformin use, back pain status, BMI,...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ana Paula Carvalho-e-Silva; Paulo H. Ferreira; Alison R. Harmer; Jan Hartvigsen; Manuela L. Ferreira (2023). Anthropometric data, the prevalence of metformin use, back pain status, BMI, and physical activity levels among people with type 2 diabetes. [Dataset]. http://doi.org/10.1371/journal.pone.0282205.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0282205.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Ana Paula Carvalho-e-Silva; Paulo H. Ferreira; Alison R. Harmer; Jan Hartvigsen; Manuela L. Ferreira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Anthropometric data, the prevalence of metformin use, back pain status, BMI, and physical activity levels among people with type 2 diabetes.
Linkage-Disequilibrium (LD) matrices for six continental ancestry groups...
zenodo.org
application/gzip
Updated Jan 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shadi Zabad; Shadi Zabad (2025). Linkage-Disequilibrium (LD) matrices for six continental ancestry groups from the UK Biobank [Dataset]. http://doi.org/10.5281/zenodo.14614207
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14614207
Dataset updated
Jan 8, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shadi Zabad; Shadi Zabad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains Linkage Disequilibrium (LD) matrices for six ancestry groups from the UK Biobank.

LD matrices record the SNP-by-SNP correlations in a given sample of individuals from the general population. In this case, we threshold the matrices so that we only record the correlations between variants in the same LD block (defined by LDetect). The continental ancestry groups are defined by the Pan-UKB initiative as:

EUR = European ancestry (N=362446)

CSA = Central/South Asian ancestry (N=8284)

AFR = African ancestry (N=6255)

EAS = East Asian ancestry (N=2700)

MID = Middle Eastern ancestry (N=1567)

AMR = Admixed American ancestry (N=987)

The sample sizes here are restricted to unrelated individuals in the UK Biobank. The matrices were computed using magenpy and quantized to int8 data type for better compressibility. The standard matrices (EUR.tar.gz, AFR.tar.gz, ...) contain pairwise correlations for 1.4 million HapMap3+ variants. For European samples, we also provide LD matrices that record pairwise correlations for up to 18 million variants (EUR_18m_variants.tar.gz)

For more details on how these matrices were computed, please consult our manuscript:

Towards whole-genome inference of polygenic scores with fast and memory-efficient algorithms
Shadi Zabad, Chirayu Anant Haryan, Simon Gravel, Sanchit Misra, Yue Li

To access these matrices, consult the codebase of magenpy, our custom python package with special data structures for processing these LD matrices.
f
Modules of the UK Biobank MHQ2.
datasetcatalog.nlm.nih.gov
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davies, Helena L.; Richards, Marcus; Skelton, Megan; McCabe, Rose; Thornton, Laura M.; McIntosh, Andrew M.; Adams, Mark; Hotopf, Matthew; Fox, Elaine; Coleman, Jonathan R. I.; Davis, Katrina A. S.; Maina, Jared; John, Ann; Starkey, Fenella; Yu, Zhaoying; Oram, Sian; Kassam, Aliyah S.; Holliday, Jo; Kempton, Matthew J.; Zvrskovec, Johan; Breen, Gerome; Dregan, Alexandru; Wang, Rujia; Li, Danyang; Cai, Na; Lee, William; Eley, Thalia C.; Hübel, Christopher; Kuile, Abigail R. ter; Davies, Kelly (2025). Modules of the UK Biobank MHQ2. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002081249
Explore at:
Dataset updated
May 28, 2025
Authors
Davies, Helena L.; Richards, Marcus; Skelton, Megan; McCabe, Rose; Thornton, Laura M.; McIntosh, Andrew M.; Adams, Mark; Hotopf, Matthew; Fox, Elaine; Coleman, Jonathan R. I.; Davis, Katrina A. S.; Maina, Jared; John, Ann; Starkey, Fenella; Yu, Zhaoying; Oram, Sian; Kassam, Aliyah S.; Holliday, Jo; Kempton, Matthew J.; Zvrskovec, Johan; Breen, Gerome; Dregan, Alexandru; Wang, Rujia; Li, Danyang; Cai, Na; Lee, William; Eley, Thalia C.; Hübel, Christopher; Kuile, Abigail R. ter; Davies, Kelly
Description
BackgroundThis paper introduces the UK Biobank (UKB) second mental health questionnaire (MHQ2), describes its design, the respondents and some notable findings. UKB is a large cohort study with over 500,000 volunteer participants aged 40–69 years when recruited in 2006–2010. It is an important resource of extensive health, genetic and biomarker data. Enhancements to UKB enrich the data available. MHQ2 is an enhancement designed to enable and facilitate research with psychosocial and mental health aspects.MethodsUKB sent participants a link to MHQ2 by email in October-November 2022. The MHQ2 was designed by a multi-institutional consortium to build on MHQ1. It characterises lifetime depression further, adds data on panic disorder and eating disorders, repeats ‘current’ mental health measures and updates information about social circumstances. It includes established measures, such as the PHQ-9 for current depression and CIDI-SF for lifetime panic, as well as bespoke questions. Algorithms and R code were developed to facilitate analysis.ResultsAt the time of analysis, MHQ2 results were available for 169,253 UKB participants, of whom 111,275 had also completed the earlier MHQ1. Characteristics of respondents and the whole UKB cohort are compared. The major phenotypes are lifetime: depression (18%); panic disorder (4.0%); a specific eating disorder (2.8%); and bipolar affective disorder I (0.4%). All mental disorders are found less with older age and also seem to be related to selected social factors. In those participants who answered both MHQ1 (2016) and MHQ2 (2022), current mental health measure showed that fewer respondents have harmful alcohol use than in 2016 (relative risk 0.84), but current depression (RR 1.07) and anxiety (RR 0.98) have not fallen, as might have been expected given the relationship with age. We also compare lifetime concepts for test-retest reliability.ConclusionsThere are some drawbacks to UKB due to its lack of population representativeness, but where the research question does not depend on this, it offers exceptional resources that any researcher can apply to access. This paper has just scratched the surface of the results from MHQ2 and how this can be combined with other tranches of UKB data, but we predict it will enable many future discoveries about mental health and health in general.
r
Data from: Collaborative learning from distributed data with differentially...
resodate.org
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antti Honkela; Joonas Jälkö; Lukas Prediger; Samuel Kaski (2024). Collaborative learning from distributed data with differentially private synthetic data [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9yZXNlYXJjaC5maS9lbi9yZXN1bHRzL2RhdGFzZXQvZDE3MjYwMTAtOTdmZi00MDRiLWEyODItMzczYTI4YzY0YmM0
Explore at:
Dataset updated
Jun 15, 2024
Dataset provided by
ACRIS catalog
Research.fi
Authors
Antti Honkela; Joonas Jälkö; Lukas Prediger; Samuel Kaski
Description
Abstract Background Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank. Methods We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study’s Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores. Results We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups. Conclusions Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.
f
Table_1_Associations between genetically predicted sex and growth hormones...
datasetcatalog.nlm.nih.gov
Updated Oct 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lv, Huiyun; Qin, Hongzhi; Zhao, Hongliang; Zhang, Yunshu; Zhao, Mingjian (2023). Table_1_Associations between genetically predicted sex and growth hormones and facial aging in the UK Biobank: a two−sample Mendelian randomization study.xlsx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001088681
Explore at:
Dataset updated
Oct 17, 2023
Authors
Lv, Huiyun; Qin, Hongzhi; Zhao, Hongliang; Zhang, Yunshu; Zhao, Mingjian
Description
BackgroundAging is an inescapable process, but it can be slowed down, particularly facial aging. Sex and growth hormones have been shown to play an important role in the process of facial aging. We investigated this association further, using a two-sample Mendelian randomization study.MethodsWe analyzed genome-wide association study (GWAS) data from the UK Biobank database comprising facial aging data from 432,999 samples, using two-sample Mendelian randomization. In addition, single-nucleotide polymorphism (SNP) data on sex hormone-binding globulin (SHBG) and sex steroid hormones were obtained from a GWAS in the UK Biobank [SHBG, N = 189,473; total testosterone (TT), N = 230,454; bioavailable testosterone (BT), N = 188,507; and estradiol (E2), N = 2,607)]. The inverse-variance weighted (IVW) method was the major algorithm used in this study, and random-effects models were used in cases of heterogeneity. To avoid errors caused by a single algorithm, we selected MR-Egger, weighted median, and weighted mode as supplementary algorithms. Horizontal pleiotropy was detected based on the intercept in the MR-Egger regression. The leave-one-out method was used for sensitivity analysis.ResultsSHBG plays a promoting role, whereas sex steroid hormones (TT, BT, and E2) play an inhibitory role in facial aging. Growth hormone (GH) and insulin-like growth factor-1 (IGF-1) levels had no significant effect on facial aging, which is inconsistent with previous findings in vitro.ConclusionRegulating the levels of SHBG, BT, TT, and E2 may be an important means to delay facial aging.
European LD files for GhostKnockoffGWAS
zenodo.org
zip
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin B Chu; Benjamin B Chu (2024). European LD files for GhostKnockoffGWAS [Dataset]. http://doi.org/10.5281/zenodo.10433663
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10433663
Dataset updated
Feb 20, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Benjamin B Chu; Benjamin B Chu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 2024
Description
This contains pre-processed LD files (Sigma matrix, S matrix, ...etc) computed on the EUR cohort of Pan-UKB LD data. It is intended to be used as an input to the GhostKnockoffGWAS pipeline.

We restricted our attention to the EUR panel

We filtered the original HailBlockMatrix LD panel to genotypes that are typed (i.e. imputed SNPs were removed)

Coordinates in both hg19 and hg38 are available. Conversion from hg19 to hg38 were achieved by the R package liftOver.

Downloading and processing of the original HailBlockMatrix formatted data is accomplished by the EasyLD.jl software: https://biona001.github.io/EasyLD.jl

Knockoff optimization were carried out by the Knockoffs.jl julia package: https://github.com/biona001/Knockoffs.jl

The result (i.e. files available in this site) is saved in .csv and .h5 formatted files for easier access, which is directly readable by GhostKnockoffGWAS.

Facebook

Twitter

Click to copy link

Link copied

Cite

Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170

Synthetic datasets of the UK Biobank cohort

Explore at:

bin, csv, zip, pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13983170

Dataset updated

Sep 17, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]
Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
synthbdbasevar: baseline variables, mostly collected at recruitment.
synthpmdata: annual average exposure to PM_2.5 for each participant reconstructed using their residential history.
synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM_2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data, including the annual PM_2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM_2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

Clear search

Close search

Google apps

Main menu

Synthetic datasets of the UK Biobank cohort

Content

Generation of the synthetic data

UK Biobank

Source Data 2. The dataset derived from the UK Biobank for the cohort study....

Data from: Brain Ages Derived from Different MRI Modalities are Associated...

UK Biobank

Source Data 5. The dataset derived from the UK Biobank for the G-E...

Informing Educational Interventions using Genome-Wide Data, 2016-2019

Data from: Brain Ages Derived from Different MRI Modalities are Associated...

GWAS on self-reported hearing difficulty in the UK Biobank

UK Biobank GWAS result of Fatty Liver Index

Variable mapped to malnutrition, frailty and sarcopenia.

ROMK variants from the UK Biobank analyzed in this study.

Sociability GWAS in a population-based sample : summary statistics of a...

TwinsUK

Anthropometric data, the prevalence of metformin use, back pain status, BMI,...

Linkage-Disequilibrium (LD) matrices for six continental ancestry groups...

Modules of the UK Biobank MHQ2.

Data from: Collaborative learning from distributed data with differentially...

Table_1_Associations between genetically predicted sex and growth hormones...

European LD files for GhostKnockoffGWAS

Synthetic datasets of the UK Biobank cohort

Content

Generation of the synthetic data