Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.
The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.
The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:
Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.
The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).
The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:
In addition, this repository provides these additional files:
The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).
The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.
This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
Facebook
TwitterUK Biobank is a large-scale biomedical database and research resource that provides researchers access to detailed longitudinal phenotype, medical and genetic data from 500,000 volunteer participants.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data is used to conduct cohort study to evaluate the association between smoking and the risk of inflammatory bowel disease.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Brain ageing is a highly variable, spatially and temporally heterogeneous process, marked by numerous structural and functional changes. These can cause discrepancies between individuals’ chronological age and the apparent age of their brain, as inferred from neuroimaging data. Machine learning models, and particularly Convolutional Neural Networks (CNNs), have proven adept in capturing patterns relating to ageing induced changes in the brain. The differences between the predicted and chronological ages, referred to as brain age deltas, have emerged as useful biomarkers for exploring those factors which promote accelerated ageing or resilience, such as pathologies or lifestyle factors. However, previous studies rely only on structural neuroimaging for predictions, overlooking potentially informative functional and microstructural changes. Here we show that multiple contrasts derived from different MRI modalities can predict brain age, each encoding bespoke brain ageing information. By using 3D CNNs and UK Biobank data, we found that 57 contrasts derived from structural, susceptibility-weighted, diffusion, and functional MRI can successfully predict brain age. For each contrast, different patterns of association with non-imaging phenotypes were found, resulting in a total of 191 unique, statistically significant associations. Furthermore, we found that ensembling data from multiple contrasts results in both higher prediction accuracies and stronger correlations to non-imaging measurements. Our results demonstrate that other 3D contrasts and modalities, which have not been considered so far for the task of brain age prediction, encode different information about the ageing brain. We envision our work as being the starting point for future investigations into the causal links underpinning the observed brain age deltas and non-imaging measurement associations. For instance, drug effects can be monitored, given that certain medications correlated with accelerated brain ageing. Furthermore, continued development of brain age models could facilitate their deployment in clinical trials for recruitment and monitoring, and hospitals for diagnostic and screening tasks.
Data Description
This dataset contains the full correlation results with all nIDPs in the UK Biobank. These are presented in datasets split by sex in Female and Male subjects. For easier data manipulation, two smaller datasets have also been made available, containing just those correlation which pass the False Discovery Rate (FDR) threshold.
As experiments were also conducted for ensembles using multiple contrasts, similar datasets are provided for those.
Finally, global datasets are also provided. These are the concatenation of the associations contained in the Male and Female datasets.
Paper & Code
The original paper for this article can be accessed here:
https://ieeexplore.ieee.org/abstract/document/10196736
To access the codes relevant for this project, please access the project GitHub Repos:
https://github.com/AndreiRoibu/AgeMapper
If using this work, please cite it based on the above paper, or using the following BibTex:
@inproceedings{roibu2023brain, title={Brain Ages Derived from Different MRI Modalities are Associated with Distinct Biological Phenotypes}, author={Roibu, Andrei-Claudiu and Adaszewski, Stanislaw and Schindler, Torsten and Smith, Stephen M and Namburete, Ana IL and Lange, Frederik J}, booktitle={2023 10th IEEE Swiss Conference on Data Science (SDS)}, pages={17--25}, year={2023}, organization={IEEE}, doi={10.1109/SDS57534.2023.00010} }
Data Access
The data for this project is freely available upon application at the UK Biobank. For more information regarding the individual nIDPs, please access the UK Biobank Showcase website at: https://biobank.ctsu.ox.ac.uk/showcase/search.cgi
Funding
ACR is supported by EPSRC Grant EP/S024093/1, F. Hoffmann-La Roche AG and a 2021 Industrial Fellowship offered by the Royal Commission for the Exhibition of 1851. SMS is supported by a Wellcome Trust Collaborative Award 215573/Z/19/Z. AILN is grateful for support from the Academy of Medical Sciences under the Springboard Awards scheme (SBF005/1136), and the Bill and Melinda Gates Foundation. FJL is supported by a Wellcome Trust Collaborative Award (215573/Z/19/Z). The WIN is supported by core funding from the Wellcome Trust (203139/Z/16/Z). The computational aspects were supported by the Wellcome Trust (203141/Z/16/Z) and the NIHR Oxford BRC. Corresponding authors: ACR (andreiroibu@icloud.com), SA (stanislaw.adaszewski@roche.com) and AILN (ana.namburete@cs.ox.ac.uk).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
UK Biobank is a large-scale biomedical database and detailed prospective study containing de-identified genetic, lifestyle and health information and biological samples from over 500,000 participants in the United Kingdom. Between 2006 and 2010, participants aged 40 to 69 years were recruited from Nation Health Service (NHS) central registers across the United Kingdom. Participants have been followed up with regularly since 2006.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data contains of the information on the mqtls of smoking-related methylation and is used to perform the G-E interaction analysis (for CD).
Facebook
TwitterData sources include UK Biobank https://www.ukbiobank.ac.uk/enable-your-research, ALSPAC http://www.bristol.ac.uk/alspac/researchers/access, and the NCDS https://cls.ucl.ac.uk/data-access-training/. This collection provides the links to the code for the analysis used in this project available under Related Resources.
During this fellowship, I will use the wealth of genetic data from longitudinal cohort studies in the UK and abroad to conduct innovative research into three core issues in modern economics, psychology and sociology: academic attainment, non-cognitive skills, and assortative matching in relationships. These are some of the most heavily researched topics across various social sciences (1-4). Many researchers have argued that the genome plays an important role in each of these topics, yet we have relatively little direct evidence about this. A major limitation of much of the existing research in this area is that it has struggled to account for intrinsic differences between individuals. I will overcome this limitation by combining the growing wealth of biosocial and genome-wide data, from eight longitudinal cohort studies from the UK and others worldwide, with cutting edge econometric and statistical methods for causal inference. These novel data and methods offer an opportunity for new evidence and discoveries about research questions that were previously difficult or impossible to address (5, 6).
My research objectives are to investigate the following three research questions:
1) How are the effects of three genetic variants associated with educational attainment mediated? What are their long-term effects on labour market outcomes?
To date, we know of three individual genetic variants that are associated with educational attainment. However, we do not know which biosocial mechanisms mediate these effects. During this fellowship, I will investigate this using data from the UK Biobank. This cohort study has genome-wide data on 500,000 individuals. Due to its size, the UK Biobank will offer unparalleled statistical power to investigate the aetiology of these associations and their long-term consequences. In addition, I will seek to replicate my findings and investigate these relationships in more detail using the rich and highly detailed information in the English Longitudinal Study of Aging (N=8,000) and Understanding Society (N=10,000).
2) What is the genetic architecture of cognitive and non-cognitive skills and educational outcomes across the life course?
Non-cognitive skills are a set of psychological character traits that influence success in school and at work, for example, motivation, perseverance, emotional intelligence, resilience, and self-control (7). Research about the importance of non-cognitive skills has led to policy interventions that aim to improve children's non-cognitive skills (2, 8-10). However, whilst we know that these skills are associated with outcomes, we do not know if they cause success in school or work. I will add to the evidence about this question using genome-wide data from the Avon Longitudinal Study of Parents and Children (ALSPAC) offspring (N=8,365). I will seek to replicate these results in the National Child Development Study (N=5,595) and The Twins Early Development Study (N=3,500).
3) How does assortative mating affect the human genome? What are the consequences of assortative mating for interpreting the results of social-science studies using genome-wide datasets?
Despite the saying 'opposites attract, spouses tend to be more alike than two randomly chosen individuals from the population. In this project, I will investigate whether this is because spouses come from similar backgrounds or if spouses are also more likely to have similar genetic variants than would be expected by chance. This has implications for interpreting the results of studies using genome-wide data. I will use data from UK Biobank, ALSPAC mothers and fathers (N=10,107 and 2000 respectively), the Health and Retirement Study (N=15,620) and the Generation Scotland study (N=10,399).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Brain ageing is a highly variable, spatially and temporally heterogeneous process, marked by numerous structural and functional changes. These can cause discrepancies between individuals’ chronological age and the apparent age of their brain, as inferred from neuroimaging data. Machine learning models, and particularly Convolutional Neural Networks (CNNs), have proven adept in capturing patterns relating to ageing induced changes in the brain. The differences between the predicted and chronological ages, referred to as brain age deltas, have emerged as useful biomarkers for exploring those factors which promote accelerated ageing or resilience, such as pathologies or lifestyle factors. However, previous studies rely only on structural neuroimaging for predictions, overlooking potentially informative functional and microstructural changes. Here we show that multiple contrasts derived from different MRI modalities can predict brain age, each encoding bespoke brain ageing information. By using 3D CNNs and UK Biobank data, we found that 57 contrasts derived from structural, susceptibility-weighted, diffusion, and functional MRI can successfully predict brain age. For each contrast, different patterns of association with non-imaging phenotypes were found, resulting in a total of 191 unique, statistically significant associations. Furthermore, we found that ensembling data from multiple contrasts results in both higher prediction accuracies and stronger correlations to non-imaging measurements. Our results demonstrate that other 3D contrasts and modalities, which have not been considered so far for the task of brain age prediction, encode different information about the ageing brain. We envision our work as being the starting point for future investigations into the causal links underpinning the observed brain age deltas and non-imaging measurement associations. For instance, drug effects can be monitored, given that certain medications correlated with accelerated brain ageing. Furthermore, continued development of brain age models could facilitate their deployment in clinical trials for recruitment and monitoring, and hospitals for diagnostic and screening tasks.
Data Description
This dataset contains the full correlation results with all nIDPs in the UK Biobank. These are presented in datasets split by sex in Female and Male subjects. For easier data manipulation, two smaller datasets have also been made available, containing just those correlation which pass the False Discovery Rate (FDR) threshold.
As experiments were also conducted for ensembles using multiple contrasts, similar datasets are provided for those.
Finally, global datasets are also provided. These are the concatenation of the associations contained in the Male and Female datasets.
Paper & Code
The original paper for this article can be accessed here:
To access the codes relevant for this project, please access the project GitHub Repos:
If using this work, please cite it based on the above paper, or using the following BibTex:
@inproceedings{roibu2023brain,
title={Brain Ages Derived from Different MRI Modalities are Associated with Distinct Biological Phenotypes},
author={Roibu, Andrei-Claudiu and Adaszewski, Stanislaw and Schindler, Torsten and Smith, Stephen M and Namburete, Ana IL and Lange, Frederik J},
booktitle={2023 10th IEEE Swiss Conference on Data Science (SDS)},
pages={17--25},
year={2023},
organization={IEEE},
doi={10.1109/SDS57534.2023.00010}
}
Data Access
The data for this project is freely available upon application at the UK Biobank. For more information regarding the individual nIDPs, please access the UK Biobank Showcase website at: https://biobank.ctsu.ox.ac.uk/showcase/search.cgi
Funding
ACR is supported by EPSRC Grant EP/S024093/1, F. Hoffmann-La Roche AG and a 2021 Industrial Fellowship offered by the Royal Commission for the Exhibition of 1851. SMS is supported by a Wellcome Trust Collaborative Award 215573/Z/19/Z. AILN is grateful for support from the Academy of Medical Sciences under the Springboard Awards scheme (SBF005/1136), and the Bill and Melinda Gates Foundation. FJL is supported by a Wellcome Trust Collaborative Award (215573/Z/19/Z). The WIN is supported by core funding from the Wellcome Trust (203139/Z/16/Z). The computational aspects were supported by the Wellcome Trust (203141/Z/16/Z) and the NIHR Oxford BRC. Corresponding authors: ACR (andreiroibu@icloud.com), SA (stanislaw.adaszewski@roche.com) and AILN (ana.namburete@cs.ox.ac.uk).
Facebook
TwitterThe dataset contains results of two genome-wide association studies for age-related hearing impairment (ARHI)-related traits as described in the following publication Wells HRR, Freidin MB, Zainul Abidin FN, Payton A, Dawes P, Munro KJ, Morton CC, Moore DR, Dawson SJ, Williams FMK. GWAS Identifies 44 Independent Associated Genomic Loci for Self-Reported Adult Hearing Difficulty in UK Biobank. Am J Hum Genet. 2019 Oct 3;105(4):788-802. doi: 10.1016/j.ajhg.2019.09.008. Epub 2019 Sep 26. Please cite the article if using this dataset. Two files provide summary statistics for discovery analysis of Hearing difficulty (HD) and Hearing aid use (HAID) phenotypes for individuals of European descent from UK Biobank. Acknowledgements The research was carried out using the UK Biobank Resource under application number 11516. H.R.R.W. is funded by a PhD Studentship Grant, S44, from Action on Hearing Loss. The study was also supported by funding from NIHR UCLH BRC Deafness and Hearing Problems Theme, a grant from MED_EL, and the NIHR Manchester Biomedical Research Centre. The English Longitudinal Study of Aging is jointly run by University College London, Institute for Fiscal Studies, University of Manchester, and National Centre for Social Research. Genetic analyses have been carried out by UCL Genomics and funded by the Economic and Social Research Council and the National Institute on Aging. Data governance was provided by the METADAC data access committee, funded by ESRC, Wellcome, and MRC (2015-2018: Grant Number MR/N01104X/1 2018-2020: Grant Number ES/S008349/1). TwinsUK is funded by the Wellcome Trust, Medical Research Council, European Union, the National Institute for Health Research (NIHR)-funded BioResource, Clinical Research Facility, and Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust in partnership with King’s College London. We would like to thank all the participants of UK Biobank, English Longitudinal Study of Aging, and TwinsUK. Column headers: SNP, SNP rsID CHR, chromosome BP, genomic position (GRCh37 build) ALLELE1, effect allele (coded as "1") ALLELE0, reference allele (coded as "0") A1FREQ, effect allele frequency INFO, imputation quality BETA, effect size of effect allele SE: standard error of effect size P, P-value of association (without GC correction)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Summary statistics of genetic loci and fatty liver index in UK Biobank cohort. This resulted in 408,870 nonrelated individuals from the UKBB who self-reported as White-British and had similar genetic ancestry based on a principal component analysis of genotypes. This research has been conducted using data obtained via UKBB Access Application number 52728.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variable mapped to malnutrition, frailty and sarcopenia.
Facebook
TwitterTable shows 511 KCNJ1 variants available in the whole-exome sequencing (WES) database, which contains data from ~200k participants from the UK Biobank [61,62]. Columns represent the chromosomal location and the nucleotide change for each substitution, as well as their minor and alternative allele frequencies (denoted as maf and aaf, respectively). (XLSX)
Facebook
TwitterLevels of sociability are continuously distributed in the general population, and decreased sociability represents an early manifestation of several brain disorders. Here, we investigated the genetic underpinnings of sociability in the population.
Main question of our research: 1. Are there common genetic variants that are associated with sociability in the general population? 2. Are genetic variants that are associated with sociability also associated with neuropsychiatric disorders?
Type of data uploaded in this repository: The UK Biobank project (see https://www.ukbiobank.ac.uk/) is a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants. The database is globally accessible to approved researchers undertaking vital research into the most common and life-threatening diseases. The raw data that this project is based on comes from the publically available UK Biobank set, which is very large and is therefore not provided here. Here we only provide the results from our analysis, that is also described here: https://www.biorxiv.org/content/10.1101/781195v2 and currently in revision in a scientific journal. In the dataset you will find the association of 9327396 genetic variants with the phenotype sociability. This dataset is not applicable to be opened with Excel, and can best be opened on a cluster computer or using specfic software.
Subjects The UK Biobank (UKBB) is a major population-based cohort from the United Kingdom that includes individuals aged between 37 and 73 years. We constructed a sociability measure based on the the aggregation of scores per participant on four questions from the UKBB database that link to sociability, including (1) a question about the frequency of friend/family visits, (2) a question on the number and type of social venues that are visited, (3) a question about worrying after social embarrassment and (4) a question about feeling lonely, leading to a sociability score ranging from 0-4. Participants were excluded if they had somatic problems that could be related to social withdrawal (BMI < 15 or BMI > 40, narcolepsy (all the time), stroke, severe tinnitus, deafness or brain-related cancers) or if they answered that they had “No friends/family outside household” or “Do not know” or “Prefer not to answer” to any of the questions.
SNP genotyping and quality control Details about the available genome-wide genotyping data for UKBB participants have been reported previously (PMID: 30305743). We used third-release genotyping data (see https://biobank.ctsu.ox.ac.uk/crystal/label.cgi?id=100319). Briefly, 49,950 participants were genotyped using the UK BiLEVE Axiom Array and 438,427 participants were genotyped using UK Biobank Axiom Array. Genotypes were imputed into the dataset using the Haplotype Reference Consortium (HRC), and the UK10K haplotype resource. To account for ethnicity, we included only those individuals that identified themselves as "white" by self-report and plotted the Principal Components (PC) provided by the UKBB, excluding individuals considered to be outliers according to PCs 1 and 2. Genetic relatedness calculated with KING kinship and provided by the UKBB (https://kenhanscombe.github.io/ukbtools/articles/explore-ukb-data.html ; http://www.ukbiobank.ac.uk/wp-content/uploads/2014/04/UKBiobank_genotyping_QC_documentation-web.pdf) was used to identify first and second-degree relatives. Subsequently ´families´ (i.e. clusters of related individuals above an IBD>0.125 threshold) were created and only one individual from each of these created ‘families’ was included in the analysis. If self-reported sex and SNP-based sex differed, individuals were excluded from further analysis. Single nucleotide polymorphisms (SNPs) with minor allele frequency <0.005, Hardy-Weinberg equilibrium test P value<1e−6, missing genotype rate >0.05, and imputation quality of INFO <0.8 were excluded. In the current study, all analyses are based on 342,461 participants of European ancestry for which both genotype data and sociability scores were available.
Genome-wide association analysis Genome-wide association analysis with the imputed marker dosages was performed in PLINK1.9, using a linear regression model with the sociability measure as the dependent variable and including sex, age, 10 first PCs, assessment center, and genotype batch as covariates. SNPs were considered significantly associated if they had p-value < 5e-8. Associated loci were considered independent of each other at r2 0.6 and lead SNPs were classified as the SNP with the smallest association p-value and at r2 0.1, using a 250kb window. The summary statistics come from the plink2 linear regression analysis.
Facebook
Twitterhttps://twinsuk.ac.uk/researchers/access-data-and-samples/request-access/https://twinsuk.ac.uk/researchers/access-data-and-samples/request-access/
The TwinsUK cohort (https://twinsuk.ac.uk/), set up in 1992, is a major volunteer-based genomic epidemiology resource with longitudinal deep genomic and phenomics data from over 15,000 adult twins (18+) from across the UK who are highly engaged and recallable. The cohort is predominantly female (80%) for historical reasons. It is one of the most deeply characterised adult twin cohort in the world, providing a rich platform for scientists to research health and ageing longitudinally. There are over 700,000 biological samples stored and data collected on twins with repeat measures at multiple timepoints. Extremely large datasets (billions of data points) have been generated for each TwinsUK participant over 30 years, including phenotypes from questionnaires, multiple clinical visits, and record linkage, and genetic and ‘omic data from biological samples. TwinsUK ensures derived datasets from raw data are returned by collaborators to enhance the resource. TwinsUK also holds a wide range of laboratory samples, including plasma, serum, DNA, faecal microbiome and tissue (skin, fat, colonic biopsies) within HTA-regulated facilities at King's College London.
More recently, postal and at-home collection strategies have allowed sample collections from frail twins, our whole cohort for COVID-19 studies, and for new twin recruits. The cohort is recallable either on a four-year longitudinal sweep visit or, based on diagnosis or genotype.
More than 1,000 data access collaborations and 250,000 samples have been shared with external researchers, resulting in over 800 publications since 2012.
TwinsUK is now working to link to twins’ official health, education and environmental records for health research purposes, which will further enhance the resource, education and environmental records for health research purposes, which will further enhance the resource.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Anthropometric data, the prevalence of metformin use, back pain status, BMI, and physical activity levels among people with type 2 diabetes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains Linkage Disequilibrium (LD) matrices for six ancestry groups from the UK Biobank.
LD matrices record the SNP-by-SNP correlations in a given sample of individuals from the general population. In this case, we threshold the matrices so that we only record the correlations between variants in the same LD block (defined by LDetect). The continental ancestry groups are defined by the Pan-UKB initiative as:
EUR = European ancestry (N=362446)CSA = Central/South Asian ancestry (N=8284)AFR = African ancestry (N=6255)EAS = East Asian ancestry (N=2700)MID = Middle Eastern ancestry (N=1567)AMR = Admixed American ancestry (N=987)The sample sizes here are restricted to unrelated individuals in the UK Biobank. The matrices were computed using magenpy and quantized to int8 data type for better compressibility. The standard matrices (EUR.tar.gz, AFR.tar.gz, ...) contain pairwise correlations for 1.4 million HapMap3+ variants. For European samples, we also provide LD matrices that record pairwise correlations for up to 18 million variants (EUR_18m_variants.tar.gz)
For more details on how these matrices were computed, please consult our manuscript:
Towards whole-genome inference of polygenic scores with fast and memory-efficient algorithms
Shadi Zabad, Chirayu Anant Haryan, Simon Gravel, Sanchit Misra, Yue Li
To access these matrices, consult the codebase of magenpy, our custom python package with special data structures for processing these LD matrices.
Facebook
TwitterBackgroundThis paper introduces the UK Biobank (UKB) second mental health questionnaire (MHQ2), describes its design, the respondents and some notable findings. UKB is a large cohort study with over 500,000 volunteer participants aged 40–69 years when recruited in 2006–2010. It is an important resource of extensive health, genetic and biomarker data. Enhancements to UKB enrich the data available. MHQ2 is an enhancement designed to enable and facilitate research with psychosocial and mental health aspects.MethodsUKB sent participants a link to MHQ2 by email in October-November 2022. The MHQ2 was designed by a multi-institutional consortium to build on MHQ1. It characterises lifetime depression further, adds data on panic disorder and eating disorders, repeats ‘current’ mental health measures and updates information about social circumstances. It includes established measures, such as the PHQ-9 for current depression and CIDI-SF for lifetime panic, as well as bespoke questions. Algorithms and R code were developed to facilitate analysis.ResultsAt the time of analysis, MHQ2 results were available for 169,253 UKB participants, of whom 111,275 had also completed the earlier MHQ1. Characteristics of respondents and the whole UKB cohort are compared. The major phenotypes are lifetime: depression (18%); panic disorder (4.0%); a specific eating disorder (2.8%); and bipolar affective disorder I (0.4%). All mental disorders are found less with older age and also seem to be related to selected social factors. In those participants who answered both MHQ1 (2016) and MHQ2 (2022), current mental health measure showed that fewer respondents have harmful alcohol use than in 2016 (relative risk 0.84), but current depression (RR 1.07) and anxiety (RR 0.98) have not fallen, as might have been expected given the relationship with age. We also compare lifetime concepts for test-retest reliability.ConclusionsThere are some drawbacks to UKB due to its lack of population representativeness, but where the research question does not depend on this, it offers exceptional resources that any researcher can apply to access. This paper has just scratched the surface of the results from MHQ2 and how this can be combined with other tranches of UKB data, but we predict it will enable many future discoveries about mental health and health in general.
Facebook
TwitterAbstract Background Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank. Methods We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study’s Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores. Results We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups. Conclusions Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.
Facebook
TwitterBackgroundAging is an inescapable process, but it can be slowed down, particularly facial aging. Sex and growth hormones have been shown to play an important role in the process of facial aging. We investigated this association further, using a two-sample Mendelian randomization study.MethodsWe analyzed genome-wide association study (GWAS) data from the UK Biobank database comprising facial aging data from 432,999 samples, using two-sample Mendelian randomization. In addition, single-nucleotide polymorphism (SNP) data on sex hormone-binding globulin (SHBG) and sex steroid hormones were obtained from a GWAS in the UK Biobank [SHBG, N = 189,473; total testosterone (TT), N = 230,454; bioavailable testosterone (BT), N = 188,507; and estradiol (E2), N = 2,607)]. The inverse-variance weighted (IVW) method was the major algorithm used in this study, and random-effects models were used in cases of heterogeneity. To avoid errors caused by a single algorithm, we selected MR-Egger, weighted median, and weighted mode as supplementary algorithms. Horizontal pleiotropy was detected based on the intercept in the MR-Egger regression. The leave-one-out method was used for sensitivity analysis.ResultsSHBG plays a promoting role, whereas sex steroid hormones (TT, BT, and E2) play an inhibitory role in facial aging. Growth hormone (GH) and insulin-like growth factor-1 (IGF-1) levels had no significant effect on facial aging, which is inconsistent with previous findings in vitro.ConclusionRegulating the levels of SHBG, BT, TT, and E2 may be an important means to delay facial aging.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains pre-processed LD files (Sigma matrix, S matrix, ...etc) computed on the EUR cohort of Pan-UKB LD data. It is intended to be used as an input to the GhostKnockoffGWAS pipeline.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.
The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.
The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:
Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.
The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).
The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:
In addition, this repository provides these additional files:
The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).
The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.
This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.