Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionThe UK Biobank (UKB) is a resource that includes detailed health-related data on about 500,000 individuals and is available to the research community. However, several obstacles limit immediate analysis of the data: data files vary in format, may be very large, and have numerical codes for column names.Resultsukbtools removes all the upfront data wrangling required to get a single dataset for statistical analysis. All associated data files are merged into a single dataset with descriptive column names. The package also provides tools to assist in quality control by exploring the primary demographics of subsets of participants; query of disease diagnoses for one or more individuals, and estimating disease frequency relative to a reference variable; and to retrieve genetic metadata.ConclusionHaving a dataset with meaningful variable names, a set of UKB-specific exploratory data analysis tools, disease query functions, and a set of helper functions to explore and write genetic metadata to file, will rapidly enable UKB users to undertake their research.
LD blocks based on 20,000 European individuals from the UK Biobank (split by chromosome), with about 1.5 million SNPs based on HapMap3 and MEGA chips
Results for 2,230 UK Biobank binary and continuous traits. We applied the gene-based tests (Gene1D, Gene3D, GeneScan1D and GeneScan3D) to 1,403 UK Biobank binary phecodes and 827 continuous phenotypes (797 continuous traits + 30 biomarkers) using GWAS summary statistics on 28 million imputed variants. The results are in 3 different zipped folders: 'GeneScan3D_UKBB_1403binary_results.zip', 'GeneScan3D_UKBB_797continuous_results.zip' and 'GeneScan3D_UKBB_30biomarkers_results.zip'. A list of all 2,230 binary and continuous phenotypes is available in excel file 'UKBB_phenotype_description.xlsx'. Reference: Ma, S., Dalgleish, J. L ., Lee, J., Wang, C., Liu, L., Gill, R., Buxbaum, J. D., Chung, W., Aschard, H., Silverman, E. K., Cho, M. H., He, Z. and Ionita-Laza, I. "Improved gene-based testing by integrating long-range chromatin interactions and knockoff statistics", 2021
This dataset has been superseded. Updated version available at DOI: 10.5523/bris.1ovaau5sxunp2cv8rcy88688v
This is a full description of the quality control procedure undertaken and the derived files produced by the MRC-IEU associated with the full UK Biobank (July 2017) genetic data.
The results of BIGKnock analyses of manuscript ''Fine-mapping gene-based associations via knockoff analysis of biobank-scale data with applications to UK Biobank''
Pleiotropy and genetic correlation are widespread features in GWAS, but they are often difficult to interpret at the molecular level. Here, we perform GWAS of 16 metabolites clustered at the intersection of amino acid catabolism, glycolysis, and ketone body metabolism in a subset of UK Biobank. We utilize the well-documented biochemistry jointly impacting these metabolites to analyze pleiotropic effects in the context of their pathways. Among the 213 lead GWAS hits, we find a strong enrichment for genes encoding pathway-relevant enzymes and transporters. We demonstrate that the effect directions of variants acting on biology between metabolite pairs often contrast with those of upstream or downstream variants as well as the polygenic background. Thus, we find that these outlier variants often reflect biology local to the traits. Finally, we explore the implications for interpreting disease GWAS, underscoring the potential of unifying biochemistry with dense metabolomics data to understa...
Levels of sex differences for human body size and shape phenotypes are hypothesized to have adaptively reduced following the agricultural transition as part of an evolutionary response to relatively more equal divisions of labor and new technology adoption. In this study, we tested this hypothesis by studying genetic variants associated with five sexually differentiated human phenotypes: height, body mass, hip circumference, body fat percentage, and waist circumference. We first analyzed genome-wide association (GWAS) results for UK Biobank individuals (~197,000 females and ~167,000 males) to identify a total of 119,023 single nucleotide polymorphisms (SNPs) significantly associated with at least one of the studied phenotypes in females, males, or both sexes (P<5x10-8). From these loci we then identified 3,016 SNPs (2.5%) with significant differences in the strength of association between the female- and male-specific GWAS results at a low false-discovery rate (FDR<0.001). Genes w...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary-level data generated by Genomics plc as presented in: Diogo, D. et al. Phenome-wide association studies across large population cohorts support drug target validation. Nat. Commun. 9, 4285 (2018). https://doi.org/10.1038/s41467-018-06540-3
If you have any questions or comments regarding these files, please contact Genomics plc at research@genomicsplc.com
These analyses were carried out using the interim UK Biobank imputation data release. Analyses were restricted to a subset of "white-British" unrelated samples with a maximum sample size of 112,337 individuals.
Case control phenotypes were defined based on categorical datafields as listed in the accompanying file. Quantitative phenotypes were either rank-normalised before analysis, or beta/se values were standardised after analysis using the variance of the phenotype. The normalisation value is indicated in the accompanying file.
All analyses included Age at assessment, sex, genotyping chip, and 10 principal components as covariates.
We used plink1.9 linear/logistic regression as appropriate. For chromosome X variants males were treated as having 0 or 2 alternative alleles.
The results are not adjusted for genomic control.
CHR - Chromosome SNP - Variant rsID ALT - Alternative allele (effect allele) REF - Reference Allele (non-effect allele) BP - Position in base pairs (b37, 1-based) NMISS - Number of samples with non-missing genotypes BETA - Effect size (log odds ratio or standardised effect size) SE - Standard error P - P-value F_MISS - genotype missing rate P_hwe - Hardy-weinberg p-value MAF - ALT allele frequency
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sex-stratified GWAS can help shed light on sexual differences in genetic architecture. In Bernabeu et al (2021) we fit sex-stratified linear mixed models (using DISSECT) across a total of 530 phenotypes to assess the effects of sex on genetic effect estimates, and compared estimates between males and females in a search for genetic variants that presented significant differences in association to the traits considered. Here, the summary statistics of said efforts, pertaining to clinical binary traits, are included (note: does not include UK Biobank cancer traits - these are found in DataShare item pertaining to non-clinical binary traits). Each file contains the results for a single clinical binary trait, as stated in the file name, using its corresponding UK Biobank trait code. Trait descriptions, including their respective UK Biobank codes, are stated in the 'trait_description.tsv' file. For each trait (each .gz file), GWAS summary statistics obtained for over 4 million genetic variants across the genome (both autosomal, and X chromosome, MAF 10% filtered) and circa 450K individuals, as well as the results of the t-test comparing genetic effect estimates between the sexes, are included.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the manuscript 'Associations between alcohol use and accelerated biological ageing'. Specifically: Genome Wide Association Study of brain age.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Genome-wide association study summary statistics of email contact and Mental Health Questionnaire participation in UK Biobank. Data in support of the manuscript: 'Factors associated with sharing email information and mental health survey participation in large population cohorts'. ABSTRACT BACKGROUND People who opt to participate in scientific studies tend to be healthier, wealthier, and more educated than the broader population. While selection bias does not always pose a problem for analysing the relationships between exposures and diseases or other outcomes, it can lead to biased effect size estimates. Biased estimates may weaken the utility of genetic findings because the goal is often to make inferences in a new sample (such as in polygenic risk score analysis). METHODS We used data from UK Biobank, Generation Scotland, and Partners Biobank and conducted phenotypic and genome-wide association analyses on two phenotypes that reflected mental health data availability: (1) whether participants were contactable by email for follow-up and (2) whether participants responded to follow-up surveys of mental health. RESULTS In UK Biobank, we identified nine genetic loci associated (P < 5 x 10-8) with email contact and 25 loci associated with mental health survey completion. Both phenotypes were positively genetically correlated with higher educational attainment and better health and negatively genetically correlated with psychological distress and schizophrenia. One SNP association replicated along with the overall direction of effect of all association results. CONCLUSIONS Recontact availability and follow-up participation can act as further genetic filters for data on mental health phenotypes.
To identify genetic variation underlying atrial fibrillation, the most common cardiac arrhythmia, we performed a genome-wide association study of >1,000,000 people, including 60,620 atrial fibrillation cases and 970,216 controls. We identified 142 independent risk variants at 111 loci and prioritized 151 functional candidate genes likely to be involved in atrial fibrillation. Many of the identified risk variants fall near genes where more deleterious mutations have been reported to cause serious heart defects in humans (GATA4, MYH6, NKX2-5, PITX2, TBX5)1, or near genes important for striated muscle function and integrity (for example, CFL2, MYH7, PKP2, RBM20, SGCG, SSPN). Pathway and functional enrichment analyses also suggested that many of the putative atrial fibrillation genes act via cardiac structural remodeling, potentially in the form of an ‘atrial cardiomyopathy’2, either during fetal heart development or as a response to stress in the adult heart. The data format is .tbl
This project aims to leverage the power of UK Biobank to detect rare genetic variants associated with lung function.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Drug treatment for nociceptive musculoskeletal pain (NMP) follows a three-step analgesic ladder, starting from non-steroidal anti-inflammatory drugs (NSAIDs), followed by weak or strong opioids until the pain is under control. Here, we conducted a genome-wide association study (GWAS) of a binary phenotype comparing NSAID users and opioid users as a proxy of treatment response to NSAID using data from the UK Biobank. We aim to find the common genetic variants associated with pain treatment response in the general population.Type of data uploaded in this repositoryUK Biobank is a large-scale biomedical database and research resource containing in-depth genetic and health information from half a million UK participants (https://www.ukbiobank.ac.uk/). The database is globally accessible to approved researchers undertaking vital research into the most common and life-threatening diseases. As the raw data is quite large and only available upon application to UKB, we only provide the results from our analysis, which is also described here: medrxiv and currently in revision in a scientific journal. In the dataset, you will find the association of 9,435,994 SNPs genetic variants with the pain treatment response (PTR) phenotype. This dataset is not applicable to be opened with Excel and can best be opened on a cluster computer or using specific software.SubjectsThe UK Biobank is a general population cohort with over 0.5 million participants aged 40–69 recruited across the United Kingdom (UK). We derived a phenotype as a proxy for the pain treatment response to NSAIDs by using recently released primary care (general practitioners', GPs') data, which contains longitudinal structured diagnosis and prescription data. To define the PTR phenotype, we first extracted all nociceptive musculoskeletal pain (NMP) treatments and diagnoses from the GP data. NMP diagnosis was primarily selected from the chapters on musculoskeletal and connective tissue diseases and relevant symptoms or signs from other chapters in the Read codes (versions 2 and 3). See Supplementary data 1 on medrxiv for the diagnosis codes included in this study. Secondly, pain prescriptions (NSAID and opioid) were extracted from the GP data using the British national formulary (BNF), dictionary of medicines and devices (dmd), and Read code (version 2) for data extraction. An overview of the extracted medication codes is provided in Supplementary data 2 on medrxiv. Only participants with an NMP diagnosis record and a pain prescription record occurring on the same date were included for analysis to ensure that we would only include pain treatment for NMP.PhenotypeBased on the information of NMP and pain prescriptions from the UK biobank, a dichotomous score was used for the binary (case/control) PTR phenotype: NSAID users were defined as controls and opioid users as cases. Two additional quality control (QC) steps were applied. First, participants with only one treatment event were removed to safeguard the inclusion of only participants with relatively long-term treatment. Second, a chronological check was applied for the first prescription of each ladder to ensure that the treatment ladder was correctly followed, i.e., initial NSAID use was followed by weak or strong opioids. Participants that were not treated according to this order were removed.SNP genotyping and quality controlGenotyping procedures have been described in detail elsewhere [PMID: 30305743].The third-release genotyping data were used for analysis (see https://biobank.ctsu.ox.ac.uk/crystal/label.cgi?id=100319).Participants passing quality control were included for analysis. QC steps for the samples included removal of participants with (1) inconsistent self-reported and genetically determined sex, (2) missing individual genetic data with a frequency of more than 0.1, (3) putative sex-chromosome aneuploidy. Participants were also excluded from the analysis if they were considered outliers due to missing heterozygosity, not white British ancestry based on the genotype, and had missing covariate data. Note that when we fit the linear mixed model in GCTA, it reminded us that the number of closely related participants was low. Therefore, we didn't further remove the related individuals in the sample.Routine QC steps for genetic markers on autosomes included removal of single nucleotide polymorphisms (SNPs) with (1) an imputation quality score less than 0.8, (2) a minor allele frequency (MAF) less than 0.005, (3) a Hardy-Weinberg equilibrium (HWE) test P-value less than 1 × 10−6, and (4) a genotyping call rate less than 0.95.Genome-wide association analysisA GWAS for binary PTR phenotype was conducted using a linear function in GCTA [38] for markers on the autosomal chromosomes, adjusting for age, sex, BMI, depression history, smoking status, drinking frequency, assessment center, genotyping array, and the first ten principal components (PCs). The following variables from the UK Biobank data set...
Summary-level CAD GWAS data generated by Genomics plc as presented in: Riveros-Mckay F. et al. An integrated polygenic tool substantially enhances coronary artery disease prediction. Circulation: Genomics and Precision Medicine (in press). If you have any questions or comments regarding these files, please contact Genomics plc at research@genomicsplc.com NOTES ----------------------------- These analyses were carried out using the full UK Biobank imputation data release (v3b). Analyses were restricted to a subset of UK Biobank, described as “Group I” in the published paper. Group I, “no PCE/QRISK3 available”, included 114,196 European-ancestry individuals with missing data that prevented PCE or QRISK3 calculation. CAD case phenotypes were defined as described in the “Phenotype definitions” section of the paper’s Supplementary Materials, using both prevalent (pre-baseline) and incident (post-baseline) events. All analyses included Age at assessment, sex, genotyping chip, and 10 principal components as covariates. We used plink2.0 logistic regression. For chromosome X variants males were treated as having 0 or 2 alternative alleles. The results are not adjusted for genomic control. DATA FILE CONTENT DESCRIPTION ----------------------------- cpra Variant ID in ‘CPRA’ format. Position reflects position in b37. chrom Chromosome pos Position in base pairs (b37, 1-based) alt Alternative allele (effect allele) beta Effect size (log odds ratio) standard_error Standard error of beta minus_log10_p Minus log(base 10) of P-value ref Reference allele (non-effect allele) ncase Number of cases ncontrol Number of controls
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Colorectal cancer is one of the leading causes of cancer-related mortality in the world. Incidence and mortality are predicted to rise globally during the next several decades. When detected early, colorectal cancer is treatable with surgery and medications. This leads to the requirement for prognostic and diagnostic biomarker development. Our study integrates machine learning models and protein network analysis to identify protein biomarkers for colorectal cancer. Our methodology leverages an extensive collection of proteome profiles from both healthy and colorectal cancer individuals. To identify a potential biomarker with high predictive ability, we used three machine learning models. To enhance the interpretability of our models, we quantify each protein’s contribution to the model’s predictions using SHapley Additive exPlanations values. Three classifiers—LASSO, XGBoost, and LightGBM were evaluated for predictive performance along with hyperparameter tuning of each model using grid search, with LASSO achieving the highest AUC of 75% in the UK Biobank dataset and the AUCs for LightGBM and XGBoost are 69.61% and 71.42%, respectively. Using SHapley Additive exPlanations values, TFF3, LCN2, and CEACAM5 were found to be key biomarkers associated with cell adhesion and inflammation. Protein quantitative trait loci analyze studies provided further evidence for the involvement of TFF1, CEACAM5, and SELE in colorectal cancer, with possible connections to the PI3K/Akt and MAPK signaling pathways. By offering insights into colorectal cancer diagnostics and targeted therapeutics, our findings set the stage for further biomarker validation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Brain ageing is a highly variable, spatially and temporally heterogeneous process, marked by numerous structural and functional changes. These can cause discrepancies between individuals’ chronological age and the apparent age of their brain, as inferred from neuroimaging data. Machine learning models, and particularly Convolutional Neural Networks (CNNs), have proven adept in capturing patterns relating to ageing induced changes in the brain. The differences between the predicted and chronological ages, referred to as brain age deltas, have emerged as useful biomarkers for exploring those factors which promote accelerated ageing or resilience, such as pathologies or lifestyle factors. However, previous studies rely only on structural neuroimaging for predictions, overlooking potentially informative functional and microstructural changes. Here we show that multiple contrasts derived from different MRI modalities can predict brain age, each encoding bespoke brain ageing information. By using 3D CNNs and UK Biobank data, we found that 57 contrasts derived from structural, susceptibility-weighted, diffusion, and functional MRI can successfully predict brain age. For each contrast, different patterns of association with non-imaging phenotypes were found, resulting in a total of 191 unique, statistically significant associations. Furthermore, we found that ensembling data from multiple contrasts results in both higher prediction accuracies and stronger correlations to non-imaging measurements. Our results demonstrate that other 3D contrasts and modalities, which have not been considered so far for the task of brain age prediction, encode different information about the ageing brain. We envision our work as being the starting point for future investigations into the causal links underpinning the observed brain age deltas and non-imaging measurement associations. For instance, drug effects can be monitored, given that certain medications correlated with accelerated brain ageing. Furthermore, continued development of brain age models could facilitate their deployment in clinical trials for recruitment and monitoring, and hospitals for diagnostic and screening tasks.
Data Description
This dataset contains the full correlation results with all nIDPs in the UK Biobank. These are presented in datasets split by sex in Female and Male subjects. For easier data manipulation, two smaller datasets have also been made available, containing just those correlation which pass the False Discovery Rate (FDR) threshold.
As experiments were also conducted for ensembles using multiple contrasts, similar datasets are provided for those.
Finally, global datasets are also provided. These are the concatenation of the associations contained in the Male and Female datasets.
Paper & Code
The original paper for this article can be accessed here:
To access the codes relevant for this project, please access the project GitHub Repos:
If using this work, please cite it based on the above paper, or using the following BibTex:
@inproceedings{roibu2023brain,
title={Brain Ages Derived from Different MRI Modalities are Associated with Distinct Biological Phenotypes},
author={Roibu, Andrei-Claudiu and Adaszewski, Stanislaw and Schindler, Torsten and Smith, Stephen M and Namburete, Ana IL and Lange, Frederik J},
booktitle={2023 10th IEEE Swiss Conference on Data Science (SDS)},
pages={17--25},
year={2023},
organization={IEEE},
doi={10.1109/SDS57534.2023.00010}
}
Data Access
The data for this project is freely available upon application at the UK Biobank. For more information regarding the individual nIDPs, please access the UK Biobank Showcase website at: https://biobank.ctsu.ox.ac.uk/showcase/search.cgi
Funding
ACR is supported by EPSRC Grant EP/S024093/1, F. Hoffmann-La Roche AG and a 2021 Industrial Fellowship offered by the Royal Commission for the Exhibition of 1851. SMS is supported by a Wellcome Trust Collaborative Award 215573/Z/19/Z. AILN is grateful for support from the Academy of Medical Sciences under the Springboard Awards scheme (SBF005/1136), and the Bill and Melinda Gates Foundation. FJL is supported by a Wellcome Trust Collaborative Award (215573/Z/19/Z). The WIN is supported by core funding from the Wellcome Trust (203139/Z/16/Z). The computational aspects were supported by the Wellcome Trust (203141/Z/16/Z) and the NIHR Oxford BRC. Corresponding authors: ACR (andreiroibu@icloud.com), SA (stanislaw.adaszewski@roche.com) and AILN (ana.namburete@cs.ox.ac.uk).
The dataset contains results of two genome-wide association studies for age-related hearing impairment (ARHI)-related traits as described in the following publication Wells HRR, Freidin MB, Zainul Abidin FN, Payton A, Dawes P, Munro KJ, Morton CC, Moore DR, Dawson SJ, Williams FMK. GWAS Identifies 44 Independent Associated Genomic Loci for Self-Reported Adult Hearing Difficulty in UK Biobank. Am J Hum Genet. 2019 Oct 3;105(4):788-802. doi: 10.1016/j.ajhg.2019.09.008. Epub 2019 Sep 26. Please cite the article if using this dataset. Two files provide summary statistics for discovery analysis of Hearing difficulty (HD) and Hearing aid use (HAID) phenotypes for individuals of European descent from UK Biobank. Acknowledgements The research was carried out using the UK Biobank Resource under application number 11516. H.R.R.W. is funded by a PhD Studentship Grant, S44, from Action on Hearing Loss. The study was also supported by funding from NIHR UCLH BRC Deafness and Hearing Problems Theme, a grant from MED_EL, and the NIHR Manchester Biomedical Research Centre. The English Longitudinal Study of Aging is jointly run by University College London, Institute for Fiscal Studies, University of Manchester, and National Centre for Social Research. Genetic analyses have been carried out by UCL Genomics and funded by the Economic and Social Research Council and the National Institute on Aging. Data governance was provided by the METADAC data access committee, funded by ESRC, Wellcome, and MRC (2015-2018: Grant Number MR/N01104X/1 2018-2020: Grant Number ES/S008349/1). TwinsUK is funded by the Wellcome Trust, Medical Research Council, European Union, the National Institute for Health Research (NIHR)-funded BioResource, Clinical Research Facility, and Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust in partnership with King’s College London. We would like to thank all the participants of UK Biobank, English Longitudinal Study of Aging, and TwinsUK. Column headers: SNP, SNP rsID CHR, chromosome BP, genomic position (GRCh37 build) ALLELE1, effect allele (coded as "1") ALLELE0, reference allele (coded as "0") A1FREQ, effect allele frequency INFO, imputation quality BETA, effect size of effect allele SE: standard error of effect size P, P-value of association (without GC correction)
BOLT-LMM summary statistics for 45 UK Biobank diseases/traits analyzed by TGFM. See README for more details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionThe UK Biobank (UKB) is a resource that includes detailed health-related data on about 500,000 individuals and is available to the research community. However, several obstacles limit immediate analysis of the data: data files vary in format, may be very large, and have numerical codes for column names.Resultsukbtools removes all the upfront data wrangling required to get a single dataset for statistical analysis. All associated data files are merged into a single dataset with descriptive column names. The package also provides tools to assist in quality control by exploring the primary demographics of subsets of participants; query of disease diagnoses for one or more individuals, and estimating disease frequency relative to a reference variable; and to retrieve genetic metadata.ConclusionHaving a dataset with meaningful variable names, a set of UKB-specific exploratory data analysis tools, disease query functions, and a set of helper functions to explore and write genetic metadata to file, will rapidly enable UKB users to undertake their research.