Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.
The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.
The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:
Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.
The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).
The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:
In addition, this repository provides these additional files:
The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).
The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.
This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains Linkage Disequilibrium (LD) matrices for six ancestry groups from the UK Biobank.
LD matrices record the SNP-by-SNP correlations in a given sample of individuals from the general population. In this case, we threshold the matrices so that we only record the correlations between variants in the same LD block (defined by LDetect). The continental ancestry groups are defined by the Pan-UKB initiative as:
EUR = European ancestry (N=362446)CSA = Central/South Asian ancestry (N=8284)AFR = African ancestry (N=6255)EAS = East Asian ancestry (N=2700)MID = Middle Eastern ancestry (N=1567)AMR = Admixed American ancestry (N=987)The sample sizes here are restricted to unrelated individuals in the UK Biobank. The matrices were computed using magenpy and quantized to int8 data type for better compressibility. The standard matrices (EUR.tar.gz, AFR.tar.gz, ...) contain pairwise correlations for 1.4 million HapMap3+ variants. For European samples, we also provide LD matrices that record pairwise correlations for up to 18 million variants (EUR_18m_variants.tar.gz)
For more details on how these matrices were computed, please consult our manuscript:
Towards whole-genome inference of polygenic scores with fast and memory-efficient algorithms
Shadi Zabad, Chirayu Anant Haryan, Simon Gravel, Sanchit Misra, Yue Li
To access these matrices, consult the codebase of magenpy, our custom python package with special data structures for processing these LD matrices.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains GWAS summary statistics for Standing Height in the UK Biobank.
The GWAS study used data from "White British" samples (N = 337225), which were randomly divided into 5 folds for the purposes of cross-validation. The upload contains, for each fold, GWAS summary statistics for the training and test set. The test summary statistics can be used to evaluate PRS models via pseudo-validation methods. Association testing was done with plink2.
The structure of the data is as follows:
For more details about the GWAS study, Quality Control (QC) criteria, or other information, please consult our publication:
Zabad, S., Gravel, S., & Li, Y. (2023). Fast and accurate Bayesian polygenic risk modeling with variational inference. The American Journal of Human Genetics, 110(5), 741–761. https://doi.org/10.1016/j.ajhg.2023.03.009
If you use this data in your work, please cite the publication above.
Facebook
Twitterhttps://saildatabank.com/data/apply-to-work-with-the-data/https://saildatabank.com/data/apply-to-work-with-the-data/
The NURTuRE project was devised to create a national kidney biobank as recommended in the UK Renal Research Strategy 2016. Strategic Aims: To work towards achieving this NURTuRE will:
Biological samples availability - Samples are available via the NURTuRE biobank - https://nurturebiobank.org/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterDreamBiome is built entirely on real human data — no synthetic, no invented corpora.
The system integrates three research-backed data pillars:
Sources:
- DreamBank (Hall & Van de Castle / UC Santa Cruz): https://dreambank.net/
- Dryad RSOS “textual dream analysis” dataset: https://doi.org/10.5061/dryad.4t880
Source:
- PhysioNet Sleep-EDF Expanded Dataset: https://physionet.org/content/sleep-edfx/1.0.0/
Sources:
- Ohayon (2002). Epidemiology of insomnia. Sleep Medicine Reviews.
- Lane et al. (2019). UK Biobank sleep data.
- Jiang et al. (2015). East Asia insomnia prevalence. Journal of Sleep Research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistically significant signals from the metaUSAT analysis are shown in the left-hand column. The central column shows the association p-values for those SNPs in the six original GWAS analyses, with the direction of effect indicated by a + or–sign. Candidate genes are those selected from the prioritised genes (using the four mapping strategies described previously for all GWAS-discovered loci) or genes in proximity as identified within the UCSC genome browser.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Brain ageing is a highly variable, spatially and temporally heterogeneous process, marked by numerous structural and functional changes. These can cause discrepancies between individuals’ chronological age and the apparent age of their brain, as inferred from neuroimaging data. Machine learning models, and particularly Convolutional Neural Networks (CNNs), have proven adept in capturing patterns relating to ageing induced changes in the brain. The differences between the predicted and chronological ages, referred to as brain age deltas, have emerged as useful biomarkers for exploring those factors which promote accelerated ageing or resilience, such as pathologies or lifestyle factors. However, previous studies rely only on structural neuroimaging for predictions, overlooking potentially informative functional and microstructural changes. Here we show that multiple contrasts derived from different MRI modalities can predict brain age, each encoding bespoke brain ageing information. By using 3D CNNs and UK Biobank data, we found that 57 contrasts derived from structural, susceptibility-weighted, diffusion, and functional MRI can successfully predict brain age. For each contrast, different patterns of association with non-imaging phenotypes were found, resulting in a total of 191 unique, statistically significant associations. Furthermore, we found that ensembling data from multiple contrasts results in both higher prediction accuracies and stronger correlations to non-imaging measurements. Our results demonstrate that other 3D contrasts and modalities, which have not been considered so far for the task of brain age prediction, encode different information about the ageing brain. We envision our work as being the starting point for future investigations into the causal links underpinning the observed brain age deltas and non-imaging measurement associations. For instance, drug effects can be monitored, given that certain medications correlated with accelerated brain ageing. Furthermore, continued development of brain age models could facilitate their deployment in clinical trials for recruitment and monitoring, and hospitals for diagnostic and screening tasks.
Data Description
This dataset contains the full correlation results with all nIDPs in the UK Biobank. These are presented in datasets split by sex in Female and Male subjects. For easier data manipulation, two smaller datasets have also been made available, containing just those correlation which pass the False Discovery Rate (FDR) threshold.
As experiments were also conducted for ensembles using multiple contrasts, similar datasets are provided for those.
Finally, global datasets are also provided. These are the concatenation of the associations contained in the Male and Female datasets.
Paper & Code
The original paper for this article can be accessed here:
To access the codes relevant for this project, please access the project GitHub Repos:
If using this work, please cite it based on the above paper, or using the following BibTex:
@inproceedings{roibu2023brain,
title={Brain Ages Derived from Different MRI Modalities are Associated with Distinct Biological Phenotypes},
author={Roibu, Andrei-Claudiu and Adaszewski, Stanislaw and Schindler, Torsten and Smith, Stephen M and Namburete, Ana IL and Lange, Frederik J},
booktitle={2023 10th IEEE Swiss Conference on Data Science (SDS)},
pages={17--25},
year={2023},
organization={IEEE},
doi={10.1109/SDS57534.2023.00010}
}
Data Access
The data for this project is freely available upon application at the UK Biobank. For more information regarding the individual nIDPs, please access the UK Biobank Showcase website at: https://biobank.ctsu.ox.ac.uk/showcase/search.cgi
Funding
ACR is supported by EPSRC Grant EP/S024093/1, F. Hoffmann-La Roche AG and a 2021 Industrial Fellowship offered by the Royal Commission for the Exhibition of 1851. SMS is supported by a Wellcome Trust Collaborative Award 215573/Z/19/Z. AILN is grateful for support from the Academy of Medical Sciences under the Springboard Awards scheme (SBF005/1136), and the Bill and Melinda Gates Foundation. FJL is supported by a Wellcome Trust Collaborative Award (215573/Z/19/Z). The WIN is supported by core funding from the Wellcome Trust (203139/Z/16/Z). The computational aspects were supported by the Wellcome Trust (203141/Z/16/Z) and the NIHR Oxford BRC. Corresponding authors: ACR (andreiroibu@icloud.com), SA (stanislaw.adaszewski@roche.com) and AILN (ana.namburete@cs.ox.ac.uk).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Four loci significantly associated with overlap hernia in 5,219 cases and 26,095 controls in UK Biobank.
Facebook
TwitterThe dataset contains lung cancer CT images (DICOM format) with segmentations (DICOM SEG) of tumors. Clinical and research data associated with images are available through IUCPQ-UL Biobank.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains pre-processed LD files (Sigma matrix, S matrix, ...etc) computed on unrelated British samples of the UK-Biobank (n = 306604). It is intended to be used as an input to the GhostKnockoffGWAS pipeline.
Note: We previously released another set of EUR LD files. This set of LD files should be preferred over the previous one. The main difference with this entry is that the previous entry used quasi-independent blocks from LDetect computed on the 1000 genomes project. Here we compute the independent blocks using snp_ldsplit directly on the UK-Biobank British samples.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ASPirin in Reducing Events in the Elderly (ASPREE) is a clinical trial and longitudinal study of healthy ageing, involving 16,703 Australians aged over 70 years and 2,411 Americans aged over 65 enrolled. The primary ASPREE trial has received >$60M in NIH funding since 2012. Each ASPREE participants’ health is tracked longitudinally through extensive phenotyping and collection of clinical outcome data. The ASPREE Healthy Ageing Biobank is an associated biorepository of blood, saliva and urine samples from >15,000 ASPREE participants, 10,000 of whom have now provided a matched follow-up 3-year sample. Biospecimens are consented for genetic and biomarker studies, enabling ASPREE to conduct molecular epidemiology and healthy ageing research. ASPREE facilitates over 12 sub-studies funded through NHMRC.
Genome wide association analysis has been undertaken in multiple areas:
https://bioplatforms.com/projects/aspree-framework-initiative/
To make data requests, you may undertake an approval process by contacting Paul Lacaze on Paul.Lacaze@monash.edu
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We present a database of representative left and right ventricular meshes constructed from patient-specific models based on a large cohort of ~55k participants from UK Biobank. It comprises 1423 representative tetrahedral finite element meshes across sex (male, female), body mass index (range: 16 - 42 kg/m²) and age (range: 49 - 80 years).
For each mesh, it also includes:
We also present trained network weights and nnUNet plan and hyperparameter selection files for cine MR segmentation models trained separately for the following views: 2 chamber, 3 chamber, 4 chamber and short axis. These are supplied as a zip of relevant nnUNet files for each view: Dataset101_UKBB_LAX_2Ch.zip, Dataset102_UKBB_LAX_3Ch.zip, Dataset103_UKBB_LAX_4Ch.zip, Dataset100_UKBB_Petersen_SAX.zip.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all model weights and corresponding datasets generated by Betti et al. in the manuscript Genetically regulated enhancer RNA expression predicts enhancer-promoter contact frequency and reveals genetic mechanisms at complex trait-associated loci. The following are the contents of the sub-directories in this dataset:
Please cite:
Betti, M.J., Aldrich, M.C., Lin, P., & Gamazon, E.R. (2024). Genetically regulated enhancer RNA expression predicts enhancer-promoter contact frequency and reveals genetic mechanisms at complex trait-associated loci. Preprint.
Betti, M.J., Aldrich, M.C., Lin, P., & Gamazon, E.R. (2024). eRNA GReX (Version 1.0). Zenodo. 10.5281/zenodo.11212496
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains pre-processed LD files (Sigma matrix, S matrix, ...etc) computed on Caribbean samples of the UK-Biobank (n = 4517). It is intended to be used as an input to the GhostKnockoffGWAS pipeline.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains pre-processed LD files (Sigma matrix, S matrix, ...etc) computed on Chinese samples of the UK-Biobank (n = 1574). It is intended to be used as an input to the GhostKnockoffGWAS pipeline.
Facebook
TwitterThe human sex ratio (fraction of males) at birth is close to 0.5 at the population level, an observation commonly explained by Fisher's principle. However, past human studies yielded conflicting results regarding the existence of sex ratio-influencing mutations-a prerequisite to Fisher’s principle, raising the question of whether the nearly even population sex ratio is instead dictated by the random X/Y chromosome segregation in male meiosis. Here we show that, because a person’s offspring sex ratio (OSR) has an enormous measurement error, a gigantic sample is required to detect OSR-influencing genetic variants. Conducting a UK Biobank-based genome-wide association study that is more powerful than previous studies, we detect an OSR-associated genetic variant, which awaits verification in independent samples. Given the abysmal precision in measuring OSR, it is unsurprising that the estimated heritability of OSR is effectively zero. We further show that OSR’s estimated heritability would ..., GWAS: When conducting the GWAS in the UKB, we did not simply use the sibling sex ratio as the trait, because of the difficulty in accounting for different estimation errors of the sibling sex ratio for different families as a result of the variation in family size. For example, individual A has one brother and zero sister, while individual B has four brothers and one sister. Although A has a higher sibling sex ratio than B, B’s siblings obviously provide stronger evidence for a male-biased sibling sex ratio than A’s siblings. To properly weigh the data by the family size, we considered the birth of each sibling as an independent event. In the above example, we would associate A’s genotype with one male birth and associate B’s genotype with four male births and one female birth. In GWAS, a male birth is coded as 1 and a female birth is coded as 0. The UKB participants have a total of 873,715 full siblings, leading to an unprecedented statistical power. In our GWAS in the UKB, we i..., , # In search of the genetic variants of human sex ratio at birth: Was Fisher wrong about sex ratio evolution?
https://doi.org/10.5061/dryad.vdncjsz43
GWAS summary statistics and simulation data of the paper "In search of the genetic variants of human sex ratio at birth: Was Fisher wrong about sex ratio evolution?"
Description: Scripts for the project. For descriptions of each script files, see README.md in the zip file or https://github.com/song88180/Human_sex_ratio
Description:Â GWAS summary statistics of offspring sex ratio. Cells with "NA" means the value is not available.Â
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the genome-wide association study (GWAS) statistics in the UK Biobank and Source Data files for our paper Chen ZJ, Das SS, Kar A, Lee SHT, Abuhanna KD, Alvarez M, Sukhatme MG, Wang Z, Gelev KZ, Heffel MG, Zhang Y, Avram O, Rahmani E, Sankararaman S, Heinonen S, Peltoniemi H, Halperin E, Pietiläinen KH, Luo C, Pajukanta P. Single-cell DNA methylome and 3D genome atlas of human subcutaneous adipose tissue.
Further details of these analyses can be found in the Methods and Results part of this paper.
Repository contents
GWAS summary statistics in the UK Biobank for C-reactive protein (CRP), body mass index (BMI), metabolic-dysfunction associated steatotic liver disease (MASLD), and waist-to-hip ratio adjusted for BMI (WHRadjBMI):
Figure source data:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.
The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.
The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:
Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.
The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).
The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:
In addition, this repository provides these additional files:
The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).
The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.
This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.