Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.
The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.
The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:
Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.
The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).
The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:
In addition, this repository provides these additional files:
The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).
The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.
This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
Facebook
TwitterObjective: In UK Biobank (UKB), a large population-based prospective study, cases of many diseases are ascertained through linkage to routinely collected, coded national health datasets. We assessed the accuracy of these for identifying incident strokes.
Methods: In a regional UKB sub-population (n=17,249), we identified all participants with ≥1 code signifying a first stroke after recruitment (incident stroke-coded cases) in linked hospital admission, primary care or death record data. Stroke physicians reviewed their full electronic patient records (EPRs) and generated reference standard diagnoses. We evaluated the number and proportion of cases that were true positives (i.e. positive predictive value, PPV) for all codes combined and by code source and type.
Results: Of 232 incident stroke-coded cases, 97% had EPR information available. Data sources were: 30% hospital admission only; 39% primary care only; 28% hospital and primary care; 3% death records only. While 42% of cases ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary-level GWAS data for 53 traits generated by Genomics plc as presented in:
Thompson D. et al. UK Biobank release and systematic evaluation of optimised polygenic risk scores for 53 diseases and quantitative traits (https://doi.org/10.1101/2022.06.16.22276246)
If you have any questions or comments regarding these files, please contact Genomics plc at research@genomicsplc.com
NOTES
These analyses were carried out using the full UK Biobank (UKB) imputation data release (v3b). After removal of exclusions and withdrawals, a subset of 337,151 UKB individuals, the White British Unrelated (WBU) subgroup, was defined as the intersection of two sample groups created by Bycroft et al 2018 (Nature 562, 203-209): the ‘White British ancestry’ group (UKB Data Field 22006) and the ‘used in genetic principal components’ group (UKB Data Field 22020), the latter being high quality samples that were filtered to avoid closely related individuals. All GWAS analyses were performed on the WBU subgroup.
Phenotypes were defined as described in Supplementary Table 1 ‘Phenotype definitions’ using a combination of Hospital Episode Statistics, Cancer Registry reports (where applicable) and self-report responses.
All analyses included Age at assessment, sex (for non-sex specific traits), genotyping chip, and 10 principal components as covariates.
GWAS summary statistics for each trait were generated by applying PLINK 2.0 to the WBU subgroup, using a logistic regression for disease traits, and a linear regression model for quantitative traits. For chromosome X variants males were treated as having 0 or 2 alternative alleles.
The results are not adjusted for genomic control.
DATA FILE CONTENT DESCRIPTION (DISEASE TRAITS)
cpra
Variant ID in ‘CPRA’ format. Position reflects position in b37
chrom
Chromosome
pos
Position in base pairs (b37, 1-based)
alt
Alternative allele (effect allele)
beta
Effect size (log odds ratio)
standard_error
Standard error of beta
minus_log10_p
Minus log(base 10) of P-value
ref
Reference allele (non-effect allele)
ncase
Number of cases
ncontrol
Number of controls
DATA FILE CONTENT DESCRIPTION (QUANTITATIVE TRAITS)
cpra
Variant ID in ‘CPRA’ format. Position reflects position in b37
chrom
Chromosome
pos
Position in base pairs (b37, 1-based)
alt
Alternative allele (effect allele)
beta
Effect size (log odds ratio
standard_error
Standard error of beta
minus_log10_p
Minus log(base 10) of P-value
ref
Reference allele (non-effect allele)
ntotal
Total sample size
PHENOTYPE CODES
The following is a list of traits and their phenotype codes (as used in file naming).
DISEASE TRAITS
Age-related macular degeneration
AMD
Alzheimer's disease
AD
Asthma
AST
Atrial fibrillation
AF
Bipolar disorder
BD
Bowel cancer
CRC
Breast cancer
BC
Cardiovascular disease
CVD
Coeliac disease
CED
Coronary artery disease
CAD
Crohn's disease
CD
Epithelial ovarian cancer
EOC
Hypertension
HT
Ischaemic stroke
ISS
Melanoma
MEL
Multiple sclerosis
MS
Osteoporosis
OP
Prostate cancer
PC
Parkinson's disease
PD
Primary open angle glaucoma
POAG
Psoriasis
PSO
Rheumatoid arthritis
RA
Schizophrenia
SCZ
Systemic lupus erythematosus
SLE
Type 1 diabetes
T1D
Type 2 diabetes
T2D
Ulcerative colitis
UC
Venous thromboembolic disease
VTE
QUANTITATIVE TRAITS
Age at menopause
AAM
Apolipoprotein A1
APOEA
Apolipoprotein B
APOEB
Body mass index
BMI
Calcium
ACALMD
Docosahexaenoic acid
DOA
Estimated bone mineral density T-score
EBMDT
Estimated glomerular filtration rate (creatinine based)
EGCR
Estimated glomerular filtration rate (cystatin based)
EGCY
Glycated haemoglobin
HBA1C_DF
High density lipoprotein cholesterol
HDL
Height
HEIGHT
Intraocular pressure
IOP
Low density lipoprotein cholesterol
LDL_SF
Omega-6 fatty acids
OSFA
Omega-3 fatty acids
OTFA
Phosphatidylcholines
PDCL
Phosphoglycerides
PHG
Polyunsaturated fatty acids
PFA
Resting heart rate
RHR
Remnant cholesterol (Non-HDL, Non-LDL cholesterol)
RMNC
Sphingomyelins
SGM
Total cholesterol
TCH
Total fatty acids
TFA
Total triglycerides
TTG
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AimsTo assess the associations of vitamin and/or nutritional supplements (VNS) with falls among patients with diabetes.Methods9,141 and 21,489 middle-aged participants with diabetes from Action to Control Cardiovascular Risk in Diabetes (ACCORD) trial and UK Biobank were included. Use of VNS was collected at baseline, and fall events were recorded using annual questionnaires in ACCORD and electric records in UK Biobank during follow-up. The associations of VNS use with fall risk were analyzed using logistic regression models in ACCORD and Fine-Gray sub-distribution hazard models in UK Biobank. The role of specific supplements was also estimated in UK Biobank, adjusting for confounding factors and multiple comparisons.Results45.9% (4,193/9,141, 5.5 median follow-up years) patients in ACCORD and 10.5% (2,251/21,489, 11.9 median follow-up years) in UK Biobank experienced fall and in-patient events during follow-up, respectively. In ACCORD, VNS using was associated with an increased risk of fall (full-adjusted odds ratio [OR]: 1.26, P < 0.05). In UK Biobank, despite no significant association between VNS overall and in-patient fall, vitamin B, calcium, and iron using increased the risk of falls significantly (full-adjusted hazard ratio range: 1.31–1.37, P < 0.05).ConclusionsUse of specific VNS increased the risk of fall among patients with diabetes. The non-indicative use of nutritional supplements for patients with diabetes might be inadvisable.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Population characteristics and distribution of symptoms, blood tests and primary care consultation patterns in CPRD and UK Biobank.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Characteristics of the 240,477 UK Biobank participants included in analyses.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Genetically informed, deep-phenotyped biobanks are an important research resource and it is imperative that the most powerful, versatile, and efficient analysis approaches are used. Here, we apply our recently developed Bayesian grouped mixture of regressions model (GMRM) in the UK and Estonian Biobanks and obtain the highest genomic prediction accuracy reported to date across 21 heritable traits. When compared to other approaches, GMRM accuracy was greater than annotation prediction models run in the LDAK or LDPred-funct software by 15% (SE 7%) and 14% (SE 2%), respectively, and was 18% (SE 3%) greater than a baseline BayesR model without single-nucleotide polymorphism (SNP) markers grouped into minor allele frequency–linkage disequilibrium (MAF-LD) annotation categories. For height, the prediction accuracy R 2 was 47% in a UK Biobank holdout sample, which was 76% of the estimated h SNP 2 . We then extend our GMRM prediction model to provide mixed-linear model association (MLMA) SNP marker estimates for genome-wide association (GWAS) discovery, which increased the independent loci detected to 16,162 in unrelated UK Biobank individuals, compared to 10,550 from BoltLMM and 10,095 from Regenie, a 62 and 65% increase, respectively. The average χ2 value of the leading markers increased by 15.24 (SE 0.41) for every 1% increase in prediction accuracy gained over a baseline BayesR model across the traits. Thus, we show that modeling genetic associations accounting for MAF and LD differences among SNP markers, and incorporating prior knowledge of genomic function, is important for both genomic prediction and discovery in large-scale individual-level studies. Methods From the measurements, tests, and electronic health record data available in the UK Biobank data, we selected 12 blood based biomarkers, 3 of the most common heritable complex diseases, and 6 quantitative measures. The full list of traits, the UK Biobank coding of the data used, and the covariates adjusted for are given in Table S1. For the quantitative measures and blood-based biomarkers we adjusted the values by the covariates, removed any individuals with a phenotype greater or less than 7 SD from the mean (assuming these are measurement errors), and standardized the values to have zero mean and variance 1. For the common complex diseases, we determined disease status using a combination of information available. For high blood pressure (BP), we used self-report information of whether high blood pressure was diagnosed by a doctor (UK Biobank code 6150-0.0), the age high blood pressure was diagnosed (2966-0.0), and whether the individual reported taking blood pressure medication (6153-0.0, 6177-0.0). For type-2 diabetes (T2D), we used self-report information of whether diabetes was diagnosed by a doctor (2443-0.0), the age diabetes was diagnosed (2976-0.0), and whether the individual reported taking diabetes medication (6153-0.0, 6177-0.0). For cardiovascular disease (CAD), we used self-report information of whether a heart attack was diagnosed by a doctor (3894-0.0), the age angina was diagnosed (3627-0.0), and whether the individual reported heart problem diagnosed by a doctor (6150-0.0) the date of myocardial infarction (42000-0.0). For each disease, we then combined this with primary death ICD10 codes (40001-0.0), causes of operative procedures (41201-0.0), and the main (41202-0.0), secondary (41204-0.0) and inpatient ICD10 codes (41270-0.0). For BP we selected ICD10 codes I10, for T2D we selected ICD10 codes E11 to E14 and excluded from the analysis individuals with E10 (type-1 diabetes), and for CAD we selected ICD10 code I20-I29. Thus, for the purposes of this analysis, we define these diseases broadly simply to maximise the number of cases available for analysis. For each disease, individuals with neither a self-report indication or a relevant ICD10 diagnosis, were then assigned a zero value as a control. We restricted our discovery analysis of the UK Biobank to a sample of European-ancestry individuals. To infer ancestry, we used both self-reported ethnic background (21000-0) selecting coding 1 and genetic ethnicity (22006-0) selecting coding 1. We also took the 488,377 genotyped participants and projected them onto the first two genotypic principal components (PC) calculated from 2,504 individuals of the 1,000 Genomes project with known ancestries. Using the obtained PC loadings, we then assigned each participant to the closest population in the 1000 Genomes data: European, African, East-Asian, South-Asian or Admixed, selecting individuals with PC1 projection < absolute value 4 and PC 2 projection < absolute value 3. Samples were excluded if in the UK Biobank quality control procedures they (i) were identified as extreme heterozygosity or missing genotype outliers; (ii) had a genetically inferred gender that did not match the self-reported gender; (iii) were identified to have putative sex chromosome aneuploidy; (iv) were excluded from kinship inference; (v) had withdrawn their consent for their data to be used. We used the imputed autosomal genotype data of the UK Biobank provided as part of the data release. We used the genotype probabilities to hard-call the genotypes for variants with an imputation quality score above 0.3. The hard-call-threshold was 0.1, setting the genotypes with probability <=0.9 as missing. From the good quality markers (with missingness less than 5% and p-value for Hardy-Weinberg test larger than 10-6, as determined in the set of unrelated Europeans) we selected those with minor allele frequency (MAF) > 0.0002 and rs identifier, in the set of European-ancestry participants, providing a data set 9,144,511 SNPs. From this we took the overlap with the Estonian Genome centre data described below to give a final set of 8,430,446 markers. For computational convenience we then removed markers in very high LD selecting one marker from any set of markers with LD R2 > 0.8 within a 1MB window. These filters resulted in a data set with 458,747 individuals and 2,174,071 markers. We apply our GMRM model to each UK Biobank trait, running two short chains for 5000 iterations and combining the last 2000 posterior samples together. Here, we provide the posterior mean effect size estimates fo each SNP and the mixed-linear model association regression coefficient, SE, t-statistic, and association p-value.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hb: haemoglobin; SNPs: single nucleotide polymorphisms; CNV: copy number variations; MRI: magnetic resonance imaging; DXA: dual-energy X-ray absorptiometry; ICD: International Classification of Diseases; OPCS: Office of Population Censuses and Surveys Classification of Interventions and Procedures‡ Future dates are estimated. Data available may be all or part of the relevant dataset.* available from an earlier date from health record systems in ScotlandCurrent and planned future data available in the UK Biobank resource.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Three blood pressure traits were analysed: systolic blood pressure (SBP), diastolic blood pressure (DBP) and pulse pressure (PP; the difference between SBP and DBP). Mean SBP and DBP values from automated values were calculated. After calculating blood pressure values, SBP and DBP were adjusted for medication use by adding 15 and 10 mm Hg to their values, respectively, for individuals reported to be taking blood pressure–lowering medication.For the UK Biobank genome-wide association studies (GWAS), we performed linear mixed model (LMM) association testing under an additive genetic model of the three continuous, medication-adjusted blood pressure traits (SBP, DBP, PP) for all measured and imputed genetic variants (Data Field-22828) with minor allele frequency (MAF) >=1% and imputation score>=0.3 in dosage format using the BOLT-LMM (v2.4.1) software. Covariates were age, age2, sex, BMI, genotyping array and 10PCs. Genomic inflation was not applied to the GWAS summary statistics.Sample QC was described below:We included up to 337,422 individuals from UK Biobank for the purpose of this project. We followed UK Biobank sample-based quality control criteria (Nature 2018;562:203-209); excluded were samples/individuals based on the following criteria: (i) outliers in heterozygosity and missingness, (ii) self-reported gender not consistent with genetic data inferred gender (ii) sample call rate (computed using probesets internal to Affymetrix)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Infectious agents contribute significantly to the global burden of diseases, through both acute infection and their chronic sequelae. We leveraged the UK Biobank to identify genetic loci that influence humoral immune response to multiple infections. From 45 genome-wide association studies in 9,611 participants from UK Biobank, we identified NFKB1 as a locus associated with quantitative antibody responses to multiple pathogens including those from the herpes, retro- and polyoma-virus families. An insertion-deletion variant thought to affect NFKB1 expression (rs28362491), was mapped as the likely causal variant. This variant has persisted throughout hominid evolution and could play a key role in regulation of the immune response. Using 121 infection and inflammation related traits in 487,297 UK Biobank participants, we show that the deletion allele was associated with an increased risk of infection from diverse pathogens but had a protective effect against allergic disease. We propose that altered expression of NFKB1, as a result of the deletion, modulates haematopoietic pathways, and likely impacts cell survival, antibody production, and inflammation. Taken together, we show that disruptions to the tightly regulated immune processes may tip the balance between exacerbated immune responses and allergy, or increased risk of infection and impaired resolution of inflammation.
This dataset contains GWAS summary statistics for infection, inflammation, and allergy related traits in 487,297 individuals
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Population atlases are commonly utilised in medical imaging to facilitate the investigation of variability across populations.
Such atlases enable the mapping of medical images into a common coordinate system, promoting comparability and enabling the study of inter-subject differences.
Constructing such atlases becomes particularly challenging when working with highly heterogeneous datasets, such as whole-body images, where subjects show significant anatomical variations.
In this work, we propose a pipeline for generating a standardised whole-body atlas for a highly heterogeneous population by partitioning the population into anatomically meaningful subgroups.
Using magnetic resonance (MR) images from the UK Biobank dataset, we create six whole-body atlases representing a healthy population average.
We furthermore unbias them, and this way obtain a realistic representation of the population.
In addition to the anatomical atlases, we generate probabilistic atlases that capture the distributions of abdominal fat (visceral and subcutaneous) and five abdominal organs across the population (liver, spleen, pancreas, left and right kidneys).
We demonstrate a clinical application of these atlases, using the differences between subjects with medical conditions such as diabetes and cardiovascular diseases and healthy subjects from the atlas space.
With this work, we make the constructed anatomical and label atlases publically available and anticipate them to support medical research conducted on whole-body MR images.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains pre-processed LD files (Sigma matrix, S matrix, ...etc) computed on unrelated British samples of the UK-Biobank (n = 306604). It is intended to be used as an input to the GhostKnockoffGWAS pipeline.
Note: We previously released another set of EUR LD files. This set of LD files should be preferred over the previous one. The main difference with this entry is that the previous entry used quasi-independent blocks from LDetect computed on the 1000 genomes project. Here we compute the independent blocks using snp_ldsplit directly on the UK-Biobank British samples.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting paper 'Molecular genetic contributions to social deprivation and household income in UK Biobank'. Current Biology (2016). doi: 10.1016/j.cub.2016.09.035 ## Note re working with data ## Each of the three data files contains over seventeen million rows. Users will encounter difficulties if they attempt to view the content using Notepad++ or Microsoft Notepad. Microsoft Excel 2016 will not display all rows. These space-delimited text files contains seven columns, with a header row, which are listed in the readme file. ## Note re other copy ## The data files are identical to the files of the same name previously made available on the website of the Centre for Cognitive Ageing and Cognitive Epidemiology (CCACE) http://www.ccace.ed.ac.uk/node/335 as the zip archive 'Hill_CB_2016.zip'.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Previous research demonstrates the joint association of self-reported physical activity and genotype with coronary artery disease. However, an existing research gap is whether accelerometer-measured overall physical activity or physical activity intensity can offset genetic predisposition to coronary artery disease. This study explores the independent and joint associations of accelerometer-measured physical activity and genetic predisposition with incident coronary artery disease. Incident coronary artery disease based on hospital inpatient records and death register data serves as the outcome of this study. Polygenic risk score and overall physical activity, measured as Euclidean Norm Minus One, and intensity, measured as minutes per day of moderate-to-vigorous intensity physical activity (MVPA), are examined both linearly and by decile. The UK Biobank population-based cohort recruited over 500,000 individuals aged 40 to 69 between 2006 and 2010, with 103,712 volunteers participating in a weeklong wrist-worn accelerometer study from 2013 to 2015. Individuals of White British ancestry (n = 65,079) meeting the genotyping and accelerometer-based inclusion criteria and with no missing covariates were included in the analytic sample. In the sample of 65,079 individuals, the mean (SD) age was 62.51 (7.76) and 61% were female. During a median follow-up of 6.8 years, 1,382 cases of coronary artery disease developed. At the same genetic risk, physical activity intensity had a hazard ratio (HR) of 0.41 (95% CI: 0.29–0.60) at the 90th compared to 10th percentile, equivalent to 31.68 and 120.96 minutes of moderate-to-vigorous physical activity per day, respectively, versus an HR of 0.61 (95% CI: 0.52–0.72) for overall physical activity. The combination of high genetic risk and low physical activity intensity showed the greatest risk, with an individual at the 10th percentile of genetic risk and 90th percentile of intensity facing an HR of 0.14 (95% CI: 0.09–0.21) compared to an individual at the 90th percentile of genetic risk and 10th percentile of intensity. Physical activity, especially physical activity intensity, is associated with an attenuation of some of the risk of coronary artery disease but this pattern does not vary by genetic risk. This accelerometer-based study provides the clearest evidence to date regarding the joint influence of genetics, overall physical activity, and physical activity intensity on coronary artery disease.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GCTB sparse shrunk LD matrices from 2.8M common variants from the UK Biobank. Part AA of AA, AB, AC, AD and AE. TO JOIN AND UNZIP THESE MATRICES Download all parts to one folder from: PartAA - 10.5281/zenodo.3375373 PartAB - 10.5281/zenodo.3376357 Part AC - 10.5281/zenodo.3376456 Parts AD and AE - 10.5281/zenodo.3376628 Use cat to join cat ukb_50k_bigset_2.8M.zip.part* > ukb_50k_bigset_2.8M.zip Then unzip. See README for further details. unzip ukb_50k_bigset_2.8M.zip
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We conducted a genome-wide association study in a sample of 5,390 APOE-ε4 homozygous (ε4ε4) individuals (288 cases and 5,102 controls) aged 65 or over in the UK Biobank. Results are the summary statistics from a GWAS conducted in Plink using age, sex and the first 15 principal components as covariates.There are 5,349,830 rows and 9 columns. Each row corresponds to a variant. Columns are: CHR, BP, SNP, A1, A2_counted_allele, OR, SE_beta, Z_STAT and P. As indicated by column names, A2 is the risk allele counted in the regression. Dataset is gzipped and tab-delimited.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PLINK association test statistics of UK Biobank blood traits generated for mvSuSiE fine-mapping analyses; see https://www.biorxiv.org/content/10.1101/2023.04.14.536893.
Facebook
TwitterIntroductionFrailty has been associated with various diseases. However, its impact on gastrointestinal bleeding (GIB) remains largely unexplored. This study investigates the relationship between frailty and the incidence of gastrointestinal bleeding events.MethodsA total of 352,060 participants from the UK Biobank with no history of gastrointestinal bleeding were included. Baseline frailty status was assessed using the Fried phenotype and categorized as non-frail, pre-frail, or frail. The primary outcome was gastrointestinal bleeding, identified through hospitalization records and death registries. Cox proportional hazard models were used to evaluate the association between frailty and gastrointestinal bleeding incidence.ResultsAmong the 352,060 participants (mean age 56.1 years), 3.6% (N = 12,747) were classified as frail, and 43.6% (N = 153,424) as pre-frail at baseline. Over a median follow-up of 14.7 years, 20,105 gastrointestinal bleeding events were recorded. Compared to non-frail individuals, frail (HR = 1.53, 95% CI: 1.44–1.62) and pre-frail (HR = 1.15, 95% CI: 1.11–1.18) individuals exhibited a significantly higher risk of gastrointestinal bleeding after multivariate adjustment (P for trend < 0.001). Subgroup and sensitivity analyses remained consistent findings.ConclusionFrailty significantly elevates the risk of gastrointestinal bleeding. Early identification and targeted multidimensional interventions addressing frailty may reduce gastrointestinal bleeding events and improve patient prognosis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"meta.tar" archive contains 249 files representing UK Biobank + Estonian Biobank sex-combined GWAS meta-analysis summary statistics for the Nightingale panel of 249 circulating plasma metabolic markers presented in "Pleiotropic and sex-specific genetic architecture of circulating metabolic markers" [https://doi.org/10.1101/2024.07.30.24311254].
Each file contains nine columns:
SNP: ID of the genetic marker;
CHR: chromosome code (GRCh37 genomic build);
BP: base-pair coordinate (GRCh37 genomic build);
PVAL: regression p-value;
A1: effect allele;
A2: other allele;
N: sample size;
BETA: regression coefficient for effect allele (A1);
SE: standard error of regression coefficient (BETA).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the paper Davies et al. Genome-wide association study of cognitive functions and educational attainment in UK Biobank (N=112 151). Molecular Psychiatry (2016). ## Note re working with data ## Each of the four data files contains over seventeen million rows. Users will encounter difficulties if they attempt to view the contents of these files using Notepad++ or Notepad. Microsoft Excel 2016 will not display all rows, only the first one million or so rows. These are space-delimited files.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.
The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.
The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:
Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.
The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).
The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:
In addition, this repository provides these additional files:
The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).
The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.
This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.