Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The GTEx dataset is a public resource that has generated a broad collection of gene expression data collected from a diverse set of human tissues. Here we share the processed GTEx data used in Hypergraph factorisation for multi-tissue gene expression imputation (Vinas Torne et al., 2023). We processed the data following the GTEx eQTL discovery pipeline.
If you use this data for your research, please cite the GTEx consortium paper: GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. DOI: 10.1126/science.aaz1776
Facebook
TwitterDatabase and browser that provides a central resource to archive and display association between genetic variation and high-throughput molecular-level phenotypes. This effort originated with the NIH GTEx roadmap project: however the scope of this resource will be extended to include any available genotype/molecular phenotype datasets.
Facebook
TwitterProject to study human gene expression and regulation in multiple tissues, providing valuable insights into mechanisms of gene regulation and its disease related perturbations. Genetic variation between individuals will be examined for correlation with differences in gene expression level to identify regions of the genome that influence whether and how much a gene is expressed. Includes initiatives: Novel Statistical Methods for Human Gene Expression Quantitative Trait Loci (eQTL) Analysis ,Laboratory, Data Analysis, and Coordinating Center (LDACC), caHUB Acquisition of Normal Tissues in Support of GTEx Project.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: GTEx. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
The Genotype-Tissue Expression (GTEx) Project established a data resource and tissue bank to study the relationship between genetic variants and gene expression in multiple human tissues and across individuals. The project included contributions from numerous groups with diverse expertise in biospecimen collection and processing, pathology review, molecular analysis, and data management. The contributors are collectively called the GTEx Consortium.
GTEx collected a total of 26,468 unique tissue samples from 50+ different tissue types, from 956 healthy postmortem donors. The standardized biospecimen collection and analysis practices applied during the study served to minimize preanalytical variability associated with specimen-related factors and their potential impact on analytic endpoints. Each GTEx tissue was divided into two tissue blocks, one for histology and one for molecular analysis; both tissue blocks were preserved in PAXgene Tissue Fixative (Qiagen) solution for 6 to 24 hours, followed by PAXgene Tissue Stabilizer (Qiagen) as specified in the project-specific standard operating procedures. Tissue blocks were processed and embedded in paraffin at the GTEx central repository at the Van Andel Institute (MI) and hematoxylin and eosin–stained slides were generated from all GTEx donors. Digitally scanned whole slide images of PAXgene-fixed/stabilized, paraffin-embedded tissue sections were created using Aperio Scanscope software (Leica Biosystems). The digital images were then reviewed and annotated by one of four board-certified pathologists assigned to the GTEx study. There are a total of 25,503 digital histology images in the GTEx collection.
GTEx was supported by the NIH Common Fund (2010 – 2019). Additional resources include the GTEx Biobank, the GTEx Portal, and the full dataset at dbGaP (accession number phs000424).
Please refer to the listed GTEx publications below for more details [2-7].
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, collection_id-idc_v8-aws.s5cmd corresponds to the contents of the collection_id collection introduced in IDC data release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.
gtex-idc_v19-aws.s5cmd: manifest of files available for download from public IDC Amazon Web Services bucketsgtex-idc_v19-gcs.s5cmd: manifest of files available for download from public IDC Google Cloud Storage bucketsgtex-idc_v19-dcf.dcf: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd manifests:
pip install --upgrade idc-index.s5cmd manifest file: idc download manifest.s5cmdTo download the files using .dcf manifest, see manifest header.
The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health (commonfund.nih.gov/GTEx). Additional funds were provided by the NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. Donors were enrolled at Biospecimen Source Sites funded by NCI/Leidos Biomedical Research, Inc. subcontracts to the National Disease Research Interchange (10XS170), Roswell Park Cancer Institute (10XS171), and Science Care, Inc. (X10S172). The Laboratory, Data Analysis, and Coordinating Center (LDACC) was funded through a contract (HHSN268201000029C) to the Broad Institute of MIT and Harvard. Biorepository operations were funded through a Leidos Biomedical Research, Inc. subcontract to Van Andel Research Institute (10ST1035). Additional data repository and project management were provided by Leidos Biomedical Research, Inc. (HHSN261200800001E). The Brain Bank was supported with supplements to University of Miami grant DA006227. Statistical Methods development grants were made to the University of Geneva (MH090941& MH101814), the University of Chicago (MH090951, MH090937, MH101825, & MH101820), the University of North Carolina - Chapel Hill (MH090936), North Carolina State University (MH101819), Harvard University (MH090948), Stanford University (MH101782), Washington University (MH101810), and to the University of Pennsylvania (MH101822).
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W. L., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National cancer institute imaging data commons: Toward transparency, reproducibility, and scalability in imaging artificial intelligence. Radiographics 43, (2023).
[2] Sobin, L., Barcus, M., Branton, P. A., Engel, K. B., Keen, J., Tabor, D., Ardlie, K. G., Greytak, S. R., Roche, N., Luke, B., Vaught, J., Guan, P. & Moore, H. M. Histologic and quality assessment of genotype-Tissue Expression (GTEx) research samples: A large postmortem tissue collection. Arch. Pathol. Lab. Med. (2024). doi:10.5858/arpa.2023-0467-OA
[3] GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
[4] GTEx Consortium. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
[5] GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
[6] Carithers, L. J., Ardlie, K., Barcus, M., Branton, P. A., Britton, A., Buia, S. A., Compton, C. C., DeLuca, D. S., Peter-Demchok, J., Gelfand, E. T., Guan, P., Korzeniewski, G. E., Lockhart, N. C., Rabiner, C. A., Rao, A. K., Robinson, K. L., Roche, N. V., Sawyer, S. J., Segrè, A. V., Shive, C. E., Smith, A. M., Sobin, L. H., Undale, A. H., Valentino, K. M., Vaught, J., Young, T. R., Moore, H. M. & GTEx Consortium. A novel approach to high-quality postmortem tissue procurement: The GTEx project. Biopreserv. Biobank. 13, 311–319 (2015).
[7] Branton, P. A., Sobin, L., Barcus, M., Engel, K. B., Greytak, S. R., Guan, P., Vaught, J. & Moore, H. M. Notable histologic findings in a ‘normal’ cohort: The National Institutes of Health Genotype-Tissue Expression (GTEx) project. Arch. Pathol. Lab. Med. (2024). doi:10.5858/arpa.2023-0468-OA
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset description:
49 folders, each corresponding to one tissue from GTEx v6p and containing the following files:
geneCounts: gene-level counts
k_j: split counts spanning from one exon to another.
k_theta: non-split counts covering a splice site
n_psi3: total split counts from a given acceptor site
n_psi5: total split counts from a given donor site
n_theta: total split and non-split counts for a given splice site
Sample annotation describing each sample from the dataset
Description file with global information from the dataset
The gene counts were originated using the GTF file from release 29 of GENCODE, and the split and non-split counts contain only the annotated junctions from the same release. Statistics are reported only for GENCODE-annotated introns and splice sites, in compliance with the regulations of the GTEx consortium. For a description of the samples, methods, and protocols, see the GTEx publication specified below.
Use: The count matrices are intended to help researchers that are interested in using RNA-Seq data with the purpose of diagnostics. Researchers can merge their own dataset with the downloaded ones, provided the tissue, genome build, strand, and paired-end specifications match. Afterwards, the Detection of RNA outliers Pipeline (DROP) can be used to compute gene expression and splicing outliers.
Organism: Homo sapiens
Genome assembly: hg19
Gene annotation: gencode29
Strand specific: FALSE
Paired end: TRUE
Protocol: poly(A) enrichment
Contact: Vicente A. Yepez, yepez at in.tum.de; Christian Mertes, mertes at in.tum.de; Julien Gagneur, gagneur at in.tum.de
Citation: Write the following in the "Data availability" section of the manuscript or similar replacing the three citations by the ones from the References section below:
The count matrices for the GTEx samples were downloaded from Zenodo (doi: 10.5281/zenodo.5596755) and were generated through DROP using the release 29 of the GENCODE annotation .
Also, write the following in the Acknowledgements section:
The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The raw data used for the analyses described in this manuscript were obtained from the GTEx Portal on June 12, 2017, under accession number dbGaP phs00424.v6.p1.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
When using this data, you must acknowledge the source by citing the publication "Widespread dose-dependent effects of RNA expression and splicing on complex diseases and traits" (https://doi.org/10.1101/814350).
This package contains DAP-G results on GTEx v8 eQTL and sQTL data.
See (DAP-G software) for details.
We used only European individuals and variants with MAF>0.01, on genes that are annotated as protein_coding or lncRNA.
DAP-G ld_control parameter was 0.75.
The results were analyzed in this preprint
finemapping/
|-- README_finemapping.md
|-- dapg_eqtl.tar
`-- dapg_sqtl.tar
Unpack each tarball with a command like tar -xvpf dapg_sqtl.tar
For every tissue:
{tissue}.variants_pip.txt.gz contains the variants' posterior inclusion probabilities at being causal for every gene.
{tissue}.models_variants.txt.gz contains, for every model contemplated by DAPG, the list of variants involved. Most of them have single variant.{tissue}.model_summary.txt.gz contains, for every analized gene, a summary of the modes such as expected number of causal variants
{tissue}.models.txt.gz for every analyzed gene:
{tissue}.clusters.txt.gz for every analyzed gene:
{tissue}.cluster_correlations.txt.gz: upper triangular matrix of correlations among clusters The data is provided "as is", and the authors assume no responsibility for errors or omissions.
The User assumes the entire risk associated with its use of these data.
The authors shall not be held liable for any use or misuse of the data described and/or contained herein.
The User bears all responsibility in determining whether these data are fit for the User's intended use.
The information contained in these data is not better than the original sources from which they were derived,
and both scale and accuracy may vary across the data set.
These data may not have the accuracy, resolution, completeness, timeliness, or other characteristics
appropriate for applications that potential users of the data may contemplate.
The user is responsible to comply with any data usage policy from the original GWAS studies; refer to the list of traits described here to identify their respective Consortia's requirements.
THE DATA IS PROVIDED WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATA OR THE USE OR OTHER DEALINGS IN THE DATA.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BioBombe analysis applied to randomly permuted gene expression data from The Genotype-Tissue Expression (GTEx) project. Method and results described in https://github.com/greenelab/BioBombe
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains cleaned GTEx data from the dataset "TCGA TARGET GTEx" of UCSC Xena.
All samples have survival and expression data. The patient ID matches the expression, survival, and phenotype data.
The script for data cleaning is also included.
Facebook
TwitterGTEx Single-Cell RNA-seq Dataset
This repository provides tools to create a Hugging Face dataset from GTEx single-nucleus RNA-seq data, transforming the hierarchical H5AD format into a flat, ML-ready structure.
Overview
Data Source
The data comes from GTEx's snRNA-seq atlas:
Source: GTEx Portal Publication: Eraslan et al., Science 2022 - "Single-nucleus cross-tissue molecular reference maps toward understanding disease gene function" Content: 209,126… See the full description on the dataset page: https://huggingface.co/datasets/ai-department-lpnu/gtex-single-cell-rnaseq.
Facebook
TwitterAnalysis of RNA-seq data was raw data was obtained from the Genotype-Tissue Expression project (GTEx). A total of 363 samples of frontal cortex, dorsolateral prefrontal cortex, and hippocampus from 180 non-demented human brain donors were analysed. For donors with more than one sample in the same brain region, only the one with the highest levels of MAPT were analysed. FASTQ files were obtained from the SRA files and reads were re-mapped to human genome GRCh38 by means of STAR 2.5.2a. Gene expression was quantified using RSEM 1.3.1, as Transcripts per Million (TPM). The annotation file was obtained from GENCODE v23 and was modified to include TIR12-MAPT gene (coordinates chr17:45894382–46018851), which contains part of intron 12 (coordinates chr17:46018731–46018851) as the 3’ end of the gene.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BioBombe analysis applied to gene expression data from The Genotype-Tissue Expression (GTEx) project. Method and results described in https://github.com/greenelab/BioBombe
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used in http://biorxiv.org/content/early/2017/04/18/125450
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset description:
49 folders, each corresponding to one tissue from GTEx v8 and containing the following files:
geneCounts: gene-level counts
k_j: split counts spanning from one exon to another.
k_theta: non-split counts covering a splice site
n_psi3: total split counts from a given acceptor site
n_psi5: total split counts from a given donor site
n_theta: total split and non-split counts for a given splice site
Sample annotation describing each sample from the dataset
Description file with global information from the dataset
The gene counts were originated using the GTF file from release 29 of GENCODE, and the split and non-split counts contain only the annotated junctions from the same release. Statistics are reported only for GENCODE-annotated introns and splice sites, in compliance with the regulations of the GTEx consortium. For a description of the samples, methods, and protocols, see the GTEx publication specified below.
Use: The count matrices are intended to help researchers that are interested in using RNA-Seq data with the purpose of diagnostics. Researchers can merge their own dataset with the downloaded ones, provided the tissue, genome build, strand, and paired-end specifications match. Afterwards, the Detection of RNA outliers Pipeline (DROP) can be used to compute gene expression and splicing outliers.
Organism: Homo sapiens
Genome assembly: hg38
Gene annotation: gencode29
Strand specific: FALSE
Paired end: TRUE
Protocol: poly(A) enrichment
Contact: Vicente A. Yepez, yepez at in.tum.de; Christian Mertes, mertes at in.tum.de; Julien Gagneur, gagneur at in.tum.de
Citation: Write the following in the "Data availability" section of the manuscript or similar replacing the three citations by the ones from the References section below:
The count matrices for the GTEx samples were downloaded from Zenodo (doi: 10.5281/zenodo.6078397) and were generated through DROP using the release 29 of the GENCODE annotation .
Also, write the following in the Acknowledgements section:
The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The raw data used for the analyses described in this manuscript were obtained from the GTEx Portal on June 12, 2017, under accession number dbGaP phs000424.v8.p2.
Facebook
Twitterhttps://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
This is a normalized dataset from the original RNAseq dataset downloaded from Genotype-Tissue Expression (GTEx) project: www.gtexportal.org: RNA-SeQCv1.1.8 gene rpkm Pilot V3 patch1. The data was used to analyze how tissue samples are related to each other in terms of gene expression data The data can be used to get insights in how gene expression levels behave in in the different human tissues.
Facebook
TwitterThe genetic control of gene expression is a core component of human physiology. For the past several years, transcriptome-wide association studies have leveraged large datasets of linked genotype and RNA sequencing information to create a powerful gene-based test of association that has been used in dozens of studies. While numerous discoveries have been made, the populations in the training data are overwhelmingly of European descent, and little is known about the generalizability of these models to other populations. Here, we test for cross-population generalizability of gene expression prediction models using a dataset of African American individuals with RNA-Seq data in whole blood. We find that the default models trained in large datasets such as GTEx and DGN fare poorly in African Americans, with a notable reduction in prediction accuracy when compared to European Americans. We replicate these limitations in cross-population generalizability using the five populations in the GEUVA...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Data usage policy
When using this data, you must acknowledge the source by citing the publication "Widespread dose-dependent effects of RNA expression and splicing on complex diseases and traits" (https://doi.org/10.1101/814350).
# GTEx GWAS integration This package contains the application of several GWAS-QTL integration methods. The results were analyzed in [this preprint](https://www.biorxiv.org/content/10.1101/814350v1) about GTEx v8 application to several GWAS traits. ``` . |-- colocalization | |-- coloc | | `-- coloc_enloc_priors_eqtl.tar.gz | |-- enloc | | |-- enloc_eqtl_eur.tar.gz | | `-- enloc_sqtl_eur.tar.gz | `-- eur_ld.bed.gz |-- prediction_models | |-- gtex_v8_expression_mashr_snp_smultixcan_covariance.txt.gz | |-- gtex_v8_splicing_mashr_snp_smultixcan_covariance.txt.gz | |-- mashr_eqtl.tar | `-- mashr_sqtl.tar |-- smr | |-- SMR_gtex_v8_README.txt | `-- SMRresults_GTEx_v8_peQTL5e-08.tar.gz |-- smultixcan | |-- smultixcan_eqtl.tar.gz | `-- smultixcan_sqtl.tar.gz `-- spredixcan |-- spredixcan_eqtl.tar.gz `-- spredixcan_sqtl.tar.gz ``` You can uncompress gzipped tarball packages `*.tar.gz` in a UNIX command line with an instruction such as: ```bash tar -xzvpf smultixcan_eqtl.tar.gz ``` , and the tar packages (`*.tar`) with an analogous instruction: ```bash tar -xvpf mashr_eqtl.tar ``` ## Preliminaries **Finemapping** results are contained in a separate release due to size constraints. GWAS summary statistics for 114 traits were harmonized and imputed to GTEx v8 variants with MAF>0.01 using only european samples. (summary imputation software [here](https://github.com/hakyimlab/summary-gwas-imputation)). Some of the following analyses used the full set of 114 traits, while some focused only on 87 traits whose imputed associations showed no deflation (the imputation algorithm is conservative, and studies with too few available variants have a depleted distribution of association p-values after imputation). The harmonized and imputed GWAS summary statistics are contained in a separate release due to size constraints. For completeness' sake, the imputed summary statistics look like: ``` variant_id panel_variant_id chromosome position effect_allele non_effect_allele current_build frequency sample_size zscore pvalue effect_size standard_error imputation_status n_cases rs554008981 chr1_13550_G_A_b38 chr1 13550 A G hg38 0.017316017316017316 336474 -2.2919929353647097 0.021906050841240293 NA NA imputed NA rs201055865 chr1_14671_G_C_b38 chr1 14671 C G hg38 0.012987012987012988 336474 -0.9559192804440632 0.33911301727494103 NA NA imputed NA ... ``` The GWAS were split in approximately independent LD regions (Berisa-Pickrell)/ GWAS regions are defined in `eur_ld.bed.gz` (note that a few of them are ill-defined in hg38 and where ignored; only completely defined regions were used). ## Colocalization ### Enloc ENLOC ([see fotware here](https://github.com/xqwen/integrative)) was run for sQTLs and eQTLs using individuals of european ancestry and DAP-G QTL enrichment results on 87 traits. Result files are included in `enloc_eqtl_eur.tar.gz` and `enloc_sqtl_eur.tar.gz` Each file contains a particular tissue-trait combination. Each row details colocalization between a GWAS region (Berisa-Pickrell) and gene's or intron's cis-window. A region might overlap multiple genes/introns or viceversa. Each ENLOC file contains the following columns: * gwas_locus: GWAS LD region * molecular_qtl_trait: gene or intron * locus_gwas_pip: posterior inclusion probability of variants in the GWAS LD region * locus_rcp: regional colocalization probability (main colocalization measure) * lead_coloc_SNP: snp with highest RCP * lead_snp_rcp: rcp of the lead coloc snp ### Coloc Coloc ([see software here](https://cran.r-project.org/web/packages/coloc/index.html)) was run using prior probabilities estimated from QTL enrichment of GWAS variants (computed via ENLOC). Results for eQTL are available in `coloc_enloc_priors_eqtl.tar.gz`. Each file contains results for a trait-tissue combination. Columns are: * gene_id: gene or intron id * p0: probability that neither QTL nor GWAS contain a causal variant * p1: probability that only GWAS contains a causal variant * p2: probability that only QTL has a causal variant * p3: probability that GWAS and QTL have a causal variant and it's distinct * p4: probability that GWAS and QTL have a causal variant and it's the same (main colocalization measure) ## PrediXcan `mashr_eqtl.tar` and `mashr_sqtl.tar` contain prediction models (trained on expression or splicing data respectively, for 49 GTEx tissues) and LD compilations to be used with PrediXcan, S-PrediXcan, MultiXcan and S-MultiXcan. For every tissue, the `mashr_{tissue}.db` file is a SQLite file with the prediction model definitions. `mashr_{tissue}.txt.gz` is a gzipped-text file with the upper triangular matrices of covariance between snps within a gene/intron prediction model. Many variants in these models don't have an rsid. To fully leverage the information in these models, it is advised to at least harmonize to GTEx variants, and if possible impute as we did [here](https://github.com/hakyimlab/summary-gwas-imputation). ### S-PrediXcan S-PrediXcan was run for the 114 harmonized and imputed traits, on eQTL and sQTL mashr prediction models. All of the GWAS traits had the same format, so that the following format parameters were used with S-PrediXcan: ``` --snp_column panel_variant_id --effect_allele_column effect_allele --non_effect_allele_column non_effect_allele --zscore_column zscore \ --keep_non_rsid --additional_output --model_db_snp_key varID \ ``` Each file is a CSV, with each row containing a gene/intron association at a given trait-tissue combination: * gene: ENSEMBLE ID or intron id * gene_name: HUGO name or intron id * zscore: predicted association z-score * effect_size: estimated effect size * pvalue: association p-value * var_g: estimated variance of predicted expression or splicing * pred_perf_r2: prediction model cross-validated performance * pred_perf_pval: prediction model cross-validated performance * pred_perf_qval: deprecated, empty field left for compatibility * n_snps_used: number of snps in the intersection of GWAS and model * n_snps_in_cov: number of snps in the LD compilation * n_snps_in_model: number of snps in the model * best_gwas_p: smallest p-value acros GWAS snps used in this model * largest_weight: largest prediction model weight ### S-Multixcan S-MultiXcan results were generated from the above S-PrediXcan results. Each fiel contains multi-tissue associations for a given trait: * gene: ENSEMBLE ID or intron id * gene_name: HUGO name or intron id * pvalue: multi-tissue association p-value * n: number of models avialble for this gene/intron * n_indep: number of independent components of variation in predicted expression/splicing (surviving principal components) * p_i_best: highest single-tissue p-value (S-PrediXcan) * t_i_best: tissue of highest p-value * p_i_worst: lowest single-tissue p-value (S-PrediXcan) * t_i_worst: tissue of lowest p-value * eigen_max: maximum eigenvalue of SVD * eigen_min: minimum eigenvalue of SVD * eigen_min_kept: smallest eigenvalue retained after discarding smallest variations * z_min: minimum single-tissue z-score * z_max: maximum single-tissue z-score * z_mean: mean single-tissue zscre * z_sd: standard deviation of the single-tissue z-scores * tmi: trace of M * M_i where M is predicted expression/splicing covariance across tissues for a gene, and M_i is its SVD pseudo-inverse * status: computation status, 0 if no errors ## SMR See `SMR_gtex_v8_README.txt` for details.
# Disclaimer
The data is provided "as is", and the authors assume no responsibility for errors or omissions.
The User assumes the entire risk associated with its use of these data.
The authors shall not be held liable for any use or misuse of the data described and/or contained herein.
The User bears all responsibility in determining whether these data are fit for the User's intended use.
The information contained in these data is not better than the original sources from which they were derived,
and both scale and accuracy may vary across the data set.
These data may not have the accuracy, resolution, completeness, timeliness, or other characteristics
appropriate for applications that potential users of the data may contemplate.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the aberrant gene expression prediction benchmark data as well as the necessary expected gene expression across tissues and tissue-specific isoform contribution scores for AbExp prediction.
The aberrant gene expression prediction benchmark data (aberrant_expression_prediction_benchmark.parquet) contains the following columns:
individual: GTEx individual
gene: Ensembl gene identifier
tissue: GTEx tissue
tissue_type: GTEx tissue type
mu: OUTRIDER-estimated expected gene expression
theta: OUTRIDER-estimated gene dispersion
counts: Raw gene expression count
normalized_counts: OUTRIDER-normalized gene expression count
l2fc: log2 fold change between observed and expected gene expression count
zscore: z-score of gene expression, obtained by quantile-mapping the OUTRIDER-estimated distribution to the standard normal distribution
nominal_pvalue: OUTRIDER-estimated p-value of being an expression outlier
FDR: FDR-adjusted p-value of being an expression outlier
is_in_benchmark: Whether this observation is part of the aberrant gene expression prediction benchmark
is_underexpressed_outlier: Whether this observation is an underexpression outlier at FDR < 5%. This is the benchmark prediction label.
The isoform proportions table (gtex_v8_isoform_proportions.tsv) contains the following columns:
gene: Ensembl gene identifier
tissue_type: GTEx tissue type
tissue: GTEx tissue
transcript: Ensembl transcript identifier
mean_transcript_proportions: mean transcript proportions across individuals in GTEx v8
median_transcript_proportions: median transcript proportions across individuals in GTEx v8
sd_transcript_proportions: standard deviation of transcript proportions across individuals in GTEx v8
The expected gene expression table (gtex_v8_expected_expression.tsv) contains the following columns:
gene: Ensembl gene identifier
tissue_type: GTEx tissue type
tissue: GTEx tissue
gene_is_expressed: Whether the gene is expressed in the tissue
median_expression: median OUTRIDER-estimated expected gene expression (mu) across individuals
expression_dispersion: OUTRIDER-estimated gene dispersion (theta)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
isoTWAS models for 48 GTEx tissues, adult frontal cortex tissue from the CommonMind Consortium (subset of PsychENCODE project; Gandal et al 2018, Science), and fetal frontal cortext from Walker et al 2019, Cell.
Each folder corresponds to a separate tissue and contains 1 .tsv.gz file per gene that contains the isoTWAS model. Refer to https://bhattacharya-a-bt.github.io/isotwas/ on how to use these models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains cleaned GTEx data from the dataset "TCGA TARGET GTEx" of UCSC Xena.
All samples have survival and expression data. The patient ID matches the expression, survival, and phenotype data.
The script for data cleaning is also included.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate annotation of genes in the human genome is fundamental for biomedical research and genomic data interpretation. The Ensembl, RefSeq, and GENCODE consortiums continuously update the human genome annotations based on new computational and experimental evidence, and new proteins were identified constantly. The Genotype-Tissue Expression (GTEx) project has generated more than 15,000 RNA sequencing dataset from multiple-tissues of more than 800 donors which allows to model almost all transcripts and proteins in the human genome. Using proteins translated from the GTEx transcript model, more than 21 million in-silico trypsin-digested peptides were generated. To identify high-confidence novel proteins with proteomic support, we screened more than 2,000 proteomic projects in the PRIDE database and selected more than 50,000 mass spectrometry (MS) runs from 923 projects. These MS data were used to validate the predicted novel peptides. With a stringent standard, we identified almost 20,000 novel peptides.
This dataset include files used in the the above analysis. More details can be found in the GitHub page (https://github.com/ATPs/human_novo_protein_2022).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The GTEx dataset is a public resource that has generated a broad collection of gene expression data collected from a diverse set of human tissues. Here we share the processed GTEx data used in Hypergraph factorisation for multi-tissue gene expression imputation (Vinas Torne et al., 2023). We processed the data following the GTEx eQTL discovery pipeline.
If you use this data for your research, please cite the GTEx consortium paper: GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. DOI: 10.1126/science.aaz1776