100+ datasets found

n
Genomic Data Commons Data Portal (GDC Data Portal)
neuinfo.org
rrid.site
+2more
Updated Oct 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Genomic Data Commons Data Portal (GDC Data Portal) [Dataset]. http://identifiers.org/RRID:SCR_014514/resolver/mentions
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_014514 https://identifiers.org/RRID:SCR_014514/resolver/mentions
Dataset updated
Oct 18, 2024
Description
A unified data repository of the National Cancer Institute (NCI)'s Genomic Data Commons (GDC) that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI). The GDC Data Portal provides a platform for efficiently querying and downloading high quality and complete data. The GDC also provides a GDC Data Transfer Tool and a GDC API for programmatic access.
b
Genomic Data Commons Data Portal
bioregistry.io
Updated Apr 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Genomic Data Commons Data Portal [Dataset]. https://bioregistry.io/gdc
Explore at:
Dataset updated
Apr 23, 2021
Description
The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Historical NCI Genomic Data Commons data (09-14-2017)
zenodo.org
data-staging.niaid.nih.gov
tsv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inge Seim; Inge Seim (2020). Historical NCI Genomic Data Commons data (09-14-2017) [Dataset]. http://doi.org/10.5281/zenodo.1186945
Explore at:
tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1186945
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Inge Seim; Inge Seim
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Historical NCI Genomic Data Commons data (v09-14-2017). Clinical ('phenotype') and gene expression (HTSeq FPKM-UQ).

TCGA-COAD.GDC_phenotype.tsv

dataset: phenotype - Phenotype

cohortGDC TCGA Colon Cancer (COAD)
dataset IDTCGA-COAD/Xena_Matrices/TCGA-COAD.GDC_phenotype.tsv
downloadhttps://gdc.xenahubs.net/download/TCGA-COAD/Xena_Matrices/TCGA-COAD.GDC_phenotype.tsv.gz; Full metadata
samples570
version11-27-2017
hubhttps://gdc.xenahubs.net
type of dataphenotype
authorGenomic Data Commons
raw datahttps://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-90
raw datahttps://api.gdc.cancer.gov/data/
input data formatROWs (samples) x COLUMNs (identifiers) (i.e. clinicalMatrix)
570 samples X 151 identifiersAll Identifiers All Samples

TCGA-COAD.htseq_fpkm-uq.tsv

dataset: gene expression RNAseq - HTSeq - FPKM-UQ

cohortGDC TCGA Colon Cancer (COAD)
dataset IDTCGA-COAD/Xena_Matrices/TCGA-COAD.htseq_fpkm-uq.tsv
downloadhttps://gdc.xenahubs.net/download/TCGA-COAD/Xena_Matrices/TCGA-COAD.htseq_fpkm-uq.tsv.gz; Full metadata
samples512
version09-14-2017
hubhttps://gdc.xenahubs.net
type of datagene expression RNAseq
unitlog2(fpkm-uq+1)
platformIllumina
ID/Gene Mappinghttps://gdc.xenahubs.net/download/probeMaps/gencode.v22.annotation.gene.probeMap.gz; Full metadata
authorGenomic Data Commons
raw datahttps://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-80
raw datahttps://api.gdc.cancer.gov/data/
wranglingData from the same sample but from different vials/portions/analytes/aliquotes is averaged; data from different samples is combined into genomicMatrix; all data is then log2(x+1) transformed.
input data formatROWs (identifiers) x COLUMNs (samples) (i.e. genomicMatrix)
60,484 identifiers X 512 samples
c
The Cancer Genome Atlas Breast Invasive Carcinoma Collection
cancerimagingarchive.net
dicom, n/a
Updated Feb 2, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2014). The Cancer Genome Atlas Breast Invasive Carcinoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.AB2NAZRP
Explore at:
n/a, dicomAvailable download formats
Unique identifier
https://doi.org/10.7937/K9/TCIA.2016.AB2NAZRP
Dataset updated
Feb 2, 2014
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 29, 2020
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
CIP TCGA Radiology Initiative
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Breast Phenotype Research Group.
c
The Cancer Genome Atlas Rectum Adenocarcinoma Collection
cancerimagingarchive.net
dicom, n/a
Updated Jan 5, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2016). The Cancer Genome Atlas Rectum Adenocarcinoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.F7PPNPNU
Explore at:
dicom, n/aAvailable download formats
Unique identifier
https://doi.org/10.7937/K9/TCIA.2016.F7PPNPNU
Dataset updated
Jan 5, 2016
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 29, 2020
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
The Cancer Genome Atlas Rectum Adenocarcinoma (TCGA-READ) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
CIP TCGA Radiology Initiative
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
The Cancer Genome Atlas (TCGA) RNA-seq meta-analysis
figshare.com
xlsx
Updated Feb 2, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Namshik Han (2018). The Cancer Genome Atlas (TCGA) RNA-seq meta-analysis [Dataset]. http://doi.org/10.6084/m9.figshare.5851743.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5851743.v1
Dataset updated
Feb 2, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Namshik Han
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TCGA RNA-seq V2 Level3 data were downloaded from TCGA Genomic Data Commons Data Portal (https://gdc-portal.nci.nih.gov), consisting of 11,303 samples in 34 cancer projects (33 cancer types). Nine cancer types that do not have corresponding non-tumour samples were filtered out, and the analysis was focused on tumour versus non-tumour comparison. 24 cancer types were used in this meta-analysis: BLCA, BRCA, CESC, CHOL, COAD, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LIHC, LUAD, LUSC, PAAD, PCPG, PRAD, READ, SARC, SKCM, STAD, THCA, THYM, UCEC (https://gdc-portal.nci.nih.gov). The nine filtered cancer types were ACC, DLBC, LAML, LGG, MESO, OV, TGCT, UCS and UVM. To extract expression values from TCGA RNA-seq data, we used genomic coordinates to retrieve UCSC Transcript IDs that correspond to the identifiers in TCGA RNA-seq V2 Level3 data (isoform level). The GAF (General Annotation Format) file was used to map the coordinate to UCSC Transcript ID, and it was downloaded form https://tcga-data.nci.nih.gov/docs/GAF/GAF.hg19.June2011.bundle/outputs/TCGA.hg19.June2011.gaf. This file contains genomic annotations shared by all TCGA projects. More details of the GAF file format can be found at https://tcga-data.nci.nih.gov/docs/GAF/GAF3.0/GAF_v3_file_description.docx. We filtered out any coding exons overlapping UCSC Transcript IDs to eliminate expression value of coding genes and evaluate lncRNA expression.We could find the expression values of 443 pcRNAs and 203 tapRNAs in TCGA data, as many of non-coding regions are not yet fully annotated in the TCGA RNA-seq V2 Level3 data. The expression value of pcRNAs and tapRNAs were extracted and clustered by un-supervised Pearson correlation method (Supplementary Figure 18A). The expression values of tapRNA-associated coding genes were also extracted and used to generate the heat-map (Supplementary Figure 18B), which shows the similar pattern of expression with tapRNAs across the cancer types.To show that tapRNAs and associated coding genes have similar expression profiles in cancers we generated a Spearman's Rank-Order Correlation heatmap (Figure 6A) between tapRNAs and their associated coding genes based on the TCGA RNA-seq data. We used the MatLab function corr to calculate the Spearman's rho. This function takes two matrices X (197-by-8,850 expression profiling matrix of tapRNA) and Y (197-by-8,850 expression profiling matrix of tapRNA-assocated coding gene) and returns an 8,850-by-8,850 matrix containing the pairwise correlation coefficient between each pair of 8,850 columns (TCGA cancer samples in Supplementary Figure 18A and B). Thus, the rank-order correlation matrix that we computed from the matrices of expression profiling data (Supplementary Figure S18A and B) allowed us to compare the correlation between two column vectors i.e. cancer samples. This function also returns a matrix of p-values for testing the hypothesis of no correlation against the alternative that there is a nonzero correlation. Each element of a matrix of p-values is the p value for the corresponding element of Spearman's rho. The p-values for Spearman's rho are calculated using large-sample approximations. To check significance level of correlation between tapRNA and its associated coding gene, the diagonal of the p-value matrix was extracted and used. The median is 1.31x10-11 and the mean is 1.03x10-4 with standard deviation 0.0029.To identify cancer-specific tapRNAs, we considered not only the global expression pattern of a given tapRNA in each cancer type, but also expression pattern of specific sub-group that is significantly distinct, to take into account cancer sample heterogeneity. Thus, two conditions were applied: (1) average expression level of a tapRNA in a given cancer type is in top 10% or bottom 10% and (2) a tapRNA has at least 10% of samples in a given cancer type that are significantly up-regulated (Z-score > 2) or down-regulated (Z-score < -2).
c
The Cancer Genome Atlas Lung Adenocarcinoma Collection
cancerimagingarchive.net
dicom, n/a
Updated Jan 30, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2017). The Cancer Genome Atlas Lung Adenocarcinoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.JGNIHEP5
Explore at:
n/a, dicomAvailable download formats
Unique identifier
https://doi.org/10.7937/K9/TCIA.2016.JGNIHEP5
Dataset updated
Jan 30, 2017
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 29, 2020
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
The Cancer Genome Atlas Lung Adenocarcinoma (TCGA-LUAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
CIP TCGA Radiology Initiative
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Lung Phenotype Research Group.
Cancer Categories and clinical research figures
kaggle.com
zip
Updated Oct 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DrAHung (2025). Cancer Categories and clinical research figures [Dataset]. https://www.kaggle.com/datasets/drahung/cancer-categories-and-clinical-research-figures
Explore at:
zip(996208 bytes)Available download formats
Dataset updated
Oct 16, 2025
Authors
DrAHung
Description
This dataset integrates open public data from multiple biomedical sources to provide a structured, queryable database of cancer classifications and clinical data from The Cancer Genome Atlas (TCGA).

All data are de-identified and publicly available via the U.S. National Cancer Institute (NCI) Genomic Data Commons (GDC) API, ensuring full compliance with NIH open-access guidelines.

Included Tables Table Description cancer_category Disease Ontology (DOID) categories and hierarchical labels (including English + Chinese translations). patient_tcga_clinical De-identified patient clinical records per TCGA project (demographics, stage, grade, survival, treatment). tcga_project_summary Per-project summary statistics (case counts, survival averages, tumor stage/grade coverage, and mapped cancer type).

tcga_project TCGA project metadata with links to DOID cancer categories.

Data source is from The Cancer Genome Atlas (TCGA).

A snapshot of clinical data. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F29334708%2F0049f6224420593507bfc8072df3e0e4%2Fsample.png?generation=1760586452165254&alt=media" alt="">
h
TCGA-12K-litdata
huggingface.co
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MedARC (2025). TCGA-12K-litdata [Dataset]. https://huggingface.co/datasets/medarc/TCGA-12K-litdata
Explore at:
Dataset updated
Nov 20, 2025
Dataset authored and provided by
MedARC
Description
Attribution

This dataset contains 224 x 224 JPEG patches from whole-slide images originally downloaded from The Cancer Genome Atlas (TCGA) that are available in the NCI Genomic Data Commons (GDC) Open Access tier. We mirror and repackage a commonly used ~12k WSI subset in LitData format for ease of training. We exclude patches that did not pass HSV thresholding, following the procedure in Kaiko.AI's Midnight paper. Patches were randomly sampled across magnification levels. There are… See the full description on the dataset page: https://huggingface.co/datasets/medarc/TCGA-12K-litdata.
c
The Cancer Genome Atlas Sarcoma Collection
cancerimagingarchive.net
dicom, n/a
Updated Jan 5, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2016). The Cancer Genome Atlas Sarcoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.CX6YLSUX
Explore at:
dicom, n/aAvailable download formats
Unique identifier
https://doi.org/10.7937/K9/TCIA.2016.CX6YLSUX
Dataset updated
Jan 5, 2016
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 29, 2020
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
The Cancer Genome Atlas Sarcoma (TCGA-SARC) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
CIP TCGA Radiology Initiative
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
Comparison of the top 10 differentially expressed genes inferred from...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ling-Hong Hung; Bryce Fukuda; Robert Schmitz; Varik Hoang; Wes Lloyd; Ka Yee Yeung (2025). Comparison of the top 10 differentially expressed genes inferred from concatenation of published counts (“published vs published”) versus those inferred from harmonized uniform GDC re-processing (“reprocessed vs reprocessed”). [Dataset]. http://doi.org/10.1371/journal.pone.0318676.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0318676.t002
Dataset updated
Mar 4, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Ling-Hong Hung; Bryce Fukuda; Robert Schmitz; Varik Hoang; Wes Lloyd; Ka Yee Yeung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of the top 10 differentially expressed genes inferred from concatenation of published counts (“published vs published”) versus those inferred from harmonized uniform GDC re-processing (“reprocessed vs reprocessed”).
TCGA-KIRP
kaggle.com
zip
Updated Apr 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chun Yu Chen (2022). TCGA-KIRP [Dataset]. https://www.kaggle.com/datasets/junyussh/tcga-kirp/versions/5
Explore at:
zip(5197766487 bytes)Available download formats
Dataset updated
Apr 27, 2022
Authors
Chun Yu Chen
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
The Cancer Genome Atlas Cervical Kidney renal papillary cell carcinoma (KIRP) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).

Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.

Source: https://wiki.cancerimagingarchive.net/display/Public/TCGA-KIRP
c
Transportation Energy Resources from Renewable Agriculture Phenotyping...
datacommons.cyverse.org
Updated Mar 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TERRA-REF (2016). Transportation Energy Resources from Renewable Agriculture Phenotyping Reference Platform (TERRA-REF) Genomic Data Repository for Sorghum [Dataset]. https://datacommons.cyverse.org/browse/iplant/home/shared/terraref
Explore at:
Dataset updated
Mar 1, 2016
Dataset provided by
CyVerse Data Commons
Authors
TERRA-REF
Description
The terraref directory contains raw and derived sorghum genome sequencing data from the Transportation Energy Resources from Renewable Agriculture Phenotyping Reference Platform (TERRA-REF) project (http://terraref.org/). Raw data includes DNA sequence files in compressed FASTQ format. Derived data is available for whole-genome resequencing and genotyping-by-sequencing. See the README.txt and DATA_USE_POLICY.md for more information. TERRA-REF is under development, and November 15 2017 marks the beta release. To request early access to these data sources and to recieve notifications and updates, please fill out the beta-user application at http://terraref.org/data/.
DataSheet_1_Survival Analysis of Multi-Omics Data Identifies Potential...
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nitish Kumar Mishra; Siddesh Southekal; Chittibabu Guda (2023). DataSheet_1_Survival Analysis of Multi-Omics Data Identifies Potential Prognostic Markers of Pancreatic Ductal Adenocarcinoma.pdf [Dataset]. http://doi.org/10.3389/fgene.2019.00624.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2019.00624.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Nitish Kumar Mishra; Siddesh Southekal; Chittibabu Guda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pancreatic ductal adenocarcinoma (PDAC) is the most common and among the deadliest of pancreatic cancers. Its 5-year survival is only ∼8%. Pancreatic cancers are a heterogeneous group of diseases, of which PDAC is particularly aggressive. Like many other cancers, PDAC also starts as a pre-invasive precursor lesion (known as pancreatic intraepithelial neoplasia, PanIN), which offers an opportunity for both early detection and early treatment. Even advanced PDAC can benefit from prognostic biomarkers. However, reliable biomarkers for early diagnosis or those for prognosis of therapy remain an unfulfilled goal for PDAC. In this study, we selected 153 PDAC patients from the TCGA database and used their clinical, DNA methylation, gene expression, and micro-RNA (miRNA) and long non-coding RNA (lncRNA) expression data for multi-omics analysis. Differential methylations at about 12,000 CpG sites were observed in PDAC tumor genomes, with about 61% of them hypermethylated, predominantly in the promoter regions and in CpG-islands. We correlated promoter methylation and gene expression for mRNAs and identified 17 genes that were previously recognized as PDAC biomarkers. Similarly, several genes (B3GNT3, DMBT1, DEPDC1B) and lncRNAs (PVT1, and GATA6-AS) are strongly correlated with survival, which have not been reported in PDAC before. Other genes such as EFR3B, whose biological roles are not well known in mammals are also found to strongly associated with survival. We further identified 406 promoter methylation target loci associated with patients survival, including known esophageal squamous cell carcinoma biomarkers, cg03234186 (ZNF154), and cg02587316, cg18630667, and cg05020604 (ZNF382). Overall, this is one of the first studies that identified survival associated genes using multi-omics data from PDAC patients.
h
TCGA-12K-parquet
huggingface.co
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MedARC (2025). TCGA-12K-parquet [Dataset]. https://huggingface.co/datasets/medarc/TCGA-12K-parquet
Explore at:
Dataset updated
Nov 20, 2025
Dataset authored and provided by
MedARC
Description
TCGA-12K Parquet

Attribution

This dataset contains 224 x 224 JPEG patches from whole-slide images originally downloaded from The Cancer Genome Atlas (TCGA) that are available in the NCI Genomic Data Commons (GDC) Open Access tier. We mirror and repackage a commonly used ~12k WSI subset in parquet format for ease of training. We exclude patches that did not pass HSV thresholding, following the procedure in Kaiko.AI's Midnight paper. Patches were randomly sampled across… See the full description on the dataset page: https://huggingface.co/datasets/medarc/TCGA-12K-parquet.
DLBCL RNA-Seq Gene Expression Dataset
kaggle.com
zip
Updated Dec 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meenal Sinha (2025). DLBCL RNA-Seq Gene Expression Dataset [Dataset]. https://www.kaggle.com/datasets/meenalsinha/dlbcl-rna-seq-gene-expression-dataset
Explore at:
zip(509385530 bytes)Available download formats
Dataset updated
Dec 19, 2025
Authors
Meenal Sinha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About This Dataset

This dataset contains open-access bulk RNA-Seq gene expression data for Diffuse Large B-Cell Lymphoma (DLBCL). It was derived from the NCICCR-DLBCL project available through the NCI Genomic Data Commons (GDC) and includes only non-controlled, anonymized data suitable for public research and education.

Dataset Content

Gene-level RNA-Seq expression quantified using the STAR workflow

Data provided in long (tidy) tabular format

Includes:

Raw read counts

Strand-specific counts

Normalized expression values (TPM, FPKM, FPKM-UQ)

Gene annotations such as gene symbols and gene biotypes are included

Each row represents the expression of a single gene in a single sample. Alignment summary rows (e.g., N_unmapped, N_multimapping) are retained for quality-control and transparency.

Intended Use

This dataset is designed for: - Exploratory Data Analysis (EDA) - Gene expression profiling - Dimensionality reduction and clustering - Feature engineering and preprocessing for machine learning - Educational and research purposes in bioinformatics and computational biology

It is not intended for clinical diagnosis or medical decision-making.

Source and Attribution

The original data were generated and curated by the National Cancer Institute (NCI) and accessed via the Genomic Data Commons (GDC). This dataset represents a processed and consolidated form of that open-access data for ease of use.

License

This dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. Users are free to use and adapt the data with proper attribution to the original source.

Citation suggestion:
NCICCR-DLBCL project, NCI Genomic Data Commons (GDC)
Pan-Cancer Atlas (PanCanAtlas)
datacatalog.mskcc.org
Updated Nov 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States - National Institutes of Health (NIH) - National Cancer Institute (NCI) (2019). Pan-Cancer Atlas (PanCanAtlas) [Dataset]. https://datacatalog.mskcc.org/dataset/10404
Explore at:
Dataset updated
Nov 19, 2019
Dataset provided by
National Cancer Institutehttp://www.cancer.gov/
MSK Library
Description
The Pan-Cancer Atlas (PanCanAtlas) initiative aims to answer big, overarching questions about cancer by examining the full set of tumors characterized in the robust TCGA dataset. The Cancer Genome Atlas (TCGA) Research Network has profiled and analyzed large numbers of human tumors to discover molecular aberrations at the DNA, RNA, protein and epigenetic levels. The resulting rich data provide a major opportunity to develop an integrated picture of commonalities, differences and emergent themes across tumor lineages. The Pan-Cancer Atlas initiative compares the 33 tumor types profiled by TCGA. Analysis of the molecular aberrations and their functional roles across tumor types will teach us how to extend therapies effective in one cancer type to others with a similar genomic profile. The datasets gathered as part of the PanCan Atlas are included in the larger Genomic Data Commons (GDC) repository.
f
Table 1 from NCI’s Proteomic Data Commons: A Cloud-Based Proteomics...
datasetcatalog.nlm.nih.gov
aacr.figshare.com
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodriguez, Henry; Connolly, Brian; Ma, Lei; Pilozzi, Alexander; Chaudhary, Rekha; Rudnick, Paul A.; Nyce, Kristen; Ketchum, Karen A.; Riffle, Michael; Edwards, Nathan; Thangudu, Ratna R.; Domagalski, Marcin J.; McGarvey, Peter B.; Xin, Yi; Zhang, Xu; MacLean, Brendan; Chambers, Matthew C.; Otridge, John; Casas-Silva, Esmeralda; Maurais, Aaron; MacCoss, Michael J.; Singhal, Deepak; Le, Toan; Chilappagari, Padmini; Basu, Anand; Venkatachari, Sudha; Holck, Michael (2024). Table 1 from NCI’s Proteomic Data Commons: A Cloud-Based Proteomics Repository Empowering Comprehensive Cancer Analysis through Cross-Referencing with Genomic and Imaging Data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001394937
Explore at:
Dataset updated
Sep 20, 2024
Authors
Rodriguez, Henry; Connolly, Brian; Ma, Lei; Pilozzi, Alexander; Chaudhary, Rekha; Rudnick, Paul A.; Nyce, Kristen; Ketchum, Karen A.; Riffle, Michael; Edwards, Nathan; Thangudu, Ratna R.; Domagalski, Marcin J.; McGarvey, Peter B.; Xin, Yi; Zhang, Xu; MacLean, Brendan; Chambers, Matthew C.; Otridge, John; Casas-Silva, Esmeralda; Maurais, Aaron; MacCoss, Michael J.; Singhal, Deepak; Le, Toan; Chilappagari, Padmini; Basu, Anand; Venkatachari, Sudha; Holck, Michael
Description
Available data types in the proteomic data commons
c
The Cancer Genome Atlas Prostate Adenocarcinoma Collection
cancerimagingarchive.net
dicom, n/a
Updated Feb 2, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2014). The Cancer Genome Atlas Prostate Adenocarcinoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.YXOGLM4Y
Explore at:
dicom, n/aAvailable download formats
Unique identifier
https://doi.org/10.7937/K9/TCIA.2016.YXOGLM4Y
Dataset updated
Feb 2, 2014
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 29, 2020
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
The Cancer Genome Atlas Prostate Adenocarcinoma (TCGA-PRAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
CIP TCGA Radiology Initiative
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
TCGA RNA Datasets
kaggle.com
zip
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianjie Chen (2024). TCGA RNA Datasets [Dataset]. https://www.kaggle.com/datasets/tianjiechen/tcga-rna-datasets/discussion
Explore at:
zip(133551151 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
Tianjie Chen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Used for paper Explainable machine learning approach for cancer prediction through binarilization of RNA sequencing data on PLOS ONE (doi: 10.1371/journal.pone.0302947).

The feature "sample_type_id" is the label / target variable. Value 0.0 means the patient is not a cancer patient, whereas value 1.0 means the patient is a cancer patient.

Collected from the Genomic Data Commons created by the Cancer Genome Atlas.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). Genomic Data Commons Data Portal (GDC Data Portal) [Dataset]. http://identifiers.org/RRID:SCR_014514/resolver/mentions

Genomic Data Commons Data Portal (GDC Data Portal)

RRID:SCR_014514, Genomic Data Commons Data Portal (GDC Data Portal) (RRID:SCR_014514), Genomic Data Commons Data Portal, GDC Data Portal

Explore at:

94 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://identifiers.org/RRID:SCR_014514 https://identifiers.org/RRID:SCR_014514/resolver/mentions

Dataset updated

Oct 18, 2024

Description

A unified data repository of the National Cancer Institute (NCI)'s Genomic Data Commons (GDC) that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI). The GDC Data Portal provides a platform for efficiently querying and downloading high quality and complete data. The GDC also provides a GDC Data Transfer Tool and a GDC API for programmatic access.

Clear search

Close search

Google apps

Main menu

Genomic Data Commons Data Portal (GDC Data Portal)

Genomic Data Commons Data Portal

Historical NCI Genomic Data Commons data (09-14-2017)

The Cancer Genome Atlas Breast Invasive Carcinoma Collection

CIP TCGA Radiology Initiative

The Cancer Genome Atlas Rectum Adenocarcinoma Collection

CIP TCGA Radiology Initiative

The Cancer Genome Atlas (TCGA) RNA-seq meta-analysis

The Cancer Genome Atlas Lung Adenocarcinoma Collection

CIP TCGA Radiology Initiative

Cancer Categories and clinical research figures

tcga_project TCGA project metadata with links to DOID cancer categories.

TCGA-12K-litdata

The Cancer Genome Atlas Sarcoma Collection

CIP TCGA Radiology Initiative

Comparison of the top 10 differentially expressed genes inferred from...

TCGA-KIRP

Transportation Energy Resources from Renewable Agriculture Phenotyping...

DataSheet_1_Survival Analysis of Multi-Omics Data Identifies Potential...

TCGA-12K-parquet

DLBCL RNA-Seq Gene Expression Dataset

About This Dataset

Dataset Content

Intended Use

Source and Attribution

License

Pan-Cancer Atlas (PanCanAtlas)

Table 1 from NCI’s Proteomic Data Commons: A Cloud-Based Proteomics...

The Cancer Genome Atlas Prostate Adenocarcinoma Collection

CIP TCGA Radiology Initiative

TCGA RNA Datasets

Genomic Data Commons Data Portal (GDC Data Portal)See More Versions

RRID:SCR_014514, Genomic Data Commons Data Portal (GDC Data Portal) (RRID:SCR_014514), Genomic Data Commons Data Portal, GDC Data Portal

Genomic Data Commons Data Portal (GDC Data Portal)