100+ datasets found
  1. n

    Genomic Data Commons Data Portal (GDC Data Portal)

    • neuinfo.org
    • rrid.site
    • +2more
    Updated Oct 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Genomic Data Commons Data Portal (GDC Data Portal) [Dataset]. http://identifiers.org/RRID:SCR_014514/resolver/mentions
    Explore at:
    Dataset updated
    Oct 18, 2024
    Description

    A unified data repository of the National Cancer Institute (NCI)'s Genomic Data Commons (GDC) that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI). The GDC Data Portal provides a platform for efficiently querying and downloading high quality and complete data. The GDC also provides a GDC Data Transfer Tool and a GDC API for programmatic access.

  2. b

    Genomic Data Commons Data Portal

    • bioregistry.io
    Updated Apr 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Genomic Data Commons Data Portal [Dataset]. https://bioregistry.io/gdc
    Explore at:
    Dataset updated
    Apr 23, 2021
    Description

    The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.

  3. Historical NCI Genomic Data Commons data (09-14-2017)

    • zenodo.org
    • data-staging.niaid.nih.gov
    tsv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Inge Seim; Inge Seim (2020). Historical NCI Genomic Data Commons data (09-14-2017) [Dataset]. http://doi.org/10.5281/zenodo.1186945
    Explore at:
    tsvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Inge Seim; Inge Seim
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Historical NCI Genomic Data Commons data (v09-14-2017). Clinical ('phenotype') and gene expression (HTSeq FPKM-UQ).

    TCGA-COAD.GDC_phenotype.tsv

    dataset: phenotype - Phenotype

    cohortGDC TCGA Colon Cancer (COAD)
    dataset IDTCGA-COAD/Xena_Matrices/TCGA-COAD.GDC_phenotype.tsv
    downloadhttps://gdc.xenahubs.net/download/TCGA-COAD/Xena_Matrices/TCGA-COAD.GDC_phenotype.tsv.gz; Full metadata
    samples570
    version11-27-2017
    hubhttps://gdc.xenahubs.net
    type of dataphenotype
    authorGenomic Data Commons
    raw datahttps://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-90
    raw datahttps://api.gdc.cancer.gov/data/
    input data formatROWs (samples) x COLUMNs (identifiers) (i.e. clinicalMatrix)
    570 samples X 151 identifiersAll IdentifiersAll Samples

    TCGA-COAD.htseq_fpkm-uq.tsv

    dataset: gene expression RNAseq - HTSeq - FPKM-UQ

    cohortGDC TCGA Colon Cancer (COAD)
    dataset IDTCGA-COAD/Xena_Matrices/TCGA-COAD.htseq_fpkm-uq.tsv
    downloadhttps://gdc.xenahubs.net/download/TCGA-COAD/Xena_Matrices/TCGA-COAD.htseq_fpkm-uq.tsv.gz; Full metadata
    samples512
    version09-14-2017
    hubhttps://gdc.xenahubs.net
    type of datagene expression RNAseq
    unitlog2(fpkm-uq+1)
    platformIllumina
    ID/Gene Mappinghttps://gdc.xenahubs.net/download/probeMaps/gencode.v22.annotation.gene.probeMap.gz; Full metadata
    authorGenomic Data Commons
    raw datahttps://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-80
    raw datahttps://api.gdc.cancer.gov/data/
    wranglingData from the same sample but from different vials/portions/analytes/aliquotes is averaged; data from different samples is combined into genomicMatrix; all data is then log2(x+1) transformed.
    input data formatROWs (identifiers) x COLUMNs (samples) (i.e. genomicMatrix)
    60,484 identifiers X 512 samples

  4. c

    The Cancer Genome Atlas Breast Invasive Carcinoma Collection

    • cancerimagingarchive.net
    dicom, n/a
    Updated Feb 2, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2014). The Cancer Genome Atlas Breast Invasive Carcinoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.AB2NAZRP
    Explore at:
    n/a, dicomAvailable download formats
    Dataset updated
    Feb 2, 2014
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    May 29, 2020
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).

    Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.

    CIP TCGA Radiology Initiative

    Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Breast Phenotype Research Group.

  5. c

    The Cancer Genome Atlas Rectum Adenocarcinoma Collection

    • cancerimagingarchive.net
    dicom, n/a
    Updated Jan 5, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2016). The Cancer Genome Atlas Rectum Adenocarcinoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.F7PPNPNU
    Explore at:
    dicom, n/aAvailable download formats
    Dataset updated
    Jan 5, 2016
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    May 29, 2020
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    The Cancer Genome Atlas Rectum Adenocarcinoma (TCGA-READ) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).

    Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.

    CIP TCGA Radiology Initiative

    Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.

  6. The Cancer Genome Atlas (TCGA) RNA-seq meta-analysis

    • figshare.com
    xlsx
    Updated Feb 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Namshik Han (2018). The Cancer Genome Atlas (TCGA) RNA-seq meta-analysis [Dataset]. http://doi.org/10.6084/m9.figshare.5851743.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 2, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Namshik Han
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TCGA RNA-seq V2 Level3 data were downloaded from TCGA Genomic Data Commons Data Portal (https://gdc-portal.nci.nih.gov), consisting of 11,303 samples in 34 cancer projects (33 cancer types). Nine cancer types that do not have corresponding non-tumour samples were filtered out, and the analysis was focused on tumour versus non-tumour comparison. 24 cancer types were used in this meta-analysis: BLCA, BRCA, CESC, CHOL, COAD, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LIHC, LUAD, LUSC, PAAD, PCPG, PRAD, READ, SARC, SKCM, STAD, THCA, THYM, UCEC (https://gdc-portal.nci.nih.gov). The nine filtered cancer types were ACC, DLBC, LAML, LGG, MESO, OV, TGCT, UCS and UVM. To extract expression values from TCGA RNA-seq data, we used genomic coordinates to retrieve UCSC Transcript IDs that correspond to the identifiers in TCGA RNA-seq V2 Level3 data (isoform level). The GAF (General Annotation Format) file was used to map the coordinate to UCSC Transcript ID, and it was downloaded form https://tcga-data.nci.nih.gov/docs/GAF/GAF.hg19.June2011.bundle/outputs/TCGA.hg19.June2011.gaf. This file contains genomic annotations shared by all TCGA projects. More details of the GAF file format can be found at https://tcga-data.nci.nih.gov/docs/GAF/GAF3.0/GAF_v3_file_description.docx. We filtered out any coding exons overlapping UCSC Transcript IDs to eliminate expression value of coding genes and evaluate lncRNA expression.We could find the expression values of 443 pcRNAs and 203 tapRNAs in TCGA data, as many of non-coding regions are not yet fully annotated in the TCGA RNA-seq V2 Level3 data. The expression value of pcRNAs and tapRNAs were extracted and clustered by un-supervised Pearson correlation method (Supplementary Figure 18A). The expression values of tapRNA-associated coding genes were also extracted and used to generate the heat-map (Supplementary Figure 18B), which shows the similar pattern of expression with tapRNAs across the cancer types.To show that tapRNAs and associated coding genes have similar expression profiles in cancers we generated a Spearman's Rank-Order Correlation heatmap (Figure 6A) between tapRNAs and their associated coding genes based on the TCGA RNA-seq data. We used the MatLab function corr to calculate the Spearman's rho. This function takes two matrices X (197-by-8,850 expression profiling matrix of tapRNA) and Y (197-by-8,850 expression profiling matrix of tapRNA-assocated coding gene) and returns an 8,850-by-8,850 matrix containing the pairwise correlation coefficient between each pair of 8,850 columns (TCGA cancer samples in Supplementary Figure 18A and B). Thus, the rank-order correlation matrix that we computed from the matrices of expression profiling data (Supplementary Figure S18A and B) allowed us to compare the correlation between two column vectors i.e. cancer samples. This function also returns a matrix of p-values for testing the hypothesis of no correlation against the alternative that there is a nonzero correlation. Each element of a matrix of p-values is the p value for the corresponding element of Spearman's rho. The p-values for Spearman's rho are calculated using large-sample approximations. To check significance level of correlation between tapRNA and its associated coding gene, the diagonal of the p-value matrix was extracted and used. The median is 1.31x10-11 and the mean is 1.03x10-4 with standard deviation 0.0029.To identify cancer-specific tapRNAs, we considered not only the global expression pattern of a given tapRNA in each cancer type, but also expression pattern of specific sub-group that is significantly distinct, to take into account cancer sample heterogeneity. Thus, two conditions were applied: (1) average expression level of a tapRNA in a given cancer type is in top 10% or bottom 10% and (2) a tapRNA has at least 10% of samples in a given cancer type that are significantly up-regulated (Z-score > 2) or down-regulated (Z-score < -2).

  7. c

    The Cancer Genome Atlas Lung Adenocarcinoma Collection

    • cancerimagingarchive.net
    dicom, n/a
    Updated Jan 30, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2017). The Cancer Genome Atlas Lung Adenocarcinoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.JGNIHEP5
    Explore at:
    n/a, dicomAvailable download formats
    Dataset updated
    Jan 30, 2017
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    May 29, 2020
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    The Cancer Genome Atlas Lung Adenocarcinoma (TCGA-LUAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).

    Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.

    CIP TCGA Radiology Initiative

    Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Lung Phenotype Research Group.

  8. Cancer Categories and clinical research figures

    • kaggle.com
    zip
    Updated Oct 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DrAHung (2025). Cancer Categories and clinical research figures [Dataset]. https://www.kaggle.com/datasets/drahung/cancer-categories-and-clinical-research-figures
    Explore at:
    zip(996208 bytes)Available download formats
    Dataset updated
    Oct 16, 2025
    Authors
    DrAHung
    Description

    This dataset integrates open public data from multiple biomedical sources to provide a structured, queryable database of cancer classifications and clinical data from The Cancer Genome Atlas (TCGA).

    All data are de-identified and publicly available via the U.S. National Cancer Institute (NCI) Genomic Data Commons (GDC) API, ensuring full compliance with NIH open-access guidelines.

    Included Tables Table Description cancer_category Disease Ontology (DOID) categories and hierarchical labels (including English + Chinese translations). patient_tcga_clinical De-identified patient clinical records per TCGA project (demographics, stage, grade, survival, treatment). tcga_project_summary Per-project summary statistics (case counts, survival averages, tumor stage/grade coverage, and mapped cancer type).

    tcga_project TCGA project metadata with links to DOID cancer categories.

    Data source is from The Cancer Genome Atlas (TCGA).

    A snapshot of clinical data. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F29334708%2F0049f6224420593507bfc8072df3e0e4%2Fsample.png?generation=1760586452165254&alt=media" alt="">

  9. h

    TCGA-12K-litdata

    • huggingface.co
    Updated Nov 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MedARC (2025). TCGA-12K-litdata [Dataset]. https://huggingface.co/datasets/medarc/TCGA-12K-litdata
    Explore at:
    Dataset updated
    Nov 20, 2025
    Dataset authored and provided by
    MedARC
    Description

    Attribution

    This dataset contains 224 x 224 JPEG patches from whole-slide images originally downloaded from The Cancer Genome Atlas (TCGA) that are available in the NCI Genomic Data Commons (GDC) Open Access tier. We mirror and repackage a commonly used ~12k WSI subset in LitData format for ease of training. We exclude patches that did not pass HSV thresholding, following the procedure in Kaiko.AI's Midnight paper. Patches were randomly sampled across magnification levels. There are… See the full description on the dataset page: https://huggingface.co/datasets/medarc/TCGA-12K-litdata.

  10. c

    The Cancer Genome Atlas Sarcoma Collection

    • cancerimagingarchive.net
    dicom, n/a
    Updated Jan 5, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2016). The Cancer Genome Atlas Sarcoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.CX6YLSUX
    Explore at:
    dicom, n/aAvailable download formats
    Dataset updated
    Jan 5, 2016
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    May 29, 2020
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    The Cancer Genome Atlas Sarcoma (TCGA-SARC) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).

    Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.

    CIP TCGA Radiology Initiative

    Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.

  11. Comparison of the top 10 differentially expressed genes inferred from...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ling-Hong Hung; Bryce Fukuda; Robert Schmitz; Varik Hoang; Wes Lloyd; Ka Yee Yeung (2025). Comparison of the top 10 differentially expressed genes inferred from concatenation of published counts (“published vs published”) versus those inferred from harmonized uniform GDC re-processing (“reprocessed vs reprocessed”). [Dataset]. http://doi.org/10.1371/journal.pone.0318676.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ling-Hong Hung; Bryce Fukuda; Robert Schmitz; Varik Hoang; Wes Lloyd; Ka Yee Yeung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of the top 10 differentially expressed genes inferred from concatenation of published counts (“published vs published”) versus those inferred from harmonized uniform GDC re-processing (“reprocessed vs reprocessed”).

  12. TCGA-KIRP

    • kaggle.com
    zip
    Updated Apr 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chun Yu Chen (2022). TCGA-KIRP [Dataset]. https://www.kaggle.com/datasets/junyussh/tcga-kirp/versions/5
    Explore at:
    zip(5197766487 bytes)Available download formats
    Dataset updated
    Apr 27, 2022
    Authors
    Chun Yu Chen
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    The Cancer Genome Atlas Cervical Kidney renal papillary cell carcinoma (KIRP) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).

    Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.

    Source: https://wiki.cancerimagingarchive.net/display/Public/TCGA-KIRP

  13. c

    Transportation Energy Resources from Renewable Agriculture Phenotyping...

    • datacommons.cyverse.org
    Updated Mar 1, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TERRA-REF (2016). Transportation Energy Resources from Renewable Agriculture Phenotyping Reference Platform (TERRA-REF) Genomic Data Repository for Sorghum [Dataset]. https://datacommons.cyverse.org/browse/iplant/home/shared/terraref
    Explore at:
    Dataset updated
    Mar 1, 2016
    Dataset provided by
    CyVerse Data Commons
    Authors
    TERRA-REF
    Description

    The terraref directory contains raw and derived sorghum genome sequencing data from the Transportation Energy Resources from Renewable Agriculture Phenotyping Reference Platform (TERRA-REF) project (http://terraref.org/). Raw data includes DNA sequence files in compressed FASTQ format. Derived data is available for whole-genome resequencing and genotyping-by-sequencing. See the README.txt and DATA_USE_POLICY.md for more information. TERRA-REF is under development, and November 15 2017 marks the beta release. To request early access to these data sources and to recieve notifications and updates, please fill out the beta-user application at http://terraref.org/data/.

  14. DataSheet_1_Survival Analysis of Multi-Omics Data Identifies Potential...

    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nitish Kumar Mishra; Siddesh Southekal; Chittibabu Guda (2023). DataSheet_1_Survival Analysis of Multi-Omics Data Identifies Potential Prognostic Markers of Pancreatic Ductal Adenocarcinoma.pdf [Dataset]. http://doi.org/10.3389/fgene.2019.00624.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Nitish Kumar Mishra; Siddesh Southekal; Chittibabu Guda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pancreatic ductal adenocarcinoma (PDAC) is the most common and among the deadliest of pancreatic cancers. Its 5-year survival is only ∼8%. Pancreatic cancers are a heterogeneous group of diseases, of which PDAC is particularly aggressive. Like many other cancers, PDAC also starts as a pre-invasive precursor lesion (known as pancreatic intraepithelial neoplasia, PanIN), which offers an opportunity for both early detection and early treatment. Even advanced PDAC can benefit from prognostic biomarkers. However, reliable biomarkers for early diagnosis or those for prognosis of therapy remain an unfulfilled goal for PDAC. In this study, we selected 153 PDAC patients from the TCGA database and used their clinical, DNA methylation, gene expression, and micro-RNA (miRNA) and long non-coding RNA (lncRNA) expression data for multi-omics analysis. Differential methylations at about 12,000 CpG sites were observed in PDAC tumor genomes, with about 61% of them hypermethylated, predominantly in the promoter regions and in CpG-islands. We correlated promoter methylation and gene expression for mRNAs and identified 17 genes that were previously recognized as PDAC biomarkers. Similarly, several genes (B3GNT3, DMBT1, DEPDC1B) and lncRNAs (PVT1, and GATA6-AS) are strongly correlated with survival, which have not been reported in PDAC before. Other genes such as EFR3B, whose biological roles are not well known in mammals are also found to strongly associated with survival. We further identified 406 promoter methylation target loci associated with patients survival, including known esophageal squamous cell carcinoma biomarkers, cg03234186 (ZNF154), and cg02587316, cg18630667, and cg05020604 (ZNF382). Overall, this is one of the first studies that identified survival associated genes using multi-omics data from PDAC patients.

  15. h

    TCGA-12K-parquet

    • huggingface.co
    Updated Nov 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MedARC (2025). TCGA-12K-parquet [Dataset]. https://huggingface.co/datasets/medarc/TCGA-12K-parquet
    Explore at:
    Dataset updated
    Nov 20, 2025
    Dataset authored and provided by
    MedARC
    Description

    TCGA-12K Parquet

      Attribution
    

    This dataset contains 224 x 224 JPEG patches from whole-slide images originally downloaded from The Cancer Genome Atlas (TCGA) that are available in the NCI Genomic Data Commons (GDC) Open Access tier. We mirror and repackage a commonly used ~12k WSI subset in parquet format for ease of training. We exclude patches that did not pass HSV thresholding, following the procedure in Kaiko.AI's Midnight paper. Patches were randomly sampled across… See the full description on the dataset page: https://huggingface.co/datasets/medarc/TCGA-12K-parquet.

  16. DLBCL RNA-Seq Gene Expression Dataset

    • kaggle.com
    zip
    Updated Dec 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meenal Sinha (2025). DLBCL RNA-Seq Gene Expression Dataset [Dataset]. https://www.kaggle.com/datasets/meenalsinha/dlbcl-rna-seq-gene-expression-dataset
    Explore at:
    zip(509385530 bytes)Available download formats
    Dataset updated
    Dec 19, 2025
    Authors
    Meenal Sinha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About This Dataset

    This dataset contains open-access bulk RNA-Seq gene expression data for Diffuse Large B-Cell Lymphoma (DLBCL). It was derived from the NCICCR-DLBCL project available through the NCI Genomic Data Commons (GDC) and includes only non-controlled, anonymized data suitable for public research and education.

    Dataset Content

    • Gene-level RNA-Seq expression quantified using the STAR workflow
    • Data provided in long (tidy) tabular format
    • Includes:
      • Raw read counts
      • Strand-specific counts
      • Normalized expression values (TPM, FPKM, FPKM-UQ)
    • Gene annotations such as gene symbols and gene biotypes are included

    Each row represents the expression of a single gene in a single sample. Alignment summary rows (e.g., N_unmapped, N_multimapping) are retained for quality-control and transparency.

    Intended Use

    This dataset is designed for: - Exploratory Data Analysis (EDA) - Gene expression profiling - Dimensionality reduction and clustering - Feature engineering and preprocessing for machine learning - Educational and research purposes in bioinformatics and computational biology

    It is not intended for clinical diagnosis or medical decision-making.

    Source and Attribution

    The original data were generated and curated by the National Cancer Institute (NCI) and accessed via the Genomic Data Commons (GDC). This dataset represents a processed and consolidated form of that open-access data for ease of use.

    License

    This dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. Users are free to use and adapt the data with proper attribution to the original source.

    Citation suggestion:
    NCICCR-DLBCL project, NCI Genomic Data Commons (GDC)

  17. Pan-Cancer Atlas (PanCanAtlas)

    • datacatalog.mskcc.org
    Updated Nov 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States - National Institutes of Health (NIH) - National Cancer Institute (NCI) (2019). Pan-Cancer Atlas (PanCanAtlas) [Dataset]. https://datacatalog.mskcc.org/dataset/10404
    Explore at:
    Dataset updated
    Nov 19, 2019
    Dataset provided by
    National Cancer Institutehttp://www.cancer.gov/
    MSK Library
    Description

    The Pan-Cancer Atlas (PanCanAtlas) initiative aims to answer big, overarching questions about cancer by examining the full set of tumors characterized in the robust TCGA dataset. The Cancer Genome Atlas (TCGA) Research Network has profiled and analyzed large numbers of human tumors to discover molecular aberrations at the DNA, RNA, protein and epigenetic levels. The resulting rich data provide a major opportunity to develop an integrated picture of commonalities, differences and emergent themes across tumor lineages. The Pan-Cancer Atlas initiative compares the 33 tumor types profiled by TCGA. Analysis of the molecular aberrations and their functional roles across tumor types will teach us how to extend therapies effective in one cancer type to others with a similar genomic profile. The datasets gathered as part of the PanCan Atlas are included in the larger Genomic Data Commons (GDC) repository.

  18. f

    Table 1 from NCI’s Proteomic Data Commons: A Cloud-Based Proteomics...

    • datasetcatalog.nlm.nih.gov
    • aacr.figshare.com
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodriguez, Henry; Connolly, Brian; Ma, Lei; Pilozzi, Alexander; Chaudhary, Rekha; Rudnick, Paul A.; Nyce, Kristen; Ketchum, Karen A.; Riffle, Michael; Edwards, Nathan; Thangudu, Ratna R.; Domagalski, Marcin J.; McGarvey, Peter B.; Xin, Yi; Zhang, Xu; MacLean, Brendan; Chambers, Matthew C.; Otridge, John; Casas-Silva, Esmeralda; Maurais, Aaron; MacCoss, Michael J.; Singhal, Deepak; Le, Toan; Chilappagari, Padmini; Basu, Anand; Venkatachari, Sudha; Holck, Michael (2024). Table 1 from NCI’s Proteomic Data Commons: A Cloud-Based Proteomics Repository Empowering Comprehensive Cancer Analysis through Cross-Referencing with Genomic and Imaging Data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001394937
    Explore at:
    Dataset updated
    Sep 20, 2024
    Authors
    Rodriguez, Henry; Connolly, Brian; Ma, Lei; Pilozzi, Alexander; Chaudhary, Rekha; Rudnick, Paul A.; Nyce, Kristen; Ketchum, Karen A.; Riffle, Michael; Edwards, Nathan; Thangudu, Ratna R.; Domagalski, Marcin J.; McGarvey, Peter B.; Xin, Yi; Zhang, Xu; MacLean, Brendan; Chambers, Matthew C.; Otridge, John; Casas-Silva, Esmeralda; Maurais, Aaron; MacCoss, Michael J.; Singhal, Deepak; Le, Toan; Chilappagari, Padmini; Basu, Anand; Venkatachari, Sudha; Holck, Michael
    Description

    Available data types in the proteomic data commons

  19. c

    The Cancer Genome Atlas Prostate Adenocarcinoma Collection

    • cancerimagingarchive.net
    dicom, n/a
    Updated Feb 2, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2014). The Cancer Genome Atlas Prostate Adenocarcinoma Collection [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.YXOGLM4Y
    Explore at:
    dicom, n/aAvailable download formats
    Dataset updated
    Feb 2, 2014
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    May 29, 2020
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    The Cancer Genome Atlas Prostate Adenocarcinoma (TCGA-PRAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).

    Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.

    CIP TCGA Radiology Initiative

    Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.

  20. TCGA RNA Datasets

    • kaggle.com
    zip
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianjie Chen (2024). TCGA RNA Datasets [Dataset]. https://www.kaggle.com/datasets/tianjiechen/tcga-rna-datasets/discussion
    Explore at:
    zip(133551151 bytes)Available download formats
    Dataset updated
    Apr 2, 2024
    Authors
    Tianjie Chen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Used for paper Explainable machine learning approach for cancer prediction through binarilization of RNA sequencing data on PLOS ONE (doi: 10.1371/journal.pone.0302947).

    The feature "sample_type_id" is the label / target variable. Value 0.0 means the patient is not a cancer patient, whereas value 1.0 means the patient is a cancer patient.

    Collected from the Genomic Data Commons created by the Cancer Genome Atlas.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). Genomic Data Commons Data Portal (GDC Data Portal) [Dataset]. http://identifiers.org/RRID:SCR_014514/resolver/mentions

Genomic Data Commons Data Portal (GDC Data Portal)

RRID:SCR_014514, Genomic Data Commons Data Portal (GDC Data Portal) (RRID:SCR_014514), Genomic Data Commons Data Portal, GDC Data Portal

Explore at:
94 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 18, 2024
Description

A unified data repository of the National Cancer Institute (NCI)'s Genomic Data Commons (GDC) that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI). The GDC Data Portal provides a platform for efficiently querying and downloading high quality and complete data. The GDC also provides a GDC Data Transfer Tool and a GDC API for programmatic access.

Search
Clear search
Close search
Google apps
Main menu