https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Interactions of the extracellular matrix (ECM) and cellular receptors constitute one of the crucial pathways involved in colorectal cancer progression and metastasis. With the use of bioinformatics analysis, we comprehensively evaluated the prognostic information concentrated in the genes from this pathway. First, we constructed a ECM–receptor regulatory network by integrating the transcription factor (TF) and 5’-isomiR interaction databases with mRNA/miRNA-seq data from The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD). Notably, one-third of interactions mediated by 5’-isomiRs was represented by noncanonical isomiRs (isomiRs, whose 5’-end sequence did not match with the canonical miRBase version). Then, exhaustive search-based feature selection was used to fit prognostic signatures composed of nodes from the network for overall survival prediction. Two reliable prognostic signatures were identified and validated on the independent The Cancer Genome Atlas Rectum Adenocarcinoma (TCGA-READ) cohort. The first signature was made up by six genes, directly involved in ECM–receptor interaction: AGRN, DAG1, FN1, ITGA5, THBS3, and TNC (concordance index 0.61, logrank test p = 0.0164, 3-years ROC AUC = 0.68). The second hybrid signature was composed of three regulators: hsa-miR-32-5p, NR1H2, and SNAI1 (concordance index 0.64, logrank test p = 0.0229, 3-years ROC AUC = 0.71). While hsa-miR-32-5p exclusively regulated ECM-related genes (COL1A2 and ITGA5), NR1H2 and SNAI1 also targeted other pathways (adhesion, cell cycle, and cell division). Concordant distributions of the respective risk scores across four stages of colorectal cancer and adjacent normal mucosa additionally confirmed reliability of the models.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
This collection contains subjects from the National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium CPTAC Colon Adenocarcinoma cohort. CPTAC is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. Radiology and pathology images from CPTAC patients are being collected and made publicly available by The Cancer Imaging Archive to enable researchers to investigate cancer phenotypes which may correlate to corresponding proteomic, genomic and clinical data.
Imaging from each cancer type will be contained in its own TCIA Collection, with the collection name "CPTAC-cancertype". Radiology imaging is collected from standard of care imaging performed on patients immediately before the pathological diagnosis, and from follow-up scans where available. For this reason the radiology image data sets are heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. Pathology imaging is collected as part of the CPTAC qualification workflow.
All CPTAC cohorts are released as either a single combined cohort, or split into Discovery and Confirmatory where applicable. There are two main types of proteomic studies: discovery proteomics and targeted proteomics. The term "discovery proteomics" is in reference to "untargeted" identification and quantification of a maximal number of proteins in a biological or clinical sample. The term “targeted proteomics” refers to quantitative measurements on a defined subset of total proteins in a biological or clinical sample, often following the completion of discovery proteomics studies to confirm interesting targets selected. Commonly used proteomic technologies and platforms are different types of mass spectrometry and protein microarrays depending on the needs, throughput and sample input requirement of an analysis, with further development on nanotechnologies and automation in the pipeline in order to improve the detection of low abundance proteins, increase throughput, and selectively reach a target protein in vivo. Once the protein targets of interest are identified, high-throughput targeted assays are developed for confirmatory studies: tests to affirm that the initial tests were accurate. A summary of CPTAC imaging efforts can be found on the CPTAC Imaging Proteomics page.
You can join the CPTAC Imaging Special Interest Group to be notified of webinars & data releases, collaborate on common data wrangling tasks and seek out partners to explore research hypotheses! Artifacts from previous webinars such as slide decks and video recordings can be found on the CPTAC SIG Webinars page.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Historical NCI Genomic Data Commons data (v09-14-2017). Clinical ('phenotype') and gene expression (HTSeq FPKM-UQ).
dataset: phenotype - Phenotype
cohortGDC TCGA Colon Cancer (COAD)
dataset IDTCGA-COAD/Xena_Matrices/TCGA-COAD.GDC_phenotype.tsv
downloadhttps://gdc.xenahubs.net/download/TCGA-COAD/Xena_Matrices/TCGA-COAD.GDC_phenotype.tsv.gz; Full metadata
samples570
version11-27-2017
hubhttps://gdc.xenahubs.net
type of dataphenotype
authorGenomic Data Commons
raw datahttps://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-90
raw datahttps://api.gdc.cancer.gov/data/
input data formatROWs (samples) x COLUMNs (identifiers) (i.e. clinicalMatrix)
570 samples X 151 identifiersAll IdentifiersAll Samples
dataset: gene expression RNAseq - HTSeq - FPKM-UQ
cohortGDC TCGA Colon Cancer (COAD)
dataset IDTCGA-COAD/Xena_Matrices/TCGA-COAD.htseq_fpkm-uq.tsv
downloadhttps://gdc.xenahubs.net/download/TCGA-COAD/Xena_Matrices/TCGA-COAD.htseq_fpkm-uq.tsv.gz; Full metadata
samples512
version09-14-2017
hubhttps://gdc.xenahubs.net
type of datagene expression RNAseq
unitlog2(fpkm-uq+1)
platformIllumina
ID/Gene Mappinghttps://gdc.xenahubs.net/download/probeMaps/gencode.v22.annotation.gene.probeMap.gz; Full metadata
authorGenomic Data Commons
raw datahttps://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-80
raw datahttps://api.gdc.cancer.gov/data/
wranglingData from the same sample but from different vials/portions/analytes/aliquotes is averaged; data from different samples is combined into genomicMatrix; all data is then log2(x+1) transformed.
input data formatROWs (identifiers) x COLUMNs (samples) (i.e. genomicMatrix)
60,484 identifiers X 512 samples
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TCGA COAD paired sample isoform level read counts from Level 3 RNASeq-v2 data.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
TCGA Cancer Variant and Clinical Data
Dataset Description
This dataset combines genetic variant information at the protein level with clinical data from The Cancer Genome Atlas (TCGA) project, curated by the International Cancer Genome Consortium (ICGC). It provides a comprehensive view of protein-altering mutations and clinical characteristics across various cancer types.
Dataset Summary
The dataset includes:
Protein sequence data for both mutated and… See the full description on the dataset page: https://huggingface.co/datasets/seq-to-pheno/TCGA-Cancer-Variant-and-Clinical-Data.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Colon adenocarcinoma (COAD) is the commonest colon cancer exhibiting high mortality. Due to the association with cancers progression, long noncoding RNAs (lncRNAs) become prognostic biomarkers. This study, using relevant clinic information and expression profiles of lncRNA originating in The Cancer Genome Atlas database, aims to construct a prognostic lncRNA signature to estimate the prognosis for patients. In the training cohort, prognosis related lncRNAs were selected from differently expressed lncRNAs by univariate Cox analysis. Furthermore, the least absolute shrinkage and selection operator (LASSO) regress and multivariate Cox analysis were employed for identifying prognostic lncRNAs. The prognostic signature was constructed by those lncRNAs. Prognostic model was able to calculate each COAD patient's risk score and split the patients to groups of low and high risk. Compared to the low-risk group, the high-risk group had significant poor prognosis. Then, the prognostic signature was validated in validation and all cohorts. The receiver operating characteristic (ROC) curve and c-index were performed in all cohort. Moreover, those prognostic lncRNAs signature were combined with clinicopathological risk factors to construct a nomogram for predicting the prognosis of COAD in clinic. Finally, 7 lncRNAs (CTC-273B12.10, AC009404.2, AC073283.7, RP11-167H9.4, AC007879.7, RP4-816N1.7, RP11-400N13.2) were identified and validated by different cohorts. The Kyoto Encyclopedia of Genes and Genomes analysis of the mRNAs co-expressed with 7 prognostic lncRNAs suggested 4 significantly up-regulated pathways, which are AGE-RAGE signaling pathway, focal adhesion, ECM-receptor interaction and PI3K/Akt signaling pathway. To sum up, our study verified that the mentioned 7 lncRNAs can be biomarkers to predict the prognosis of COAD patients and design personalized treatment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TCGA COAD paired sample gene level read counts from Level 3 RNASeq-v2 data.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: TCGA-LUAD. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
The Cancer Imaging Program (CIP) is working directly with primary investigators from institutes participating in TCGA to obtain and load images relating to the genomic, clinical, and pathological data being stored within the TCGA Data Portal Currently this large CT multi-sequence image collection of lung adenocarcinoma (LUAD) patients can be matched by each unique case identifier with the extensive gene and expression data of the same case from The Cancer Genome Atlas Data Portal to research the link between clinical phenome and tissue genome.
Please see the TCGA-LUAD page to learn more about the images and to obtain any supporting metadata for this collection.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced.
For example, collection_id-idc_v8-aws.s5cmd
corresponds to the contents of the
collection_id
collection introduced in IDC data
release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of
the corresponding collection was introduced.
tcga_luad-idc_v8-aws.s5cmd
: manifest of files available for download from public IDC Amazon Web Services bucketstcga_luad-idc_v8-gcs.s5cmd
: manifest of files available for download from public IDC Google Cloud Storage bucketstcga_luad-idc_v8-dcf.dcf
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference
files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
.To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Objective To investigate the gene expression profile of mRNA in colon cancer and determine the optimal prognostic markers.Method The colon cancer dataset TCGA-COAD was downloaded from the Cancer Genome Atlas (TCGA) database as the training cohort. The random survival forest (RSF) model was used to determine gene labels, and the obtained gene labels were analyzed using the Cox model to construct risk scores. The colon cancer dataset GSE17536 was downloaded from the Gene Expression Database (GEO) as the validation cohort to validate the model, and compared horizontally with similar studies in the past year. Exploring the relationship between gene tags and immune cells through immune cell infiltration.Result A total of 11 gene tags were screened, and the risk score constructed by the multi factor Cox model was an independent prognostic indicator for colon cancer patients. The comparative development of this model is superior to previous studies. Immune cell infiltration revealed a significant correlation (P<0.05) between monocytes and the gene labels used in this study.Conclusion This study identified 11 gene markers with prognostic value for colon cancer, and monocytes may serve as potential therapeutic targets for colon cancer.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Detection, segmentation and classification of nuclei are fundamental analysis operations in digital pathology. Existing state-of-the-art approaches demand extensive amounts of supervised training data from pathologists and may still perform poorly in images from unseen tissue types. We propose an unsupervised approach for histopathology image segmentation that synthesizes heterogeneous sets of training image patches, of every tissue type. Although our synthetic patches are not always of high quality, we harness the motley crew of generated samples through a generally applicable importance sampling method. This proposed approach, for the first time, re-weighs the training loss over synthetic data so that the ideal (unbiased) generalization loss over the true data distribution is minimized. This enables us to use a random polygon generator to synthesize approximate cellular structures (i.e., nuclear masks) for which no real examples are given in many tissue types, and hence, GAN-based methods are not suited. In addition, we propose a hybrid synthesis pipeline that utilizes textures in real histopathology patches and GAN models, to tackle heterogeneity in tissue textures. Compared with existing state-of-the-art supervised models, our approach generalizes significantly better on cancer types without training data. Even in cancer types with training data, our approach achieves the same performance without supervision cost. In this dataset we release code and nucleus segmentations in whole slide tissue images with quality control results for Whole Slide Images (WSI) in The Cancer Genome Atlas (TCGA) repository from 5,204 subjects (6,142 slide images). Within this total, there are two subsets of data: (1) automatic nucleus segmentation data of 5,060 whole slide tissue images of 10 cancer types, with quality control results, and (2) manual nucleus segmentation data of 1,356 image patches from the same 10 cancer types plus additional 4 cancer types.
Pre-processed TCGA COAD data used for PIVOT analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TCGA COAD samples somatic mutation data in BED format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COAD/READ/COADREAD_rnaseq_fpkm.txt files contain TCGA RNA-Seq data in FPKM normalisation form for colorectal adenocarcinoma (COAD), rectum adenocarcinoma (READ) or combined (COADREAD).
COAD/READ/COADREAD_rnaseq_tpm.txt files contain TCGA RNA-Seq data in TPM normalisation form for colorectal adenocarcinoma (COAD), rectum adenocarcinoma (READ) or combined (COADREAD).
COAD/READ/COADREAD_clinical_raw.xlsx files contain TCGA clinical data for patients with colorectal adenocarcinoma (COAD), rectum adenocarcinoma (READ) or combined (COADREAD).
COAD/READ/COADREAD_rnaseq_clinical_raw.xlsx files contain corresponding information of TCGA clinical data and RNA-Seq data for patients with colorectal adenocarcinoma (COAD), rectum adenocarcinoma (READ) or combined (COADREAD).
Local_cohort_tumour/adenoma_qPCR_rawdata.xlsx files contain our experimental results of qPCR CT values for SORD and GAPDH (as internal ref), shown as separate values for duplicate wells and average values.
Local_cohort_tumour_clinical_rawdata.xlsx contains clinical information and calculated SORD relative expression of our recruited patients.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Lung Cancer CT Scan Dataset
Dataset Description
This dataset contains CT scan images for lung cancer detection and classification. It includes images of four different categories: adenocarcinoma, large cell carcinoma, squamous cell carcinoma, and normal (non-cancerous) lung tissue.
Classes
Adenocarcinoma Large Cell Carcinoma Normal (non-cancerous) Squamous Cell Carcinoma
Dataset Statistics
Total number of images: 315 Number of classes: 4 Class… See the full description on the dataset page: https://huggingface.co/datasets/dorsar/lung-cancer.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: Pan-Cancer-Nuclei-Seg-DICOM. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, pan_cancer_nuclei_seg_dicom-collection_id-idc_v19-aws.s5cmd
corresponds to the annotations for th eimages in the collection_id
collection introduced in IDC data release v19. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.
For each of the collections, the following manifest files are provided:
pan_cancer_nuclei_seg_dicom-
: manifest of files available for download from public IDC Amazon Web Services bucketspan_cancer_nuclei_seg_dicom-
: manifest of files available for download from public IDC Google Cloud Storage bucketspan_cancer_nuclei_seg_dicom-
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Colon Cancer is a dataset for classification tasks - it contains Colon Cancer annotations for 618 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HISTOPANTUM is a comprehensive pan-cancer dataset of histology images categorized into Tumor and Non-Tumor classes over 4 different cancer types (domains). This dataset is designed to facilitate domain generalization analysis for tumor detection tasks, serving as a benchmark for foundation models and domain generalization algorithms.
The dataset comprises histology images sourced from The Cancer Genome Atlas (TCGA), spanning the following four cancer types:
The dataset is provided in four zipped files, each corresponding to one cancer type. Within each zip file, images are organized into two subfolders:
tumour
non-tumour
Each image filename encodes the originating slide and the patch position within the slide, following this naming convention:
If you use this dataset in your research, please cite the following publication:
@article{zamanitajeddin2024benchmarking,
title={Benchmarking Domain Generalization Algorithms in Computational Pathology},
author={Zamanitajeddin, Neda and Jahanifar, Mostafa and Xu, Kesi and Siraj, Fouzia and Rajpoot, Nasir},
journal={arXiv preprint arXiv:2409.17063},
year={2024}
}
For further details, please refer to the linked publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Lung Cancer Dataset is a dataset for object detection tasks - it contains Test annotations for 8,590 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: CPTAC-COAD. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
This collection contains subjects from the National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium CPTAC Colon Adenocarcinoma cohort. CPTAC is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics.
Please see the CPTAC-COAD wiki page to learn more about the images and to obtain any supporting metadata for this collection.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced.
For example, collection_id-idc_v8-aws.s5cmd
corresponds to the contents of the
collection_id
collection introduced in IDC data
release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of
the corresponding collection was introduced.
cptac_coad-idc_v10-aws.s5cmd
: manifest of files available for download from public IDC Amazon Web Services bucketscptac_coad-idc_v10-gcs.s5cmd
: manifest of files available for download from public IDC Google Cloud Storage bucketscptac_coad-idc_v10-dcf.dcf
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference
files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
.To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.