https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Interactions of the extracellular matrix (ECM) and cellular receptors constitute one of the crucial pathways involved in colorectal cancer progression and metastasis. With the use of bioinformatics analysis, we comprehensively evaluated the prognostic information concentrated in the genes from this pathway. First, we constructed a ECM–receptor regulatory network by integrating the transcription factor (TF) and 5’-isomiR interaction databases with mRNA/miRNA-seq data from The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD). Notably, one-third of interactions mediated by 5’-isomiRs was represented by noncanonical isomiRs (isomiRs, whose 5’-end sequence did not match with the canonical miRBase version). Then, exhaustive search-based feature selection was used to fit prognostic signatures composed of nodes from the network for overall survival prediction. Two reliable prognostic signatures were identified and validated on the independent The Cancer Genome Atlas Rectum Adenocarcinoma (TCGA-READ) cohort. The first signature was made up by six genes, directly involved in ECM–receptor interaction: AGRN, DAG1, FN1, ITGA5, THBS3, and TNC (concordance index 0.61, logrank test p = 0.0164, 3-years ROC AUC = 0.68). The second hybrid signature was composed of three regulators: hsa-miR-32-5p, NR1H2, and SNAI1 (concordance index 0.64, logrank test p = 0.0229, 3-years ROC AUC = 0.71). While hsa-miR-32-5p exclusively regulated ECM-related genes (COL1A2 and ITGA5), NR1H2 and SNAI1 also targeted other pathways (adhesion, cell cycle, and cell division). Concordant distributions of the respective risk scores across four stages of colorectal cancer and adjacent normal mucosa additionally confirmed reliability of the models.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Historical NCI Genomic Data Commons data (v09-14-2017). Clinical ('phenotype') and gene expression (HTSeq FPKM-UQ).
TCGA-COAD.GDC_phenotype.tsv
dataset: phenotype - Phenotype
cohortGDC TCGA Colon Cancer (COAD) dataset IDTCGA-COAD/Xena_Matrices/TCGA-COAD.GDC_phenotype.tsv downloadhttps://gdc.xenahubs.net/download/TCGA-COAD/Xena_Matrices/TCGA-COAD.GDC_phenotype.tsv.gz; Full metadata samples570 version11-27-2017 hubhttps://gdc.xenahubs.net type of dataphenotype authorGenomic Data Commons raw datahttps://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-90 raw datahttps://api.gdc.cancer.gov/data/ input data formatROWs (samples) x COLUMNs (identifiers) (i.e. clinicalMatrix) 570 samples X 151 identifiersAll IdentifiersAll Samples
TCGA-COAD.htseq_fpkm-uq.tsv
dataset: gene expression RNAseq - HTSeq - FPKM-UQ
cohortGDC TCGA Colon Cancer (COAD) dataset IDTCGA-COAD/Xena_Matrices/TCGA-COAD.htseq_fpkm-uq.tsv downloadhttps://gdc.xenahubs.net/download/TCGA-COAD/Xena_Matrices/TCGA-COAD.htseq_fpkm-uq.tsv.gz; Full metadata samples512 version09-14-2017 hubhttps://gdc.xenahubs.net type of datagene expression RNAseq unitlog2(fpkm-uq+1) platformIllumina ID/Gene Mappinghttps://gdc.xenahubs.net/download/probeMaps/gencode.v22.annotation.gene.probeMap.gz; Full metadata authorGenomic Data Commons raw datahttps://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-80 raw datahttps://api.gdc.cancer.gov/data/ wranglingData from the same sample but from different vials/portions/analytes/aliquotes is averaged; data from different samples is combined into genomicMatrix; all data is then log2(x+1) transformed. input data formatROWs (identifiers) x COLUMNs (samples) (i.e. genomicMatrix) 60,484 identifiers X 512 samples
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: TCGA-TGCT. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
Testicular germ cell cancer is rare, comprising 1-2% of all tumors in males. However, it is the most common cancer in men ages 15 to 35. The incidence of testicular germ cell cancer has been continuously rising in many countries, including Europe and the U.S. In 2013, about 8,000 American men were estimated to be diagnosed with the cancer. Of those, 370 are predicted to die from the disease. Men who are Caucasian, have an undescended testicle, abnormally developed testicles, or a family history of testicular cancer have a greater risk of developing testicular cancer. Fortunately, testicular germ cell cancer is highly treatable.
Please see the TCGA-TGCT information page to learn more about the images and to obtain any supporting metadata for this collection.
Citation guidelines can be found on the Citing TCGA in Publications and Presentations information page.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced.
For example, collection_id-idc_v8-aws.s5cmd
corresponds to the contents of the
collection_id
collection introduced in IDC data
release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of
the corresponding collection was introduced.
tcga_tgct-idc_v8-aws.s5cmd
: manifest of files available for download from public IDC Amazon Web Services bucketstcga_tgct-idc_v8-gcs.s5cmd
: manifest of files available for download from public IDC Google Cloud Storage bucketstcga_tgct-idc_v8-dcf.dcf
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference
files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
.To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: Pan-Cancer-Nuclei-Seg-DICOM. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, pan_cancer_nuclei_seg_dicom-collection_id-idc_v19-aws.s5cmd
corresponds to the annotations for th eimages in the collection_id
collection introduced in IDC data release v19. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.
For each of the collections, the following manifest files are provided:
pan_cancer_nuclei_seg_dicom-
: manifest of files available for download from public IDC Amazon Web Services bucketspan_cancer_nuclei_seg_dicom-
: manifest of files available for download from public IDC Google Cloud Storage bucketspan_cancer_nuclei_seg_dicom-
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
Pre-processed TCGA COAD data used for PIVOT analysis.
Backgrounds: Colorectal cancer (CRC) with high incidence, has the third highest mortality of tumors. DNA damage and repair influence a variety of tumors. However, the role of these genes in colon cancer prognosis has been less systematically investigated. Here, we aim to establish a corresponding prognostic signature providing new therapeutic opportunities for CRC.Method: After related genes were collected from GSEA, univariate Cox regression was performed to evaluate each gene’s prognostic relevance through the TCGA-COAD dataset. Stepwise COX regression was used to establish a risk prediction model through the training sets randomly separated from the TCGA cohort and validated in the remaining testing sets and two GEO datasets (GSE17538 and GSE38832). A 12-DNA-damage-and-repair-related gene-based signature able to classify COAD patients into high and low-risk groups was developed. The predictive ability of the risk model or nomogram were evaluated by different bioinformatics‐ methods. Gene functional enrichment analysis was performed to analyze the co-expressed genes of the risk-based genes.Result: A 12-gene based prognostic signature established within 160 significant survival-related genes from DNA damage and repair related gene sets performed well with an AUC of ROC 0.80 for 5 years in the TCGA-CODA dataset. The signature includes CCNB3, ISY1, CDC25C, SMC1B, MC1R, LSP1P4, RIN2, TPM1, ELL3, POLG, CD36, and NEK4. Kaplan-Meier survival curves showed that the prognosis of the risk status owns more significant differences than T, M, N, and stage prognostic parameters. A nomogram was constructed by LASSO regression analysis with T, M, N, age, and risk as prognostic parameters. ROC curve, C-index, Calibration analysis, and Decision Curve Analysis showed the risk module and nomogram performed best in years 1, 3, and 5. KEGG, GO, and GSEA enrichment analyses suggest the risk involved in a variety of important biological processes and well-known cancer-related pathways. These differences may be the key factors affecting the final prognosis.Conclusion: The established gene signature for CRC prognosis provides a new molecular tool for clinical evaluation of prognosis, individualized diagnosis, and treatment. Therapies based on targeted DNA damage and repair mechanisms may formulate more sensitive and potential chemotherapy regimens, thereby expanding treatment options and potentially improving the clinical outcome of CRC patients.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TCGA COAD paired sample gene level read counts from Level 3 RNASeq-v2 data.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Explore the TCGA Whole Slide Image (WSI) SVS files available on Kaggle, offering detailed visual representations of tissue samples from various cancer types. These high-resolution images provide valuable insights into tumor morphology and tissue architecture, facilitating cancer diagnosis, prognosis, and treatment research. Delve into the rich landscape of cancer biology, leveraging the wealth of information contained within these SVS files to drive innovative advancements in oncology. This is a dataset of WSI images downloaded from the TCGA portal.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
This collection contains 406 ROI masks in MATLAB format defining the low grade glioma (LGG) tumour region on T1-weighted (T1W), T2-weighted (T2W), T1-weighted post-contrast (T1CE) and T2-flair (T2F) MR images of 108 different patients from the TCGA-LGG collection. From this subset of 108 patients, 81 patients have ROI masks drawn for the four MRI sequences (T1W, T2W, T1CE and T2F), and 27 patients have ROI masks drawn for three or less of the four MRI sequences. The ROI masks were used to extract texture features in order to develop radiomic-based multivariable models for the prediction of isocitrate dehydrogenase 1 (IDH1) mutation, 1p/19q codeletion status, histological grade and tumour progression. Clinical data (188 patients in total from the TCGA-LGG collection, some incomplete depending on the clinical attribute), VASARI scores (188 patients in total from the TCGA-LGG collection, 178 complete) with feature keys, and source code used in this study are also available with this collection. Please contact Martin Vallières (mart.vallieres@gmail.com) of the Medical Physics Unit of McGill University for any scientific inquiries about this dataset.
Transcriptional profiling of pre-malignant and malignant colorectal cancer lesions provides a means for temporally monitoring key molecular events underlying neoplastic progression. Unfortunately, the most widely used central dataset for colorectal cancer samples from The Cancer Genome Atlas (TCGA) does not contain adenoma samples, putting a greater reliance of in silico analyses and pre-clinical modelling on a handful of independent microarray experiments. Due to the differences in sample acquisition, preparation, downstream analysis and other parameters, results are often incongruent, hindering consensus building. Here, we developed a microarray meta-dataset consisting of 231 normal, 132 adenoma, and 342 colon cancer tissue samples (705 samples total) sourced from 12 independent microarray studies all using the Affymetrix HG U133 Plus 2.0 (GPL570) chip platform including GSE4183, GSE8671,GSE9348, GSE15960, GSE20916, GSE21510, GSE22598, GSE23194, GSE23878, GSE32323, GSE33113, and GSE37364. Individual datasets were pre-processed and normalized by frozen robust multiarray averaging (fRMA) before merging by matching probe sets. Batch effects were subsequently identified by Principal Component Analysis (PCA) and removed using ComBat. In addition, low variant probes were filtered from the meta-dataset before downstream analysis. Finally, biological signatures corresponding to cancer and adenoma samples were both quantitatively and functionally validated. Quantitative validation was performed by correlation analysis of LogFC values with the TCGA-COAD or other external GEO microarray datasets, respectively. Functional validation was carried out through predictive analyses using Ingenuity Pathway Analysis (IPA) and Gene Set Enrichment Analysis (GSEA). Overall, our meta-dataset provides a powerful tool for studying transcriptome-wide changes which occur during early dysplasia and malignant transformation of adenomas as well as colorectal cancer in general.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multi-layer Complex networks are commonly used for modeling and analysing biological entities. This paper presents the advantage of using COMBO (Combining Multi Bio Omics) to suggest a new role of the chromosomal aberration as a cancer driver factor. Exploiting the heterogeneous multi-layer networks, COMBO integrates gene expression and DNA-methylation data in order to identify complex bilateral relationships between transcriptome and epigenome. We evaluated the multi-layer networks generated by COMBO on different TCGA cancer datasets (COAD, BLCA, BRCA, CESC, STAD) focusing on the effect of a specific chromosomal numerical aberration, broad gain in chromosome 20, on different cancer histotypes. In addition, the effect of chromosome 8q amplification was tested in the same TCGA cancer dataset. The results demonstrate the ability of COMBO to identify the chromosome 20 amplification cancer driver force in the different TCGA Pan Cancer project datasets.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: TCGA-UVM. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
Uveal (intraocular or eye) melanoma develops in the pigment cells of the uvea, which is the middle layer of the eye. The uvea consists of three main parts: the iris, ciliary body, and choroid. Compared to tumors of the iris, tumors of the ciliary body and choroid tend to be larger and more likely to spread to other parts of the body. TCGA studied tumors from all three parts of the uvea.
Please see the TCGA-UVM information page to learn more about the images and to obtain any supporting metadata for this collection.
Citation guidelines can be found on the Citing TCGA in Publications and Presentations information page.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced.
For example, collection_id-idc_v8-aws.s5cmd
corresponds to the contents of the
collection_id
collection introduced in IDC data
release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of
the corresponding collection was introduced.
tcga_uvm-idc_v8-aws.s5cmd
: manifest of files available for download from public IDC Amazon Web Services bucketstcga_uvm-idc_v8-gcs.s5cmd
: manifest of files available for download from public IDC Google Cloud Storage bucketstcga_uvm-idc_v8-dcf.dcf
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference
files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
.To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HISTOPANTUM is a comprehensive pan-cancer dataset of histology images categorized into Tumor and Non-Tumor classes over 4 different cancer types (domains). This dataset is designed to facilitate domain generalization analysis for tumor detection tasks, serving as a benchmark for foundation models and domain generalization algorithms.
The dataset comprises histology images sourced from The Cancer Genome Atlas (TCGA), spanning the following four cancer types:
The dataset is provided in four zipped files, each corresponding to one cancer type. Within each zip file, images are organized into two subfolders:
tumour
non-tumour
Each image filename encodes the originating slide and the patch position within the slide, following this naming convention:
If you use this dataset in your research, please cite the following publication:
@article{zamanitajeddin2024benchmarking,
title={Benchmarking Domain Generalization Algorithms in Computational Pathology},
author={Zamanitajeddin, Neda and Jahanifar, Mostafa and Xu, Kesi and Siraj, Fouzia and Rajpoot, Nasir},
journal={arXiv preprint arXiv:2409.17063},
year={2024}
}
For further details, please refer to the linked publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of images with or without tumor-infiltrating lymphocytes (TILs). The original images are from Abousamra et al. (2022) and Saltz et al. (2018), and the original whole slide images are from TCGA. This dataset is a subset of the data presented in Abousamra et al. (2022) (with new data partitions).
If you use this dataset, please cite the following papers, as well as this Zenodo page.
Abousamra, S., Gupta, M. D., Hou, L., Batiste, R., Zhao, T., Shankar, A., Rao, A., Chen, C., Samaras, D., Kurc, T., & Saltz, J. (2022). Deep Learning-Based Mapping of Tumor Infiltrating Lymphocytes in Whole Slide Images of 23 Types of Cancer. Frontiers in Oncology, 5971. https://doi.org/10.3389/fonc.2021.806603
Saltz, J., Gupta, R., Hou, L., Kurc, T., Singh, P., Nguyen, V., Samaras, D., Shroyer, K. R., Zhao, T., Batiste, R., & Danilova, L. (2018). Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Reports, 23(1), 181-193.
The acknowledgements from the Frontiers in Oncology and Cell Reports papers are included below:
This work was supported by the National Institutes of Health (NIH) and National Cancer Institute (NCI) grants UH3-CA22502103, U24-CA21510904, 1U24CA180924-01A1, 3U24CA215109-02, and 1UG3CA225021-01 as well as generous private support from Bob Beals and Betsy Barton. AR and AS were partially supported by NCI grant R37-CA214955 (to AR), the University of Michigan (U-M) institutional research funds and also supported by ACS grant RSG-16-005-01 (to AR). AS was supported by the Biomedical Informatics & Data Science Training Grant (T32GM141746). This work was enabled by computational resources supported by National Science Foundation grant number ACI-1548562, providing access to the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center, and also a DOE INCITE award joint with the MENNDL team at the Oak Ridge National Laboratory, providing access to Summit high performance computing system. The funders were not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.
We are grateful to all the patients and families who contributed to this study. Funding from the Cancer Research Institute is gratefully acknowledged, as is support from National Cancer Institute (NCI) through U54 HG003273, U54 HG003067, U54 HG003079, U24 CA143799, U24 CA143835, U24 CA143840, U24 CA143843, U24 CA143845,U24 CA143848, U24 CA143858, U24 CA143866, U24 CA143867, U24 CA143882, U24 CA143883, U24 CA144025, P30 CA016672, U24CA180924, U24CA210950, U24CA215109, NCI Contract HHSN261201400007C, and Leidos Biomedical Contract 14X138. A.U.K.R. and P.S were supported by CCSG Bioinformatics Shared Resource P30 CA01667, ITCR U24 Supplement 1U24CA199461-01, a gift from Agilent technologies, CPRIT RP150578, and a Research Scholar Grant from the American Cancer Society (RSG-16-005-01). This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation XSEDE Science Gateways program under grant ACI-1548562 allocation TG-ASC130023. The authors would like to thank Stony Brook Research Computing and Cyberinfrastructure and the Institute for Advanced Computational Science at Stony Brook University for access to the high-performance LIred and SeaWulf computing systems, the latter of which was supported by National Science Foundation grant (#1531492).
This dataset includes 304,097 image patches. All images are 100 x 100 pixels at 0.5 micrometers per pixel. An image is TIL-positive if there are at least two TILs present.
Refer to images-tcga-tils-metadata.csv
for information about each image. That spreadsheet has the following columns:
partition,study,barcode,label,path,md5
Partition specifies which partition the image is part of (train, val, test). Study is the TCGA study the image is part of (e.g., acc for TCGA-ACC). Barcode is the TCGA participant barcode. This is used during partitioning, to ensure that images from the same participant are not present in different data partitions. Label is either til-negative or til-positive. An image is til-positive if there are at least two TILs in the image. Path is the path to the PNG image. All images are stored as PNG. Md5 is the md5 hash of the image. This can be used to ensure there are no duplicate images and to verify the integrity of images.
There are study-specific directories in the directory images-tcga-tils
, and there is a directory named pancancer
that includes images from all the included TCGA studies. That directory uses symlinks to avoid storing duplicate data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TCGA COAD non-paired sample isoform level read counts from Level 3 RNASeq-v2 data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains mitosis detections, mitotic network structures, and social network analysis (SNA) measures derived from 11,161 diagnostic slides in The Cancer Genome Atlas (TCGA). Mitoses were automatically identified using the MDFS algorithm [1], and each detected mitosis was converted into a node within a mitotic network. The resulting graphs are provided in JSON format, with each file representing a single diagnostic slide.
Each JSON file contains four primary fields:
edge_index
Two parallel lists representing edges between nodes. The ii-th element in the first list corresponds to the source node index, and the ii-th element in the second list is the target node index.
coordinates
A list of [x, y] positions for each node (mitosis). The (x,y) coordinates can be used for spatial visualization or further spatial analyses.
feats
A list of feature vectors, with each row corresponding to a node. These features include:
feat_names
The names of the features in feats
. The order matches the columns in each node’s feature vector.
{
"edge_index": [[1, 2, 6, 10], [2, 4, 8, 11]],
"coordinates": [[27689.0, 12005.0], [24517.0, 17809.0], ...],
"feats": [[1.0, 0.0, 0.0, 0.0], [1.0, 1.0, 0.0, 0.115], ...],
"feat_names": ["type", "Node_Degree", "Clustering_Coeff", "Harmonic_Cen"]
}
Below is a sample Python snippet to load one JSON file, extract node coordinates and the type
feature, and combine them into a single NumPy array:
import json
import numpy as np
# Path to your JSON file
json_file_path = "example_graph.json"
with open(json_file_path, 'r') as f:
data = json.load(f)
# Convert coordinates to NumPy
coordinates = np.array(data["coordinates"])
# Identify the "type" column
feat_names = data["feat_names"]
type_index = feat_names.index("type")
# Extract features and isolate the "type" column
feats = np.array(data["feats"])
node_types = feats[:, type_index].reshape(-1, 1)
# Combine x, y, and type into a single array (N x 3)
combined_data = np.hstack([coordinates, node_types])
print(combined_data)
To visualize or analyze the network structure, you can construct a NetworkX graph as follows:
import json
import networkx as nx
import matplotlib.pyplot as plt
json_file_path = "example_graph.json"
with open(json_file_path, "r") as f:
data = json.load(f)
# Create a NetworkX Graph
G = nx.Graph()
# Add each node with position attributes
for i, (x, y) in enumerate(data["coordinates"]):
G.add_node(i, pos=(x, y))
# Add edges using the parallel lists in edge_index
# (Adjust for 1-based indexing if necessary)
for src, dst in zip(data["edge_index"][0], data["edge_index"][1]):
G.add_edge(src, dst)
Having TIAToolbox installed, one can easily visualize the mitotic network on their respective whole slide images using the following command:
tiatoolbox visualize --slides path/to/slides --overlays path/to/overlays
The only thing to consider is that slides and overlays (provided graph json files) should have the same name. For more information, please refer to Visualization Interface Usage - TIA Toolbox 1.5.1 Documentation.
In case of using this dataset, please cite the following publication:
@article{jahanifar2024mitosis, title={Mitosis detection, fast and slow: robust and efficient detection of mitotic figures}, author={Jahanifar, Mostafa and Shephard, Adam and Zamanitajeddin, Neda and Graham, Simon and Raza, Shan E Ahmed and Minhas, Fayyaz and Rajpoot, Nasir}, journal={Medical Image Analysis}, volume={94}, pages={103132}, year={2024}, publisher={Elsevier} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundSuccinate dehydrogenase (SDH), one of the key enzymes in the tricarboxylic acid cycle, is mainly found in the mitochondria. SDH consists of four subunits encoding SDHA, SDHB, SDHC, and SDHD. The biological function of SDH is significantly related to cancer progression. Colorectal cancer (CRC) is one of the most common malignant tumors globally, whose most common histological subtype is colon adenocarcinoma (COAD). However, the correlation between SDH factors and COAD remains unclear.MethodsThe data on pan-cancer was obtained from The Cancer Genome Atlas (TCGA) database. Kaplan-Meier survival analysis showed the prognostic ability of SDHs. The cBioPortal database reflected genetic variations of SDHs. The correlation analysis was conducted between SDHs and mitochondrial energy metabolism genes (MMGs) and the protein-protein interaction (PPI) network was built. Consequently, Univariate and Multivariate Cox Regression Analysis on SDHs and other clinical characteristics were conducted. A nomogram was established. The ssGSEA analysis visualized the association between SDHs and immune infiltration. Immunophenoscore (IPS) explored the correlation between SDHs and immunotherapy, and the correlation between SDHs and targeted therapy was investigated through Genomics of Drug Sensitivity in Cancer. Finally, qPCR and immunohistochemistry detected SDHs’ expression.ResultsAfter assessing SDHs differential expression in pan-cancer, we found that SDHB, SDHC, and SDHD benefit COAD patients. The cBioPortal database demonstrated that SDHA was the top gene in mutation frequency rank. Correlation analysis mirrored a strong link between SDHs and MMGs. We formulated a nomogram and found that SDHB, SDHC, SDHD, and clinical characteristics correlated with COAD patients’ survival. For T helper cells, Th2 cells, and Tem, SDHA, SDHB, SDHC, and SDHD were significantly enriched in the high expression group. Moreover, COAD patients with high SDHA expression were more suitable for immunotherapy. And COAD patients with different SDHs’ expression have different sensitivity to targeted drugs. Further verifying the gene and protein expression levels of SDHs, we found that the tissues were consistent with the bioinformatics analysis.ConclusionsOur study analyzed the expression and prognostic value of SDHs in COAD, explored the pathway mechanisms involved, and the immune cell correlations, indicating that SDHs might be biomarkers for COAD patients.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Cancer Genome Atlas Ovarian Cancer for Ascites Segmentation (TCGA-OV-AS)
This dataset was curated as part of the research 'Deep Learning Segmentation of Ascites on Abdominal CT Scans for Automatic Volume Quantification' (Paper, arXiv). To replicate TCGA-OV-AS, please download TCGA-OV from TCIA using the Descriptive Directory Name download option.
Converting Images
Convert the DICOMs to NIFTI format using dcm2niix and GNU parallel.
Create the directory structure… See the full description on the dataset page: https://huggingface.co/datasets/farrell236/TCGA-OV-AS.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe exact mechanisms driving colorectal cancer (CRC) are yet to be fully elucidated. This study aims to confirm the reliability of a prognostic model for colon adenocarcinoma (COAD) by analyzing the varied expression levels of Glycolysis & Pyroptosis-Related Differentially Expressed Genes (G&PRDEGs) in COAD using bioinformatics tools.MethodsWe retrieved gene expression data and clinical details for COAD patients from the Cancer Genome Atlas (TCGA) database. These data were analyzed to categorize the samples into pyroptosis-positive and pyroptosis-negative groups based on their expression of G&PRDEGs. A prognostic model for COAD was then developed using LASSO Cox regression analysis, focusing on these differentially expressed genes (DEGs). Kaplan-Meier curves were plotted to assess the differences in survival between the two groups. Furthermore, we conducted multivariate Cox regression analyses to evaluate the influence of clinical parameters and model-derived risk scores. Analyses of pathway enrichment were performed using R software, alongside single-sample gene-set enrichment analysis (ssGSEA) to explore the role of immune cells and functions associated with G&PRDEGs.ResultsA predictive model was developed using 53 G&PRDEGs that were expressed differentially. An examination of survival rates revealed that the high-risk groups exhibited a noticeably diminished overall survival (OS) in comparison to the low-risk groups in the TCGA database (P
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.