A unified data repository of the National Cancer Institute (NCI)'s Genomic Data Commons (GDC) that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI). The GDC Data Portal provides a platform for efficiently querying and downloading high quality and complete data. The GDC also provides a GDC Data Transfer Tool and a GDC API for programmatic access.
The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Breast Phenotype Research Group.
Cloud based data science infrastructure that provides secure access to cancer research data from NCI programs and key external cancer programs. Serves as coordinated resource for public data sharing of NCI funded programs. Users can explore and use analytical and visualization tools for data analysis. Enables to search and aggregate data across repositories including Cancer Data Service, Clinical Trial Data Commons, Genomic Data Commons, Imaging Data Commons, Integrated Canine Data Commons, Proteomic Data Commons.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Rectum Adenocarcinoma (TCGA-READ) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TCGA RNA-seq V2 Level3 data were downloaded from TCGA Genomic Data Commons Data Portal (https://gdc-portal.nci.nih.gov), consisting of 11,303 samples in 34 cancer projects (33 cancer types). Nine cancer types that do not have corresponding non-tumour samples were filtered out, and the analysis was focused on tumour versus non-tumour comparison. 24 cancer types were used in this meta-analysis: BLCA, BRCA, CESC, CHOL, COAD, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LIHC, LUAD, LUSC, PAAD, PCPG, PRAD, READ, SARC, SKCM, STAD, THCA, THYM, UCEC (https://gdc-portal.nci.nih.gov). The nine filtered cancer types were ACC, DLBC, LAML, LGG, MESO, OV, TGCT, UCS and UVM. To extract expression values from TCGA RNA-seq data, we used genomic coordinates to retrieve UCSC Transcript IDs that correspond to the identifiers in TCGA RNA-seq V2 Level3 data (isoform level). The GAF (General Annotation Format) file was used to map the coordinate to UCSC Transcript ID, and it was downloaded form https://tcga-data.nci.nih.gov/docs/GAF/GAF.hg19.June2011.bundle/outputs/TCGA.hg19.June2011.gaf. This file contains genomic annotations shared by all TCGA projects. More details of the GAF file format can be found at https://tcga-data.nci.nih.gov/docs/GAF/GAF3.0/GAF_v3_file_description.docx. We filtered out any coding exons overlapping UCSC Transcript IDs to eliminate expression value of coding genes and evaluate lncRNA expression.We could find the expression values of 443 pcRNAs and 203 tapRNAs in TCGA data, as many of non-coding regions are not yet fully annotated in the TCGA RNA-seq V2 Level3 data. The expression value of pcRNAs and tapRNAs were extracted and clustered by un-supervised Pearson correlation method (Supplementary Figure 18A). The expression values of tapRNA-associated coding genes were also extracted and used to generate the heat-map (Supplementary Figure 18B), which shows the similar pattern of expression with tapRNAs across the cancer types.To show that tapRNAs and associated coding genes have similar expression profiles in cancers we generated a Spearman's Rank-Order Correlation heatmap (Figure 6A) between tapRNAs and their associated coding genes based on the TCGA RNA-seq data. We used the MatLab function corr to calculate the Spearman's rho. This function takes two matrices X (197-by-8,850 expression profiling matrix of tapRNA) and Y (197-by-8,850 expression profiling matrix of tapRNA-assocated coding gene) and returns an 8,850-by-8,850 matrix containing the pairwise correlation coefficient between each pair of 8,850 columns (TCGA cancer samples in Supplementary Figure 18A and B). Thus, the rank-order correlation matrix that we computed from the matrices of expression profiling data (Supplementary Figure S18A and B) allowed us to compare the correlation between two column vectors i.e. cancer samples. This function also returns a matrix of p-values for testing the hypothesis of no correlation against the alternative that there is a nonzero correlation. Each element of a matrix of p-values is the p value for the corresponding element of Spearman's rho. The p-values for Spearman's rho are calculated using large-sample approximations. To check significance level of correlation between tapRNA and its associated coding gene, the diagonal of the p-value matrix was extracted and used. The median is 1.31x10-11 and the mean is 1.03x10-4 with standard deviation 0.0029.To identify cancer-specific tapRNAs, we considered not only the global expression pattern of a given tapRNA in each cancer type, but also expression pattern of specific sub-group that is significantly distinct, to take into account cancer sample heterogeneity. Thus, two conditions were applied: (1) average expression level of a tapRNA in a given cancer type is in top 10% or bottom 10% and (2) a tapRNA has at least 10% of samples in a given cancer type that are significantly up-regulated (Z-score > 2) or down-regulated (Z-score < -2).
The Cancer Genome Atlas Lung Adenocarcinoma (TCGA-LUAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
https://wiki.cancerimagingarchive.net/display/Public/TCGA-LUAD
Portal to make cancer related proteomic datasets easily accessible to public. Facilitates multiomic integration in support of precision medicine through interoperability with other resources. Developed to advance our understanding of how proteins help to shape risk, diagnosis, development, progression, and treatment of cancer. One of several repositories within NCI Cancer Research Data Commons which enables researchers to link proteomic data with other data sets (e.g., genomic and imaging data) and to submit, collect, analyze, store, and share data throughout cancer data ecosystem. PDC provides access to highly curated and standardized biospecimen, clinical, and proteomic data, intuitive interface to filter, query, search, visualize and download data and metadata. Provides common data harmonization pipeline to uniformly analyze all PDC data and provides advanced visualization of quantitative information. Cloud based (Amazon Web Services) infrastructure facilitates interoperability with AWS based data analysis tools and platforms natively. Application programming interface (API) provides cloud-agnostic data access and allows third parties to extend functionality beyond PDC. Structured workspace that serves as private user data store and also data submission portal. Distributes controlled access data, such as patient-specific protein fasta sequence databases, with dbGaP authorization and eRA Commons authentication.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Liver Hepatocellular Carcinoma (TCGA-LIHC) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: TCGA-PRAD. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
The Cancer Imaging Program (CIP) is working directly with primary investigators from institutes participating in TCGA to obtain and load images relating to the genomic, clinical, and pathological data being stored within the TCGA Data Portal. Currently this image collection of prostate adenocarcinoma (PRAD) patients can be matched by each unique case identifier with the extensive gene and expression data of the same case from The Cancer Genome Atlas Data Portal to research the link between clinical phenome and tissue genome.
Please see the TCGA-PRAD page to learn more about the images and to obtain any supporting metadata for this collection.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced.
For example, collection_id-idc_v8-aws.s5cmd
corresponds to the contents of the
collection_id
collection introduced in IDC data
release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of
the corresponding collection was introduced.
tcga_prad-idc_v8-aws.s5cmd
: manifest of files available for download from public IDC Amazon Web Services bucketstcga_prad-idc_v8-gcs.s5cmd
: manifest of files available for download from public IDC Google Cloud Storage bucketstcga_prad-idc_v8-dcf.dcf
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference
files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
.To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ExpressionData.csv
----------------------------
A small subset of transcriptomics data (30 genes) curated for learning Gene Regulatory Networks (GRNs) pertaining to signaling by the ALK pathway. Genes were selected by referencing the "signaling by ALK" pathway from Reactome (https://reactome.org/content/detail/R-HSA-201556). This subset of data belongs the TARGET-NBL project (https://portal.gdc.cancer.gov/projects/TARGET-NBL), hosted via the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/). Please refer to GDCs data access policies (https://gdc.cancer.gov/about-gdc/gdc-policies) if planning to use the data.
refNetwork.csv
----------------------
Contains a reference network of known pairwise regulatory relationships among the genes of which we have transcriptomics data available in "ExpressionData.csv." These relationships were again determined by referencing the "signaling by ALK" pathway from Reactome (https://reactome.org/content/detail/R-HSA-201556).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for uc-ctds/GDC-QAG-genes-mutations
This dataset contains genes and somatic mutations observed in various cancers. It is scraped from the /ssms endpoint in the Genomic Data Commons (GDC). This data is used to run the Query Augmented Generation (GDC) tool on the GDC. GDC QAG is currently deployed in the HuggingFace Spaces as a web app.
Dataset Details
Dataset Description
This dataset contains around 5.6 million somatic mutations (protein⊠See the full description on the dataset page: https://huggingface.co/datasets/uc-ctds/GDC-QAG-genes-mutations.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Summary
This metadata record provides details of the data supporting the claims of the related manuscript: âThe CINSARC signature predicts the clinical outcome in patients with Luminal B breast cancerâ.
The related study tested the prognostic value for disease-free survival (DFS) of CINSARC, a multigene expression signature originally developed in sarcomas and shown to have prognostic impact in various cancers, in a series of 6035 early-stage invasive primary breast cancers.
Type of data: prognostic value for DFS of CINSARC
Subject of data: Homo sapiens
Sample size: 6035
Population characteristics: All cases were invasive breast carcinomas profiled using DNA microarrays or RNA-sequencing with expression and clinicopathological data available. All samples are pre-treatment samples (operative specimen or diagnostic biopsy before neo-adjuvant chemotherapy). The
detailed characteristics of patients and tumours analysed in the present study are available in Supplementary Table 10.
Recruitment: publicly available transcriptomic data of invasive primary breast cancer enrolled in 36 retrospective studies published over a 10-year period between 2002 and 2012.
Data access
All data sets of primary breast cancer were downloaded from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/), ArrayExpress (https://www.ebi.ac.uk/arrayexpress/), Genomic Data Commons (GDC, https://portal.gdc.cancer.gov/) and cBioPortal (https://www.cbioportal.org/) databases. All accession IDs are provided in Supplementary Table 10 (Table S10 revised.xlsx), which is included with this data record.
The data underlying the figures and tables of the related article are contained in the files âGoncalves_supporting_data.xlsxâ and âTable S8.xlsxâ, which are included with this data record.
A detailed list of the data underlying each figure and table of the related article is available in the file âGoncalves_2021_underlying_data_list.xlsxâ, which is included with this data record.
Corresponding author(s) for this study
Pr François BERTUCCI, MD PhD, DĂ©partement dâOncologie MĂ©dicale, Institut Paoli-Calmettes, 232 Bd. Ste-Marguerite, 13009 Marseille, France e-mail:bertuccif@ipc.unicancer.fr ; Phone : +33 4 91 22 35 37 ; Fax : +33 4 91 22 36 70
Study approval
The details of Institutional Review Board and Ethical Committee approval and patientsâ consent for the 36 studies analysed in the related study are present in their corresponding publications, which are listed in Supplementary Table 10 of the related article.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Lung Squamous Cell Carcinoma (TCGA-LUSC) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Lung Phenotype Research Group.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bioinformatics analysis of CEP350 tumor suppression in human TCGA cutaneous melanoma. Publicly available datasets were downloaded from Genomic Data Commons Data Portal using data from the The Cancer Genome Atlas Program and the International Cancer Genome Consortium Data Portal for statistical analyses using R and Shiny.Supplementary datasets and other information accompanying manuscript: Tumor Suppressive Functions of CEP350 in Cutaneous Melanoma Cells by Aziz Aiderus, Bin Fang, John M. Koomen and Michael B. Mann.Abstract: We previously identified Cep350 as a novel melanoma haploinsufficient melanoma tumor suppressor gene using SB transposon-mediated mutagenesis to drive melanoma progression in Braf(V600E) mutant (SB|Braf) mice functionally demonstrated that the human CEP350 ortholog is a new melanoma tumor-suppressor gene in human cancer cell lines (Mann et al., Nature Genetics, 2015). Further dissection of the latent tumor suppressive functions of CEP350 in cutaneous melanoma cells is essential for understanding its role in melanoma imitation and progression. In this work, we investigated the role of the novel tumor suppressive functions of CEP350 in cutaneous melanoma cells using comparative informatics, molecular oncology, and proteomics approaches to demonstrate that CEP350 acts via altered cytoskeletal dynamics to contribute to BRAF-V600E driven melanoma.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Summary statistics are fundamental to data science, and are the buidling blocks of statistical reasoning. Most of the data and statistics made available on government web sites are aggregate, however, until now, we have not had a suitable linked data representation available. We propose a way to express summary statistics across aggregate groups as linked data using Web Ontology Language (OWL) Class based sets, where members of the set contribute to the overall aggregate value. Additionally, many clinical studies in the biomedical field rely on demographic summaries of their study cohorts and the patients assigned to each arm. While most data query languages, including SPARQL, allow for computation of summary statistics, they do not provide a way to integrate those values back into the RDF graphs they were computed from. We represent this knowledge, that would otherwise be lost, through the use of OWL 2 punning semantics, the expression of aggregate grouping criteria as OWL classes with variables, and constructs from the Semanticscience Integrated Ontology (SIO), and the World Wide Web Consortium's provenance ontology, PROV-O, providing interoperable representations that are well supported across the web of Linked Data. We evaluate these semantics using a Resource Description Framework (RDF) representation of patient case information from the Genomic Data Commons, a data portal from the National Cancer Institute.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: CCDI-MCI. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
The Molecular Characterization Initiative (MCI) [2] is a component of the National Cancer Instituteâs (NCI) Childhood Cancer Data Initiative (CCDI). It offers state-of-the-art molecular testing at no cost to newly diagnosed children, adolescents, and young adults (AYAs) with central nervous system (CNS) tumors, soft tissue sarcomas (STS), certain rare childhood cancers (RAR), and certain neuroblastomas (NBL) treated at a Childrenâs Oncology Group (COG)âaffiliated hospital. The goal of MCI is to enhance the understanding of genetic factors in pediatric cancers and to provide timely, clinically relevant findings to doctors and families to aid in treatment decisions and determine eligibility for certain planned COG clinical trials.
The original images in vendor-specific format were collected on IRB-approved clinical trials or tissue banking studies from Childrenâs Oncology Group (COG) patients enrolled in EveryChild APEC14B1 protocol.
Those images, augmented with the metadata describing their content, were provided to the IDC team for the purposes of archival, and were converted into DICOM Whole Slide Microscopy (SM) representation [3,4] using custom open source scripts and tools as described in [5]. The resulting converted images were released in IDC in the CCDI-MCI collection with the IDC data release v19.
To learn how to access related clinical and genomic data accompanying this collection please see the CCDI-MCI page and CCDI Hub.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, collection_id-idc_v8-aws.s5cmd
corresponds to the contents of the collection_id
collection introduced in IDC data release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.
ccdi_mci-idc_v19-aws.s5cmd
: manifest of files available for download from public IDC Amazon Web Services bucketsccdi_mci-idc_v19-gcs.s5cmd
: manifest of files available for download from public IDC Google Cloud Storage bucketsccdi_mci-idc_v19-dcf.dcf
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W. L., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National cancer institute imaging data commons: Toward transparency, reproducibility, and scalability in imaging artificial intelligence. Radiographics 43, (2023).
[3] National Electrical Manufacturers Association (NEMA). DICOM PS3.3 - Information Object Definitions: A.32.8 VL Whole Slide Microscopy Image IOD. at <https://dicom.nema.org/medical/dicom/current/output/html/part03.html#sect_A.32.8>
[4] Herrmann, M. D., Clunie, D. A., Fedorov, A., Doyle, S. W., Pieper, S., Klepeis, V., Le, L. P., Mutter, G. L., Milstone, D. S., Schultz, T. J., Kikinis, R., Kotecha, G. K., Hwang, D. H., Andriole, K. P., John Lafrate, A., Brink, J. A., Boland, G. W., Dreyer, K. J., Michalski, M., Golden, J. A., Louis, D. N. & Lennerz, J. K. Implementing the DICOM standard for digital pathology. J. Pathol. Inform. 9, 37 (2018).
[5] Clunie, D., Fedorov, A. & Herrmann, M. D. ImagingDataCommons/idc-wsi-conversion: Initial release. (Zenodo, 2023). doi:10.5281/ZENODO.8240154
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Stomach Adenocarcinoma (TCGA-STAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundImmunological-related genes (IRGs) play a critical role in the immune microenvironment of tumors. Our study aimed to develop an IRG-based survival prediction model for hepatocellular carcinoma (HCC) patients and to investigate the impact of IRGs on the immune microenvironment.MethodsDifferentially expressed IRGs were obtained from The Genomic Data Commons Data Portal (TCGA) and the immunology database and analysis portal (ImmPort). The univariate Cox regression was used to identify the IRGs linked to overall survival (OS), and a Lasso-regularized Cox proportional hazard model was constructed. The International Cancer Genome Consortium (ICGC) database was used to verify the prediction model. ESTIMATE and CIBERSORT were used to estimate immune cell infiltration in the tumor immune microenvironment (TIME). RNA sequencing was performed on HCC tissue specimens to confirm mRNA expression.ResultsA total of 401 differentially expressed IRGs were identified, and 63 IRGs were found related to OS on the 237 up-regulated IRGs by univariate Cox regression analyses. Finally, five IRGs were selected by the LASSO Cox model, including SPP1, BIRC5, STC2, GLP1R, and RAET1E. This prognostic model demonstrated satisfactory predictive value in the ICGC dataset. The risk score was an independent predictive predictor for OS in HCC patients. Immune-related analysis showed that the immune infiltration level in the high-risk group was higher, suggesting that the 5-IRG signature may play an important role in mediating immune escape and immune resistance in the TIME of HCC. Finally, we confirmed the 5-IRG signature is highly expressed in 65 HCC patients with good predictive power.ConclusionWe established and verified a new prognosis model for HCC patients based on survival-related IRGs, and the signature could provide new insights into the prognosis of HCC.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data accompanies the paper "Ecological network analysis reveals cancer-dependent chaperone-client interaction structure and robustness", by Geut Galai, Xie He, Barak Rotblat, Shai Pilosof. Published in Nature Communications. Please cite the paper when using the data.All users must read the paper to understand how the data were obtained and processed, and their limitations. Data comes without warranty. Licence is CC BY-NC-SA (Attribution-NonCommercial-ShareAlike): This license lets you remix, tweak, and build upon this work non-commercially, as long as you credit the authors and license the new creations under the identical terms.All the computational processes related to data derivation and analysis are in the GutHub repository that accompanies the paper.Raw data (raw.zip)Gene level transcriptome profiling (RNA-Seq) data (in the form of HTSeq - FPKM) that was download from The Cancer Genome Atlas (TCGA) using the Genomic Data Commons Data Portal https://portal.gdc.cancer.gov).Human protein expression data that was downloaded from the string-db.org data base, and from published papers as follows.File: 12192_2020_1080_MOESM4_ESM.xlsx. Source: Bie AS, Cömert C, Körner R, Corydon TJ, Palmfeldt J, Hipp MS, et al. An inventory of interactors of the human HSP60/HSP10 chaperonin in the mitochondrial matrix space. Cell Stress Chaperones. 2020;25: 407â416. doi:10.1007/s12192-020-01080-6File: 41467_2013_BFncomms3139_MOESM481_ESM.xls. Source: Chae YC, Angelin A, Lisanti S, Kossenkov AV, Speicher KD, Wang H, et al. Landscape of the mitochondrial Hsp90 metabolome in tumours. Nat Commun. 2013;4: 2139. doi:10.1038/ncomms3139File: 12915_2020_740_MOESM8_ESM.xlsx Source: Joshi A, Dai L, Liu Y, Lee J, Ghahhari NM, Segala G, et al. The mitochondrial HSP90 paralog TRAP1 forms an OXPHOS-regulated tetramer and is involved in mitochondrial metabolic homeostasis. BMC Biol. 2020;18: 10. doi:10.1186/s12915-020-0740-7File: mmc2.xlsx Source: Ishizawa J, Zarabi SF, Davis RE, Halgas O, Nii T, Jitkova Y, et al. Mitochondrial ClpP-Mediated Proteolysis Induces Selective Cancer Cell Lethality. Cancer Cell. 2019;35: 721â737.e9. doi:10.1016/j.ccell.2019.03.014Processed data (processed.zip)The network data. Rows are chaperones, columns are clients.Source dataFile: Source Data for Figures and Tables.zipThis is the source data underlying the figures and tables, as requested by Nature Communications.
A unified data repository of the National Cancer Institute (NCI)'s Genomic Data Commons (GDC) that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI). The GDC Data Portal provides a platform for efficiently querying and downloading high quality and complete data. The GDC also provides a GDC Data Transfer Tool and a GDC API for programmatic access.