A unified data repository of the National Cancer Institute (NCI)'s Genomic Data Commons (GDC) that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI). The GDC Data Portal provides a platform for efficiently querying and downloading high quality and complete data. The GDC also provides a GDC Data Transfer Tool and a GDC API for programmatic access.
Portal for finding and analyzing cancer imaging data. Part of Cancer Research Data Commons to support cancer imaging research. Provides cloud based access to medical imaging data and library of analytical tools and workflows to share, analyze, and visualize multi modal imaging data from both clinical and basic cancer research studies.
SEER Limited-Use cancer incidence data with associated population data. Geographic areas available are county and SEER registry. The Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute collects and distributes high quality, comprehensive cancer data from a number of population-based cancer registries. Data include patient demographics, primary tumor site, morphology, stage at diagnosis, first course of treatment, and follow-up for vital status. The SEER Program is the only comprehensive source of population-based information in the United States that includes stage of cancer at the time of diagnosis and survival rates within each stage.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Stomach Adenocarcinoma (TCGA-STAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
The United States Cancer Statistics (USCS) online databases in WONDER provide cancer incidence and mortality data for the United States for the years since 1999, by year, state and metropolitan areas (MSA), age group, race, ethnicity, sex, childhood cancer classifications and cancer site. Report case counts, deaths, crude and age-adjusted incidence and death rates, and 95% confidence intervals for rates. The USCS data are the official federal statistics on cancer incidence from registries having high-quality data and cancer mortality statistics for 50 states and the District of Columbia. USCS are produced by the Centers for Disease Control and Prevention (CDC) and the National Cancer Institute (NCI), in collaboration with the North American Association of Central Cancer Registries (NAACCR). Mortality data are provided by the Centers for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS), National Vital Statistics System (NVSS).
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: TCGA-TGCT. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
Testicular germ cell cancer is rare, comprising 1-2% of all tumors in males. However, it is the most common cancer in men ages 15 to 35. The incidence of testicular germ cell cancer has been continuously rising in many countries, including Europe and the U.S. In 2013, about 8,000 American men were estimated to be diagnosed with the cancer. Of those, 370 are predicted to die from the disease. Men who are Caucasian, have an undescended testicle, abnormally developed testicles, or a family history of testicular cancer have a greater risk of developing testicular cancer. Fortunately, testicular germ cell cancer is highly treatable.
Please see the TCGA-TGCT information page to learn more about the images and to obtain any supporting metadata for this collection.
Citation guidelines can be found on the Citing TCGA in Publications and Presentations information page.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced.
For example, collection_id-idc_v8-aws.s5cmd
corresponds to the contents of the
collection_id
collection introduced in IDC data
release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of
the corresponding collection was introduced.
tcga_tgct-idc_v10-aws.s5cmd
: manifest of files available for download from public IDC Amazon Web Services bucketstcga_tgct-idc_v10-gcs.s5cmd
: manifest of files available for download from public IDC Google Cloud Storage bucketstcga_tgct-idc_v10-dcf.dcf
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference
files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
.To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
The purpose of the National Oncology Register (hereinafter referred to as NOR) is the registration of oncological diseases and periodic monitoring of their further development, i.e. data collection, verification, storage, protection and processing. NOR provides summary data for statistical overviews at both national and international levels, as well as for epidemiological studies and health research. NOR is a nationwide population register that follows on from the monitoring of neoplasms in the population of the Czech Republic introduced in the 1950s, and as a population register of records of individual neoplasms, the ÚZIS of the Czech Republic has been operating since 1976.
NOR data are also used to support early diagnosis and treatment of neoplasms and pre-cancerous conditions, to monitor trends in their occurrence, causative factors and social consequences. At the population level, the results of the treatment of neoplasms are also evaluated in the form of a survival analysis.
The Chemical Carcinogenesis Research Information System (CCRIS) database contains chemical records with carcinogenicity, mutagenicity, tumor promotion, and tumor inhibition test results. It was developed by the National Cancer Institute (NCI). Data are derived from studies cited in primary journals, current awareness tools, NCI reports, and other sources. Test results have been reviewed by experts in carcinogenesis and mutagenesis. CCRIS provides historical information from the years 1985 - 2011. It is no longer updated.
This blog post was posted by Sandeep Patel on June 18, 2015
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Rectum Adenocarcinoma (TCGA-READ) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
National Cancer Institute’s Cancer registry is a nationwide and population-based cancer registry, which covers all territory of Lithuania and it collects information about all new cancer cases (ICD-10-AM codes: C00-C96, D00-D09, D32-D33, D39.1, D42-D43, D45-D47) of all cancer patients.
The main task of the Cancer Registry is to guarantee as complete and reliable registration of incident cancer cases as possible.
In 1984 the Lithuanian Cancer Registry was established at the National Cancer Institute by the Order of the Minister of Health. The population-based Cancer Registry was set up in 1990.
The Greater Bay Area Cancer Registry (GBACR), in compliance with California state law, gathers information about all cancers diagnosed or treated in a nine-county area (Alameda, Contra Costa, Marin, Monterey, San Benito, San Francisco, San Mateo, Santa...
PHS does NOT host these data. This listing is information only.
The Greater Bay Area Cancer Registry (GBACR), in compliance with California state law, gathers information about all cancers diagnosed or treated in a nine-county area (Alameda, Contra Costa, Marin, Monterey, San Benito, San Francisco, San Mateo, Santa Clara and Santa Cruz). This information is obtained from medical records provided by hospitals, doctors\342\200\231 offices, and other related facilities.
The information, stored under secure conditions with strict regulations that protect confidentiality, helps the GBACR understand cancer occurrence and survival in the Greater Bay Area. For each patient, the information includes basic demographic facts like age, gender, and race/ethnicity, as well as cancer type, extent of disease, treatment and survival. Combined over the diverse Bay Area population, this information gives the GBACR and all users an opportunity to learn how such characteristics may be related to cancer causes, mortality, care and prevention.
In addition to its local use, information collected by the GBACR becomes part of state and federal population-based registries whose mission is to monitor cancer occurrence at the state and national levels, respectively. Data from the GBACR have contributed to the National Cancer Institute’s Surveillance, Epidemiology and End Results (SEER) program since 1973. The nine counties are also part of the statewide California Cancer Registry (CCR), which conducts essential monitoring of cancer occurrence and survival in California.
GBACR data are of the highest quality, as recognized by national and international registry standard-setting organizations, including SEER, the National Program for Cancer Registries, and the North American Association for Central Cancer Registries (NAACCR).
The CPIC has also started collecting data on environmenal factors. These data are available in the The California Neighborhoods Data System. This a new resource for examining the impact of neighborhood characteristics on cancer incidence and outcomes in populations includes a compilation of existing geospatial and other secondary data for characterizing contextual factors
A summary and description of social and built environment data and measures in the California Neighborhoods Data System (2010) can be found here: Social and Built Environment Data and Measures
More information about this new data source can be found here: The California Neighborhoods Data System
Patient characteristics All reported cancer cases in the state of California.
Data overview Data categories Socioeconomic status Racial/ethnic composition Immigration/acculturation characteristics Racial/ethnic residential segregation Population density Urbanicity (Rural/Urban) Housing Businesses Commuting Street connectivity Parks Farmers Markets Traffic density Crime Tapestry Segmentation
Notes To apply for these data, you can see instructions here: https://www.ccrcal.org/retrieve-data/data-for-researchers/how-to-request-ccr-data/
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Liver Hepatocellular Carcinoma (TCGA-LIHC) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Ovarian Cancer (TCGA-OV) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Ovarian Phenotype Research Group.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: HTAN-OHSU. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
The Human Tumor Atlas Network (HTAN) [2], part of the National Cancer Institute (NCI) Cancer Moonshot Initiative, will establish a clinical, experimental, computational, and organizational framework to generate informative and accessible three-dimensional atlases of cancer transitions for a diverse set of tumor types.
The overall goal of the HTAN OMS Atlas Center is to elucidate mechanisms by which metastatic breast cancers become resistant to current generation pathway- and immune checkpoint-targeted treatments. The OMS Atlas is motivated by the appreciation that these treatments are often effective in primary tumors but only transiently effective in the metastatic setting. Possible resistance mechanisms include tumor-intrinsic genomic instability and epigenomic plasticity, as well as events extrinsic to the cancer cells, including chemical and mechanical signals from the microenvironments, production of mechanical extracellular matrix barriers and/or changes in vasculature that reduce drug and/or immune cell access, nanoscale cancer cell-microenvironment interactions that reduce drug efficacy, and a plethora of immune resistance mechanisms, such as loss of HLA expression and antigen presentation, and immune exhaustion. These mechanisms likely vary between patients and within individual patients and change with time as tumors respond to therapeutic attack. The OMS Atlas will focus on elucidating resistance mechanisms in two specific current generation clinical trial scenarios: (a) hormone receptor-positive breast cancer (HRBC) undergoing treatment with a CDK4/6 inhibitor in combination with endocrine therapy and (b) triple negative breast cancer (TNBC) undergoing treatment with a PARP inhibitor and an immunomodulatory agent.
Please see the HTAN-OHSU information page to learn more about the images and to obtain any supporting metadata for this collection.
Citation guidelines can be found on the HTAN Publication Policy information page.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, collection_id-idc_v8-aws.s5cmd
corresponds to the contents of the collection_id
collection introduced in IDC data release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.
htan_ohsu-idc_v10-aws.s5cmd
: manifest of files available for download from public IDC Amazon Web Services bucketshtan_ohsu-idc_v10-gcs.s5cmd
: manifest of files available for download from public IDC Google Cloud Storage bucketshtan_ohsu-idc_v10-dcf.dcf
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
.To download the files using .dcf
manifest, see manifest header.
Collection of the images that were converted by IDC was supported through the Human Tumor Atlas Network, grants 1U2CCA233280-01 "Omic and Multidimensional Spatial Atlas of Metastatic Breast and Prostate Cancers" and 1U24CA233243-01 "Human Tumor Atlas Network: Data Coordinating Center" from National Cancer Institute.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
[2] Rozenblatt-Rosen, O., Regev, A., Oberdoerffer, P., Nawy, T., Hupalowska, A., Rood, J. E., Ashenberg, O., Cerami, E., Coffey, R. J., Demir, E., Ding, L., Esplin, E. D., Ford, J. M., Goecks, J., Ghosh, S., Gray, J. W., Guinney, J., Hanlon, S. E., Hughes, S. K., Hwang, E. S., Iacobuzio-Donahue, C. A., Jané-Valbuena, J., Johnson, B. E., Lau, K. S., Lively, T., Mazzilli, S. A., Pe’er, D., Santagata, S., Shalek, A. K., Schapiro, D., Snyder, M. P., Sorger, P. K., Spira, A. E., Srivastava, S., Tan, K., West, R. B., Williams, E. H. & Human Tumor Atlas Network. The Human Tumor Atlas Network: Charting Tumor Transitions across Space and Time at Single-Cell Resolution. Cell 181, 236–249 (2020). http://dx.doi.org/10.1016/j.cell.2020.03.053
The State Cancer Profiles (SCP) web site provides statistics to help guide and prioritize cancer control activities at the state and local levels. SCP is a collaborative effort using local and national level cancer data from the Centers for Disease Control and Prevention's National Program of Cancer Registries (NPCR) and National Cancer Institute's Surveillance, Epidemiology and End Results Registries (SEER). SCP address select types of cancer and select behavioral risk factors for which there are evidence-based control interventions. The site provides incidence, mortality and prevalence comparison tables as well as interactive graphs and maps and support data. The graphs and maps provide visual support for deciding where to focus cancer control efforts.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Prostate Adenocarcinoma (TCGA-PRAD) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
At the time of our study, 108 cases with breast MRI data were available in the The Cancer Genome Atlas Breast Invasive Carcinoma Collection (TCGA-BRCA) collection. In order to minimize variations in image quality across the multi-institutional cases we included only breast MRI studies acquired on GE 1.5 Tesla magnet strength scanners (GE Medical Systems, Milwaukee, Wisconsin, USA) scanners, yielding a total of 93 cases. We then excluded cases that had missing images in the dynamic sequence (1 patient), or at the time did not have gene expression analysis available in the TCGA Data Portal (8 patients). After these criteria, a dataset of 84 breast cancer patients resulted, with MRIs from four institutions: Memorial Sloan Kettering Cancer Center, the Mayo Clinic, the University of Pittsburgh Medical Center, and the Roswell Park Cancer Institute. The resulting cases contributed by each institution were 9 (date range 1999-2002), 5 (1999-2003), 46 (1999-2004), and 24 (1999-2002), respectively. The dataset of biopsy proven invasive breast cancers included 74 (88%) ductal, 8 (10%) lobular, and 2 (2%) mixed. Of these, 73 (87%) were ER+, 67 (80%) were PR+, and 19 (23%) were HER2+. Various types of analyses were conducted using the combined imaging, genomic, and clinical data. Those analyses are described within several manuscripts created by the group (cited below). Additional information about the methodology for how the Radiologist Annotations file can be found on the TCGA Breast Image Feature Scoring Project page.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Summary statistics are fundamental to data science, and are the buidling blocks of statistical reasoning. Most of the data and statistics made available on government web sites are aggregate, however, until now, we have not had a suitable linked data representation available. We propose a way to express summary statistics across aggregate groups as linked data using Web Ontology Language (OWL) Class based sets, where members of the set contribute to the overall aggregate value. Additionally, many clinical studies in the biomedical field rely on demographic summaries of their study cohorts and the patients assigned to each arm. While most data query languages, including SPARQL, allow for computation of summary statistics, they do not provide a way to integrate those values back into the RDF graphs they were computed from. We represent this knowledge, that would otherwise be lost, through the use of OWL 2 punning semantics, the expression of aggregate grouping criteria as OWL classes with variables, and constructs from the Semanticscience Integrated Ontology (SIO), and the World Wide Web Consortium's provenance ontology, PROV-O, providing interoperable representations that are well supported across the web of Linked Data. We evaluate these semantics using a Resource Description Framework (RDF) representation of patient case information from the Genomic Data Commons, a data portal from the National Cancer Institute.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
CMB program is organized into multiple cancer-specific collections. Digital pathology images for each of those collections were converted into DICOM representation by the IDC team and are shared via IDC. This entry corresponds to the CMB-OV collection (Ovarian cancer).
Digital pathology images, augmented with the metadata describing their content, were converted into DICOM Whole Slide Microscopy (SM) representation [2,3] using custom open source scripts and tools as described in [4].
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, collection_id-idc_v8-aws.s5cmd
corresponds to the contents of the collection_id
collection introduced in IDC data release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.
For each of the collections, the following manifest files are provided:
: manifest of files available for download from public IDC Amazon Web Services buckets
: manifest of files available for download from public IDC Google Cloud Storage buckets
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W. L., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National cancer institute imaging data commons: Toward transparency, reproducibility, and scalability in imaging artificial intelligence. Radiographics 43, (2023).
[2] National Electrical Manufacturers Association (NEMA). DICOM PS3.3 - Information Object Definitions: A.32.8 VL Whole Slide Microscopy Image IOD. at <https://dicom.nema.org/medical/dicom/current/output/html/part03.html#sect_A.32.8>
[3] Herrmann, M. D., Clunie, D. A., Fedorov, A., Doyle, S. W., Pieper, S., Klepeis, V., Le, L. P., Mutter, G. L., Milstone, D. S., Schultz, T. J., Kikinis, R., Kotecha, G. K., Hwang, D. H., Andriole, K. P., John Lafrate, A., Brink, J. A., Boland, G. W., Dreyer, K. J., Michalski, M., Golden, J. A., Louis, D. N. & Lennerz, J. K. Implementing the DICOM standard for digital pathology. J. Pathol. Inform. 9, 37 (2018).
[4] Clunie, D., Fedorov, A. & Herrmann, M. D. ImagingDataCommons/idc-wsi-conversion: Initial release. (Zenodo, 2023). doi:10.5281/ZENODO.8240154
A unified data repository of the National Cancer Institute (NCI)'s Genomic Data Commons (GDC) that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI). The GDC Data Portal provides a platform for efficiently querying and downloading high quality and complete data. The GDC also provides a GDC Data Transfer Tool and a GDC API for programmatic access.