https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Open access or shared research data must comply with (HIPAA) patient privacy regulations. These regulations require the de-identification of datasets before they can be placed in the public domain. The process of image de-identification is time consuming, requires significant human resources, and is prone to human error. Automated image de-identification algorithms have been developed but the research community requires some method of evaluation before such tools can be widely accepted. This evaluation requires a robust dataset that can be used as part of an evaluation process for de-identification algorithms.
We developed a DICOM dataset that can be used to evaluate the performance of de-identification algorithms. DICOM image information objects were selected from datasets published in TCIA. Synthetic Protected Health Information (PHI) was generated and inserted into selected DICOM data elements to mimic typical clinical imaging exams. The evaluation dataset was de-identified by a TCIA curation team using standard TCIA tools and procedures. We are publishing the evaluation dataset (containing synthetic PHI) and de-identified evaluation dataset (result of TCIA curation) in advance of a potential competition, sponsored by the National Cancer Institute (NCI), for de-identification algorithm evaluation, and de-identification of medical image datasets. The evaluation dataset published here is a subset of a larger evaluation dataset that was created under contract for the National Cancer Institute. This subset is being published to allow researchers to test their de-identification algorithms and promote standardized procedures for validating automated de-identification.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Data Integration & Imaging Informatics (DI-Cubed) project explored the issue of lack of standardized data capture at the point of data creation, as reflected in the non-image data accompanying various TCIA breast cancer collections. The work addressed the desire for semantic interoperability between various NCI initiatives by aligning on common clinical metadata elements and supporting use cases that connect clinical, imaging, and genomics data. Accordingly, clinical and measurement data was imported into I2B2 and cross-mapped to industry standard concepts for names and values including those derived from BRIDG, CDISC SDTM, DICOM Structured Reporting models and using NCI Thesaurus, SNOMED CT and LOINC controlled terminology. A subset of the standardized data was then exported from I2B2 to CSV and thence converted to DICOM SR according to the the DICOM Breast Imaging Report template [1] , which supports description of patient characteristics, histopathology, receptor status and clinical findings including measurements. The purpose was not to advocate DICOM SR as an appropriate format for interchange or storage of such information for query purposes, but rather to demonstrate that use of standard concepts harmonized across multiple collections could be transformed into an existing standard report representation. The DICOM SR can be stored and used together with the images in repositories such as TCIA and in image viewers that support rendering of DICOM SR content. During the project, various deficiencies in the DICOM Breast Imaging Report template were identified with respect to describing breast MR studies, laterality of findings versus procedures, more recently developed receptor types, and patient characteristics and status. These were addressed via DICOM CP 1838, finalized in Jan 2019, and this subset reflects those changes. DICOM Breast Imaging Report Templates available from: http://dicom.nema.org/medical/dicom/current/output/chtml/part16/sect_BreastImagingReportTemplates.html
DICOM files are given for 11 CT scans which were used in a research article. Each scan contains about 1250 slices with 512x512 gray scale images, each in its own directory. The low number slices contain the diapers in the order D5 ... D1, then the vials of powder are contained in the order V11 ... V2. The symbols correspond to the injected masses which are given in the paper and which are repeated here in a file called "mass.txt". There is also a file "README.txt" which describes the directory structure.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains DICOM versions of the 24 anthropomorphic pulmonary CT phantoms accompanying the manuscript "Automatic Synthesis of Anthropomorphic Pulmonary CT Phantoms" submitted to PLoS ONE.
NRRD versions can be found in http://dx.doi.org/10.5281/zenodo.20766 (doi:10.5281/zenodo.20766).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The following are pre-radiotherapy T2W and DWI MRI sequences in Digital Imaging and Communications in Medicine (DICOM) format for 20 patients curated from the MD Anderson Databases (NCT03145077).
For each image set (T2W image and DWI image), ground truth segmentations for the left and right submandibular glands, left and right parotid glands, cervical spinal cord, brainstem, and primary gross tumor volume were manually generated by a trained physician expert (radiologist with > 5 years of experience in HNC). In a subset of five cases, segmentations for all structures in both sequences were also manually generated by three additional separate observers (two physicians and one medical student). All segmentations were generated in Velocity AI (v.3.0.1; Varian Medical Systems; Palo Alto, CA, USA) in DICOM RT structure format.
DICOM data was anonymized using an in-house Python script that implements the RSNA CRP DICOM Anonymizer software. All files have had any DICOM header info and metadata containing PHI removed or replaced with dummy entries.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Brain MRI Dataset, Normal Brain Dataset, Anomaly Classification & Detection
The dataset consists of .dcm files containing MRI scans of the brain of the person with a normal brain. The images are labeled by the doctors and accompanied by report in PDF-format. The dataset includes 7 studies, made from the different angles which provide a comprehensive understanding of a normal brain structure and useful in training brain anomaly classification algorithms.
MRI study angles… See the full description on the dataset page: https://huggingface.co/datasets/TrainingDataPro/dicom-brain-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains the same data as published in our previous zenodo dataset upload. Unlike our previous upload, this version contains data after transferring the DICOMs directly from the Siemens Skyra 3T to our Linux machine (as done in real-time experiments). The purpose of this separate upload is to serve as sample data for our real-time cloud software, for a specific sample project. The brain data are contributed by author S.A.N. and are authorized for non-anonymized distribution.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data associated with manuscript:
The Australasian dingo archetype: De novo chromosome-length genome assembly, DNA methylome, and cranial morphology
Raw Dicom data, Alpine Dingo brain (zip) and domestic dog brain (zip). Brains were scanned using high-resolution magnetic resonance imaging (MRI). A Bruker Biospec 94/20 9.4T high field pre-clinical MRI system located at the Biological Resources imaging Laboratory University of New South Wales (UNSW) was used to acquire MRI data of a fixed dingo and domestic dog brain. The system was equipped with microimaging gradients with a maximum gradient strength of 660mT/m and a 72mm Quadrature volume coil. Images were acquired in transverse and coronal orientation using optimized 2D and 3D Fast Spin Echo (FSE) and Gradient Echo (MGE) methods. Image resolution was 200x200x500 and 300x300 microns isotropic for type 3D and 2D pulse sequences, respectively.
Archive of medical images of cancer accessible for public download. All images are stored in DICOM file format and organized as Collections, typically patients related by common disease (e.g. lung cancer), image modality (MRI, CT, etc) or research focus. Neuroimaging data sets include clinical outcomes, pathology, and genomics in addition to DICOM images. Submitting Data Proposals are welcomed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Magnetic Resonance - Computed Tomography (MR-CT) Jordan University Hospital (JUH) dataset has been collected after receiving Institutional Review Board (IRB) approval of the hospital and consent forms have been obtained from all patients. All procedures has been carried out in accordance with The Code of Ethics of the World Medical Association (Declaration of Helsinki).
The dataset consists of 2D image slices extracted using the RadiAnt DICOM viewer software. The extracted images are transformed to DICOM image data format with a resolution of 256x256 pixels. There are a total of 179 2D axial image slices referring to 20 patient volumes (90 MR and 89 CT 2D axial image slices). The dataset contains MR and CT brain tumour images with corresponding segmentation masks. The MR images of each patient were acquired with a 5.00mm T Siemens Verio 3T using a T2-weighted without contrast agent, 3 Fat sat pulses (FS), 2500-4000 TR, 20-30 TE, and 90/180 flip angle. The CT images were acquired with Siemens Somatom scanner with 2.46mGY.cm dose length, 130KV voltage, 113-327 mAs tube current, topogram acquisition protocol, 64 dual source, one projection, and slice thickness of 7.0mm. Smooth and sharp filters have been applied to the CT images. The MR scans have a resolution of 0.7x0.6x5 mm^3, while the CT scans have a resolution of 0.6x0.6x7 mm^3.
More information and the application of the dataset can be found in the following research paper:
Alaa Abu-Srhan; Israa Almallahi; Mohammad Abushariah; Waleed Mahafza; Omar S. Al-Kadi. Paired-Unpaired Unsupervised Attention Guided GAN with Transfer Learning for Bidirectional Brain MR-CT Synthesis. Comput. Biol. Med. 136, 2021. doi: https://doi.org/10.1016/j.compbiomed.2021.104763.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains key characteristics about the data described in the Data Descriptor A DICOM dataset for evaluation of medical image de-identification. Contents:
1. human readable metadata summary table in CSV format
2. machine readable metadata file in JSON format
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The MIMIC Chest X-ray (MIMIC-CXR) Database v2.0.0 is a large publicly available dataset of chest radiographs in DICOM format with free-text radiology reports. The dataset contains 377,110 images corresponding to 227,835 radiographic studies performed at the Beth Israel Deaconess Medical Center in Boston, MA. The dataset is de-identified to satisfy the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements. Protected health information (PHI) has been removed. The dataset is intended to support a wide body of research in medicine including image understanding, natural language processing, and decision support.
https://www.nlm.nih.gov/databases/download/terms_and_conditions.htmlhttps://www.nlm.nih.gov/databases/download/terms_and_conditions.html
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: NLM-Visible-Human-Project. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
The NLM Visible Human Project [2] has created publicly-available complete, anatomically detailed, three-dimensional representations of a human male body and a human female body. Specifically, the VHP provides a public-domain library of cross-sectional cryosection, CT, and MRI images obtained from one male cadaver and one female cadaver. The Visible Man data set was publicly released in 1994 and the Visible Woman in 1995.
The data sets were designed to serve as (1) a reference for the study of human anatomy, (2) public-domain data for testing medical imaging algorithms, and (3) a test bed and model for the construction of network-accessible image libraries. The VHP data sets have been applied to a wide range of educational, diagnostic, treatment planning, virtual reality, artistic, mathematical, and industrial uses. About 4,000 licensees from 66 countries were authorized to access the datasets. As of 2019, a license is no longer required to access the VHP datasets.
Courtesy of the U.S. National Library of Medicine. Release of this collection by IDC does not indicate or imply that NLM has endorsed its products/services/applications. Please see the Visible Human Project information page to learn more about the images and to obtain any supporting metadata for this collection. Note that this collection may not reflect the most current/accurate data available from NLM.
Citation guidelines can be found on the National Library of Medicine Terms and Conditions information page.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, collection_id-idc_v8-aws.s5cmd
corresponds to the contents of the collection_id
collection introduced in IDC data release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.
nlm_visible_human_project-idc_v15-aws.s5cmd
: manifest of files available for download from public IDC Amazon Web Services bucketsnlm_visible_human_project-idc_v15-gcs.s5cmd
: manifest of files available for download from public IDC Google Cloud Storage bucketsnlm_visible_human_project-idc_v15-dcf.dcf
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
.To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
[2] Spitzer, V., Ackerman, M. J., Scherzinger, A. L. & Whitlock, D. The visible human male: a technical report. J. Am. Med. Inform. Assoc. 3, 118–130 (1996). https://doi.org/10.1136/jamia.1996.96236280
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Lung Image Database Consortium image collection (LIDC-IDRI) consists of diagnostic and lung cancer screening thoracic computed tomography (CT) scans with marked-up annotated lesions. It is a web-accessible international resource for development, training, and evaluation of computer-assisted diagnostic (CAD) methods for lung cancer detection and diagnosis. Initiated by the National Cancer Institute (NCI), further advanced by the Foundation for the National Institutes of Health (FNIH), and accompanied by the Food and Drug Administration (FDA) through active participation, this public-private partnership demonstrates the success of a consortium founded on a consensus-based process.
Seven academic centers and eight medical imaging companies collaborated to create this data set which contains 1018 cases. Each subject includes images from a clinical thoracic CT scan and an associated XML file that records the results of a two-phase image annotation process performed by four experienced thoracic radiologists. In the initial blinded-read phase, each radiologist independently reviewed each CT scan and marked lesions belonging to one of three categories ("nodule > or =3 mm," "nodule <3 mm," and "non-nodule > or =3 mm"). In the subsequent unblinded-read phase, each radiologist independently reviewed their own marks along with the anonymized marks of the three other radiologists to render a final opinion. The goal of this process was to identify as completely as possible all lung nodules in each CT scan without requiring forced consensus.
Note : The TCIA team strongly encourages users to review pylidc and the Standardized representation of the TCIA LIDC-IDRI annotations using DICOM (DICOM-LIDC-IDRI-Nodules) of the annotations/segmentations included in this dataset before developing custom tools to analyze the XML version.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contributes DICOM-converted annotations to the publicly available National Cancer Institute Imaging Data Commons [1] Prostate-MRI-US-Biopsy collection (https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=Community&collection_id=prostate_mri_us_biopsy). Prostate-MRI-US-Biopsy collection was initially released by The Cancer Imaging Archive (TCIA) [2,3,4]. While the images in this collection are stored in the standard DICOM format, the collection is also accompanied by 1017 semi-automatic segmentations of the prostate and 1317 manual segmentations of target lesions in the STL format. Although STL is a common and practical format for 3D printing, it is not interoperable with many visualization and analysis tools commonly used in medical imaging research and does not provide any standard means to communicate metadata, among other limitations. This dataset contains segmentations of the prostate and target lesions harmonized into DICOM representation. Specifically, we created DICOM Encapsulated 3D Manufacturing Model objects (M3D modality) that includes the original STL content enriched with the DICOM metadata. Furthermore, we created an alternative encoding of the surface segmentations by rasterizing them and saving the result as a DICOM Segmentation object (SEG modality). As a result, the contributed DICOM objects can be stored in any DICOM server that supports those objects (including Google Healthcare DICOM stores), and the DICOM Segmentations can be visualized using off-the-shelf tools, such as OHIF Viewer. Conversion from STL to DICOM M3D modality was performed using PixelMed toolkit (https://www.pixelmed.com/dicomtoolkit.html). Conversion from STL to DICOM SEG was done in 2 steps. We used Slicer (https://www.slicer.org/) to rasterize the surface segmentation to the matrix of the segmented image, which were next converted to DICOM SEGs using dcmqi (https://github.com/QIICR/dcmqi) [5]. Resulting objects were validated using dicom3tools dciodvfy (https://www.dclunie.com/dicom3tools.html). Details describing the conversion process as well as the details on how to access the encapsulated STL content from the DICOM m3D files are provided in this GitHub repository: https://github.com/ImagingDataCommons/prostate_mri_us_biopsy_dcm_conversion. Specific files included in the record are:
Prostate-MRI-US-Biopsy-DICOM-Annotations.zip: DICOM M3D and SEG files, organized into the folder hierarchy following this pattern: Prostate-MRI-US-Biopsy/%PatientID/%StudyInstanceUID/%SeriesNumber-%Modality-%SeriesDescription.dcm referenced_images_sorted-idc_file_manifest.s5cmd: IDC manifest for downloading the T2W MRI images corresponding to the annotations. To download the files in this manifest, first install s5cmd (https://github.com/peak/s5cmd), and run the following command: s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com run referenced_images_sorted-idc_file_manifest.s5cmd. Files will be organized in the Prostate-MRI-US-Biopsy/%PatientID/%StudyInstanceUID/ folder hierarchy upon download. References [1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S., Aerts, H. J. W. L., Homeyer, A., Lewis, R., Akbarzadeh, A., Bontempi, D., Clifford, W., Herrmann, M. D., Höfener, H., Octaviano, I., Osborne, C., Paquette, S., Petts, J., Punzo, D., Reyes, M., Schacherer, D. P., Tian, M., White, G., Ziegler, E., Shmulevich, I., Pihl, T., Wagner, U., Farahani, K. & Kikinis, R. NCI Imaging Data Commons. Cancer Res. 81, 4188–4193 (2021). doi: 10.1158/0008-5472.CAN-21-0950. [2] Natarajan, S., Priester, A., Margolis, D., Huang, J., & Marks, L. (2020). Prostate MRI and Ultrasound With Pathology and Coordinates of Tracked Biopsy (Prostate-MRI-US-Biopsy) (version 2) [Data set]. The Cancer Imaging Archive. DOI: 10.7937/TCIA.2020.A61IOC1A [3] Sonn GA, Natarajan S, Margolis DJ, MacAiran M, Lieu P, Huang J, Dorey FJ, Marks LS. Targeted biopsy in the detection of prostate cancer using an office based magnetic resonance ultrasound fusion device. Journal of Urology 189, no. 1 (2013): 86-91. DOI: 10.1016/j.juro.2012.08.095 [4] Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045-1057. DOI: 10.1007/s10278-013-9622-7 [5] Herz, C., Fillion-Robin, J.-C., Onken, M., Riesmeier, J., Lasso, A., Pinter, C., Fichtinger, G., Pieper, S., Clunie, D., Kikinis, R. & Fedorov, A. dcmqi: An Open Source Library for Standardized Communication of Quantitative Image Analysis Results Using DICOM. Cancer Res. 77, e87–e90 (2017). DOI: 10.1158/0008-5472.CAN-17-0336.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The MIMIC Chest X-ray JPG (MIMIC-CXR-JPG) Database v2.0.0 is a large publicly available dataset of chest radiographs in JPG format with structured labels derived from free-text radiology reports. The MIMIC-CXR-JPG dataset is wholly derived from MIMIC-CXR, providing JPG format files derived from the DICOM images and structured labels derived from the free-text reports. The aim of MIMIC-CXR-JPG is to provide a convenient processed version of MIMIC-CXR, as well as to provide a standard reference for data splits and image labels. The dataset contains 377,110 JPG format images and structured labels derived from the 227,827 free-text radiology reports associated with these images. The dataset is de-identified to satisfy the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements. Protected health information (PHI) has been removed. The dataset is intended to support a wide body of research in medicine including image understanding, natural language processing, and decision support.
The 'Use of Deep Learning for structural analysis of CT-images of soil samples' used a set of soil sample data (CT-images). All the data and programs used here are open source and were created with the help of open source software. All steps are made by Python programs which are included in the data set.
Imaging Data Commons (IDC) is a repository within the Cancer Research Data Commons (CRDC) that manages imaging data and enables its integration with the other components of CRDC. IDC hosts a growing number of imaging collections that are contributed by either funded US National Cancer Institute (NCI) data collection activities, or by the individual researchers.Image data hosted by IDC is stored in DICOM format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Public imaging datasets are critical for the development and evaluation of automated tools in cancer imaging. Unfortunately, many of the available datasets do not provide annotations of tumors or organs-at-risk, crucial for the assessment of these tools. This is due to the fact that annotation of medical images is time consuming and requires domain expertise. It has been demonstrated that artificial intelligence (AI) based annotation tools can achieve acceptable performance and thus can be used to automate the annotation of large datasets. As part of the effort to enrich the public data available within NCI Imaging Data Commons (IDC) (https://imaging.datacommons.cancer.gov/) [1], we introduce this dataset that consists of such AI-generated annotations for two publicly available medical imaging collections of Computed Tomography (CT) images of the chest. For detailed information concerning this dataset, please refer to our publication here [2].
We use publicly available pre-trained AI tools to enhance CT lung cancer collections that are unlabeled or partially labeled. The first tool is the nnU-Net deep learning framework [3] for volumetric segmentation of organs, where we use a pretrained model (Task D18 using the SegTHOR dataset) for labeling volumetric regions in the image corresponding to the heart, trachea, aorta and esophagus. These are the major organs-at-risk for radiation therapy for lung cancer. We further enhance these annotations by computing 3D shape radiomics features using the pyradiomics package [4]. The second tool is a pretrained model for per-slice automatic labeling of anatomic landmarks and imaged body part regions in axial CT volumes [5].
We focus on enhancing two publicly available collections, the Non-small Cell Lung Cancer Radiomics (NSCLC-Radiomics collection) [6,7], and the National Lung Screening Trial (NLST collection) [8,9]. The CT data for these collections are available both in The Cancer Imaging Archive (TCIA) [10] and in NCI Imaging Data Commons (IDC). Further, the NSLSC-Radiomics collection includes expert-generated manual annotations of several chest organs, allowing us to quantify performance of the AI tools in that subset of data.
IDC is relying on the DICOM standard to achieve FAIR [10] sharing of data and interoperability. Generated annotations are saved as DICOM Segmentation objects (volumetric segmentations of regions of interest) created using the dcmqi [12], and DICOM Structured Report (SR) objects (per-slice annotations of the body part imaged, anatomical landmarks and radiomics features) created using dcmqi and highdicom [13]. 3D shape radiomics features and corresponding DICOM SR objects are also provided for the manual segmentations available in the NSCLC-Radiomics collection.
The dataset is available in IDC, and is accompanied by our publication here [2]. This pre-print details how the data were generated, and how the resulting DICOM objects can be interpreted and used in tools. Additionally, for further information about how to interact with and explore the dataset, please refer to our repository and accompanying Google Colaboratory notebook.
The annotations are organized as follows. For NSCLC-Radiomics, three nnU-Net models were evaluated ('2d-tta', '3d_lowres-tta' and '3d_fullres-tta'). Within each folder, the PatientID and the StudyInstanceUID are subdirectories, and within this the DICOM Segmentation object and the DICOM SR for the 3D shape features are stored. A separate directory for the DICOM SR body part regression regions ('sr_regions') and landmarks ('sr_landmarks') are also provided with the same folder structure as above. Lastly, the DICOM SR for the existing manual annotations are provided in the 'sr_gt' directory. For NSCLC-Radiomics, each patient has a single StudyInstanceUID. The DICOM Segmentation and SR objects are named according to the SeriesInstanceUID of the original CT files.
nsclc
2d-tta
PatientID
StudyInstanceUID
ReferencedSeriesInstanceUID_SEG.dcm
ReferencedSeriesInstanceUID_features_SR.dcm
3d_lowres-tta
PatientID
StudyInstanceUID
ReferencedSeriesInstanceUID_SEG.dcm
ReferencedSeriesInstanceUID_features_SR.dcm
3d_fullres-tta
PatientID
StudyInstanceUID
ReferencedSeriesInstanceUID_SEG.dcm
ReferencedSeriesInstanceUID_features_SR.dcm
sr_regions
PatientID
StudyInstanceUID
ReferencedSeriesInstanceUID_regions_SR.dcm
sr_landmarks
PatientID
StudyInstanceUID
ReferencedSeriesInstanceUID_landmarks_SR.dcm
sr_gt
PatientID
StudyInstanceUID
ReferencedSeriesInstanceUID_features_SR.dcm
For NLST, the '3d_fullres-tta' model was evaluated. The data is organized the same as above, where within each folder the PatientID and the StudyInstanceUID are subdirectories. For the NLST collection, it is possible that some patients have more than one StudyInstanceUID subdirectory. A separate directory for the DICOM SR body par regions ('sr_regions') and landmarks ('sr_landmarks') are also provided. The DICOM Segmentation and SR objects are named according to the SeriesInstanceUID of the original CT files.
nlst
3d_fullres-tta
PatientID
StudyInstanceUID
ReferencedSeriesInstanceUID_SEG.dcm
ReferencedSeriesInstanceUID_features_SR.dcm
sr_regions
PatientID
StudyInstanceUID
ReferencedSeriesInstanceUID_regions_SR.dcm
sr_landmarks
PatientID
StudyInstanceUID
ReferencedSeriesInstanceUID_landmarks_SR.dcm
The query used for NSCLC-Radiomics is here, and a list of corresponding SeriesInstanceUIDs (along with PatientIDs and StudyInstanceUIDs) is here. The query used for NLST is here, and a list of corresponding SeriesInstanceUIDs (along with PatientIDs and StudyInstanceUIDs) is here. The two csv files that describe the series analyzed, nsclc_series_analyzed.csv and nlst_series_analyzed.csv, are also available as uploads to this repository.
Version updates:
Version 2: For the regions SR and landmarks SR, changed to use a distinct TrackingUniqueIdentifier for each MeasurementGroup. Also instead of using TargetRegion, changed to use FindingSite. Additionally for the landmarks SR, the TopographicalModifier was made a child of FindingSite instead of a sibling.
Version 3: Added the two csv files that describe which series were analyzed
Version 4: Modified the landmarks SR as the TopographicalModifier for the Kidney landmark (bottom) does not describe the landmark correctly. The Kidney landmark is the "first slice where both kidneys can be seen well." Instead, removed the use of the TopographicalModifier for that landmark. For the features SR, modified the units code for the Flatness and Elongation, as we incorrectly used mm units instead of no units.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: TCGA-BRCA. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
Please see the TCGA-BRCA page to learn more about the images and to obtain any supporting metadata for this collection.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced.
For example, collection_id-idc_v8-aws.s5cmd
corresponds to the contents of the
collection_id
collection introduced in IDC data
release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of
the corresponding collection was introduced.
tcga_brca-idc_v8-aws.s5cmd
: manifest of files available for download from public IDC Amazon Web Services bucketstcga_brca-idc_v8-gcs.s5cmd
: manifest of files available for download from public IDC Google Cloud Storage bucketstcga_brca-idc_v8-dcf.dcf
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference
files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
.To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Open access or shared research data must comply with (HIPAA) patient privacy regulations. These regulations require the de-identification of datasets before they can be placed in the public domain. The process of image de-identification is time consuming, requires significant human resources, and is prone to human error. Automated image de-identification algorithms have been developed but the research community requires some method of evaluation before such tools can be widely accepted. This evaluation requires a robust dataset that can be used as part of an evaluation process for de-identification algorithms.
We developed a DICOM dataset that can be used to evaluate the performance of de-identification algorithms. DICOM image information objects were selected from datasets published in TCIA. Synthetic Protected Health Information (PHI) was generated and inserted into selected DICOM data elements to mimic typical clinical imaging exams. The evaluation dataset was de-identified by a TCIA curation team using standard TCIA tools and procedures. We are publishing the evaluation dataset (containing synthetic PHI) and de-identified evaluation dataset (result of TCIA curation) in advance of a potential competition, sponsored by the National Cancer Institute (NCI), for de-identification algorithm evaluation, and de-identification of medical image datasets. The evaluation dataset published here is a subset of a larger evaluation dataset that was created under contract for the National Cancer Institute. This subset is being published to allow researchers to test their de-identification algorithms and promote standardized procedures for validating automated de-identification.