4 datasets found
  1. Z

    Preliminary Mitosis Detection Results for TCGA-BRCA Dataset

    • data.niaid.nih.gov
    Updated Feb 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jahanifar, Mostafa (2024). Preliminary Mitosis Detection Results for TCGA-BRCA Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10245706
    Explore at:
    Dataset updated
    Feb 21, 2024
    Dataset authored and provided by
    Jahanifar, Mostafa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides mitosis detection results employing the "Mitosis Detection, Fast and Slow" (MDFS) algorithm [[2208.12587] Mitosis Detection, Fast and Slow: Robust and Efficient Detection of Mitotic Figures (arxiv.org)] on the TCGA-BRCA dataset.

    The MDFS algorithm exemplifies a robust and efficient two-stage process for mitosis detection. Initially, potential mitotic figures are identified and later refined. The proposed model for the preliminary identification of candidates, the EUNet, stands out for its swift and accurate performance, largely due to its structural design. EUNet operates by outlining candidate areas at a lower resolution, significantly expediting the detection process. In the second phase, the initially identified candidates undergo further refinement using a more intricate classifier network, namely the EfficientNet-B7. The MDFS algorithm was originally developed for the MIDOG challenges.

    Viewing in QuPath

    The dataset at hand comprises GeoJSON files in two categories: mitosis and proxy (mimicker -- the candidates that are unlikely to be mitosis based on our algorithm). Users can open and visualize each category overlaid on the Whole Slide Image (WSI) using QuPath. Simply drag and drop the annotation file onto the opened image in the program. Additionally, users can employ the provided Python snippet to read the annotation into a Python dictionary or a Numpy array.

    Loading in Python

    To load the GeoJSON files in Python, users can use the following code:

    import json

    import numpy as np

    import pandas as pd

    def load_geojson(filename):

    # Load the GeoJSON file

    with open(filename, 'r') as f:

     data = json.load(f)
    

    # Extract the properties and store in a dictionary

    slide_properties = data["properties"]

    # Convert the points to a numpy array

    points_np = np.array([(feat['geometry']['coordinates'][0], feat['geometry']['coordinates'][1], feat['properties']['score']) for feat in data['features']])

    # Convert the points to a pandas DataFrame

    points_df = pd.DataFrame(points_np, columns=['x', 'y', 'score'])

    return slide_properties, points_np, points_df

    Use the function to load mitosis data

    mitosis_properties, mitosis_points_np, mitosis_points_df = load_geojson('mitosis.geojson')

    Use the function to load mimickers data

    mimickers_properties, mimickers_points_np, mimickers_points_df = load_geojson('mimickers.geojson')

    Properties

    Each WSI in the dataset includes the candidate's centroid, bounding box, hotspot location, hotspot mitotic count, and hotspot mitotic score. The structures of the mitosis and mimicker property dictionaries are as follows:

    Mitosis property dictionary structure:

    mitosis_properties = {

    'slide_id': slide_id,

    'slide_height': img_h,

    'slide_width': img_w,

    'wsi_mitosis_count': num_mitosis,

    'mitosis_threshold': 0.5,

    'hotspot_rect': {'x1': hotspot[0], 'y1': hotspot[1], 'x2': hotspot[2], 'y2': hotspot[3]},

    'hotspot_mitosis_count': mitosis_count,

    'hotspot_mitosis_score': mitosis_score,

    }

    Proxy figure (mimicker) property dictionary structure:

    mimicker_properties = {

    'slide_id': slide_id,

    'slide_height': img_h,

    'slide_width': img_w,

    'wsi_mimicker_count': num_mimicker,

    'mitosis_threshold': 0.5,

    }

    Disclaimer:

    It should be noted that we did not conduct a comprehensive review of all mitotic figures within each WSI, and we do not purport these to be free of errors. Nonetheless, a pathologist examined the resultant hotspot regions of interest from 757 WSIs within the TCGA-BRCA Mitosis Dataset where we found strong correlations between pathologist and MDFS mitotic counts (r=0.8, p$<$0.001). Furthermore, MDFS-derived mitosis scores are shown to be as prognostic as pathologist-assigned mitosis scores [1]. This examination was also aimed at verifying the quality of the selections, ensuring excessive false detections or artifacts did not primarily drive them and were in a plausible location in the tumor landscape.

    [1] Ibrahim, Asmaa, et al. "Artificial Intelligence-Based Mitosis Scoring in Breast Cancer: Clinical Application." Modern Pathology 37.3 (2024): 100416.

  2. d

    Data from: Generation of synthetic whole-slide image tiles of tumours from...

    • search-dev.test.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Carrillo-Perez; Marija Pizurica; Yuanning Zheng; Tarak Nath Nandi; Ravi Madduri; Jeanne Shen; Olivier Gevaert (2025). Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models [Dataset]. http://doi.org/10.5061/dryad.6djh9w174
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Francisco Carrillo-Perez; Marija Pizurica; Yuanning Zheng; Tarak Nath Nandi; Ravi Madduri; Jeanne Shen; Olivier Gevaert
    Time period covered
    Jan 1, 2023
    Description

    Data scarcity presents a significant obstacle in the field of biomedicine, where acquiring diverse and sufficient datasets can be costly and challenging. Synthetic data generation offers a potential solution to this problem by expanding dataset sizes, thereby enabling the training of more robust and generalizable machine learning models. Although previous studies have explored synthetic data generation for cancer diagnosis, they have predominantly focused on single-modality settings, such as whole-slide image tiles or RNA-Seq data. To bridge this gap, we propose a novel approach, RNA-Cascaded-Diffusion-Model or RNA-CDM, for performing RNA-to-image synthesis in a multi-cancer context, drawing inspiration from successful text-to-image synthesis models used in natural images. In our approach, we employ a variational auto-encoder to reduce the dimensionality of a patient’s gene expression profile, effectively distinguishing between different types of cancer. Subsequently, we employ a cascad..., , , # RNA-CDM Generated One Million Synthetic Images

    https://doi.org/10.5061/dryad.6djh9w174

    One million synthetic digital pathology images were generated using the RNA-CDM model presented in the paper "RNA-to-image multi-cancer synthesis using cascaded diffusion models".

    Description of the data and file structure

    There are ten different h5 files per cancer type (TCGA-CESC, TCGA-COAD, TCGA-KIRP, TCGA-GBM, TCGA-LUAD). Each h5 file contains 20.000 images. The key is the tile number, ranging from 0-20,000 in the first file, and from 180,000-200,000 in the last file. The tiles are saved as numpy arrays.

    Code/Software

    The code used to generate this data is available under academic license in https://rna-cdm.stanford.edu .

    Manuscript citation

    Carrillo-Perez, F., Pizurica, M., Zheng, Y. et al. Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models...

  3. S

    Association of Unique SAMHD1 mutations with selected solid tumors

    • scidb.cn
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    null.null; null.null; null.null; null.null (2024). Association of Unique SAMHD1 mutations with selected solid tumors [Dataset]. http://doi.org/10.57760/sciencedb.j00217.00320
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    Science Data Bank
    Authors
    null.null; null.null; null.null; null.null
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    OBJECTIVE Application of TCGA database information to explore the correlation between SAMHD1 mutations and solid tumor.METHODS The mutation rates of SAMHD1 in 12 solid tumors were compared by TCGA and COSMIC database information.16 mammalian SAMHD1 sequences were downloaded from NCBI and UCSC for comparison, the conserved amino acid sites were screened, and the mutation sites were analyzed. Three different calculation tools (PROVEAN、SIFT、PolyPhen2) were used to calculate the mutation properties. The X-ray structure of SAMHD1 was downloaded from PDB database to explore the effect of mutation site on the maintenance of crystal tetramer structure. The expression pattern of SAMHD1 in gastric cancer was explored by qRT-PCR, and the correlation between SAMHD1 expression and clinicopathological features was analyzed.RESULTS SAMHD1 has a high mutation rate in most solid tumors. Most of the mutations were non-synonymous, and the majority were located in the dNTPase-containing HD domain of SAMHD1.The majority of the tumor-associated mutations were found in conserved amino acids in mammalian SAMHD1, and many are expected to affect the protein’s function. These nonsynonymous mutations of SAMHD1 also affect the formation and maintenance of its crystal tetramer. The expression levels of SAMHD1in gastric cancer was statistically significant lower, and the down-regulation of SAMHD1 is associated with adverse histological classification.CONCLUSION These observations suggest that SAMHD1 has a suppressive effect on the development and/or maintenance of certain solid tumors and may represent a potential target for cancer therapeutics.

  4. Z

    Data from: Image segmentations produced by BAMF under the AIMI Annotations...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soni, Rahul (2024). Image segmentations produced by BAMF under the AIMI Annotations initiative [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8345959
    Explore at:
    Dataset updated
    Sep 27, 2024
    Dataset provided by
    Soni, Rahul
    Murugesan, Gowtham Krishnan
    Van Oss, Jeff
    McCrumb, Diana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Imaging Data Commons (IDC)(https://imaging.datacommons.cancer.gov/) [1] connects researchers with publicly available cancer imaging data, often linked with other types of cancer data. Many of the collections have limited annotations due to the expense and effort required to create these manually. The increased capabilities of AI analysis of radiology images provide an opportunity to augment existing IDC collections with new annotation data. To further this goal, we trained several nnUNet [2] based models for a variety of radiology segmentation tasks from public datasets and used them to generate segmentations for IDC collections.

    To validate the model's performance, roughly 10% of the AI predictions were assigned to a validation set. For this set, a board-certified radiologist graded the quality of AI predictions on a Likert scale. If they did not 'strongly agree' with the AI output, the reviewer corrected the segmentation.

    This record provides the AI segmentations, Manually corrected segmentations, and Manual scores for the inspected IDC Collection images.

    Only 10% of the AI-derived annotations provided in this dataset are verified by expert radiologists . More details, on model training and annotations are provided within the associated manuscript to ensure transparency and reproducibility.

    This work was done in two stages. Versions 1.x of this record were from the first stage. Versions 2.x added additional records. In the Version 1.x collections, a medical student (non-expert) reviewed all the AI predictions and rated them on a 5-point Likert Scale, for any AI predictions in the validation set that they did not 'strongly agree' with, the non-expert provided corrected segmentations. This non-expert was not utilized for the Version 2.x additional records.

    Likert Score Definition:

    Guidelines for reviewers to grade the quality of AI segmentations.

    5 Strongly Agree - Use-as-is (i.e., clinically acceptable, and could be used for treatment without change)

    4 Agree - Minor edits that are not necessary. Stylistic differences, but not clinically important. The current segmentation is acceptable

    3 Neither agree nor disagree - Minor edits that are necessary. Minor edits are those that the review judges can be made in less time than starting from scratch or are expected to have minimal effect on treatment outcome

    2 Disagree - Major edits. This category indicates that the necessary edit is required to ensure correctness, and sufficiently significant that user would prefer to start from the scratch

    1 Strongly disagree - Unusable. This category indicates that the quality of the automatic annotations is so bad that they are unusable.

    Zip File Folder Structure

    Each zip file in the collection correlates to a specific segmentation task. The common folder structure is

    ai-segmentations-dcm This directory contains the AI model predictions in DICOM-SEG format for all analyzed IDC collection files

    qa-segmentations-dcm This directory contains manual corrected segmentation files, based on the AI prediction, in DICOM-SEG format. Only a fraction, ~10%, of the AI predictions were corrected. Corrections were performed by radiologist (rad*) and non-experts (ne*)

    qa-results.csv CSV file linking the study/series UIDs with the ai segmentation file, radiologist corrected segmentation file, radiologist ratings of AI performance.

    qa-results.csv Columns

    The qa-results.csv file contains metadata about the segmentations, their related IDC case image, as well as the Likert ratings and comments by the reviewers.

    Column

    Description

    Collection

    The name of the IDC collection for this case

    PatientID

    PatientID in DICOM metadata of scan. Also called Case ID in the IDC

    StudyInstanceUID

    StudyInstanceUID in the DICOM metadata of the scan

    SeriesInstanceUID

    SeriesInstanceUID in the DICOM metadata of the scan

    Validation

    true/false if this scan was manually reviewed

    Reviewer

    Coded ID of the reviewer. Radiologist IDs start with ‘rad’ non-expect IDs start with ‘ne’

    AimiProjectYear

    2023 or 2024, This work was split over two years. The main methodology difference between the two is that in 2023, a non-expert also reviewed the AI output, but a non-expert was not utilized in 2024.

    AISegmentation

    The filename of the AI prediction file in DICOM-seg format. This file is in the ai-segmentations-dcm folder.

    CorrectedSegmentation

    The filename of the reviewer-corrected prediction file in DICOM-seg format. This file is in the qa-segmentations-dcm folder. If the reviewer strongly agreed with the AI for all segments, they did not provide any correction file.

    Was the AI predicted ROIs accurate?

    This column appears one for each segment in the task for images from AimiProjectYear 2023. The reviewer rates segmentation quality on a Likert scale. In tasks that have multiple labels in the output, there is only one rating to cover them all.

    Was the AI predicted {SEGMENT_NAME} label accurate?

    This column appears one for each segment in the task for images from AimiProjectYear 2024. The reviewer rates each segment for its quality on a Likert scale.

    Do you have any comments about the AI predicted ROIs?

    Open ended question for the reviewer

    Do you have any comments about the findings from the study scans?

    Open ended question for the reviewer

    File Overview

    brain-mr.zip

    Segment Description: brain tumor regions: necrosis, edema, enhancing

    IDC Collection: UPENN-GBM

    Links: model weights, github

    breast-fdg-pet-ct.zip

    Segment Description: FDG-avid lesions in breast from FDG PET/CT scans QIN-Breast

    IDC Collection: QIN-Breast

    Links: model weights, github

    breast-mr.zip

    Segment Description: Breast, Fibroglandular tissue, structural tumor

    IDC Collection: duke-breast-cancer-mri

    Links: model weights, github

    kidney-ct.zip

    Segment Description: Kidney, Tumor, and Cysts from contrast enhanced CT scans

    IDS Collection: TCGA-KIRC, TCGA-KIRP, TCGA-KICH, CPTAC-CCRCC

    Links: model weights, github

    liver-ct.zip

    Segment Description: Liver from CT scans

    IDC Collection: TCGA-LIHC

    Links: model weights, github

    liver2-ct.zip

    Segment Description: Liver and Lesions from CT scans

    IDC Collection: HCC-TACE-SEG, COLORECTAL-LIVER-METASTASES

    Links: model weights, github

    liver-mr.zip

    Segment Description: Liver from T1 MRI scans

    IDC Collection: TCGA-LIHC

    Links: model weights, github

    lung-ct.zip

    Segment Description: Lung and Nodules (3mm-30mm) from CT scans

    IDC Collections:

    Anti-PD-1-Lung

    LUNG-PET-CT-Dx

    NSCLC Radiogenomics

    RIDER Lung PET-CT

    TCGA-LUAD

    TCGA-LUSC

    Links: model weights 1, model weights 2, github

    lung2-ct.zip

    Improved model version

    Segment Description: Lung and Nodules (3mm-30mm) from CT scans

    IDC Collections:

    QIN-LUNG-CT, SPIE-AAPM Lung CT Challenge

    Links: model weights, github

    lung-fdg-pet-ct.zip

    Segment Description: Lungs and FDG-avid lesions in the lung from FDG PET/CT scans

    IDC Collections:

    ACRIN-NSCLC-FDG-PET

    Anti-PD-1-Lung

    LUNG-PET-CT-Dx

    NSCLC Radiogenomics

    RIDER Lung PET-CT

    TCGA-LUAD

    TCGA-LUSC

    Links: model weights, github

    prostate-mr.zip

    Segment Description: Prostate from T2 MRI scans

    IDC Collection: ProstateX, Prostate-MRI-US-Biopsy

    Links: model weights, github

    Changelog

    2.0.2 - Fix the brain-mr segmentations to be transformed correctly

    2.0.1 - added AIMI 2024 radiologist comments to qa-results.csv

    2.0.0 - added AIMI 2024 segmentations

    1.X - AIMI 2023 segmentations and reviewer scores

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jahanifar, Mostafa (2024). Preliminary Mitosis Detection Results for TCGA-BRCA Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10245706

Preliminary Mitosis Detection Results for TCGA-BRCA Dataset

Explore at:
Dataset updated
Feb 21, 2024
Dataset authored and provided by
Jahanifar, Mostafa
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset provides mitosis detection results employing the "Mitosis Detection, Fast and Slow" (MDFS) algorithm [[2208.12587] Mitosis Detection, Fast and Slow: Robust and Efficient Detection of Mitotic Figures (arxiv.org)] on the TCGA-BRCA dataset.

The MDFS algorithm exemplifies a robust and efficient two-stage process for mitosis detection. Initially, potential mitotic figures are identified and later refined. The proposed model for the preliminary identification of candidates, the EUNet, stands out for its swift and accurate performance, largely due to its structural design. EUNet operates by outlining candidate areas at a lower resolution, significantly expediting the detection process. In the second phase, the initially identified candidates undergo further refinement using a more intricate classifier network, namely the EfficientNet-B7. The MDFS algorithm was originally developed for the MIDOG challenges.

Viewing in QuPath

The dataset at hand comprises GeoJSON files in two categories: mitosis and proxy (mimicker -- the candidates that are unlikely to be mitosis based on our algorithm). Users can open and visualize each category overlaid on the Whole Slide Image (WSI) using QuPath. Simply drag and drop the annotation file onto the opened image in the program. Additionally, users can employ the provided Python snippet to read the annotation into a Python dictionary or a Numpy array.

Loading in Python

To load the GeoJSON files in Python, users can use the following code:

import json

import numpy as np

import pandas as pd

def load_geojson(filename):

# Load the GeoJSON file

with open(filename, 'r') as f:

 data = json.load(f)

# Extract the properties and store in a dictionary

slide_properties = data["properties"]

# Convert the points to a numpy array

points_np = np.array([(feat['geometry']['coordinates'][0], feat['geometry']['coordinates'][1], feat['properties']['score']) for feat in data['features']])

# Convert the points to a pandas DataFrame

points_df = pd.DataFrame(points_np, columns=['x', 'y', 'score'])

return slide_properties, points_np, points_df

Use the function to load mitosis data

mitosis_properties, mitosis_points_np, mitosis_points_df = load_geojson('mitosis.geojson')

Use the function to load mimickers data

mimickers_properties, mimickers_points_np, mimickers_points_df = load_geojson('mimickers.geojson')

Properties

Each WSI in the dataset includes the candidate's centroid, bounding box, hotspot location, hotspot mitotic count, and hotspot mitotic score. The structures of the mitosis and mimicker property dictionaries are as follows:

Mitosis property dictionary structure:

mitosis_properties = {

'slide_id': slide_id,

'slide_height': img_h,

'slide_width': img_w,

'wsi_mitosis_count': num_mitosis,

'mitosis_threshold': 0.5,

'hotspot_rect': {'x1': hotspot[0], 'y1': hotspot[1], 'x2': hotspot[2], 'y2': hotspot[3]},

'hotspot_mitosis_count': mitosis_count,

'hotspot_mitosis_score': mitosis_score,

}

Proxy figure (mimicker) property dictionary structure:

mimicker_properties = {

'slide_id': slide_id,

'slide_height': img_h,

'slide_width': img_w,

'wsi_mimicker_count': num_mimicker,

'mitosis_threshold': 0.5,

}

Disclaimer:

It should be noted that we did not conduct a comprehensive review of all mitotic figures within each WSI, and we do not purport these to be free of errors. Nonetheless, a pathologist examined the resultant hotspot regions of interest from 757 WSIs within the TCGA-BRCA Mitosis Dataset where we found strong correlations between pathologist and MDFS mitotic counts (r=0.8, p$<$0.001). Furthermore, MDFS-derived mitosis scores are shown to be as prognostic as pathologist-assigned mitosis scores [1]. This examination was also aimed at verifying the quality of the selections, ensuring excessive false detections or artifacts did not primarily drive them and were in a plausible location in the tumor landscape.

[1] Ibrahim, Asmaa, et al. "Artificial Intelligence-Based Mitosis Scoring in Breast Cancer: Clinical Application." Modern Pathology 37.3 (2024): 100416.

Search
Clear search
Close search
Google apps
Main menu