100+ datasets found
  1. r

    Data from: ACROBAT - a multi-stain breast cancer histological...

    • researchdata.se
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mattias Rantalainen; Johan Hartman (2023). ACROBAT - a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology [Dataset]. http://doi.org/10.48723/w728-p041
    Explore at:
    (418), (76735897912), (36540), (2982), (74182679049), (76914241171), (81512804565), (31248), (75799632383), (23401027210), (36876), (37413), (37036), (10301), (73134087512), (36333), (1168275)Available download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    Karolinska Institutet
    Authors
    Mattias Rantalainen; Johan Hartman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2012 - 2018
    Area covered
    Stockholm County
    Description

    The ACROBAT data set consists of 4,212 whole slide images (WSIs) from 1,153 female primary breast cancer patients. The WSIs in the data set are available at 10X magnification and show tissue sections from breast cancer resection specimens stained with hematoxylin and eosin (H&E) or immunohistochemistry (IHC). For each patient, one WSI of H&E stained tissue and at least one one, and up to four, WSIs of corresponding tissue stained with the routine diagnostic stains ER, PGR, HER2 and KI67 are available. The data set was acquired as part of the CHIME study (chimestudy.se) and its primary purpose was to facilitate the ACROBAT WSI registration challenge (acrobat.grand-challenge.org). The histopathology slides originate from routine diagnostic pathology workflows and were digitised for research purposes at Karolinska Institutet (Stockholm, Sweden). The image acquisition process resembles the routine digital pathology image digitisation workflow, using three different Hamamatsu WSI scanners, specifically one NanoZoomer S360 and two NanoZoomer XR. The WSIs in this data set are accompanied by a data table with one row for each WSI, specifying an anonymised patient ID, the stain or IHC antibody type of each WSI, as well as the magnification and microns per pixel at each available resolution level. Automated registration algorithm performance evaluation is possible through the ACROBAT challenge website based on over 37,000 landmark pair annotations from 13 annotators. While the primary purpose of this data set was the development and evaluation of WSI registration methods, this data set has the potential to facilitate further research in the context of computational pathology, for example in the areas of stain-guided learning, virtual staining, unsupervised learning and stain-independent models.

    The data set consists of three subsets, the training, validation and test set, based on the ACROBAT WSI registration challenge. There are 750 cases in the training set, for each of which one H&E WSI and one to four IHC WSIs are available, with 3406 WSIs in total. The validation set consists of 100 cases with 200 WSIs in total and the test set of 303 cases with 606 WSIs in total. Both for the validation and test set, one H&E WSI as well as one randomly selected IHC WSI is available.

    WSIs were anonymised by deleting the associated macro images, by generating filenames with random case IDs and by overwriting meta data fields with potentially personal information. Hamamatsu NDPI files were then converted using libvips (libvips.org/). WSIs are available as generic tiled TIFF WSIs (openslide.org/formats/generic-tiff/) at 10X magnification and lower image levels.

    The data set is available for download in seven separate ZIP archives, five for the training data (train_part1.zip (71.47 GB), train_part2.zip (70.59 GB), train_part3.zip (75.91 GB), train_part4.zip (71.63 GB) and train_part5.zip (69.09 GB)), one for the validation data (valid.zip 21.79 GB) and one for the test data (test.zip 68.11 GB).

    File listings and checksums in SHA1 format are available for checking archive/data integrity when downloading.

    While it would be helpful to notify SND of any publications using this data set by sending an email to request@snd.gu.se, please note that this is not required to use the data.

  2. TCGA-WSI-Dataset

    • kaggle.com
    zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmood Yousaf 2018 (2024). TCGA-WSI-Dataset [Dataset]. https://www.kaggle.com/datasets/mahmoodyousaf2018/tcga-wsi-svs
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jun 25, 2024
    Authors
    Mahmood Yousaf 2018
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore the TCGA Whole Slide Image (WSI) SVS files available on Kaggle, offering detailed visual representations of tissue samples from various cancer types. These high-resolution images provide valuable insights into tumor morphology and tissue architecture, facilitating cancer diagnosis, prognosis, and treatment research. Delve into the rich landscape of cancer biology, leveraging the wealth of information contained within these SVS files to drive innovative advancements in oncology. This is a dataset of WSI images downloaded from the TCGA portal.

  3. Z

    Artefact segmentation in digital pathology whole-slide images

    • data.niaid.nih.gov
    Updated Dec 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Foucart, Adrien (2020). Artefact segmentation in digital pathology whole-slide images [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3773096
    Explore at:
    Dataset updated
    Dec 9, 2020
    Dataset provided by
    LISA, Université Libre de Bruxelles
    Authors
    Foucart, Adrien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset with examples of Artefacts in Digital Pathology.

    The dataset contains 22 Whole-Slide Images, with H&E or IHC staining, showing various types and levels of defect to the slides. Annotations were made by a biomedical engineer based on examples given by an expert.

    The dataset is split in different folders:

    train

    18 whole-slide images (extracted at 1.25x & 2.5x magnification)

    All from the same Block (colorectal cancer tissue)

    1/2 with H&E & 1/2 with anti-pan-cytokeratin IHC staining.

    validation

    3 whole-slide images (1.25x + 2.5x mag)

    2 from the same Block as the training set (1 IHC, 1 H&E)

    1 from another Block (IHC anti-pan-cytokerating, gastroesophageal junction lesion)

    validation_tiles

    patches of varying sizes taken from the 3 validation whole-slide images @1.25x magnification.

    7 patches from each slide.

    test

    1 whole-slide image (1.25x + 2.5x mag)

    From another block: IHC staining (anti-NR2F2), mouth cancer

    For the train, validation and test whole-slide images, each slide has: - The RGB images @1.25x & 2.5x mag - The corresponding background/tissue masks - The corresponding annotation masks containing examples of artefacts (note that a majority of artefacts are not annotated. In total, 918 artefacts are in the train set)

    For the validation tiles, the following table gives the "patch-level" supervision:

    tile# Artefact(s) 00 None/Few 01 Tear&Fold 02 Ink 03 None/Few 04 None/Few 05 Tear&Fold 06 Tear&Fold + Blur 07 Knife damage 08 Knife damage 09 Ink 10 None/Few 11 Tear&Fold 12 Tear&Fold 13 None/Few 14 None/Few 15 Knife damage 16 Tear&Fold 17 None/Few 18 None/Few 19 Blur 20 Knife damage

  4. BACH: Breast Cancer Histology images

    • kaggle.com
    zip
    Updated Feb 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rabia Eda Yılmaz (2023). BACH: Breast Cancer Histology images [Dataset]. https://www.kaggle.com/datasets/truthisneverlinear/bach-breast-cancer-histology-images
    Explore at:
    zip(13420324262 bytes)Available download formats
    Dataset updated
    Feb 22, 2023
    Authors
    Rabia Eda Yılmaz
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A large annotated dataset, composed of both microscopy (classification task) and whole-slide images (segmentation task), was specifically compiled and made publicly available for the BACH challenge. Following a positive response from the scientific community, a total of 64 submissions, out of 677 registrations, effectively entered the competition. From the submitted algorithms it was possible to push forward the state-of-the-art in terms of accuracy (87%) in automatic classification of breast cancer with histopathological images.

    There are two main folders for classification task: train and test. In Photos folder, there are totally four classes: benign, in situ, invasive, and normal. There is also a ground truth csv file for labels. Images are tif format.

    Paper: https://arxiv.org/abs/1808.04277

    Citation: Aresta, G., Araújo, T., Kwok, S., Chennamsetty, S. S., Safwan, M., Alex, V., ... & Aguiar, P. (2019). Bach: Grand challenge on breast cancer histology images. Medical image analysis, 56, 122-139.

    Dataset: https://zenodo.org/record/3632035

  5. Histopathology WSI

    • kaggle.com
    zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md.Nazmus Sakib2025 (2025). Histopathology WSI [Dataset]. https://www.kaggle.com/datasets/mdnazmussakib2025/histopathology-wsi
    Explore at:
    zip(7197418474 bytes)Available download formats
    Dataset updated
    Nov 12, 2025
    Authors
    Md.Nazmus Sakib2025
    Description

    Computational histopathology has made significant strides in the past few years, slowly getting closer to clinical adoption. One area of benefit would be the automatic generation of diagnostic reports from H&E-stained whole slide images, which would further increase the efficiency of the pathologists' routine diagnostic workflows.

    In this study, we compiled a dataset (PatchGastricADC22) of histopathological captions of stomach adenocarcinoma endoscopic biopsy specimens, which we extracted from diagnostic reports and paired with patches extracted from the associated whole slide images. The dataset contains a variety of gastric adenocarcinoma subtypes.

    We trained a baseline attention-based model to predict the captions from features extracted from the patches and obtained promising results. We make the captioned dataset of 262K patches publicly available.

    Purpose

    The dataset was created to support research in medical image captioning — specifically, to automatically generate diagnostic text descriptions from histopathological image patches. It helps train and evaluate models that can interpret tissue morphology and produce human-like pathology reports.

    Domain & Source

    • Medical domain: Histopathology
    • Disease focus: Gastric adenocarcinoma (a common type of stomach cancer)
    • Image type: H&E-stained tissue sections
    • Source images: Whole Slide Images (WSIs) split into small patches
    • Magnification: 20×

    Dataset Structure (PatchGastricADC22)

    📁 Folder: patches_captions/patches_captions/ Contains all patch-level histopathology image files (in .jpg format). Each patch represents a cropped region (300×300 pixels) from a Whole Slide Image (WSI).

    🧾 File: captions.csv Provides the mapping between image IDs and their corresponding diagnostic captions. Each row represents one unique image patch and its textual description.

    🧩 CSV Columns:

    id – Base ID identifying the parent WSI or case from which the patch was extracted. subtype – Indicates the histological subtype (e.g., tubular adenocarcinoma, poorly differentiated). text – Expert-written caption describing the morphological and diagnostic features visible in the patch.

    Dataset Statistics 🧩 Total images (patches) ~262,777 🧪 Total WSIs (slides) 1305 🖼️ Patch size 300 × 300 pixels 🔬 Magnification 20× ✍️ Captions One per patch 🔠 Vocabulary size 344 unique words 📏 Max caption length 47 words ⚖️ Split 70% train / 10% validation / 20% test

    Creation Process 1. Whole Slide Images (WSIs) were collected from gastric cancer pathology archives. 2. Each slide was divided into 300×300 patches (non-overlapping). 3. Expert pathologists annotated each patch with a short caption describing diagnostic features (cellular and structural morphology). 4. Data were consolidated into image files + a master captions.csv.

  6. f

    Pathology Images of Scanners and Mobilephones (PLISM) - Whole Slide Images...

    • plus.figshare.com
    application/x-gzip
    Updated Mar 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mieko Ochi; Daisuke Komura; Takumi Onoyama; Shumpei Ishikawa (2024). Pathology Images of Scanners and Mobilephones (PLISM) - Whole Slide Images Dataset [Dataset]. http://doi.org/10.25452/figshare.plus.23614422.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Mar 1, 2024
    Dataset provided by
    Figshare+
    Authors
    Mieko Ochi; Daisuke Komura; Takumi Onoyama; Shumpei Ishikawa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Pathology Images of Scanners and Mobilephones (PLISM) dataset was created for the evaluation of AI models’ robustness to domain shifts. PLISM is the first group-wised pathological image dataset that encompasses diverse tissue types stained under 13 H&E conditions, with multiple imaging media, including smartphones (7 scanners and 6 smartphones).The PLISM-wsi subset consists of image groups for all staining conditions between WSIs for each tile image. The PLISM-wsi subset contains a total of 310,947 imagesColor and texture in digital pathology images are affected by H&E stain conditions (e.g. Harris or Carrazi) and digitalization devices (e.g. slide scanners or smartphones), which cause inter-institutional domain shifts.Please see the files 'stain_condition.png' and 'counterpart.png' for H&E staining conditions and devices used.Each tar.gz file in this dataset contains a collection of files labeled via the following file naming convention: (stain_name)_(device_name)/(stain_name)_(device_name)_(top_left_x)_(top_left_y).pngThe csv file included with this dataset contains the following information:Tissue Type: The specific type of human tissue represented in the image, chosen from among 46 possible tissue types.Stain Type: The specific staining condition applied to the image, chosen from among 13 possible conditions.Device Type: The specific type of imaging device used to capture the image, chosen from among 13 possible device types.Coordinate: The xy coordinates of the top left and bottom right corners of each image (e.g., 1000_500_0_0).Image Path: The relative path to each image.See the smartphones subset of the PLISM dataset in the Collection at https://doi.org/10.25452/figshare.plus.c.6773925

  7. Z

    Testing whole slide image for OpenPhi - Open Pathology Interface

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    • +1more
    Updated Jun 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kartasalo, Kimmo (2021). Testing whole slide image for OpenPhi - Open Pathology Interface [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_5037045
    Explore at:
    Dataset updated
    Jun 29, 2021
    Dataset provided by
    Ruusuvuori, Pekka
    Kartasalo, Kimmo
    Tolonen, Teemu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    An anonymous whole slide image in Philips iSyntax format for running software tests on OpenPhi - Open Pathology Interface (https://zenodo.org/record/4680748#.YNnBxDqxXJU). See the repository (https://gitlab.com/BioimageInformaticsGroup/openphi/) for up to date information.

  8. r

    Data from: MCO study whole slide image collection

    • researchdata.edu.au
    Updated 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ward Robyn; Hawkins Nick; University of New South Wales; University of New South Wales; The University of New South Wales; Robyn Ward; Nicholas Hawkins; Nicholas Hawkins (2015). MCO study whole slide image collection [Dataset]. http://doi.org/10.4225/53/555921D09F76B
    Explore at:
    Dataset updated
    2015
    Dataset provided by
    UNSW, Sydney
    University of New South Wales
    Authors
    Ward Robyn; Hawkins Nick; University of New South Wales; University of New South Wales; The University of New South Wales; Robyn Ward; Nicholas Hawkins; Nicholas Hawkins
    License

    Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
    License information was derived automatically

    Time period covered
    1994 - 2010
    Description

    The MCO study whole slide image collection consists of 1500 digitised tissue slides of colorectal cancers. From 1994 to 2010, the Molecular and Cellular Oncology (MCO) Study group conducted a study of individuals undergoing treatment for colorectal cancer. For the study, they systematically collected tissue samples and clinical and pathological information from more than 1500 people who had tumours surgically removed from their large bowel. This collection represents one typical section from each tumour case, stained with Hematoxylin and eosin, and scanned using a x40 objective. The resolution of the digitised images approaches that visible under an optical microscope - more than 100,000 dpi. At this resolution, each image is around 2 Gigabytes, bringing the size of the 1500 images in the MCO Whole Slide Image Collection to 3 Terabytes. The MCO whole slide image collection is now available on the Intersect Australia Research Data Storage Infrastructure (RDSI) Node. Originating source(s): MCO research group, UNSW (1993-2011)

  9. Representative Sample Dataset for Resolution-Agnostic Tissue Segmentation in...

    • zenodo.org
    tiff, xml
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Péter Bándi; Péter Bándi (2024). Representative Sample Dataset for Resolution-Agnostic Tissue Segmentation in Whole-Slide Histopathology Images [Dataset]. http://doi.org/10.5281/zenodo.3375528
    Explore at:
    tiff, xmlAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Péter Bándi; Péter Bándi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a representative sample from the dataset that was used to develop resolution-agnostic convolutional neural networks for tissue segmentation1 in whole-slide histopathology images.

    The dataset is composed of two parts: development set and dissimilar set.

    Sample images from the development set:

    • breast_hne_00.tif
    • breast_lymph_node_hne_00.tif
    • tongue_ae1ae3_00.tif
    • tongue_hne_00.tif
    • tongue_ki67_00.tif

    Sample images from the dissimilar set:

    • brain_alcianblue_00.tif
    • cornea_grocott_00.tif
    • kidney_cab_00.tif
    • skin_perls_00.tif
    • uterus_vonkossa_00.tif
  10. Digital Pathology Dataset for Prostate Cancer Diagnosis

    • zenodo.org
    • nde-dev.biothings.io
    zip
    Updated Dec 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mustafa Umit Oner; Mustafa Umit Oner; Mei Ying Ng; Danilo Medina Giron; Cecilia Ee Chen Xi; Louis Ang Yuan Xiang; Malay Singh; Malay Singh; Weimiao Yu; Weimiao Yu; Wing-Kin Sung; Wing-Kin Sung; Chin Fong Wong; Hwee Kuan Lee; Hwee Kuan Lee; Mei Ying Ng; Danilo Medina Giron; Cecilia Ee Chen Xi; Louis Ang Yuan Xiang; Chin Fong Wong (2022). Digital Pathology Dataset for Prostate Cancer Diagnosis [Dataset]. http://doi.org/10.5281/zenodo.5971764
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 5, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mustafa Umit Oner; Mustafa Umit Oner; Mei Ying Ng; Danilo Medina Giron; Cecilia Ee Chen Xi; Louis Ang Yuan Xiang; Malay Singh; Malay Singh; Weimiao Yu; Weimiao Yu; Wing-Kin Sung; Wing-Kin Sung; Chin Fong Wong; Hwee Kuan Lee; Hwee Kuan Lee; Mei Ying Ng; Danilo Medina Giron; Cecilia Ee Chen Xi; Louis Ang Yuan Xiang; Chin Fong Wong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Links to code and bioRxiv pre-print:

    1. Multi-lens Neural Machine (MLNM) Code

    2. An AI-assisted Tool For Efficient Prostate Cancer Diagnosis (bioRxiv Pre-print)

    Digitized hematoxylin and eosin (H&E)-stained whole-slide-images (WSIs) of 40 prostatectomy and 59 core needle biopsy specimens were collected from 99 prostate cancer patients at Tan Tock Seng Hospital, Singapore. There were 99 WSIs in total such that each specimen had one WSI. H&E-stained slides were scanned at 40× magnification (specimen-level pixel size 0·25μm × 0·25μm) using Aperio AT2 Slide Scanner (Leica Biosystems). Institutional board review from the hospital were obtained for this study, and all the data were de-identified.

    Prostate glandular structures in core needle biopsy slides were manually annotated and classified using the ASAP annotation tool (ASAP). A senior pathologist reviewed 10% of the annotations in each slide, ensuring that some reference annotations were provided to the researcher at different regions of the core. It is to be noted that partial glands appearing at the edges of the biopsy cores were not annotated.

    Patches of size 512 × 512 pixels were cropped from whole slide images at resolutions 5×, 10×, 20×, and 40× with an annotated gland centered at each patch. This dataset contains these cropped images.

    This dataset is used to train two AI models for Gland Segmentation (99 patients) and Gland Classification (46 patients). Tables 1 and 2 illustrate both gland segmentation and gland classification datasets. We have put the two corresponding sub-datasets as two zip files as follows:

    1. gland_segmentation_dataset.zip
    2. gland_classification_dataset.zip

    Table 1: The number of slides and patches in training, validation, and test sets for gland segmentation task. There is one H&E stained WSI for each prostatectomy or core needle biopsy specimen.

    #Slides

    Train

    Valid

    Test

    Total

    Prostatectomy

    17

    8

    15

    40

    Biopsy

    26

    13

    20

    59

    Total

    43

    21

    35

    99

    #Patches

    Train

    Valid

    Test

    Total

    Prostatectomy

    7795

    3753

    7224

    18772

    Biopsy

    5559

    4028

    5981

    15568

    Total

    13354

    7781

    13205

    34340

    Table 2: The number of slides and patches in training, validation, and test sets for gland classification task. There is one H&E stained WSI for each prostatectomy or core needle biopsy specimen. The gland classification datasets are the subsets of the gland segmentation datasets. GS: Gleason Score. B: Benign. M: Malignant.

    #Slides (GS 3+3:3+4:4+3)

    Train

    Valid

    Test

    Total

    Biopsy

    10:9:1

    3:7:0

    6:10:0

    19:26:1

    #Patches (B:M)

    Train

    Valid

    Test

    Total

    Biopsy

    1557:2277

    1216:1341

    1543:2718

    4316:6336

    NB: Gland classification folder (gland_classification_dataset.zip) may contain extra patches, labels of which could not be identified from H&E slides. They were not used in the machine learning study.

  11. e

    Whole Slide Images of H&E Sections Digitised on Multiple Scanners.

    • ebi.ac.uk
    Updated Aug 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    In Hwa Um; Haseeb Nazki; Ognjen Arandjelovic; Grant D. Stewart; Clare Orange; David J. Harrison (2024). Whole Slide Images of H&E Sections Digitised on Multiple Scanners. [Dataset]. http://doi.org/10.6019/S-BIAD1343
    Explore at:
    Dataset updated
    Aug 28, 2024
    Authors
    In Hwa Um; Haseeb Nazki; Ognjen Arandjelovic; Grant D. Stewart; Clare Orange; David J. Harrison
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is composed of 100 cases of whole slide images (WSIs) of renal cell carcinoma, a type of kidney cancer. These images are stained using Haematoxylin and Eosin (H&E), a common staining technique in histopathology that highlights different tissue components, allowing for detailed examination of the cellular structure and morphology. Each of these WSIs has been captured using various high-resolution scanners, including but not limited to the Zeiss Axio Z1, Hamamatsu, Philips, and Leica scanners. These scanners represent different data acquisition domains, meaning that the images might have subtle differences in color, contrast, resolution, and other imaging characteristics depending on the scanner used. These variations can introduce inconsistencies, which can make it challenging to directly compare or analyze images across different scanning devices. Due to these variations, it becomes crucial to develop algorithms that can effectively translate and normalize these WSIs. The goal of such an algorithm would be to standardize the images, making them more consistent regardless of the scanner used. This normalization process is essential for ensuring that any subsequent analysis, such as cancer diagnosis or grading using automated tools, is accurate and reliable across different datasets. The dataset will be instrumental in the development of these normalization algorithms. Researchers can use these images to train, test, and validate models designed to minimize scanner-related variations, leading to more robust and generalized image analysis systems. This work is particularly significant in the context of digital pathology, where reliable and consistent image processing is key to supporting clinical decision-making.

  12. Histology images from uniform tumor regions in TCGA Whole Slide Images...

    • zenodo.org
    bin, zip
    Updated Feb 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daisuke Komura; Daisuke Komura; Shumpei Ishikawa; Shumpei Ishikawa (2025). Histology images from uniform tumor regions in TCGA Whole Slide Images (TCGA-UT) [Dataset]. http://doi.org/10.5281/zenodo.5889558
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Feb 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daisuke Komura; Daisuke Komura; Shumpei Ishikawa; Shumpei Ishikawa
    Description

    TCGA-UT Dataset Documentation

    Quick Links

    Dataset Overview

    The TCGA-UT dataset is a large-scale collection of histopathological image patches from human cancer tissues. It contains 1,608,060 image patches extracted from hematoxylin & eosin (H&E) stained histological samples across 32 different types of solid cancers.

    Key Features

    • Size: Over 1.6 million image patches
    • Resolution: All patches are standardized to 256 x 256 pixels
    • Source: Derived from The Cancer Genome Atlas (TCGA) dataset
    • Quality: Curated by trained pathologists
    • Coverage: 32 different cancer types
    • Patient Base: 7,175 patients from 8,736 diagnostic slides

    Data Collection Process

    1. Image Source: Whole Slide Images (WSI) were downloaded from the GDC legacy database between December 2016 and June 2017
    2. Expert Annotation: Two trained pathologists selected at least three representative tumor regions per slide
    3. Quality Control: 926 slides were removed due to various quality issues (poor staining, low resolution, focus problems, etc.)
    4. Patch Extraction: 10 patches were randomly cropped at 6 different magnification levels from each annotated region

    File Structure

    Files are organized using the following format:

    Copy
    [cancer_type]/[resolution]/[TCGA Barcode]/[region]-[number]-[pixel resolution].jpg

    Resolution Key

    • 0: 0.5 μm/pixel
    • 1: 0.6 μm/pixel
    • 2: 0.7 μm/pixel
    • 3: 0.8 μm/pixel
    • 4: 0.9 μm/pixel
    • 5: 1.0 μm/pixel

    License

    Citation

    If you use this dataset in your research, please cite:

    Copy
    Komura, D., et al. (2022). Universal encoding of pan-cancer histology by deep texture representations. Cell Reports 38, 110424. https://doi.org/10.1016/j.celrep.2022.110424

    For Model Benchmarking

    If you're interested in using this dataset for benchmarking foundation models or feature extractors, we recommend accessing the dataset through the Hugging Face Hub at dakomura/tcga-ut. The Hugging Face version provides:

    • Predefined train/validation/test splits (both internal and external facility-based splits)
    • Ready-to-use benchmarking framework for foundation models
    • WebDataset format support for efficient data loading
    • Example implementations for state-of-the-art model evaluation

  13. f

    Inference Dataset for Paper: Deep learning-based approach to the...

    • plus.figshare.com
    • datasetcatalog.nlm.nih.gov
    bin
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soma Kobayashi; Jason Shieh; Ainara Ruiz de Sabando; Julie Kim; Yang Liu; Sui Y. Zee; Prateek Prasanna; Agnieszka B. Bialkowska; Joel H. Saltz; Vincent W. Yang (2023). Inference Dataset for Paper: Deep learning-based approach to the characterization and quantification of histopathology in mouse models of colitis, PLoS One [Dataset]. http://doi.org/10.25452/figshare.plus.20425416.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Figshare+
    Authors
    Soma Kobayashi; Jason Shieh; Ainara Ruiz de Sabando; Julie Kim; Yang Liu; Sui Y. Zee; Prateek Prasanna; Agnieszka B. Bialkowska; Joel H. Saltz; Vincent W. Yang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Colitis mouse models have been heavily studied to elucidate the pathophysiology of human inflammatory bowel disease. As with patients, colitis mouse models exhibit the simultaneous presence of colonic regions that are histologically involved or uninvolved with disease. We have trained a ResNet-34 classifier to detect these regions from hematoxylin and eosin-stained whole murine colons. ‘Involved’ and ‘uninvolved’ image patches were gathered to cluster and identify histological patch classes. The per mouse proportions of these patch classes were then used to train machine learning classifiers to infer mouse model and clinical score bins. This dataset contains the whole slide images (WSIs) from our prospective mouse cohort. This allows others to run our code from WSI scaling and patch extraction to 1) patch-level ‘Involved’ and ‘Uninvolved’ predictions, 2) ‘Involved’ versus ‘Uninvolved’ prediction WSI overlay generations, 3) histological patch class detection, and 4) mouse model and clinical score bin inference.

  14. Z

    Patch-level UNI feature embeddings from colorectal cancer whole slide images...

    • data-staging.niaid.nih.gov
    Updated Dec 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Myles, Craig; Um, In Hwa; Marshall, Craig; Harris-Birtill, David; Harrison, David J (2024). Patch-level UNI feature embeddings from colorectal cancer whole slide images (WSIs) in the SurGen dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14047722
    Explore at:
    Dataset updated
    Dec 24, 2024
    Dataset provided by
    NHS Lothian Biorepository
    University of St Andrews
    Authors
    Myles, Craig; Um, In Hwa; Marshall, Craig; Harris-Birtill, David; Harrison, David J
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains UNI patch embeddings derived from the SurGen cohort's whole slide images (WSIs), focused on colorectal cancer cases. Each WSI was processed into 224x224 pixel tissue patches, extracted at a scale of 1.0 microns per pixel (MPP). A 1024-dimensional embedding was computed for each patch using the UNI foundation model[1]. This dataset allows for rapid downstream analysis of tasks such as biomarker prediction, survival analysis, tumour grading, and prognostic modelling. The SurGen dataset, comprising both primary colorectal and metastatic cases, offers a valuable resource for computational pathology research.

    Each Zarr file within the dataset contains an array of patch-level features and a corresponding array of coordinates, enabling the retrieval of specific feature locations as needed.

    Embeddings are provided in a zip archive and intended for reuse in research focused on digital pathology, tumour genomics, and oncology. For cohort ground truth labels please see the link below.

    Access the original dataset: https://doi.org/10.6019/S-BIAD1285

    GitHub for more info: https://github.com/CraigMyles/SurGen-Dataset

    [1] Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., et al. Towards a general-purpose foundation model for computational pathology. Nat Med (2024). https://doi.org/10.1038/s41591-024-02857-3

  15. n

    Data from: Orbit Image Analysis: an open-source whole slide image analysis...

    • data-staging.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +2more
    zip
    Updated Feb 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel Stritt; Anna Stalder; Enrico Vezzali (2020). Orbit Image Analysis: an open-source whole slide image analysis tool [Dataset]. http://doi.org/10.5061/dryad.fqz612jpc
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 5, 2020
    Dataset provided by
    ,
    Authors
    Manuel Stritt; Anna Stalder; Enrico Vezzali
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    We describe Orbit Image Analysis, an open-source whole slide image analysis tool. The tool consists of a generic tile-processing engine which allows the execution of various image analysis algorithms provided by either Orbit itself or from other open-source platforms using a tile-based map-reduce execution framework. Orbit Image Analysis is capable of sophisticated whole slide imaging analyses due to several key features. First, Orbit has machine-learning capabilities. This deep learning segmentation can be integrated with complex object detection for analysis of intricate tissues. In addition, Orbit can run locally as standalone or connect to the open-source image server OMERO. Another important characteristic is its scale-out functionality, using the Apache Spark framework for distributed computing. In this paper, we describe the use of Orbit in three different real-world applications: quantification of idiopathic lung fibrosis, nerve fibre density quantification, and glomeruli detection in the kidney.

  16. D

    Data from: High-throughput adaptive sampling for whole-slide histopathology...

    • datasetcatalog.nlm.nih.gov
    • datadryad.org
    Updated Jun 1, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Basavanhally, Ajay; Shih, Natalie; Tomaszewski, John; Madabhushi, Anant; Ganesan, Shridar; Cruz-Roa, Angel; Gilmore, Hannah; Feldman, Michael; González, Fabio (2018). High-throughput adaptive sampling for whole-slide histopathology image analysis (HASHI) via convolutional neural networks: application to invasive breast cancer detection [Dataset]. http://doi.org/10.5061/dryad.1g2nt41
    Explore at:
    Dataset updated
    Jun 1, 2018
    Authors
    Basavanhally, Ajay; Shih, Natalie; Tomaszewski, John; Madabhushi, Anant; Ganesan, Shridar; Cruz-Roa, Angel; Gilmore, Hannah; Feldman, Michael; González, Fabio
    Description

    Precise detection of invasive cancer on whole-slide images (WSI) is a critical first step in digital pathology tasks of diagnosis and grading. Convolutional neural network (CNN) is the most popular representation learning method for computer vision tasks, which have been successfully applied in digital pathology, including tumor and mitosis detection. However, CNNs are typically only tenable with relatively small image sizes (200x200 pixels). Only recently, Fully convolutional networks (FCN) are able to deal with larger image sizes (500x500 pixels) for semantic segmentation. Hence, the direct application of CNNs to WSI is not computationally feasible because for a WSI, a CNN would require billions or trillions of parameters. To alleviate this issue, this paper presents a novel method, High-throughput Adaptive Sampling for whole-slide Histopathology Image analysis (HASHI), which involves: i) a new efficient adaptive sampling method based on probability gradient and quasi-Monte Carlo sampling, and, ii) a powerful representation learning classifier based on CNNs. We applied HASHI to automated detection of invasive breast cancer on WSI. HASHI was trained and validated using three different data cohorts involving near 500 cases and then independently tested on 195 studies from The Cancer Genome Atlas. The results show that (1) the adaptive sampling method is an effective strategy to deal with WSI without compromising prediction accuracy by obtaining comparative results of a dense sampling (~6 million of samples in 24 hours) with far fewer samples (~2,000 samples in 1 minute), and (2) on an independent test dataset, HASHI is effective and robust to data from multiple sites, scanners, and platforms, achieving an average Dice coefficient of 76%.

  17. h

    WSI-Babel-Shark: Empty Whole-Slide Images for Slide-Label Metadata...

    • heidata.uni-heidelberg.de
    tsv, txt, zip
    Updated Dec 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahram Aliyari; Shahram Aliyari (2025). WSI-Babel-Shark: Empty Whole-Slide Images for Slide-Label Metadata Extraction [Dataset]. http://doi.org/10.11588/DATA/ZBS9RS
    Explore at:
    txt(2419), tsv(1893), zip(2697660195)Available download formats
    Dataset updated
    Dec 17, 2025
    Dataset provided by
    heiDATA
    Authors
    Shahram Aliyari; Shahram Aliyari
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Heidelberg, Germany
    Description

    This dataset contains 22 whole-slide image (WSI) files in SVS format, digitized using a Leica GT450 scanner. All WSIs were intentionally scanned without tissue; only the physical slide labels are present. The purpose of this dataset is to support the evaluation and benchmarking of the WSI-Babel-Shark metadata-extraction pipeline. Empty slides allow reduced file sizes, preservation of SVS metadata, and controlled conditions for benchmarking label-processing components, including OCR, DataMatrix decoding, stain parsing, SlideID reconstruction, and metadata harmonization. All WSIs retain full TIFF tiling, SVS headers, and Leica metadata. Files were manually inspected to ensure complete de-identification, and all CaseIDs and SlideIDs represent synthetic test cases. A ground-truth CSV file containing validated metadata fields is included for benchmarking. No patient-identifying information is contained in any file.

  18. c

    Multimodal Head and Neck cancer dataset

    • cancerimagingarchive.net
    n/a, svs and png
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2025). Multimodal Head and Neck cancer dataset [Dataset]. http://doi.org/10.7937/rcty-5h16
    Explore at:
    svs and png, n/aAvailable download formats
    Dataset updated
    Nov 21, 2025
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    Dec 12, 2025
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    Abstract

    HANCOCK is a comprehensive, monocentric dataset of 763 head and neck cancer patients, including diverse data modalities. It contains histopathology imaging (whole-slide images of H&E-stained primary tumors and tissue microarrays with immunohistochemical staining) alongside structured clinical data (demographics, tumor pathology characteristics, laboratory blood measurements) and textual data (de-identified surgery reports and medical histories). All patients were treated curatively, and data span diagnoses from 2005–2019. This multimodal collection enables research into integrative analyses – for example, combining histologic features with clinical parameters for outcome prediction. Early analyses have demonstrated that fusing these modalities improves prognostic modeling compared to single-source data, and that leveraging histology with foundation models can enhance endpoint prediction​. HANCOCK aims to facilitate precision oncology studies by providing a large public resource for developing and benchmarking multimodal machine learning methods in head and neck cancer.

    Introduction

    Head and neck cancer (HNC) is a prevalent malignancy with poor outcomes – it is the 7th most common cancer globally and carries a 5-year survival of only ~25–60% despite modern treatments​. Improving patient prognosis may require personalized, multimodal therapy decisions, using information from pathology, clinical, and other data sources​. However, progress in multimodal prediction has been limited by the lack of large public datasets that integrate these diverse data types​. To our knowledge, existing HNC datasets are either small or incomplete; for example, a radiomics study included 288 oropharyngeal cases​, and a proteomics-focused set with imaging had only 122 cases​. The Cancer Genome Atlas (TCGA) provides multi-omics for >500 HNC cases, but lacks crucial data like pathology reports, blood tests, or comprehensive imaging for each patient​. These limitations hinder robust multimodal research​.

    HANCOCK was created to address this gap​. It aggregates 763 patients’ data from a single academic center, capturing a real-world, uniformly treated cohort. The dataset uniquely combines whole slide histopathology images, tissue microarray images, detailed clinical parameters, pathology reports, and lab values in one resource​​. By curating and harmonizing these modalities, HANCOCK enables researchers to explore complex data interdependencies and develop multimodal predictive models. The patient population reflects typical HNC demographics – 80% male, median age 61, with 72% being former or current smokers​ – aligning with expected epidemiology​ and supporting generalizability. In summary, HANCOCK is an unprecedented multimodal HNC dataset that can fuel research in machine learning, prognostic biomarker discovery, and integrative oncology, ultimately advancing personalized head and neck cancer care.

    Methods

    The following sections describe how the HANCOCK data were collected, processed, and prepared for public sharing.

    Subject Inclusion and Exclusion Criteria

    Patients included in HANCOCK were those diagnosed with head and neck cancer between 2005 and 2019 at University Hospital Erlangen (Germany) who underwent a curative-intent initial treatment (surgery and/or definitive therapy)​. This encompasses cancers of the oral cavity, oropharynx, hypopharynx, and larynx​. Patients treated palliatively or with recurrent/metastatic disease at presentation were excluded to focus on first-course, curative treatments. The cohort consists of 763 patients (approximately 80% male, 20% female) with a median age of 61 years​. Notably, ~72% have a history of tobacco use​, which is consistent with real-world HNC risk factors. The distribution of tumor subsites and stages reflects typical HNC presentation, and thus the dataset is broadly representative of the general HNC patient population​. Being a single-center dataset, there is limited geographic diversity; however, the homogeneous data acquisition and treatment context reduce variability in data quality. No significant selection biases were introduced aside from the exclusion of non-curative cases – all major HNC subsite cases over the inclusion period were captured, providing a comprehensive real-world sample. Ethical approval was obtained for this retrospective data collection and sharing (Ethics Committee vote #23-22-Br), and all data were fully de-identified prior to release.

    Data Acquisition

    Histopathology: Tissue specimens from the primary tumors (and involved lymph nodes, if present) were obtained from the pathology archives. All samples were formalin-fixed and paraffin-embedded (FFPE) and stained with hematoxylin and eosin (H&E) following routine protocols​. Digital whole-slide imaging was performed on these histology slides. A total of 709 H&E slides of primary tumor tissue (701 patients had one slide, 8 patients had two slides) were scanned at high resolution using a 3DHISTECH P1000 scanner at an effective 82.44× magnification (0.1213 µm/pixel). Additionally, 396 H&E slides of lymph node metastases were scanned, using two systems: an Aperio Leica GT450 at 40× (0.2634 µm/pixel) and the 3DHISTECH P1000 at ~51× (0.1945 µm/pixel). (Multiple scanners were utilized over the course of the project; all resulting images were cross-verified for quality.) The digital whole slide images (WSIs) are provided in the pyramidal Aperio SVS format, a TIFF-based format compatible with standard viewers.

    In addition to full slides, tissue microarrays (TMAs) were constructed from each patient’s tumor block to sample important regions. For each case, two cylindrical core biopsies (diameter 1.5 mm) were taken – one from the tumor center and one from the invasive tumor front. These cores were assembled into TMA blocks and stained on separate slides with a panel of eight stains: H&E plus immunohistochemical (IHC) markers targeting various immune cells and tumor biomarkers. The IHC markers include CD3, CD8, CD56, CD68, CD163, PD-L1, and MHC-1, which label T cells (CD3, CD8), natural killer cells (CD56), monocytes/macrophages (CD68, CD163), and a tumor immune checkpoint ligand (PD-L1), as well as MHC class I expression. Each core appears on up to 8 stained TMA slides (one per stain), yielding up to 16 TMA images per patient (two cores × eight stains). In the dataset, TMA images are provided for both the tumor-center and tumor-front cores; these too are digitized high-resolution images (consistent microscope settings, ~40×). The combination of WSIs and TMAs yields a rich imaging dataset: 701 patients have at least one primary tumor WSI (62 patients lack WSIs due to unavailable tissue), and all patients have TMA core images unless the tumor block was exhausted. This imaging data offers both broad tissue context from WSIs and targeted cellular detail from TMAs. Manual tumor region annotations are also included for the primary tumor WSIs (see Data Analysis below).

    Clinical and Pathology Data: A wide array of non-imaging data was extracted from hospital information systems and pathology reports for each patient. Key demographic variables (age, sex, etc.) and tumor pathology details were collected, including primary tumor site, histologic subtype, grade, TNM stage, resection margin status, depth of invasion, perineural and lymphovascular invasion, and nodal metastasis status. These pathology parameters were recorded in a structured format for each case​​. Standard clinical coding systems were used where applicable: e.g., diagnoses are coded with ICD-10 codes and procedures with OPS codes (the German procedure classification system)​. The dataset includes these codes for each patient’s conditions and treatments. Comprehensive laboratory blood test results at diagnosis or pre-treatment were also compiled, covering complete blood counts, coagulation measures, electrolytes, kidney function, C-reactive protein, and other relevant analytes. Reference ranges for each lab parameter are provided alongside the values to indicate whether a result was normal or abnormal. Most patients have a full panel of these lab results, though some values are missing if a test was not clinically indicated; the dataset notes availability per patient. All structured data have been cleaned and validated – for example, harmonizing category values and checking consistency (e.g. TNM stages align with recorded tumor sites).

    Textual Data (Surgical Reports and Histories): Unstructured clinical text was also included to add rich context on treatment details. Surgery reports (operative notes) from the primary tumor resection and associated medical history summaries were retrieved from the hospital’s electronic records. For each patient, the operative report from their first definitive surgery and the corresponding

  19. Digital Pathology Dataset for Breast Cancer Diagnosis

    • zenodo.org
    • data-staging.niaid.nih.gov
    zip
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sepideh Naghshineh Kani; Sepideh Naghshineh Kani; Burak Can Soyak; Melih Gokce; Zeynep Duyar; Hasan Alicikus; Özlem Yapıcıer; Özlem Yapıcıer; Mustafa Umit Oner; Mustafa Umit Oner; Burak Can Soyak; Melih Gokce; Zeynep Duyar; Hasan Alicikus (2024). Digital Pathology Dataset for Breast Cancer Diagnosis [Dataset]. http://doi.org/10.5281/zenodo.14131968
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sepideh Naghshineh Kani; Sepideh Naghshineh Kani; Burak Can Soyak; Melih Gokce; Zeynep Duyar; Hasan Alicikus; Özlem Yapıcıer; Özlem Yapıcıer; Mustafa Umit Oner; Mustafa Umit Oner; Burak Can Soyak; Melih Gokce; Zeynep Duyar; Hasan Alicikus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Links to code:
    Tissue Region Segmentation Code
    This dataset comprises high-quality immunohistochemistry (IHC) and Haematoxylin and Eosin (H&E) whole slide images (WSIs) of breast tissues, provided in .svs format.

    • The IHC dataset (labeled as BAU_IHC) consists of 55 zip files, each containing 2–3 WSIs, for a total of 163 slides.
    • The H&E dataset (labeled as BAU_HE) consists of 36 zip files, each containing 2 WSIs, for a total of 72 slides.

    The data were collected from Bahçeşehir University Medical School and are intended for research in histopathology and computational pathology.

    This study was approved by the Bahçeşehir University Clinical Research Institutional Review Board (Approval No: 2022-10/03).

  20. W

    Whole-Slide Imaging Market Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Jan 11, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2026). Whole-Slide Imaging Market Report [Dataset]. https://www.marketreportanalytics.com/reports/whole-slide-imaging-market-96940
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Jan 11, 2026
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2026 - 2034
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Whole-Slide Imaging (WSI) market is booming, projected to reach [estimated 2033 market size in millions] by 2033, fueled by digital pathology adoption and technological advancements. Learn about market drivers, trends, restraints, and key players in this comprehensive analysis. Recent developments include: March 2023: Pramana, Inc., an AI-enabled health tech company modernizing the pathology sector, collaborated with PathPresenter to accelerate the enterprise adoption of digital pathology workflows. The goal of this collaboration is to ensure a seamless user experience for the labs adopting Pramana's Digital Pathology as a Service solution for whole slide image generation, as well as PathPresenter's image management/image viewing platform., March 2023: Hamamatsu, a manufacturer of photonics devices, including whole slide scanners for digital pathology, entered into a multi-year distribution agreement with Siemens Healthineers. Under the agreement, Hamamatsu is likely to provide NanoZoomer with whole slide scanners to support Siemens Healthineer's expansion into digital pathology in the Americas and Europe.. Key drivers for this market are: Growing Popularity of Virtual Slides as Compared to Physical Slides, Technological Advancements in Whole Slide Imaging; Increasing Research in Drug Discovery. Potential restraints include: Growing Popularity of Virtual Slides as Compared to Physical Slides, Technological Advancements in Whole Slide Imaging; Increasing Research in Drug Discovery. Notable trends are: Telepathology Segment is Expected to Grow Significantly Over the Forecast Period.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mattias Rantalainen; Johan Hartman (2023). ACROBAT - a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology [Dataset]. http://doi.org/10.48723/w728-p041

Data from: ACROBAT - a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology

ACROBAT

Related Article
Explore at:
7 scholarly articles cite this dataset (View in Google Scholar)
(418), (76735897912), (36540), (2982), (74182679049), (76914241171), (81512804565), (31248), (75799632383), (23401027210), (36876), (37413), (37036), (10301), (73134087512), (36333), (1168275)Available download formats
Dataset updated
Oct 20, 2023
Dataset provided by
Karolinska Institutet
Authors
Mattias Rantalainen; Johan Hartman
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
2012 - 2018
Area covered
Stockholm County
Description

The ACROBAT data set consists of 4,212 whole slide images (WSIs) from 1,153 female primary breast cancer patients. The WSIs in the data set are available at 10X magnification and show tissue sections from breast cancer resection specimens stained with hematoxylin and eosin (H&E) or immunohistochemistry (IHC). For each patient, one WSI of H&E stained tissue and at least one one, and up to four, WSIs of corresponding tissue stained with the routine diagnostic stains ER, PGR, HER2 and KI67 are available. The data set was acquired as part of the CHIME study (chimestudy.se) and its primary purpose was to facilitate the ACROBAT WSI registration challenge (acrobat.grand-challenge.org). The histopathology slides originate from routine diagnostic pathology workflows and were digitised for research purposes at Karolinska Institutet (Stockholm, Sweden). The image acquisition process resembles the routine digital pathology image digitisation workflow, using three different Hamamatsu WSI scanners, specifically one NanoZoomer S360 and two NanoZoomer XR. The WSIs in this data set are accompanied by a data table with one row for each WSI, specifying an anonymised patient ID, the stain or IHC antibody type of each WSI, as well as the magnification and microns per pixel at each available resolution level. Automated registration algorithm performance evaluation is possible through the ACROBAT challenge website based on over 37,000 landmark pair annotations from 13 annotators. While the primary purpose of this data set was the development and evaluation of WSI registration methods, this data set has the potential to facilitate further research in the context of computational pathology, for example in the areas of stain-guided learning, virtual staining, unsupervised learning and stain-independent models.

The data set consists of three subsets, the training, validation and test set, based on the ACROBAT WSI registration challenge. There are 750 cases in the training set, for each of which one H&E WSI and one to four IHC WSIs are available, with 3406 WSIs in total. The validation set consists of 100 cases with 200 WSIs in total and the test set of 303 cases with 606 WSIs in total. Both for the validation and test set, one H&E WSI as well as one randomly selected IHC WSI is available.

WSIs were anonymised by deleting the associated macro images, by generating filenames with random case IDs and by overwriting meta data fields with potentially personal information. Hamamatsu NDPI files were then converted using libvips (libvips.org/). WSIs are available as generic tiled TIFF WSIs (openslide.org/formats/generic-tiff/) at 10X magnification and lower image levels.

The data set is available for download in seven separate ZIP archives, five for the training data (train_part1.zip (71.47 GB), train_part2.zip (70.59 GB), train_part3.zip (75.91 GB), train_part4.zip (71.63 GB) and train_part5.zip (69.09 GB)), one for the validation data (valid.zip 21.79 GB) and one for the test data (test.zip 68.11 GB).

File listings and checksums in SHA1 format are available for checking archive/data integrity when downloading.

While it would be helpful to notify SND of any publications using this data set by sending an email to request@snd.gu.se, please note that this is not required to use the data.

Search
Clear search
Close search
Google apps
Main menu