Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ACROBAT data set consists of 4,212 whole slide images (WSIs) from 1,153 female primary breast cancer patients. The WSIs in the data set are available at 10X magnification and show tissue sections from breast cancer resection specimens stained with hematoxylin and eosin (H&E) or immunohistochemistry (IHC). For each patient, one WSI of H&E stained tissue and at least one one, and up to four, WSIs of corresponding tissue stained with the routine diagnostic stains ER, PGR, HER2 and KI67 are available. The data set was acquired as part of the CHIME study (chimestudy.se) and its primary purpose was to facilitate the ACROBAT WSI registration challenge (acrobat.grand-challenge.org). The histopathology slides originate from routine diagnostic pathology workflows and were digitised for research purposes at Karolinska Institutet (Stockholm, Sweden). The image acquisition process resembles the routine digital pathology image digitisation workflow, using three different Hamamatsu WSI scanners, specifically one NanoZoomer S360 and two NanoZoomer XR. The WSIs in this data set are accompanied by a data table with one row for each WSI, specifying an anonymised patient ID, the stain or IHC antibody type of each WSI, as well as the magnification and microns per pixel at each available resolution level. Automated registration algorithm performance evaluation is possible through the ACROBAT challenge website based on over 37,000 landmark pair annotations from 13 annotators. While the primary purpose of this data set was the development and evaluation of WSI registration methods, this data set has the potential to facilitate further research in the context of computational pathology, for example in the areas of stain-guided learning, virtual staining, unsupervised learning and stain-independent models.
The data set consists of three subsets, the training, validation and test set, based on the ACROBAT WSI registration challenge. There are 750 cases in the training set, for each of which one H&E WSI and one to four IHC WSIs are available, with 3406 WSIs in total. The validation set consists of 100 cases with 200 WSIs in total and the test set of 303 cases with 606 WSIs in total. Both for the validation and test set, one H&E WSI as well as one randomly selected IHC WSI is available.
WSIs were anonymised by deleting the associated macro images, by generating filenames with random case IDs and by overwriting meta data fields with potentially personal information. Hamamatsu NDPI files were then converted using libvips (libvips.org/). WSIs are available as generic tiled TIFF WSIs (openslide.org/formats/generic-tiff/) at 10X magnification and lower image levels.
The data set is available for download in seven separate ZIP archives, five for the training data (train_part1.zip (71.47 GB), train_part2.zip (70.59 GB), train_part3.zip (75.91 GB), train_part4.zip (71.63 GB) and train_part5.zip (69.09 GB)), one for the validation data (valid.zip 21.79 GB) and one for the test data (test.zip 68.11 GB).
File listings and checksums in SHA1 format are available for checking archive/data integrity when downloading.
While it would be helpful to notify SND of any publications using this data set by sending an email to request@snd.gu.se, please note that this is not required to use the data.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Explore the TCGA Whole Slide Image (WSI) SVS files available on Kaggle, offering detailed visual representations of tissue samples from various cancer types. These high-resolution images provide valuable insights into tumor morphology and tissue architecture, facilitating cancer diagnosis, prognosis, and treatment research. Delve into the rich landscape of cancer biology, leveraging the wealth of information contained within these SVS files to drive innovative advancements in oncology. This is a dataset of WSI images downloaded from the TCGA portal.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset with examples of Artefacts in Digital Pathology.
The dataset contains 22 Whole-Slide Images, with H&E or IHC staining, showing various types and levels of defect to the slides. Annotations were made by a biomedical engineer based on examples given by an expert.
The dataset is split in different folders:
train
18 whole-slide images (extracted at 1.25x & 2.5x magnification)
All from the same Block (colorectal cancer tissue)
1/2 with H&E & 1/2 with anti-pan-cytokeratin IHC staining.
validation
3 whole-slide images (1.25x + 2.5x mag)
2 from the same Block as the training set (1 IHC, 1 H&E)
1 from another Block (IHC anti-pan-cytokerating, gastroesophageal junction lesion)
validation_tiles
patches of varying sizes taken from the 3 validation whole-slide images @1.25x magnification.
7 patches from each slide.
test
1 whole-slide image (1.25x + 2.5x mag)
From another block: IHC staining (anti-NR2F2), mouth cancer
For the train, validation and test whole-slide images, each slide has: - The RGB images @1.25x & 2.5x mag - The corresponding background/tissue masks - The corresponding annotation masks containing examples of artefacts (note that a majority of artefacts are not annotated. In total, 918 artefacts are in the train set)
For the validation tiles, the following table gives the "patch-level" supervision:
tile# Artefact(s) 00 None/Few 01 Tear&Fold 02 Ink 03 None/Few 04 None/Few 05 Tear&Fold 06 Tear&Fold + Blur 07 Knife damage 08 Knife damage 09 Ink 10 None/Few 11 Tear&Fold 12 Tear&Fold 13 None/Few 14 None/Few 15 Knife damage 16 Tear&Fold 17 None/Few 18 None/Few 19 Blur 20 Knife damage
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A large annotated dataset, composed of both microscopy (classification task) and whole-slide images (segmentation task), was specifically compiled and made publicly available for the BACH challenge. Following a positive response from the scientific community, a total of 64 submissions, out of 677 registrations, effectively entered the competition. From the submitted algorithms it was possible to push forward the state-of-the-art in terms of accuracy (87%) in automatic classification of breast cancer with histopathological images.
There are two main folders for classification task: train and test. In Photos folder, there are totally four classes: benign, in situ, invasive, and normal. There is also a ground truth csv file for labels. Images are tif format.
Paper: https://arxiv.org/abs/1808.04277
Citation: Aresta, G., Araújo, T., Kwok, S., Chennamsetty, S. S., Safwan, M., Alex, V., ... & Aguiar, P. (2019). Bach: Grand challenge on breast cancer histology images. Medical image analysis, 56, 122-139.
Dataset: https://zenodo.org/record/3632035
Facebook
TwitterComputational histopathology has made significant strides in the past few years, slowly getting closer to clinical adoption. One area of benefit would be the automatic generation of diagnostic reports from H&E-stained whole slide images, which would further increase the efficiency of the pathologists' routine diagnostic workflows.
In this study, we compiled a dataset (PatchGastricADC22) of histopathological captions of stomach adenocarcinoma endoscopic biopsy specimens, which we extracted from diagnostic reports and paired with patches extracted from the associated whole slide images. The dataset contains a variety of gastric adenocarcinoma subtypes.
We trained a baseline attention-based model to predict the captions from features extracted from the patches and obtained promising results. We make the captioned dataset of 262K patches publicly available.
Purpose
The dataset was created to support research in medical image captioning — specifically, to automatically generate diagnostic text descriptions from histopathological image patches. It helps train and evaluate models that can interpret tissue morphology and produce human-like pathology reports.
Domain & Source
Dataset Structure (PatchGastricADC22)
📁 Folder: patches_captions/patches_captions/ Contains all patch-level histopathology image files (in .jpg format). Each patch represents a cropped region (300×300 pixels) from a Whole Slide Image (WSI).
🧾 File: captions.csv Provides the mapping between image IDs and their corresponding diagnostic captions. Each row represents one unique image patch and its textual description.
🧩 CSV Columns:
id – Base ID identifying the parent WSI or case from which the patch was extracted. subtype – Indicates the histological subtype (e.g., tubular adenocarcinoma, poorly differentiated). text – Expert-written caption describing the morphological and diagnostic features visible in the patch.
Dataset Statistics 🧩 Total images (patches) ~262,777 🧪 Total WSIs (slides) 1305 🖼️ Patch size 300 × 300 pixels 🔬 Magnification 20× ✍️ Captions One per patch 🔠 Vocabulary size 344 unique words 📏 Max caption length 47 words ⚖️ Split 70% train / 10% validation / 20% test
Creation Process 1. Whole Slide Images (WSIs) were collected from gastric cancer pathology archives. 2. Each slide was divided into 300×300 patches (non-overlapping). 3. Expert pathologists annotated each patch with a short caption describing diagnostic features (cellular and structural morphology). 4. Data were consolidated into image files + a master captions.csv.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Pathology Images of Scanners and Mobilephones (PLISM) dataset was created for the evaluation of AI models’ robustness to domain shifts. PLISM is the first group-wised pathological image dataset that encompasses diverse tissue types stained under 13 H&E conditions, with multiple imaging media, including smartphones (7 scanners and 6 smartphones).The PLISM-wsi subset consists of image groups for all staining conditions between WSIs for each tile image. The PLISM-wsi subset contains a total of 310,947 imagesColor and texture in digital pathology images are affected by H&E stain conditions (e.g. Harris or Carrazi) and digitalization devices (e.g. slide scanners or smartphones), which cause inter-institutional domain shifts.Please see the files 'stain_condition.png' and 'counterpart.png' for H&E staining conditions and devices used.Each tar.gz file in this dataset contains a collection of files labeled via the following file naming convention: (stain_name)_(device_name)/(stain_name)_(device_name)_(top_left_x)_(top_left_y).pngThe csv file included with this dataset contains the following information:Tissue Type: The specific type of human tissue represented in the image, chosen from among 46 possible tissue types.Stain Type: The specific staining condition applied to the image, chosen from among 13 possible conditions.Device Type: The specific type of imaging device used to capture the image, chosen from among 13 possible device types.Coordinate: The xy coordinates of the top left and bottom right corners of each image (e.g., 1000_500_0_0).Image Path: The relative path to each image.See the smartphones subset of the PLISM dataset in the Collection at https://doi.org/10.25452/figshare.plus.c.6773925
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
An anonymous whole slide image in Philips iSyntax format for running software tests on OpenPhi - Open Pathology Interface (https://zenodo.org/record/4680748#.YNnBxDqxXJU). See the repository (https://gitlab.com/BioimageInformaticsGroup/openphi/) for up to date information.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
The MCO study whole slide image collection consists of 1500 digitised tissue slides of colorectal cancers. From 1994 to 2010, the Molecular and Cellular Oncology (MCO) Study group conducted a study of individuals undergoing treatment for colorectal cancer. For the study, they systematically collected tissue samples and clinical and pathological information from more than 1500 people who had tumours surgically removed from their large bowel. This collection represents one typical section from each tumour case, stained with Hematoxylin and eosin, and scanned using a x40 objective. The resolution of the digitised images approaches that visible under an optical microscope - more than 100,000 dpi. At this resolution, each image is around 2 Gigabytes, bringing the size of the 1500 images in the MCO Whole Slide Image Collection to 3 Terabytes. The MCO whole slide image collection is now available on the Intersect Australia Research Data Storage Infrastructure (RDSI) Node. Originating source(s): MCO research group, UNSW (1993-2011)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a representative sample from the dataset that was used to develop resolution-agnostic convolutional neural networks for tissue segmentation1 in whole-slide histopathology images.
The dataset is composed of two parts: development set and dissimilar set.
Sample images from the development set:
Sample images from the dissimilar set:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Links to code and bioRxiv pre-print:
1. Multi-lens Neural Machine (MLNM) Code
2. An AI-assisted Tool For Efficient Prostate Cancer Diagnosis (bioRxiv Pre-print)
Digitized hematoxylin and eosin (H&E)-stained whole-slide-images (WSIs) of 40 prostatectomy and 59 core needle biopsy specimens were collected from 99 prostate cancer patients at Tan Tock Seng Hospital, Singapore. There were 99 WSIs in total such that each specimen had one WSI. H&E-stained slides were scanned at 40× magnification (specimen-level pixel size 0·25μm × 0·25μm) using Aperio AT2 Slide Scanner (Leica Biosystems). Institutional board review from the hospital were obtained for this study, and all the data were de-identified.
Prostate glandular structures in core needle biopsy slides were manually annotated and classified using the ASAP annotation tool (ASAP). A senior pathologist reviewed 10% of the annotations in each slide, ensuring that some reference annotations were provided to the researcher at different regions of the core. It is to be noted that partial glands appearing at the edges of the biopsy cores were not annotated.
Patches of size 512 × 512 pixels were cropped from whole slide images at resolutions 5×, 10×, 20×, and 40× with an annotated gland centered at each patch. This dataset contains these cropped images.
This dataset is used to train two AI models for Gland Segmentation (99 patients) and Gland Classification (46 patients). Tables 1 and 2 illustrate both gland segmentation and gland classification datasets. We have put the two corresponding sub-datasets as two zip files as follows:
Table 1: The number of slides and patches in training, validation, and test sets for gland segmentation task. There is one H&E stained WSI for each prostatectomy or core needle biopsy specimen.
|
|
#Slides |
|
|
|
|
|
Train |
Valid |
Test |
Total |
|
Prostatectomy |
17 |
8 |
15 |
40 |
|
Biopsy |
26 |
13 |
20 |
59 |
|
Total |
43 |
21 |
35 |
99 |
|
|
#Patches |
|
|
|
|
|
Train |
Valid |
Test |
Total |
|
Prostatectomy |
7795 |
3753 |
7224 |
18772 |
|
Biopsy |
5559 |
4028 |
5981 |
15568 |
|
Total |
13354 |
7781 |
13205 |
34340 |
Table 2: The number of slides and patches in training, validation, and test sets for gland classification task. There is one H&E stained WSI for each prostatectomy or core needle biopsy specimen. The gland classification datasets are the subsets of the gland segmentation datasets. GS: Gleason Score. B: Benign. M: Malignant.
|
|
#Slides (GS 3+3:3+4:4+3) |
|
|
|
|
|
Train |
Valid |
Test |
Total |
|
Biopsy |
10:9:1 |
3:7:0 |
6:10:0 |
19:26:1 |
|
|
#Patches (B:M) |
|
|
|
|
|
Train |
Valid |
Test |
Total |
|
Biopsy |
1557:2277 |
1216:1341 |
1543:2718 |
4316:6336 |
NB: Gland classification folder (gland_classification_dataset.zip) may contain extra patches, labels of which could not be identified from H&E slides. They were not used in the machine learning study.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is composed of 100 cases of whole slide images (WSIs) of renal cell carcinoma, a type of kidney cancer. These images are stained using Haematoxylin and Eosin (H&E), a common staining technique in histopathology that highlights different tissue components, allowing for detailed examination of the cellular structure and morphology. Each of these WSIs has been captured using various high-resolution scanners, including but not limited to the Zeiss Axio Z1, Hamamatsu, Philips, and Leica scanners. These scanners represent different data acquisition domains, meaning that the images might have subtle differences in color, contrast, resolution, and other imaging characteristics depending on the scanner used. These variations can introduce inconsistencies, which can make it challenging to directly compare or analyze images across different scanning devices. Due to these variations, it becomes crucial to develop algorithms that can effectively translate and normalize these WSIs. The goal of such an algorithm would be to standardize the images, making them more consistent regardless of the scanner used. This normalization process is essential for ensuring that any subsequent analysis, such as cancer diagnosis or grading using automated tools, is accurate and reliable across different datasets. The dataset will be instrumental in the development of these normalization algorithms. Researchers can use these images to train, test, and validate models designed to minimize scanner-related variations, leading to more robust and generalized image analysis systems. This work is particularly significant in the context of digital pathology, where reliable and consistent image processing is key to supporting clinical decision-making.
Facebook
Twitter
The TCGA-UT dataset is a large-scale collection of histopathological image patches from human cancer tissues. It contains 1,608,060 image patches extracted from hematoxylin & eosin (H&E) stained histological samples across 32 different types of solid cancers.
Files are organized using the following format:
[cancer_type]/[resolution]/[TCGA Barcode]/[region]-[number]-[pixel resolution].jpgIf you use this dataset in your research, please cite:
Komura, D., et al. (2022). Universal encoding of pan-cancer histology by deep texture representations.
Cell Reports 38, 110424. https://doi.org/10.1016/j.celrep.2022.110424If you're interested in using this dataset for benchmarking foundation models or feature extractors, we recommend accessing the dataset through the Hugging Face Hub at dakomura/tcga-ut. The Hugging Face version provides:
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Colitis mouse models have been heavily studied to elucidate the pathophysiology of human inflammatory bowel disease. As with patients, colitis mouse models exhibit the simultaneous presence of colonic regions that are histologically involved or uninvolved with disease. We have trained a ResNet-34 classifier to detect these regions from hematoxylin and eosin-stained whole murine colons. ‘Involved’ and ‘uninvolved’ image patches were gathered to cluster and identify histological patch classes. The per mouse proportions of these patch classes were then used to train machine learning classifiers to infer mouse model and clinical score bins. This dataset contains the whole slide images (WSIs) from our prospective mouse cohort. This allows others to run our code from WSI scaling and patch extraction to 1) patch-level ‘Involved’ and ‘Uninvolved’ predictions, 2) ‘Involved’ versus ‘Uninvolved’ prediction WSI overlay generations, 3) histological patch class detection, and 4) mouse model and clinical score bin inference.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains UNI patch embeddings derived from the SurGen cohort's whole slide images (WSIs), focused on colorectal cancer cases. Each WSI was processed into 224x224 pixel tissue patches, extracted at a scale of 1.0 microns per pixel (MPP). A 1024-dimensional embedding was computed for each patch using the UNI foundation model[1]. This dataset allows for rapid downstream analysis of tasks such as biomarker prediction, survival analysis, tumour grading, and prognostic modelling. The SurGen dataset, comprising both primary colorectal and metastatic cases, offers a valuable resource for computational pathology research.
Each Zarr file within the dataset contains an array of patch-level features and a corresponding array of coordinates, enabling the retrieval of specific feature locations as needed.
Embeddings are provided in a zip archive and intended for reuse in research focused on digital pathology, tumour genomics, and oncology. For cohort ground truth labels please see the link below.
Access the original dataset: https://doi.org/10.6019/S-BIAD1285
GitHub for more info: https://github.com/CraigMyles/SurGen-Dataset
[1] Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., et al. Towards a general-purpose foundation model for computational pathology. Nat Med (2024). https://doi.org/10.1038/s41591-024-02857-3
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We describe Orbit Image Analysis, an open-source whole slide image analysis tool. The tool consists of a generic tile-processing engine which allows the execution of various image analysis algorithms provided by either Orbit itself or from other open-source platforms using a tile-based map-reduce execution framework. Orbit Image Analysis is capable of sophisticated whole slide imaging analyses due to several key features. First, Orbit has machine-learning capabilities. This deep learning segmentation can be integrated with complex object detection for analysis of intricate tissues. In addition, Orbit can run locally as standalone or connect to the open-source image server OMERO. Another important characteristic is its scale-out functionality, using the Apache Spark framework for distributed computing. In this paper, we describe the use of Orbit in three different real-world applications: quantification of idiopathic lung fibrosis, nerve fibre density quantification, and glomeruli detection in the kidney.
Facebook
TwitterPrecise detection of invasive cancer on whole-slide images (WSI) is a critical first step in digital pathology tasks of diagnosis and grading. Convolutional neural network (CNN) is the most popular representation learning method for computer vision tasks, which have been successfully applied in digital pathology, including tumor and mitosis detection. However, CNNs are typically only tenable with relatively small image sizes (200x200 pixels). Only recently, Fully convolutional networks (FCN) are able to deal with larger image sizes (500x500 pixels) for semantic segmentation. Hence, the direct application of CNNs to WSI is not computationally feasible because for a WSI, a CNN would require billions or trillions of parameters. To alleviate this issue, this paper presents a novel method, High-throughput Adaptive Sampling for whole-slide Histopathology Image analysis (HASHI), which involves: i) a new efficient adaptive sampling method based on probability gradient and quasi-Monte Carlo sampling, and, ii) a powerful representation learning classifier based on CNNs. We applied HASHI to automated detection of invasive breast cancer on WSI. HASHI was trained and validated using three different data cohorts involving near 500 cases and then independently tested on 195 studies from The Cancer Genome Atlas. The results show that (1) the adaptive sampling method is an effective strategy to deal with WSI without compromising prediction accuracy by obtaining comparative results of a dense sampling (~6 million of samples in 24 hours) with far fewer samples (~2,000 samples in 1 minute), and (2) on an independent test dataset, HASHI is effective and robust to data from multiple sites, scanners, and platforms, achieving an average Dice coefficient of 76%.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains 22 whole-slide image (WSI) files in SVS format, digitized using a Leica GT450 scanner. All WSIs were intentionally scanned without tissue; only the physical slide labels are present. The purpose of this dataset is to support the evaluation and benchmarking of the WSI-Babel-Shark metadata-extraction pipeline. Empty slides allow reduced file sizes, preservation of SVS metadata, and controlled conditions for benchmarking label-processing components, including OCR, DataMatrix decoding, stain parsing, SlideID reconstruction, and metadata harmonization. All WSIs retain full TIFF tiling, SVS headers, and Leica metadata. Files were manually inspected to ensure complete de-identification, and all CaseIDs and SlideIDs represent synthetic test cases. A ground-truth CSV file containing validated metadata fields is included for benchmarking. No patient-identifying information is contained in any file.
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
HANCOCK is a comprehensive, monocentric dataset of 763 head and neck cancer patients, including diverse data modalities. It contains histopathology imaging (whole-slide images of H&E-stained primary tumors and tissue microarrays with immunohistochemical staining) alongside structured clinical data (demographics, tumor pathology characteristics, laboratory blood measurements) and textual data (de-identified surgery reports and medical histories). All patients were treated curatively, and data span diagnoses from 2005–2019. This multimodal collection enables research into integrative analyses – for example, combining histologic features with clinical parameters for outcome prediction. Early analyses have demonstrated that fusing these modalities improves prognostic modeling compared to single-source data, and that leveraging histology with foundation models can enhance endpoint prediction. HANCOCK aims to facilitate precision oncology studies by providing a large public resource for developing and benchmarking multimodal machine learning methods in head and neck cancer.
Head and neck cancer (HNC) is a prevalent malignancy with poor outcomes – it is the 7th most common cancer globally and carries a 5-year survival of only ~25–60% despite modern treatments. Improving patient prognosis may require personalized, multimodal therapy decisions, using information from pathology, clinical, and other data sources. However, progress in multimodal prediction has been limited by the lack of large public datasets that integrate these diverse data types. To our knowledge, existing HNC datasets are either small or incomplete; for example, a radiomics study included 288 oropharyngeal cases, and a proteomics-focused set with imaging had only 122 cases. The Cancer Genome Atlas (TCGA) provides multi-omics for >500 HNC cases, but lacks crucial data like pathology reports, blood tests, or comprehensive imaging for each patient. These limitations hinder robust multimodal research.
HANCOCK was created to address this gap. It aggregates 763 patients’ data from a single academic center, capturing a real-world, uniformly treated cohort. The dataset uniquely combines whole slide histopathology images, tissue microarray images, detailed clinical parameters, pathology reports, and lab values in one resource. By curating and harmonizing these modalities, HANCOCK enables researchers to explore complex data interdependencies and develop multimodal predictive models. The patient population reflects typical HNC demographics – 80% male, median age 61, with 72% being former or current smokers – aligning with expected epidemiology and supporting generalizability. In summary, HANCOCK is an unprecedented multimodal HNC dataset that can fuel research in machine learning, prognostic biomarker discovery, and integrative oncology, ultimately advancing personalized head and neck cancer care.
The following sections describe how the HANCOCK data were collected, processed, and prepared for public sharing.
Patients included in HANCOCK were those diagnosed with head and neck cancer between 2005 and 2019 at University Hospital Erlangen (Germany) who underwent a curative-intent initial treatment (surgery and/or definitive therapy). This encompasses cancers of the oral cavity, oropharynx, hypopharynx, and larynx. Patients treated palliatively or with recurrent/metastatic disease at presentation were excluded to focus on first-course, curative treatments. The cohort consists of 763 patients (approximately 80% male, 20% female) with a median age of 61 years. Notably, ~72% have a history of tobacco use, which is consistent with real-world HNC risk factors. The distribution of tumor subsites and stages reflects typical HNC presentation, and thus the dataset is broadly representative of the general HNC patient population. Being a single-center dataset, there is limited geographic diversity; however, the homogeneous data acquisition and treatment context reduce variability in data quality. No significant selection biases were introduced aside from the exclusion of non-curative cases – all major HNC subsite cases over the inclusion period were captured, providing a comprehensive real-world sample. Ethical approval was obtained for this retrospective data collection and sharing (Ethics Committee vote #23-22-Br), and all data were fully de-identified prior to release.
Histopathology: Tissue specimens from the primary tumors (and involved lymph nodes, if present) were obtained from the pathology archives. All samples were formalin-fixed and paraffin-embedded (FFPE) and stained with hematoxylin and eosin (H&E) following routine protocols. Digital whole-slide imaging was performed on these histology slides. A total of 709 H&E slides of primary tumor tissue (701 patients had one slide, 8 patients had two slides) were scanned at high resolution using a 3DHISTECH P1000 scanner at an effective 82.44× magnification (0.1213 µm/pixel). Additionally, 396 H&E slides of lymph node metastases were scanned, using two systems: an Aperio Leica GT450 at 40× (0.2634 µm/pixel) and the 3DHISTECH P1000 at ~51× (0.1945 µm/pixel). (Multiple scanners were utilized over the course of the project; all resulting images were cross-verified for quality.) The digital whole slide images (WSIs) are provided in the pyramidal Aperio SVS format, a TIFF-based format compatible with standard viewers.
In addition to full slides, tissue microarrays (TMAs) were constructed from each patient’s tumor block to sample important regions. For each case, two cylindrical core biopsies (diameter 1.5 mm) were taken – one from the tumor center and one from the invasive tumor front. These cores were assembled into TMA blocks and stained on separate slides with a panel of eight stains: H&E plus immunohistochemical (IHC) markers targeting various immune cells and tumor biomarkers. The IHC markers include CD3, CD8, CD56, CD68, CD163, PD-L1, and MHC-1, which label T cells (CD3, CD8), natural killer cells (CD56), monocytes/macrophages (CD68, CD163), and a tumor immune checkpoint ligand (PD-L1), as well as MHC class I expression. Each core appears on up to 8 stained TMA slides (one per stain), yielding up to 16 TMA images per patient (two cores × eight stains). In the dataset, TMA images are provided for both the tumor-center and tumor-front cores; these too are digitized high-resolution images (consistent microscope settings, ~40×). The combination of WSIs and TMAs yields a rich imaging dataset: 701 patients have at least one primary tumor WSI (62 patients lack WSIs due to unavailable tissue), and all patients have TMA core images unless the tumor block was exhausted. This imaging data offers both broad tissue context from WSIs and targeted cellular detail from TMAs. Manual tumor region annotations are also included for the primary tumor WSIs (see Data Analysis below).
Clinical and Pathology Data: A wide array of non-imaging data was extracted from hospital information systems and pathology reports for each patient. Key demographic variables (age, sex, etc.) and tumor pathology details were collected, including primary tumor site, histologic subtype, grade, TNM stage, resection margin status, depth of invasion, perineural and lymphovascular invasion, and nodal metastasis status. These pathology parameters were recorded in a structured format for each case. Standard clinical coding systems were used where applicable: e.g., diagnoses are coded with ICD-10 codes and procedures with OPS codes (the German procedure classification system). The dataset includes these codes for each patient’s conditions and treatments. Comprehensive laboratory blood test results at diagnosis or pre-treatment were also compiled, covering complete blood counts, coagulation measures, electrolytes, kidney function, C-reactive protein, and other relevant analytes. Reference ranges for each lab parameter are provided alongside the values to indicate whether a result was normal or abnormal. Most patients have a full panel of these lab results, though some values are missing if a test was not clinically indicated; the dataset notes availability per patient. All structured data have been cleaned and validated – for example, harmonizing category values and checking consistency (e.g. TNM stages align with recorded tumor sites).
Textual Data (Surgical Reports and Histories): Unstructured clinical text was also included to add rich context on treatment details. Surgery reports (operative notes) from the primary tumor resection and associated medical history summaries were retrieved from the hospital’s electronic records. For each patient, the operative report from their first definitive surgery and the corresponding
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Links to code:
Tissue Region Segmentation Code
This dataset comprises high-quality immunohistochemistry (IHC) and Haematoxylin and Eosin (H&E) whole slide images (WSIs) of breast tissues, provided in .svs format.
The data were collected from Bahçeşehir University Medical School and are intended for research in histopathology and computational pathology.
This study was approved by the Bahçeşehir University Clinical Research Institutional Review Board (Approval No: 2022-10/03).
Facebook
Twitterhttps://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The Whole-Slide Imaging (WSI) market is booming, projected to reach [estimated 2033 market size in millions] by 2033, fueled by digital pathology adoption and technological advancements. Learn about market drivers, trends, restraints, and key players in this comprehensive analysis. Recent developments include: March 2023: Pramana, Inc., an AI-enabled health tech company modernizing the pathology sector, collaborated with PathPresenter to accelerate the enterprise adoption of digital pathology workflows. The goal of this collaboration is to ensure a seamless user experience for the labs adopting Pramana's Digital Pathology as a Service solution for whole slide image generation, as well as PathPresenter's image management/image viewing platform., March 2023: Hamamatsu, a manufacturer of photonics devices, including whole slide scanners for digital pathology, entered into a multi-year distribution agreement with Siemens Healthineers. Under the agreement, Hamamatsu is likely to provide NanoZoomer with whole slide scanners to support Siemens Healthineer's expansion into digital pathology in the Americas and Europe.. Key drivers for this market are: Growing Popularity of Virtual Slides as Compared to Physical Slides, Technological Advancements in Whole Slide Imaging; Increasing Research in Drug Discovery. Potential restraints include: Growing Popularity of Virtual Slides as Compared to Physical Slides, Technological Advancements in Whole Slide Imaging; Increasing Research in Drug Discovery. Notable trends are: Telepathology Segment is Expected to Grow Significantly Over the Forecast Period.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ACROBAT data set consists of 4,212 whole slide images (WSIs) from 1,153 female primary breast cancer patients. The WSIs in the data set are available at 10X magnification and show tissue sections from breast cancer resection specimens stained with hematoxylin and eosin (H&E) or immunohistochemistry (IHC). For each patient, one WSI of H&E stained tissue and at least one one, and up to four, WSIs of corresponding tissue stained with the routine diagnostic stains ER, PGR, HER2 and KI67 are available. The data set was acquired as part of the CHIME study (chimestudy.se) and its primary purpose was to facilitate the ACROBAT WSI registration challenge (acrobat.grand-challenge.org). The histopathology slides originate from routine diagnostic pathology workflows and were digitised for research purposes at Karolinska Institutet (Stockholm, Sweden). The image acquisition process resembles the routine digital pathology image digitisation workflow, using three different Hamamatsu WSI scanners, specifically one NanoZoomer S360 and two NanoZoomer XR. The WSIs in this data set are accompanied by a data table with one row for each WSI, specifying an anonymised patient ID, the stain or IHC antibody type of each WSI, as well as the magnification and microns per pixel at each available resolution level. Automated registration algorithm performance evaluation is possible through the ACROBAT challenge website based on over 37,000 landmark pair annotations from 13 annotators. While the primary purpose of this data set was the development and evaluation of WSI registration methods, this data set has the potential to facilitate further research in the context of computational pathology, for example in the areas of stain-guided learning, virtual staining, unsupervised learning and stain-independent models.
The data set consists of three subsets, the training, validation and test set, based on the ACROBAT WSI registration challenge. There are 750 cases in the training set, for each of which one H&E WSI and one to four IHC WSIs are available, with 3406 WSIs in total. The validation set consists of 100 cases with 200 WSIs in total and the test set of 303 cases with 606 WSIs in total. Both for the validation and test set, one H&E WSI as well as one randomly selected IHC WSI is available.
WSIs were anonymised by deleting the associated macro images, by generating filenames with random case IDs and by overwriting meta data fields with potentially personal information. Hamamatsu NDPI files were then converted using libvips (libvips.org/). WSIs are available as generic tiled TIFF WSIs (openslide.org/formats/generic-tiff/) at 10X magnification and lower image levels.
The data set is available for download in seven separate ZIP archives, five for the training data (train_part1.zip (71.47 GB), train_part2.zip (70.59 GB), train_part3.zip (75.91 GB), train_part4.zip (71.63 GB) and train_part5.zip (69.09 GB)), one for the validation data (valid.zip 21.79 GB) and one for the test data (test.zip 68.11 GB).
File listings and checksums in SHA1 format are available for checking archive/data integrity when downloading.
While it would be helpful to notify SND of any publications using this data set by sending an email to request@snd.gu.se, please note that this is not required to use the data.