Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The source dataset and its full description may be accessed through the Harvard Dataverse, and should be cited as
Tschandl, Philipp, 2018, "The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions", https://doi.org/10.7910/DVN/DBW86T, Harvard Dataverse, V4, UNF:6:KCZFcBLiFE5ObWcTc2ZBOA== [fileUNF]
Note that the herein uploaded dataset does not contain all of the source material, namely the file ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.tab - which contains data on a study involving human-computer collaboration - and the folder HAM10000_segmentations_lesion_tschandl - containing binary segmentation masks of the training images. Still, in contrast to most of the HAM10000 datasets published in Kaggle, the current one includes the test dataset that was curated for the ISIC 2018 challenge (Task 3).
The uploaded dataset is comprised by 3 folders and 2 files, described in the table below.
| Content | Type | Description |
|---|---|---|
HAM10000_images_part_1 | folder | Part 1 of a set of training pictures |
HAM10000_images_part_2 | folder | Part 2 of a set of training pictures |
ISIC2018_Task3_Test_Images | folder | Set of test pictures |
HAM10000_metadata.csv | file | Metadata associated with the training data |
ISIC2018_Task3_Test_GroundTruth.csv | file | Metadata associated with the test data |
The training dataset (HAM10000_images_part_1 and HAM10000_images_part_2) is called "HAM10000" meaning "Human Against Machine with 10000 training images"" (actually 10015 images) and it corresponds to a large collection of multi-source dermatoscopic RGB images (JPG) of common pigmented skin lesions. The test dataset (ISIC2018_Task3_Test_Images) corresponds to 511 images. The files HAM10000_metadata.csv and ISIC2018_Task3_Test_GroundTruth.csv contain the respective metadata (data about the data) which further include other features and the labels.
Their structure of the metadata files follows the template presented by the table below.
| Column | Type | Description |
|---|---|---|
lesion_id | String | ID of the lesion case |
image_id | String | ID of an image (also the name of the respective JPG file) associated with that case |
dx | String | Label of that case |
dx_type | String | Method used for diagnosing that case |
age | Float | Age of the person associated with that case |
sex | String | Sex of the person associated with that case |
localization | String | Location of the lesion in the person body |
dataset | String | Reference from which the data was taken |
dx column (the classes)The values that the column dx may take are tabulated below.
| Value | Description |
|---|---|
akiec | Actinic keratoses and intraepithelial carcinoma (also called "Bowen's disease") - an early form of skin cancer |
bcc | Basal cell carcinoma - the most common type of skin cancer |
bkl | Benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses) - common and benign |
df | Dermatofibroma - common and benign |
mel | Melanoma - a type of skin cancer involving the melanin cells |
nv | Melanocytic nevus - the medical term for a mole (benign) |
vasc | Vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage) (benign) |
dx_type column (the diagnosis methods)And the table below present the values of the column dx_type.
| Value | Description |
|---|---|
histo | Histopathology |
follow_up | Follow-up examination |
consensus | Expert consensus |
confocal | In-vivo confocal microscopy |
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.7910/DVN/DBW86Thttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.7910/DVN/DBW86T
Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We tackle this problem by releasing the HAM10000 ("Human Against Machine with 10000 training images") dataset. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc). More than 50% of lesions are confirmed through histopathology (histo), the ground truth for the rest of the cases is either follow-up examination (follow_up), expert consensus (consensus), or confirmation by in-vivo confocal microscopy (confocal). The dataset includes lesions with multiple images, which can be tracked by the lesion_id-column within the HAM10000_metadata file. Due to upload size limitations, images are stored in two files: HAM10000_images_part1.zip (5000 JPEG files) HAM10000_images_part2.zip (5015 JPEG files) Additional data for evaluation purposes The HAM10000 dataset served as the training set for the ISIC 2018 challenge (Task 3), with the same sources contributing the majority of the validation- and test-set as well. The test-set images are available herein as ISIC2018_Task3_Test_Images.zip (1511 images), the ground-truth in the same format as the HAM10000 data (public since 2023) is available as ISIC2018_Task3_Test_GroundTruth.csv.. The ISIC-Archive also provides the challenge images and metadata (training, validation, test) at their "ISIC Challenge Datasets" page. Comparison to physicians Test-set evaluations of the ISIC 2018 challenge were compared to physicians on an international scale, where the majority of challenge participants outperformed expert readers: Tschandl P. et al., Lancet Oncol 2019 Human-computer collaboration The test-set images were also used in a study comparing different methods and scenarios of human-computer collaboration: Tschandl P. et al., Nature Medicine 2020 Following corresponding metadata is available herein: ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.csv: Human ratings for Test images with and without interaction with a ResNet34 CNN (Malignancy Probability, Multi-Class probability, CBIR) or Human-Crowd Multi-Class probabilities. This is data was collected for and analyzed in Tschandl P. et al., Nature Medicine 2020, therefore please refer to this publication when using the data. Some details on the abbreviated column headings: image_id: This is the ISIC image_id of an image at the time of the study. There should be no duplications in the combination image_id & interaction_modality. As not every image was shown with every interaction modality, not every combination is present. prob_m_dx_akiec, ... : m is "machine probabilities". Values are values after softmax, and "_mal" is all malignant classes summed. prob_h_dx_akiec, ... : h is "human probabilities". Values are aggregated percentages of human ratings from past studies distinguishing between seven classes. Note there is no "prob_h_mal" as this was none of the tested interaction modalities. user_dx_without_interaction_akiec, ...: Number of participants choosing this diagnosis without interaction. user_dx_with_interaction_akiec, ...: Number of participants choosing this diagnosis with interaction. HAM10000_segmentations_lesion_tschandl.zip: To evaluate regions of CNN activations in Tschandl P. et al., Nature Medicine 2020 (please refer to this publication when using the data), a single dermatologist (Tschandl P) created binary segmentation masks for all 10015 images from the HAM10000 dataset. Masks were initialized with the segmentation network as described by Tschandl et al., Computers in Biology and Medicine 2019, and following verified, corrected or replaced via the free-hand selection tool in FIJI.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset is derived from the ISIC Archive with the following changes:
If the "benign_malignant" column is null and the "diagnosis" column is "vascular lesion", the target is set to null.
DISCLAIMER I'm not a dermatologist and I'm not affiliated with ISIC in any way. I don't know if my approach to setting the target value is acceptable by the ISIC competition. Use at your own risk.
import os
import multiprocessing as mp
from PIL import Image, ImageOps
import glob
from functools import partial
def list_jpg_files(folder_path):
# Ensure the folder path ends with a slash
if not folder_path.endswith('/'):
folder_path += '/'
# Use glob to find all .jpg files in the specified folder (non-recursive)
jpg_files = glob.glob(folder_path + '*.jpg')
return jpg_files
def resize_image(image_path, destination_folder):
# Open the image file
with Image.open(image_path) as img:
# Get the original dimensions
original_width, original_height = img.size
# Calculate the aspect ratio
aspect_ratio = original_width / original_height
# Determine the new dimensions based on the aspect ratio
if aspect_ratio > 1:
# Width is larger, so we will crop the width
new_width = int(256 * aspect_ratio)
new_height = 256
else:
# Height is larger, so we will crop the height
new_width = 256
new_height = int(256 / aspect_ratio)
# Resize the image while maintaining the aspect ratio
img = img.resize((new_width, new_height))
# Calculate the crop box to center the image
left = (new_width - 256) / 2
top = (new_height - 256) / 2
right = (new_width + 256) / 2
bottom = (new_height + 256) / 2
# Crop the image if it results in shrinking
if new_width > 256 or new_height > 256:
img = img.crop((left, top, right, bottom))
else:
# Add black edges if it results in scaling up
img = ImageOps.expand(img, border=(int(left), int(top), int(left), int(top)), fill='black')
# Resize the image to the final dimensions
img = img.resize((256, 256))
img.save(os.path.join(destination_folder, os.path.basename(image_path)))
source_folder = ""
destination_folder = ""
images = list_jpg_files(source_folder)
with mp.Pool(processes=12) as pool:
images = pool.map(partial(resize_image, destination_folder=destination_folder), images)
print("All images resized")
This code will shrink (down-sample) the image if it is larger than 256x256. But if the image is smaller than 256x256, it will add either vertical or horizontal black edges after scaling up the image. In both scenarios, it will keep the center of the input image in the center of the output image.
The HDF5 file is created using the following code:
import os
import pandas as pd
from PIL import Image
import h5py
import io
import numpy as np
# File paths
base_folder = "./isic-2018-task-12-256x256"
csv_file_path = 'train-metadata.csv'
image_folder_path = 'train-image/image'
hdf5_file_path = 'train-image.hdf5'
# Read the CSV file
df = pd.read_csv(os.path.join(base_folder, csv_file_path))
# Open an HDF5 file
with h5py.File(os.path.join(base_folder, hdf5_file_path), 'w') as hdf5_file:
for index, row in df.iterrows():
isic_id = row['isic_id']
image_file_path = os.path.join(base_folder, image_folder_path, f'{isic_id}.jpg')
if os.path.exists(image_file_path):
# Open the image file
with Image.open(image_file_path) as img:
# Convert the image to a byte buffer
img_byte_arr = io.BytesIO()
img.save(img_byte_arr, format=img.format)
img_byte_arr = img_byte_arr.getvalue()
hdf5_file.create_dataset(isic_id, data=np.void(img_byte_arr))
else:
print(f"Image file for {isic_id} not found.")
print("HDF5 file created successfully.")
To read the hdf5 file, use the following code:
import h5py
from PIL import Image...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.
Citation
If you find this project useful or if you use our newly proposed datasets and/or our analyses, please cite our paper.
Kumar Abhishek, Aditi Jain, Ghassan Hamarneh. "Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets". arXiv preprint arXiv:2401.14497, 2024. DOI: 10.48550/ARXIV.2401.14497.
The corresponding BibTeX entry is:
@article{abhishek2024investigating, title={Investigating the Quality of {DermaMNIST} and {Fitzpatrick17k} Dermatological Image Datasets}, author={Abhishek, Kumar and Jain, Aditi and Hamarneh, Ghassan}, journal={arXiv preprint arXiv:2401.14497}, doi = {10.48550/ARXIV.2401.14497}, url = {https://arxiv.org/abs/2401.14497}, year={2024}}
Project Website
The results of the analysis, including the visualizations, are available on the project website: https://derm.cs.sfu.ca/critique/.
Code
The accompanying code for this project is hosted on GitHub at https://github.com/kakumarabhishek/Corrected-Skin-Image-Datasets.
License
The metadata files (DermaMNIST-C.csv, DermaMNIST-E.csv, Fitzpatrick17k_DiagnosisMapping.xlsx,Fitzpatrick17k-C.csv) contained in this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.
The NPZ files associated with DermaMNIST-C (dermamnist_corrected_28.npz, dermamnist_corrected_224.npz) and DermaMNIST-E (dermamnist_extended_28.npz, dermamnist_extended_224.npz) contained in this repository are licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.
The code hosted on GitHub is licensed under the Apache License 2.0.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The source dataset and its full description may be accessed through the Harvard Dataverse, and should be cited as
Tschandl, Philipp, 2018, "The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions", https://doi.org/10.7910/DVN/DBW86T, Harvard Dataverse, V4, UNF:6:KCZFcBLiFE5ObWcTc2ZBOA== [fileUNF]
Note that the herein uploaded dataset does not contain all of the source material, namely the file ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.tab - which contains data on a study involving human-computer collaboration - and the folder HAM10000_segmentations_lesion_tschandl - containing binary segmentation masks of the training images. Still, in contrast to most of the HAM10000 datasets published in Kaggle, the current one includes the test dataset that was curated for the ISIC 2018 challenge (Task 3).
The uploaded dataset is comprised by 3 folders and 2 files, described in the table below.
| Content | Type | Description |
|---|---|---|
HAM10000_images_part_1 | folder | Part 1 of a set of training pictures |
HAM10000_images_part_2 | folder | Part 2 of a set of training pictures |
ISIC2018_Task3_Test_Images | folder | Set of test pictures |
HAM10000_metadata.csv | file | Metadata associated with the training data |
ISIC2018_Task3_Test_GroundTruth.csv | file | Metadata associated with the test data |
The training dataset (HAM10000_images_part_1 and HAM10000_images_part_2) is called "HAM10000" meaning "Human Against Machine with 10000 training images"" (actually 10015 images) and it corresponds to a large collection of multi-source dermatoscopic RGB images (JPG) of common pigmented skin lesions. The test dataset (ISIC2018_Task3_Test_Images) corresponds to 511 images. The files HAM10000_metadata.csv and ISIC2018_Task3_Test_GroundTruth.csv contain the respective metadata (data about the data) which further include other features and the labels.
Their structure of the metadata files follows the template presented by the table below.
| Column | Type | Description |
|---|---|---|
lesion_id | String | ID of the lesion case |
image_id | String | ID of an image (also the name of the respective JPG file) associated with that case |
dx | String | Label of that case |
dx_type | String | Method used for diagnosing that case |
age | Float | Age of the person associated with that case |
sex | String | Sex of the person associated with that case |
localization | String | Location of the lesion in the person body |
dataset | String | Reference from which the data was taken |
dx column (the classes)The values that the column dx may take are tabulated below.
| Value | Description |
|---|---|
akiec | Actinic keratoses and intraepithelial carcinoma (also called "Bowen's disease") - an early form of skin cancer |
bcc | Basal cell carcinoma - the most common type of skin cancer |
bkl | Benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses) - common and benign |
df | Dermatofibroma - common and benign |
mel | Melanoma - a type of skin cancer involving the melanin cells |
nv | Melanocytic nevus - the medical term for a mole (benign) |
vasc | Vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage) (benign) |
dx_type column (the diagnosis methods)And the table below present the values of the column dx_type.
| Value | Description |
|---|---|
histo | Histopathology |
follow_up | Follow-up examination |
consensus | Expert consensus |
confocal | In-vivo confocal microscopy |