4 datasets found

Skin Cancer - The HAM10000 dataset

kaggle.com

Updated Jul 1, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Élio Cordeiro Pereira (2024). Skin Cancer - The HAM10000 dataset [Dataset]. https://www.kaggle.com/datasets/eliocordeiropereira/skin-cancer-the-ham10000-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 1, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Élio Cordeiro Pereira

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The Original Dataset

The source dataset and its full description may be accessed through the Harvard Dataverse, and should be cited as

Tschandl, Philipp, 2018, "The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions", https://doi.org/10.7910/DVN/DBW86T, Harvard Dataverse, V4, UNF:6:KCZFcBLiFE5ObWcTc2ZBOA== [fileUNF]

The Current Dataset

Note that the herein uploaded dataset does not contain all of the source material, namely the file ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.tab - which contains data on a study involving human-computer collaboration - and the folder HAM10000_segmentations_lesion_tschandl - containing binary segmentation masks of the training images. Still, in contrast to most of the HAM10000 datasets published in Kaggle, the current one includes the test dataset that was curated for the ISIC 2018 challenge (Task 3).

Description

Files and folders

The uploaded dataset is comprised by 3 folders and 2 files, described in the table below.

Content	Type	Description
`HAM10000_images_part_1`	folder	Part 1 of a set of training pictures
`HAM10000_images_part_2`	folder	Part 2 of a set of training pictures
`ISIC2018_Task3_Test_Images`	folder	Set of test pictures
`HAM10000_metadata.csv`	file	Metadata associated with the training data
`ISIC2018_Task3_Test_GroundTruth.csv`	file	Metadata associated with the test data

The training dataset (HAM10000_images_part_1 and HAM10000_images_part_2) is called "HAM10000" meaning "Human Against Machine with 10000 training images"" (actually 10015 images) and it corresponds to a large collection of multi-source dermatoscopic RGB images (JPG) of common pigmented skin lesions. The test dataset (ISIC2018_Task3_Test_Images) corresponds to 511 images. The files HAM10000_metadata.csv and ISIC2018_Task3_Test_GroundTruth.csv contain the respective metadata (data about the data) which further include other features and the labels.

Columns of the metadata files

Their structure of the metadata files follows the template presented by the table below.

Column	Type	Description
`lesion_id`	String	ID of the lesion case
`image_id`	String	ID of an image (also the name of the respective JPG file) associated with that case
`dx`	String	Label of that case
`dx_type`	String	Method used for diagnosing that case
`age`	Float	Age of the person associated with that case
`sex`	String	Sex of the person associated with that case
`localization`	String	Location of the lesion in the person body
`dataset`	String	Reference from which the data was taken

Values of the metadata `dx` column (the classes)

The values that the column dx may take are tabulated below.

Value	Description
`akiec`	Actinic keratoses and intraepithelial carcinoma (also called "Bowen's disease") - an early form of skin cancer
`bcc`	Basal cell carcinoma - the most common type of skin cancer
`bkl`	Benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses) - common and benign
`df`	Dermatofibroma - common and benign
`mel`	Melanoma - a type of skin cancer involving the melanin cells
`nv`	Melanocytic nevus - the medical term for a mole (benign)
`vasc`	Vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage) (benign)

Values of the metadata `dx_type` column (the diagnosis methods)

And the table below present the values of the column dx_type.

Value	Description
`histo`	Histopathology
`follow_up`	Follow-up examination
`consensus`	Expert consensus
`confocal`	In-vivo confocal microscopy

H
Data from: The HAM10000 dataset, a large collection of multi-source...
dataverse.harvard.edu
opendatalab.com
+1more
Updated Feb 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philipp Tschandl (2023). The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions [Dataset]. http://doi.org/10.7910/DVN/DBW86T
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/DBW86T
Dataset updated
Feb 7, 2023
Dataset provided by
Harvard Dataverse
Authors
Philipp Tschandl
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.7910/DVN/DBW86Thttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.7910/DVN/DBW86T
Description
Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We tackle this problem by releasing the HAM10000 ("Human Against Machine with 10000 training images") dataset. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc). More than 50% of lesions are confirmed through histopathology (histo), the ground truth for the rest of the cases is either follow-up examination (follow_up), expert consensus (consensus), or confirmation by in-vivo confocal microscopy (confocal). The dataset includes lesions with multiple images, which can be tracked by the lesion_id-column within the HAM10000_metadata file. Due to upload size limitations, images are stored in two files: HAM10000_images_part1.zip (5000 JPEG files) HAM10000_images_part2.zip (5015 JPEG files) Additional data for evaluation purposes The HAM10000 dataset served as the training set for the ISIC 2018 challenge (Task 3), with the same sources contributing the majority of the validation- and test-set as well. The test-set images are available herein as ISIC2018_Task3_Test_Images.zip (1511 images), the ground-truth in the same format as the HAM10000 data (public since 2023) is available as ISIC2018_Task3_Test_GroundTruth.csv.. The ISIC-Archive also provides the challenge images and metadata (training, validation, test) at their "ISIC Challenge Datasets" page. Comparison to physicians Test-set evaluations of the ISIC 2018 challenge were compared to physicians on an international scale, where the majority of challenge participants outperformed expert readers: Tschandl P. et al., Lancet Oncol 2019 Human-computer collaboration The test-set images were also used in a study comparing different methods and scenarios of human-computer collaboration: Tschandl P. et al., Nature Medicine 2020 Following corresponding metadata is available herein: ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.csv: Human ratings for Test images with and without interaction with a ResNet34 CNN (Malignancy Probability, Multi-Class probability, CBIR) or Human-Crowd Multi-Class probabilities. This is data was collected for and analyzed in Tschandl P. et al., Nature Medicine 2020, therefore please refer to this publication when using the data. Some details on the abbreviated column headings: image_id: This is the ISIC image_id of an image at the time of the study. There should be no duplications in the combination image_id & interaction_modality. As not every image was shown with every interaction modality, not every combination is present. prob_m_dx_akiec, ... : m is "machine probabilities". Values are values after softmax, and "_mal" is all malignant classes summed. prob_h_dx_akiec, ... : h is "human probabilities". Values are aggregated percentages of human ratings from past studies distinguishing between seven classes. Note there is no "prob_h_mal" as this was none of the tested interaction modalities. user_dx_without_interaction_akiec, ...: Number of participants choosing this diagnosis without interaction. user_dx_with_interaction_akiec, ...: Number of participants choosing this diagnosis with interaction. HAM10000_segmentations_lesion_tschandl.zip: To evaluate regions of CNN activations in Tschandl P. et al., Nature Medicine 2020 (please refer to this publication when using the data), a single dermatologist (Tschandl P) created binary segmentation masks for all 10015 images from the HAM10000 dataset. Masks were initialized with the segmentation network as described by Tschandl et al., Computers in Biology and Medicine 2019, and following verified, corrected or replaced via the free-hand selection tool in FIJI.

HAM10000 - 256x256

kaggle.com

zip

Updated Aug 7, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Mehran Ziadloo (2024). HAM10000 - 256x256 [Dataset]. https://www.kaggle.com/datasets/ziadloo/ham10000-256x256/discussion

Explore at:

zip(164309307 bytes)Available download formats

Dataset updated

Aug 7, 2024

Authors

Mehran Ziadloo

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

This dataset is derived from the ISIC Archive with the following changes:

A new integer column is added named "target" with values 0, 1, null. This column is populated using two other columns: "bengin_malignant" and "diagnosis". If the first column explicitly confirms that the record is either "benign" or "malignant", the target is set to "0" and "1" respectively. If the "benign_malignant" column is null, then the value of "diagnosis" column is used to determine the value for "target". The following diagnosis values are considered cancerous and as the result, "target" is set to "1":

squamous cell carcinoma
basal cell carcinoma
melanoma
squamous cell carcinoma

If the "benign_malignant" column is null and the "diagnosis" column is "vascular lesion", the target is set to null.

DISCLAIMER I'm not a dermatologist and I'm not affiliated with ISIC in any way. I don't know if my approach to setting the target value is acceptable by the ISIC competition. Use at your own risk.

All the images are resized to 256x256 using the following Python code:

import os
import multiprocessing as mp
from PIL import Image, ImageOps
import glob
from functools import partial


def list_jpg_files(folder_path):
  # Ensure the folder path ends with a slash
  if not folder_path.endswith('/'):
    folder_path += '/'

  # Use glob to find all .jpg files in the specified folder (non-recursive)
  jpg_files = glob.glob(folder_path + '*.jpg')

  return jpg_files



def resize_image(image_path, destination_folder):
  # Open the image file
  with Image.open(image_path) as img:
    # Get the original dimensions
    original_width, original_height = img.size

    # Calculate the aspect ratio
    aspect_ratio = original_width / original_height

    # Determine the new dimensions based on the aspect ratio
    if aspect_ratio > 1:
      # Width is larger, so we will crop the width
      new_width = int(256 * aspect_ratio)
      new_height = 256
    else:
      # Height is larger, so we will crop the height
      new_width = 256
      new_height = int(256 / aspect_ratio)

    # Resize the image while maintaining the aspect ratio
    img = img.resize((new_width, new_height))

    # Calculate the crop box to center the image
    left = (new_width - 256) / 2
    top = (new_height - 256) / 2
    right = (new_width + 256) / 2
    bottom = (new_height + 256) / 2

    # Crop the image if it results in shrinking
    if new_width > 256 or new_height > 256:
      img = img.crop((left, top, right, bottom))
    else:
      # Add black edges if it results in scaling up
      img = ImageOps.expand(img, border=(int(left), int(top), int(left), int(top)), fill='black')

    # Resize the image to the final dimensions
    img = img.resize((256, 256))

  img.save(os.path.join(destination_folder, os.path.basename(image_path)))


source_folder = ""
destination_folder = ""

images = list_jpg_files(source_folder)

with mp.Pool(processes=12) as pool:
  images = pool.map(partial(resize_image, destination_folder=destination_folder), images)
print("All images resized")

This code will shrink (down-sample) the image if it is larger than 256x256. But if the image is smaller than 256x256, it will add either vertical or horizontal black edges after scaling up the image. In both scenarios, it will keep the center of the input image in the center of the output image.

The HDF5 file is created using the following code:

import os
import pandas as pd
from PIL import Image
import h5py
import io
import numpy as np

# File paths
base_folder = "./isic-2018-task-12-256x256"
csv_file_path = 'train-metadata.csv'
image_folder_path = 'train-image/image'
hdf5_file_path = 'train-image.hdf5'

# Read the CSV file
df = pd.read_csv(os.path.join(base_folder, csv_file_path))

# Open an HDF5 file
with h5py.File(os.path.join(base_folder, hdf5_file_path), 'w') as hdf5_file:
  for index, row in df.iterrows():
    isic_id = row['isic_id']
    image_file_path = os.path.join(base_folder, image_folder_path, f'{isic_id}.jpg')
    
    if os.path.exists(image_file_path):
      # Open the image file
      with Image.open(image_file_path) as img:
        # Convert the image to a byte buffer
        img_byte_arr = io.BytesIO()
        img.save(img_byte_arr, format=img.format)
        img_byte_arr = img_byte_arr.getvalue()
        hdf5_file.create_dataset(isic_id, data=np.void(img_byte_arr))
    else:
      print(f"Image file for {isic_id} not found.")

print("HDF5 file created successfully.")

To read the hdf5 file, use the following code:

import h5py
from PIL import Image...

Z
Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological...
data.niaid.nih.gov
Updated Jul 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek, Kumar; Jain, Aditi; Hamarneh, Ghassan (2024). Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11101337
Explore at:
Dataset updated
Jul 14, 2024
Dataset provided by
Simon Fraser University
Indian Institute of Technology Delhi
Authors
Abhishek, Kumar; Jain, Aditi; Hamarneh, Ghassan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.

Citation

If you find this project useful or if you use our newly proposed datasets and/or our analyses, please cite our paper.

Kumar Abhishek, Aditi Jain, Ghassan Hamarneh. "Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets". arXiv preprint arXiv:2401.14497, 2024. DOI: 10.48550/ARXIV.2401.14497.

The corresponding BibTeX entry is:

@article{abhishek2024investigating, title={Investigating the Quality of {DermaMNIST} and {Fitzpatrick17k} Dermatological Image Datasets}, author={Abhishek, Kumar and Jain, Aditi and Hamarneh, Ghassan}, journal={arXiv preprint arXiv:2401.14497}, doi = {10.48550/ARXIV.2401.14497}, url = {https://arxiv.org/abs/2401.14497}, year={2024}}

Project Website

The results of the analysis, including the visualizations, are available on the project website: https://derm.cs.sfu.ca/critique/.

Code

The accompanying code for this project is hosted on GitHub at https://github.com/kakumarabhishek/Corrected-Skin-Image-Datasets.

License

The metadata files (DermaMNIST-C.csv, DermaMNIST-E.csv, Fitzpatrick17k_DiagnosisMapping.xlsx,Fitzpatrick17k-C.csv) contained in this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.

The NPZ files associated with DermaMNIST-C (dermamnist_corrected_28.npz, dermamnist_corrected_224.npz) and DermaMNIST-E (dermamnist_extended_28.npz, dermamnist_extended_224.npz) contained in this repository are licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.

The code hosted on GitHub is licensed under the Apache License 2.0.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Élio Cordeiro Pereira (2024). Skin Cancer - The HAM10000 dataset [Dataset]. https://www.kaggle.com/datasets/eliocordeiropereira/skin-cancer-the-ham10000-dataset

Skin Cancer - The HAM10000 dataset

Multi-source dermatoscopic images of common pigmented skin leasons

Explore at:

9 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 1, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Élio Cordeiro Pereira

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The Original Dataset

The source dataset and its full description may be accessed through the Harvard Dataverse, and should be cited as

Tschandl, Philipp, 2018, "The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions", https://doi.org/10.7910/DVN/DBW86T, Harvard Dataverse, V4, UNF:6:KCZFcBLiFE5ObWcTc2ZBOA== [fileUNF]

The Current Dataset

Description

Files and folders

The uploaded dataset is comprised by 3 folders and 2 files, described in the table below.

Content	Type	Description
`HAM10000_images_part_1`	folder	Part 1 of a set of training pictures
`HAM10000_images_part_2`	folder	Part 2 of a set of training pictures
`ISIC2018_Task3_Test_Images`	folder	Set of test pictures
`HAM10000_metadata.csv`	file	Metadata associated with the training data
`ISIC2018_Task3_Test_GroundTruth.csv`	file	Metadata associated with the test data

Columns of the metadata files

Their structure of the metadata files follows the template presented by the table below.

Column	Type	Description
`lesion_id`	String	ID of the lesion case
`image_id`	String	ID of an image (also the name of the respective JPG file) associated with that case
`dx`	String	Label of that case
`dx_type`	String	Method used for diagnosing that case
`age`	Float	Age of the person associated with that case
`sex`	String	Sex of the person associated with that case
`localization`	String	Location of the lesion in the person body
`dataset`	String	Reference from which the data was taken

Values of the metadata `dx` column (the classes)

The values that the column dx may take are tabulated below.

Value	Description
`akiec`	Actinic keratoses and intraepithelial carcinoma (also called "Bowen's disease") - an early form of skin cancer
`bcc`	Basal cell carcinoma - the most common type of skin cancer
`bkl`	Benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses) - common and benign
`df`	Dermatofibroma - common and benign
`mel`	Melanoma - a type of skin cancer involving the melanin cells
`nv`	Melanocytic nevus - the medical term for a mole (benign)
`vasc`	Vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage) (benign)

Values of the metadata `dx_type` column (the diagnosis methods)

And the table below present the values of the column dx_type.

Value	Description
`histo`	Histopathology
`follow_up`	Follow-up examination
`consensus`	Expert consensus
`confocal`	In-vivo confocal microscopy

Clear search

Close search

Google apps

Main menu

Skin Cancer - The HAM10000 dataset

The Original Dataset

The Current Dataset

Description

Files and folders

Columns of the metadata files

Values of the metadata dx column (the classes)

Values of the metadata dx_type column (the diagnosis methods)

Data from: The HAM10000 dataset, a large collection of multi-source...

HAM10000 - 256x256

Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological...

Skin Cancer - The HAM10000 dataset

Multi-source dermatoscopic images of common pigmented skin leasons

The Original Dataset

The Current Dataset

Description

Files and folders

Columns of the metadata files

Values of the metadata dx column (the classes)

Values of the metadata dx_type column (the diagnosis methods)

Values of the metadata `dx` column (the classes)

Values of the metadata `dx_type` column (the diagnosis methods)

Values of the metadata `dx` column (the classes)

Values of the metadata `dx_type` column (the diagnosis methods)