4 datasets found
  1. Skin Cancer - The HAM10000 dataset

    • kaggle.com
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Élio Cordeiro Pereira (2024). Skin Cancer - The HAM10000 dataset [Dataset]. https://www.kaggle.com/datasets/eliocordeiropereira/skin-cancer-the-ham10000-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Élio Cordeiro Pereira
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The Original Dataset

    The source dataset and its full description may be accessed through the Harvard Dataverse, and should be cited as

    Tschandl, Philipp, 2018, "The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions", https://doi.org/10.7910/DVN/DBW86T, Harvard Dataverse, V4, UNF:6:KCZFcBLiFE5ObWcTc2ZBOA== [fileUNF]

    The Current Dataset

    Note that the herein uploaded dataset does not contain all of the source material, namely the file ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.tab - which contains data on a study involving human-computer collaboration - and the folder HAM10000_segmentations_lesion_tschandl - containing binary segmentation masks of the training images. Still, in contrast to most of the HAM10000 datasets published in Kaggle, the current one includes the test dataset that was curated for the ISIC 2018 challenge (Task 3).

    Description

    Files and folders

    The uploaded dataset is comprised by 3 folders and 2 files, described in the table below.

    ContentTypeDescription
    HAM10000_images_part_1folderPart 1 of a set of training pictures
    HAM10000_images_part_2folderPart 2 of a set of training pictures
    ISIC2018_Task3_Test_ImagesfolderSet of test pictures
    HAM10000_metadata.csvfileMetadata associated with the training data
    ISIC2018_Task3_Test_GroundTruth.csvfileMetadata associated with the test data



    The training dataset (HAM10000_images_part_1 and HAM10000_images_part_2) is called "HAM10000" meaning "Human Against Machine with 10000 training images"" (actually 10015 images) and it corresponds to a large collection of multi-source dermatoscopic RGB images (JPG) of common pigmented skin lesions. The test dataset (ISIC2018_Task3_Test_Images) corresponds to 511 images. The files HAM10000_metadata.csv and ISIC2018_Task3_Test_GroundTruth.csv contain the respective metadata (data about the data) which further include other features and the labels.

    Columns of the metadata files

    Their structure of the metadata files follows the template presented by the table below.

    ColumnTypeDescription
    lesion_idStringID of the lesion case
    image_idStringID of an image (also the name of the respective JPG file) associated with that case
    dxStringLabel of that case
    dx_typeStringMethod used for diagnosing that case
    ageFloatAge of the person associated with that case
    sexStringSex of the person associated with that case
    localizationStringLocation of the lesion in the person body
    datasetStringReference from which the data was taken



    Values of the metadata dx column (the classes)

    The values that the column dx may take are tabulated below.

    ValueDescription
    akiecActinic keratoses and intraepithelial carcinoma (also called "Bowen's disease") - an early form of skin cancer
    bccBasal cell carcinoma - the most common type of skin cancer
    bklBenign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses) - common and benign
    dfDermatofibroma - common and benign
    melMelanoma - a type of skin cancer involving the melanin cells
    nvMelanocytic nevus - the medical term for a mole (benign)
    vascVascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage) (benign)



    Values of the metadata dx_type column (the diagnosis methods)

    And the table below present the values of the column dx_type.

    ValueDescription
    histoHistopathology
    follow_upFollow-up examination
    consensusExpert consensus
    confocalIn-vivo confocal microscopy
  2. H

    Data from: The HAM10000 dataset, a large collection of multi-source...

    • dataverse.harvard.edu
    • opendatalab.com
    • +1more
    Updated Feb 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philipp Tschandl (2023). The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions [Dataset]. http://doi.org/10.7910/DVN/DBW86T
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Philipp Tschandl
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.7910/DVN/DBW86Thttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.7910/DVN/DBW86T

    Description

    Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We tackle this problem by releasing the HAM10000 ("Human Against Machine with 10000 training images") dataset. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc). More than 50% of lesions are confirmed through histopathology (histo), the ground truth for the rest of the cases is either follow-up examination (follow_up), expert consensus (consensus), or confirmation by in-vivo confocal microscopy (confocal). The dataset includes lesions with multiple images, which can be tracked by the lesion_id-column within the HAM10000_metadata file. Due to upload size limitations, images are stored in two files: HAM10000_images_part1.zip (5000 JPEG files) HAM10000_images_part2.zip (5015 JPEG files) Additional data for evaluation purposes The HAM10000 dataset served as the training set for the ISIC 2018 challenge (Task 3), with the same sources contributing the majority of the validation- and test-set as well. The test-set images are available herein as ISIC2018_Task3_Test_Images.zip (1511 images), the ground-truth in the same format as the HAM10000 data (public since 2023) is available as ISIC2018_Task3_Test_GroundTruth.csv.. The ISIC-Archive also provides the challenge images and metadata (training, validation, test) at their "ISIC Challenge Datasets" page. Comparison to physicians Test-set evaluations of the ISIC 2018 challenge were compared to physicians on an international scale, where the majority of challenge participants outperformed expert readers: Tschandl P. et al., Lancet Oncol 2019 Human-computer collaboration The test-set images were also used in a study comparing different methods and scenarios of human-computer collaboration: Tschandl P. et al., Nature Medicine 2020 Following corresponding metadata is available herein: ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.csv: Human ratings for Test images with and without interaction with a ResNet34 CNN (Malignancy Probability, Multi-Class probability, CBIR) or Human-Crowd Multi-Class probabilities. This is data was collected for and analyzed in Tschandl P. et al., Nature Medicine 2020, therefore please refer to this publication when using the data. Some details on the abbreviated column headings: image_id: This is the ISIC image_id of an image at the time of the study. There should be no duplications in the combination image_id & interaction_modality. As not every image was shown with every interaction modality, not every combination is present. prob_m_dx_akiec, ... : m is "machine probabilities". Values are values after softmax, and "_mal" is all malignant classes summed. prob_h_dx_akiec, ... : h is "human probabilities". Values are aggregated percentages of human ratings from past studies distinguishing between seven classes. Note there is no "prob_h_mal" as this was none of the tested interaction modalities. user_dx_without_interaction_akiec, ...: Number of participants choosing this diagnosis without interaction. user_dx_with_interaction_akiec, ...: Number of participants choosing this diagnosis with interaction. HAM10000_segmentations_lesion_tschandl.zip: To evaluate regions of CNN activations in Tschandl P. et al., Nature Medicine 2020 (please refer to this publication when using the data), a single dermatologist (Tschandl P) created binary segmentation masks for all 10015 images from the HAM10000 dataset. Masks were initialized with the segmentation network as described by Tschandl et al., Computers in Biology and Medicine 2019, and following verified, corrected or replaced via the free-hand selection tool in FIJI.

  3. HAM10000 - 256x256

    • kaggle.com
    zip
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehran Ziadloo (2024). HAM10000 - 256x256 [Dataset]. https://www.kaggle.com/datasets/ziadloo/ham10000-256x256/discussion
    Explore at:
    zip(164309307 bytes)Available download formats
    Dataset updated
    Aug 7, 2024
    Authors
    Mehran Ziadloo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset is derived from the ISIC Archive with the following changes:

    1. A new integer column is added named "target" with values 0, 1, null. This column is populated using two other columns: "bengin_malignant" and "diagnosis". If the first column explicitly confirms that the record is either "benign" or "malignant", the target is set to "0" and "1" respectively. If the "benign_malignant" column is null, then the value of "diagnosis" column is used to determine the value for "target". The following diagnosis values are considered cancerous and as the result, "target" is set to "1":
    • squamous cell carcinoma
    • basal cell carcinoma
    • melanoma
    • squamous cell carcinoma

    If the "benign_malignant" column is null and the "diagnosis" column is "vascular lesion", the target is set to null.

    DISCLAIMER I'm not a dermatologist and I'm not affiliated with ISIC in any way. I don't know if my approach to setting the target value is acceptable by the ISIC competition. Use at your own risk.

    1. All the images are resized to 256x256 using the following Python code:
    import os
    import multiprocessing as mp
    from PIL import Image, ImageOps
    import glob
    from functools import partial
    
    
    def list_jpg_files(folder_path):
      # Ensure the folder path ends with a slash
      if not folder_path.endswith('/'):
        folder_path += '/'
    
      # Use glob to find all .jpg files in the specified folder (non-recursive)
      jpg_files = glob.glob(folder_path + '*.jpg')
    
      return jpg_files
    
    
    
    def resize_image(image_path, destination_folder):
      # Open the image file
      with Image.open(image_path) as img:
        # Get the original dimensions
        original_width, original_height = img.size
    
        # Calculate the aspect ratio
        aspect_ratio = original_width / original_height
    
        # Determine the new dimensions based on the aspect ratio
        if aspect_ratio > 1:
          # Width is larger, so we will crop the width
          new_width = int(256 * aspect_ratio)
          new_height = 256
        else:
          # Height is larger, so we will crop the height
          new_width = 256
          new_height = int(256 / aspect_ratio)
    
        # Resize the image while maintaining the aspect ratio
        img = img.resize((new_width, new_height))
    
        # Calculate the crop box to center the image
        left = (new_width - 256) / 2
        top = (new_height - 256) / 2
        right = (new_width + 256) / 2
        bottom = (new_height + 256) / 2
    
        # Crop the image if it results in shrinking
        if new_width > 256 or new_height > 256:
          img = img.crop((left, top, right, bottom))
        else:
          # Add black edges if it results in scaling up
          img = ImageOps.expand(img, border=(int(left), int(top), int(left), int(top)), fill='black')
    
        # Resize the image to the final dimensions
        img = img.resize((256, 256))
    
      img.save(os.path.join(destination_folder, os.path.basename(image_path)))
    
    
    source_folder = ""
    destination_folder = ""
    
    images = list_jpg_files(source_folder)
    
    with mp.Pool(processes=12) as pool:
      images = pool.map(partial(resize_image, destination_folder=destination_folder), images)
    print("All images resized")
    

    This code will shrink (down-sample) the image if it is larger than 256x256. But if the image is smaller than 256x256, it will add either vertical or horizontal black edges after scaling up the image. In both scenarios, it will keep the center of the input image in the center of the output image.

    The HDF5 file is created using the following code:

    import os
    import pandas as pd
    from PIL import Image
    import h5py
    import io
    import numpy as np
    
    # File paths
    base_folder = "./isic-2018-task-12-256x256"
    csv_file_path = 'train-metadata.csv'
    image_folder_path = 'train-image/image'
    hdf5_file_path = 'train-image.hdf5'
    
    # Read the CSV file
    df = pd.read_csv(os.path.join(base_folder, csv_file_path))
    
    # Open an HDF5 file
    with h5py.File(os.path.join(base_folder, hdf5_file_path), 'w') as hdf5_file:
      for index, row in df.iterrows():
        isic_id = row['isic_id']
        image_file_path = os.path.join(base_folder, image_folder_path, f'{isic_id}.jpg')
        
        if os.path.exists(image_file_path):
          # Open the image file
          with Image.open(image_file_path) as img:
            # Convert the image to a byte buffer
            img_byte_arr = io.BytesIO()
            img.save(img_byte_arr, format=img.format)
            img_byte_arr = img_byte_arr.getvalue()
            hdf5_file.create_dataset(isic_id, data=np.void(img_byte_arr))
        else:
          print(f"Image file for {isic_id} not found.")
    
    print("HDF5 file created successfully.")
    

    To read the hdf5 file, use the following code:

    import h5py
    from PIL import Image...
    
  4. Z

    Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological...

    • data.niaid.nih.gov
    Updated Jul 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek, Kumar; Jain, Aditi; Hamarneh, Ghassan (2024). Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11101337
    Explore at:
    Dataset updated
    Jul 14, 2024
    Dataset provided by
    Simon Fraser University
    Indian Institute of Technology Delhi
    Authors
    Abhishek, Kumar; Jain, Aditi; Hamarneh, Ghassan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.

    Citation

    If you find this project useful or if you use our newly proposed datasets and/or our analyses, please cite our paper.

    Kumar Abhishek, Aditi Jain, Ghassan Hamarneh. "Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets". arXiv preprint arXiv:2401.14497, 2024. DOI: 10.48550/ARXIV.2401.14497.

    The corresponding BibTeX entry is:

    @article{abhishek2024investigating, title={Investigating the Quality of {DermaMNIST} and {Fitzpatrick17k} Dermatological Image Datasets}, author={Abhishek, Kumar and Jain, Aditi and Hamarneh, Ghassan}, journal={arXiv preprint arXiv:2401.14497}, doi = {10.48550/ARXIV.2401.14497}, url = {https://arxiv.org/abs/2401.14497}, year={2024}}

    Project Website

    The results of the analysis, including the visualizations, are available on the project website: https://derm.cs.sfu.ca/critique/.

    Code

    The accompanying code for this project is hosted on GitHub at https://github.com/kakumarabhishek/Corrected-Skin-Image-Datasets.

    License

    The metadata files (DermaMNIST-C.csv, DermaMNIST-E.csv, Fitzpatrick17k_DiagnosisMapping.xlsx,Fitzpatrick17k-C.csv) contained in this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.

    The NPZ files associated with DermaMNIST-C (dermamnist_corrected_28.npz, dermamnist_corrected_224.npz) and DermaMNIST-E (dermamnist_extended_28.npz, dermamnist_extended_224.npz) contained in this repository are licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.

    The code hosted on GitHub is licensed under the Apache License 2.0.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Élio Cordeiro Pereira (2024). Skin Cancer - The HAM10000 dataset [Dataset]. https://www.kaggle.com/datasets/eliocordeiropereira/skin-cancer-the-ham10000-dataset
Organization logo

Skin Cancer - The HAM10000 dataset

Multi-source dermatoscopic images of common pigmented skin leasons

Explore at:
9 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Élio Cordeiro Pereira
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The Original Dataset

The source dataset and its full description may be accessed through the Harvard Dataverse, and should be cited as

Tschandl, Philipp, 2018, "The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions", https://doi.org/10.7910/DVN/DBW86T, Harvard Dataverse, V4, UNF:6:KCZFcBLiFE5ObWcTc2ZBOA== [fileUNF]

The Current Dataset

Note that the herein uploaded dataset does not contain all of the source material, namely the file ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.tab - which contains data on a study involving human-computer collaboration - and the folder HAM10000_segmentations_lesion_tschandl - containing binary segmentation masks of the training images. Still, in contrast to most of the HAM10000 datasets published in Kaggle, the current one includes the test dataset that was curated for the ISIC 2018 challenge (Task 3).

Description

Files and folders

The uploaded dataset is comprised by 3 folders and 2 files, described in the table below.

ContentTypeDescription
HAM10000_images_part_1folderPart 1 of a set of training pictures
HAM10000_images_part_2folderPart 2 of a set of training pictures
ISIC2018_Task3_Test_ImagesfolderSet of test pictures
HAM10000_metadata.csvfileMetadata associated with the training data
ISIC2018_Task3_Test_GroundTruth.csvfileMetadata associated with the test data



The training dataset (HAM10000_images_part_1 and HAM10000_images_part_2) is called "HAM10000" meaning "Human Against Machine with 10000 training images"" (actually 10015 images) and it corresponds to a large collection of multi-source dermatoscopic RGB images (JPG) of common pigmented skin lesions. The test dataset (ISIC2018_Task3_Test_Images) corresponds to 511 images. The files HAM10000_metadata.csv and ISIC2018_Task3_Test_GroundTruth.csv contain the respective metadata (data about the data) which further include other features and the labels.

Columns of the metadata files

Their structure of the metadata files follows the template presented by the table below.

ColumnTypeDescription
lesion_idStringID of the lesion case
image_idStringID of an image (also the name of the respective JPG file) associated with that case
dxStringLabel of that case
dx_typeStringMethod used for diagnosing that case
ageFloatAge of the person associated with that case
sexStringSex of the person associated with that case
localizationStringLocation of the lesion in the person body
datasetStringReference from which the data was taken



Values of the metadata dx column (the classes)

The values that the column dx may take are tabulated below.

ValueDescription
akiecActinic keratoses and intraepithelial carcinoma (also called "Bowen's disease") - an early form of skin cancer
bccBasal cell carcinoma - the most common type of skin cancer
bklBenign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses) - common and benign
dfDermatofibroma - common and benign
melMelanoma - a type of skin cancer involving the melanin cells
nvMelanocytic nevus - the medical term for a mole (benign)
vascVascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage) (benign)



Values of the metadata dx_type column (the diagnosis methods)

And the table below present the values of the column dx_type.

ValueDescription
histoHistopathology
follow_upFollow-up examination
consensusExpert consensus
confocalIn-vivo confocal microscopy
Search
Clear search
Close search
Google apps
Main menu