Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body.
This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
Abstract: The data collected at baseline include breast ultrasound images among women between 25 and 75 years old. This data was organized in 2018. The number of patients is 600 females, patients. The dataset consists of 780 images with an average image size of 500*500 pixels. The images are in PNG format. The ground truth images are presented with original images. The images are categorized into three classes, which are standard, benign, and malignant.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset originates from a deep learning model trained on the "Coimbra Breast Cancer" dataset, with feature distributions closely resembling the original. The original data includes clinical observations from 64 patients with breast cancer and 52 healthy controls, encompassing 10 quantitative predictors and a binary dependent variable indicating the presence or absence of breast cancer.
Quantitative Attributes:
Age (years): Represents the age of individuals in the dataset.
BMI (kg/m²): Body Mass Index, a measure of body fat based on weight and height.
Glucose (mg/dL): Reflects blood glucose levels, a vital metabolic indicator.
Insulin (µU/mL): Indicates insulin levels, a hormone associated with glucose regulation.
HOMA: Homeostatic Model Assessment, a method assessing insulin resistance and beta-cell function.
Leptin (ng/mL): Represents leptin levels, a hormone involved in appetite and energy balance regulation.
Adiponectin (µg/mL): Reflects adiponectin levels, a protein associated with metabolic regulation.
Resistin (ng/mL): Indicates resistin levels, a protein implicated in insulin resistance.
MCP-1 (pg/dL): Reflects Monocyte Chemoattractant Protein-1 levels, a cytokine involved in inflammation.
Labels:
1: Healthy controls
2: Patients with breast cancer
These quantitative attributes, including anthropometric data and parameters gathered from routine blood analysis, serve as the foundation for potential biomarkers of breast cancer. The dataset presents an opportunity for developing accurate prediction models, aiding in the identification and understanding of factors associated with breast cancer.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Data Source
https://www.kaggle.com/datasets/andrewmvd/breast-cancer-cell-segmentation
Dataset Card Authors
Mahadi Hassan
Dataset Card Contact
mahadise01@gmail.com
Linkdin: https://www.linkedin.com/in/mahadise01
Github: https://github.com/Mahadih534
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Breast Cancer Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yasserh/breast-cancer-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.
The key challenges against it’s detection is how to classify tumors into malignant (cancerous) or benign(non cancerous). We ask you to complete the analysis of classifying these tumors using machine learning (with SVMs) and the Breast Cancer Wisconsin (Diagnostic) Dataset.
This dataset has been referred from Kaggle.
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The multispectral breast cancer image datasets span three complementary imaging modalities: Ultrasound, Histopathological, and Chest X-ray. Each dataset includes balanced classes of benign and malignant cases, and the images are enhanced through spectral conversion (RGB, HSV, Jet) to support robust multispectral analysis for classification and fusion tasks.
MSI Ultrasound Breast Images for Breast Cancer This dataset contains ultrasound images of breast tissue labeled as either benign or malignant.
Total Images: 806
Benign: 406 images
Malignant: 400 images
Augmentation: Data augmentation techniques such as rotation and sharpening were applied to enhance the diversity and volume of the dataset, enabling robust training of machine learning models.
MSI BreastHis – Breast Cancer Histopathological Images This dataset comprises high-resolution microscopic images of breast tumor tissue collected for histopathological analysis. These images provide cellular-level detail and are essential for determining cancer grade and type.
Total Images Used: 1,246 (Subset of the full BreakHis dataset)
Benign: 623 images
Malignant: 623 images
MSI Chest X-Ray for Breast Cancer This dataset consists of colorized chest X-ray images used for identifying breast cancer-related anomalies. While traditionally not the primary modality for breast cancer detection, chest X-rays can provide useful structural insights when used in conjunction with other imaging types.
Total Images: 1,000
Benign: 500 images
Malignant: 500 images
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Breast Cancer Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/jainilcoder/breast-cancer-dataset on 13 February 2022.
--- Dataset description provided by original source is as follows ---
The Dataset contains 32 Columns and 570 rows consisting all the parameters used for detecting a Breast Cancer
The task for you will be predicting that wheter the cancer is Benign or Malignant. You can also perform Exploratory Data Analysis and Visualize it for practicing
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Breast Cancer Diagnostic Dataset (BCD)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/devraikwar/breast-cancer-diagnostic on 14 February 2022.
--- Dataset description provided by original source is as follows ---
The resources for this dataset can be found at https://www.openml.org/d/13 and https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal.
Number of Instances: 286
Number of Attributes: 9 + the class attribute
Attribute Information:
Class: no-recurrence-events, recurrence-events age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99. menopause: lt40, ge40, premeno. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39. node-caps: yes, no. deg-malig: 1, 2, 3. breast: left, right. breast-quad: left-up, left-low, right-up, right-low, central. irradiat: yes, no.
Missing Attribute Values: (denoted by “?”) Attribute #: Number of instances with missing values: 6. 8 9. 1.
Class Distribution:
no-recurrence-events: 201 instances recurrence-events: 85 instances
Original data https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
With the attributes described above, can you predict if a patient has recurrence event ?
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Breast Cancer Wisconsin (Diagnostic) Data Set’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/uciml/breast-cancer-wisconsin-data on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/
Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Attribute Information:
1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)
Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)
The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
All feature values are recoded with four significant digits.
Missing attribute values: none
Class distribution: 357 benign, 212 malignant
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets used in this study were collected from the Kaggle platform. Below are their available links:1. BreakHis: https://www.kaggle.com/datasets/ambarish/breakhis2. Breast Ultrasound Images Dataset (BUSI): https://www.kaggle.com/datasets/sabahesaraki/breast-ultrasound-images-dataset3. CBIS-DDSM: https://www.kaggle.com/datasets/seanbaek19/cbis-ddsm-40964. INbreast: https://www.kaggle.com/datasets/eoussama/breast-cancer-mammograms/data5. Combined Dataset: https://www.kaggle.com/datasets/rezaullhaque/combined-dataset/dataThe total data size is 26 GB.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Breast Cancer Prediction’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/shubamsumbria/breast-cancer-prediction on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Data Set Information: Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei. The separating plane described above was obtained using Multi-surface Method-Tree (MSM-T) [K. P. Bennett, “Decision Tree Construction Via Linear Programming.” Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97–101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1–4 features and 1–3 separating planes. The actual linear program used to get the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23–34].
Attribute Information: ID number Diagnosis (M = malignant, B = benign) Ten real-valued features are computed for each cell nucleus (3–32): a) radius (mean of distances from the center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter² / area — 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension (“coastline approximation” — 1)
Cite at: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
--- Original source retains full ownership of the source dataset ---
This dataset was created by Cady Wang
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Shivam_ay
Released under Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains images of mammograms and can be used for research and education purposes only. The dataset contains DCM images, TIFF images, a Radiology report, a Segmented mask, and pixel level annotation on abnormal regions and csv file that contains other metadata.
This dataset was created by Nitesh Sahu☑️
This dataset was created by song
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Anticancer peptides Data Set’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/anuragupadhyaya/anticancer-peptides-data-set on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Membranolytic anticancer peptides (ACPs) are drawing increasing attention as potential future therapeutics against cancer, due to their ability to hinder the development of cellular resistance and their potential to overcome common hurdles of chemotherapy, e.g., side effects and cytotoxicity. This dataset contains information on peptides (annotated for their one-letter amino acid code) and their anticancer activity on breast and lung cancer cell lines.
Two peptide datasets targeting breast and lung cancer cells were assembled and curated manually from CancerPPD. EC50, IC50, LD50 and LC50 annotations on breast and lung cancer cells were retained (breast cell lines: MCF7 = 57%, MDA-MB-361 = 11%, MT-1 = 9%; lung cell lines: H-1299 = 45%, A-549 = 17.7%); mg ml−1 values were converted to μM units. Linear and l-chiral peptides were retained, while cyclic, mixed or d-chiral peptides were discarded. In the presence of both amidated and non-amidated data for the same sequence, only the value referred to the amidated peptide was retained.
Peptides were split into three classes for model training: (1) very active (EC/IC/LD/LC50 ≤ 5 μM) (2) moderately active (EC/IC/LD/LC50 values up to 50 μM) (3) inactive (EC/IC/LD/LC50 > 50 μM) peptides
Duplicates with conflicting class annotations were compared manually to the original sources, and, if necessary, corrected. If multiple class annotations were present for the same sequence, the most frequently represented class was chosen; in case of ties, the less active class was chosen. Since the CancerPPD is biased towards the annotation of active peptides, we built a set of presumably inactive peptides by randomly extracting 750 alpha-helical sequences from crystal structures deposited in the Protein Data Bank (7–30 amino acids). The final training sets contained 949 peptides for Breast cancer and 901 peptides for Lung cancer.
--- Original source retains full ownership of the source dataset ---
This dataset was created by Himanshu Madan
Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body.
This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.