Facebook
TwitterExplore the field of breast cancer diagnosis with the insightful Wisconsin Breast Cancer dataset (Original). This dataset provides detailed attributes representing tumor characteristics observed in breast tissue samples. By analyzing these attributes, researchers and medical professionals can gain insights into tumor behavior and develop predictive models for cancer detection and prognosis.
| Features | |
|---|---|
| 1. Sample code number: Unique identifier for each tissue sample. | |
| 2. Clump Thickness: Assessment of the thickness of tumor cell clusters (1 - 10). | |
| 3. Uniformity of Cell Size: Uniformity in the size of tumor cells (1 - 10). | |
| 4. Uniformity of Cell Shape: Uniformity in the shape of tumor cells (1 - 10). | |
| 5. Marginal Adhesion: Degree of adhesion of tumor cells to surrounding tissue (1 - 10). | |
| 6. Single Epithelial Cell Size: Size of individual tumor cells (1 - 10). | |
| 7. Bare Nuclei: Presence of nuclei without surrounding cytoplasm (1 - 10). | |
| 8. Bland Chromatin: Assessment of chromatin structure in tumor cells (1 - 10). | |
| 9. Normal Nucleoli: Presence of normal-looking nucleoli in tumor cells (1 - 10). | |
| 10. Mitoses: Frequency of mitotic cell divisions (1 - 10). | |
| 11. Class: Classification of tumor type (2 for benign, 4 for malignant). |
The Breast Cancer Wisconsin dataset is sourced from tissue samples collected for diagnostic purposes, with attributes derived from microscopic examination. The dataset is anonymized and made available for research purposes, contributing to advancements in cancer diagnosis and treatment.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is taken from the UCI Machine Learning Repository (Link: https://data.world/health/breast-cancer-wisconsin) by the Donor: Nick Street
The main idea and inspiration behind the upload was to provide datasets for Machine Learning as practice and reference for my peers at college. The main purpose is to analyze data and experiment with different machine learning ideas and techniques for this binary classification task. As such, this dataset is a very useful resource to practice on.
Breast cancer is when breast cells mutate and become cancerous cells that multiply and form tumors. It accounts for 25% of all cancer cases and affected over 2.1 Million people in 2015 alone. Breast cancer typically affects women and people assigned female at birth (AFAB) age 50 and older, but it can also affect men and people assigned male at birth (AMAB), as well as younger women. Healthcare providers may treat breast cancer with surgery to remove tumors or treatment to kill cancerous cells.
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at http://www.cs.wisc.edu/~street/images/
The task: To classify whether the tumor is benign (B) or malignant (M).
Relevant information
Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
A few of the images can be found at
http://www.cs.wisc.edu/~street/images/
Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree. Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/
Number of instances: 569
Number of attributes: 32 (ID, diagnosis, 30 real-valued input features)
Original Creators:
Dr. William H. Wolberg, General Surgery Dept., University of
Wisconsin, Clinical Sciences Center, Madison, WI 53792
wolberg@eagle.surgery.wisc.edu
W. Nick Street, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
street@cs.wisc.edu 608-262-6619
Olvi L. Mangasarian, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
olvi@cs.wisc.edu
Donor: Nick Street
Date: November 1995
Past Usage:
first usage:
W.N. Street, W.H. Wolberg and O.L. Mangasarian
Nuclear feature extraction for breast tumor diagnosis.
IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science
and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
OR literature:
O.L. Mangasarian, W.N. Street and W.H. Wolberg.
Breast cancer diagnosis and prognosis via linear programming.
Operations Research, 43(4), pages 570-577, July-August 1995.
Medical literature:
W.H. Wolberg, W.N. Street, and O.L. Mangasarian.
Machine learning techniques to diagnose breast cancer from
fine-needle aspirates.
Cancer Letters 77 (1994) 163-171.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian.
Image analysis and machine learning applied to breast cancer
diagnosis and prognosis.
Analytical and Quantitative Cytology and Histology, Vol. 17
No. 2, pages 77-87, April 1995.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian.
Computerized breast cancer diagnosis and prognosis from fine
needle aspirates.
Archives of Surgery 1995;130:511-516.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian.
Computer-derived nuclear features distinguish malignant from
benign breast cytology.
Human Pathology, 26:792--796, 1995.
See also: http://www.cs.wisc.edu/~olvi/uwmp/mpml.html http://www.cs.wisc.edu/~olvi/uwmp/cancer.html
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was created by Anik Chand
Released under CC BY-SA 4.0
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Additional Information
Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself:
Group 1: 367 instances (January 1989) Group 2: 70 instances (October 1989) Group 3: 31 instances (February 1990) Group 4: 17 instances (April 1990) Group 5: 48 instances (August 1990) Group 6: 49 instances (Updated January 1991) Group 7: 31 instances (June 1991)
Total: 699 points (as of the donated datbase on 15 July 1992)
Note that the results summarized above in Past Usage refer to a dataset of size 369, while Group 1 has only 367 instances. This is because it originally contained 369 instances; 2 were removed. The following statements summarizes changes to the original Group 1's set of data:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Breast Cancer Wisconsin (Diagnostic) ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/faroukbenarous/breast-cancer-wisconsin-diagnostic on 30 September 2021.
--- No further description of dataset provided by original source ---
--- Original source retains full ownership of the source dataset ---
Facebook
TwitterThis dataset was created by Sony Augustine@123
Facebook
TwitterContext
This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
Attributes 1 through 10 have been used to represent instances. Each instance has one of 2 possible classes: benign or malignant.
Content
Attribute Domain 1.Sample code number id number 2.Clump Thickness 1 - 10 3.Uniformity of Cell Size 1 - 10 4.Uniformity of Cell Shape 1 - 10 5.Marginal Adhesion 1 - 10 6.Single Epithelial Cell Size 1 - 10 7.Bare Nuclei 1 - 10 8.Bland Chromatin 1 - 10 9.Normal Nucleoli 1 - 10 10.Mitoses 1 - 10 11.Class (2 for benign, 4 for malignant)
Class distribution:
Benign: 458 (65.5%) Malignant: 241 (34.5%)
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)
Facebook
TwitterThis dataset was created by Shubham Biswas
Facebook
TwitterAttribute Information:
Facebook
TwitterThis dataset was created by sagar bhaskar
Facebook
TwitterBy UCI [source]
The Breast Cancer Wisconsin (Prognostic) dataset brings together data collected from hundreds of breast cancer cases, making it valuable for predictive prognosis. It includes 30 features such as radius, texture, area, compactness and concavity that were generated from the a digitized fine needle aspirate (FNA) of the mass to generate characteristics of the cell nuclei present in each case. It also includes outcomes such as recurrence and nonrecurrence and also time-to-recurrence information for those cases that relapse.
This breaking dataset was created by some leading minds in medical science; Dr William H. Wolberg at the University Of Wisconsin Clinical Sciences Center alongside W. Nick Street at the university's Computer Sciences Dept., and Olvi L Mangasarian also based there - all credited with creating various decision tree construction systems using linear programming models to accurately predict disease recurrences within an incredibly short time frame.
The data is freely available through UW CS ftp server or on Kaggle's website making use easier than ever before - giving all researchers access up-to-date information regarding breast cancer prognosis and diagnosis via images taken from FNA tests conducted on masses in diagnosed patients' bodies - allowing each participant instantaneous access to a powerful set of features versus outcomes within both recurrent and nonrecurrent situations.. Moreover papers such as 'An inductive learning approach to prognostic prediction.' by WN street et al have utilized this database extensively mapping out how Artificial Neural Networks can be used for predictive tasks with noteworthy success! Armed with these tested ideas consequently anyone has access level ground in understanding how decisions are made as it relates to predicting breast cancer outcome effectively utilizing this dataset helping us better understand how a predictive model can significantly improve patient care processes!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset is designed to improve the prognostics of breast cancer using machine learning algorithms. The data consists of a time series of patient symptoms and various medical parameters, such as tumor size and malignancy, that can be used by programmatic algorithms to predict diagnosis and prognosis outcomes. Here are some steps on how to use this dataset:
Pre-process and clean the data: Since the dataset contains incomplete or missing values across various parameters, it is important to clean and pre-process the data before attempting any machine learning algorithm (MLA). This includes sorting out what type of values need imputation, standardizing features for better performance, encoding categorical variables for MLAs, and normalizing numerical values for accuracy.
Choose an appropriate MLA: Depending on your exact goal with this data set - for example if you wanted reliable classification results or weighted predictions based on factors - there are a variety of MLAs from which you may select; examples include logistic regression classifiers, least squares support vector machines (SVM), neural networks, nonsmooth optimization algorithms like A-Optimality or global optimization methods such as Extract M-of-N rule sets from trained neural nets.. It would be wise to read up on each algorithm in order to determine which one most appropriately meets your needs before starting experimentation with the dataset itself.
Train the model using your selected MLA: Once you have identified an MLA that fits your desired result outcome best – or if you decide on experimenting with multiple approaches –it’s time turn back towards the data itself in order run experiments actually examine outcomes based upon training models built upon it through cross validation methods such as k-fold splitting.. Then test these trained models against validation datasets taken from specified subsets within the original larger data set structure held by Kaggle in order get general outputs results determining performance rates over various conditions presented by parameter combinations relevant when predicting breast cancer diagnostic &/or prognostic outcomes .. Establishing any trends revealed during these experiments will help inform future model selections during training process associated implementing an effective predictive solution fitting specific user requirements especially where particular MLA are not tailored handle purpose generally falling outside scope designing said model so guaranteeing ac...
Facebook
TwitterIt is quite common to find ML-based applications embedded with real-time patient data available from different healthcare systems in multiple countries, thereby increasing the efficacy of new treatment options which were unavailable before. This data set is all about predicting whether the cancer cells are benign or malignant.
Information about attributes:
There are total 10 attributes(int)- Sample code number: id number Clump Thickness: 1 - 10 Uniformity of Cell Size: 1 - 10 Uniformity of Cell Shape: 1 - 10 Marginal Adhesion: 1 - 10 Single Epithelial Cell Size: 1 - 10 Bare Nuclei: 1 - 10 Bland Chromatin: 1 - 10 Normal Nucleoli: 1 - 10 Mitoses: 1 - 10 Predicted class: 2 for benign and 4 for malignant
This data set(Original Wisconsin Breast Cancer Database) is taken from UCI Machine Learning Repository.
This is the first ever data set I am sharing in Kaggle. It would be a great pleasure if you find this data set useful to develop your own model. Hope this simple data set will help beginners to develop their own models for classification and learn how to make their model even better.
Facebook
TwitterThe resources for this dataset can be found at https://www.openml.org/d/13 and https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal.
Number of Instances: 286
Number of Attributes: 9 + the class attribute
Attribute Information:
Class: no-recurrence-events, recurrence-events age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99. menopause: lt40, ge40, premeno. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39. node-caps: yes, no. deg-malig: 1, 2, 3. breast: left, right. breast-quad: left-up, left-low, right-up, right-low, central. irradiat: yes, no.
Missing Attribute Values: (denoted by “?”) Attribute #: Number of instances with missing values: 6. 8 9. 1.
Class Distribution:
no-recurrence-events: 201 instances recurrence-events: 85 instances
Original data https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
With the attributes described above, can you predict if a patient has recurrence event ?
Facebook
TwitterSamples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself:
Group 1: 367 instances (January 1989) Group 2: 70 instances (October 1989) Group 3: 31 instances (February 1990) Group 4: 17 instances (April 1990) Group 5: 48 instances (August 1990) Group 6: 49 instances (Updated January 1991) Group 7: 31 instances (June 1991)
Total: 699 points (as of the donated database on 15 July 1992)
Note that the results summarized above in Past Usage refer to a dataset of size 369, while Group 1 has only 367 instances. This is because it originally contained 369 instances; 2 were removed. The following statements summarize changes to the original Group 1's set of data:
Wolberg, W.H., & Mangasarian, O.L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. In Proceedings of the National Academy of Sciences, 87, 9193--9196.
Zhang, J. (1992). Selecting typical instances in instance-based learning. In Proceedings of the Ninth International Machine Learning Conference (pp. 470--479). Aberdeen, Scotland: Morgan Kaufmann.
Predict from the dataset whether a person has Breast Cancer: Benign or Malignant .
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cytology features of breast cancer biopsy. It can be used to predict breast cancer from cytology features.
The data was obtained from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)
Data description can be found at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names
Data contains cytology features of breast cancer biopsies - clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nuceloli, mitosis. The class variable denotes whether it was cancer or not. Cancer = 1 and not cancer = 0
Attribute Information:
Data obtained from : UCI machine learning repository Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Picture courtesy: Photo by Pablo Heimplatz on Unsplash
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Data From: UCI Machine Learning Repository http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.names
"Each record represents follow-up data for one breast cancer case. These are consecutive patients seen by Dr. Wolberg since 1984, and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis.
The first 30 features are computed from a digitized image of a
fine needle aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
A few of the images can be found at
http://www.cs.wisc.edu/~street/images/
The separation described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree. Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
The Recurrence Surface Approximation (RSA) method is a linear
programming model which predicts Time To Recur using both
recurrent and nonrecurrent cases. See references (i) and (ii)
above for details of the RSA method.
This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WPBC/
1) ID number 2) Outcome (R = recur, N = nonrecur) 3) Time (recurrence time if field 2 = R, disease-free time if field 2 = N) 4-33) Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)"
Creators:
Dr. William H. Wolberg, General Surgery Dept., University of
Wisconsin, Clinical Sciences Center, Madison, WI 53792
wolberg@eagle.surgery.wisc.edu
W. Nick Street, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
street@cs.wisc.edu 608-262-6619
Olvi L. Mangasarian, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
olvi@cs.wisc.edu
I'm really interested in trying out various machine learning algorithms on some real life science data.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains the characteristics of patients diagnosed with cancer. The dataset contains a unique ID for each patient, the type of cancer (diagnosis), the visual characteristics of the cancer and the average values of these characteristics.
There are also several categorical features where patients in the dataset are labeled with numerical values. You can examine them in the Chart area.
Other features contain specific ranges of average values of the features of the cancer image:
Each of these features is mapped to a table containing the number of values in a given range. You can examine the Chart Tables
Each sample contains the patient's unique ID, the cancer diagnosis and the average values of the cancer's visual characteristics.
Such a dataset can be used to train or test models and algorithms used to make cancer diagnoses. Understanding and analyzing the dataset can contribute to the improvement of cancer-related visual features and diagnosis.
Logistic Regression: This algorithm can be used effectively for binary classification problems. In this dataset, logistic regression may be an appropriate choice since there are "Malignant" (benign) and "Benign" (malignant) classes. It can be used to predict cancer type with the visual features in the dataset.
K-Nearest Neighbors (KNN): KNN classifies an example by looking at the k closest examples around it. This algorithm assumes that patients with similar characteristics tend to have similar types of cancer. KNN can be used for cancer diagnosis by taking into account neighborhood relationships in the data set.
Support Vector Machines (SVM): SVM is effective for classification tasks, especially for two-class problems. Focusing on the clear separation of classes in the dataset, SVM is a powerful algorithm that can be used for cancer diagnosis.
K-NN Project: https://www.kaggle.com/code/erdemtaha/prediction-cancer-data-with-k-nn-95
Logistic Regressüon: https://www.kaggle.com/code/erdemtaha/cancer-prediction-96-5-with-logistic-regression
This is a copy of content that has been elaborated for educational purposes and published to reach more people, you can access the original source from the link below, please do not forget to support that data
🔗 https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data
This database can also be accessed via the UW CS ftp server: 🔗 ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/
It can also be found at the UCI Machine Learning Repository: 🔗 https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
If you have some questions or curiosities about the data or studies, you can contact me as you wish from the links below 😊
LinkedIn: https://www.linkedin.com/in/erdem-taha-sokullu/
Mail: erdemtahasokullu@gmail.com
Github: https://github.com/Prometheussx
Kaggle: https://www.kaggle.com/erdemtaha
This Data has a CC BY-NC-SA 4.0 License You can review the license rules from the link below
License Link: https://creativecommons.org/licenses/by-nc-sa/4.0/
Facebook
TwitterBy UCI [source]
This dataset contains data on breast cancer diagnosis, a devastating medical condition that affects thousands of people around the world each year. The data is comprised of patient ID, diagnosis (Malignant or Benign), and 30 computed features extracted from a digitized image of a fine needle aspirate (FNA) of a breast mass. Features include radius, texture, perimeter, area, smoothness, compactness concavity and concave points as well as symmetry and fractal dimension.
Created by renowned researchers in the fields of General Surgery and Computer Science at the University of Wisconsin-Madison led by Dr. William H Wolberg with contributions from Professor W Nick Street and Olvi L Mangasarian this dataset was used in some groundbreaking research to predict breast cancer prognosis using linear programming methods. More recently statistical methods such as support vector machines have been employed to classify tumour types from this dataset as well other tasks such as identify hidden patterns through pattern recognition techniques like Artificial Neural Networks (ANN).
It has also been used for studies exploring unsupervised classification tools like Ant Colony Optimization for discovering meaningful relationships among different variables which can help physicians better understand the progression of certain types of tumors over time. For example types cardinality analysis allowed researchers to determine tumor’s heterogeneity before deciding on appropriate treatments potentially leading to improved prognosis success rates overall. This Wisconsin Breast Cancer Diagnostic dataset provides an invaluable resource to scientists working on preventing or curing this dreaded disease - a goal we all eagerly hope to achieve someday soon!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Developing a classifier that can accurately predict breast cancer diagnoses based on the provided features.
- Clustering patient data with similar diagnosis to discover trends or connections between certain symptoms and diagnoses.
- Optimizing feature selection algorithms to identify the most relevant predictors of breast cancer diagnosis from a set of given cell nuclei features
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: unformatted-data.csv
File: wpbc.data.csv | Column name | Description | |:--------------|:--------------------------------| | 119513 | ID number (Integer) | | N | Diagnosis (Binary) | | 31 | Radius (Real-valued) | | 18.02 | Texture (Real-valued) | | 27.6 | Perimeter (Real-valued) | | 117.5 | Area (Real-valued) | | 1013 | Smoothness (Real-valued) | | 0.09489 | Compactness (Real-valued) | | 0.1036 | Concavity (Real-valued) | | 0.1086 | Symmetry (Real-valued) | | 0.07055 | Fractal Dimension (Real-valued) | | 0.1865 | Mean Intensity (Real-valued) | | 0.06333 | Standard Error (Real-valued) | | 0.6249 | Worst Radius (Real-valued) | | 1.89 | Worst Texture (Real-valued) | | 3.972 | Worst Perimeter (Real-valued) | | 71.55 | Worst Area (Real-valued) | | 0.004433 | Worst Smoothness (Real-valued) | | 0.01421 | Worst Compactness (Real-valued) | | 0.03233 | Worst Concavity (Real-valued) |
File: breast-cancer-wisconsin.data.csv | Column name | Description | |:--------------|:--------------------------------------| | 119513 | ID number (Integer) | | 1000025 | ID number (Integer) | | 1.1 | Uniformity of Cell Size (Integer) | | 1.2 | Uniformity of Cell Shape (Integer) | | 1.3 | Single Epithelial Cell Size (Integer) | | 1.4 | Bland Chromatin (Integer) | | 1.5 | Normal Nucleoli (Integer) | | 2.1 | Mitoses (Integer) |
File: wdbc.data.csv | Column name | Description | |:--------------|:----------------------------------------| | 842302 | Patient ID number (Integer Type) | | M | Diagnosis (Binary Type) | | **...
Facebook
TwitterThis collection consists of curated datasets optimized for machine learning tasks such as binary classification, regression, and survival modeling. The datasets are derived from publicly available sources, cleaned, and preprocessed to support a variety of applications in healthcare and genomics research. Each dataset focuses on a specific domain and task, making it easier for practitioners to build and evaluate models.
File Name: breast_cancer_data.xlsx
Source Dataset: Breast Cancer Wisconsin Data by uciml
Description:
This dataset is designed for binary classification tasks aimed at diagnosing breast cancer. It contains measurements from fine-needle aspirates of breast masses, categorized into benign or malignant tumors. The target variable is the diagnosis outcome (benign = 0, malignant = 1).
Applications:
- Cancer diagnosis and prediction.
- Feature selection to identify critical predictors of malignancy.
File Name: combined_data_Azithromycin.csv
Source Dataset: Gonorrhea Unitigs by nwheeler443
Description:
This dataset is optimized for regression and survival modeling, focusing on genomic markers linked to gonorrhea's resistance to Azithromycin. It includes features derived from unitigs (genomic segments) and metadata related to antibiotic susceptibility.
Applications:
- Prediction of drug resistance for Azithromycin.
File Name: combined_data_Ciprofloxacin.csv
Source Dataset: Gonorrhea Unitigs by nwheeler443
Description:
This dataset is also tailored for regression and survival modeling, with a focus on Ciprofloxacin resistance. Similar to the Azithromycin dataset, it includes genomic unitig data and antibiotic susceptibility features, curated to predict resistance.
Applications:
- Prediction of drug resistance for Ciprofloxacin.
File Name: dementia_dataset.csv
Source Dataset: Dementia Prediction Dataset by shashwatwork
Description:
This dataset is optimized for binary classification tasks, focusing on predicting the presence of dementia based on clinical and demographic data. Features include cognitive test results, demographic information, and assessments of patient functionality. The target variable is a binary indicator of dementia diagnosis.
Applications:
- Dementia diagnosis prediction.
- Exploratory data analysis for identifying high-impact predictors.
These datasets are based on publicly available resources and are credited to their respective original creators:
1. Shashwatwork (Dementia Dataset).
2. Nwheeler443 (Gonorrhea Unitigs).
3. UCI ML Repository (Breast Cancer Wisconsin Dataset).
The curated versions have been optimized for streamlined machine learning workflows.
Facebook
TwitterCes datasets sont utilisés pour le cours de Centrale Lille sur le Machine Learning de Pascal Yim (Image générée avec ideogram.ai)
Exemples simples pour la regression Par exemple "datareg_cos_300.csv" est un ensemble de 300 points suivant un cosinus bruité avec deux colonnes 'x' et 'y'
Estimation de la valeur moyenne des maisons (MEDV) par quartier en fonction de différentes données : - RM : nombre de chambres - LSTAT : mesure du taux de pauvreté - PTRATIO : mesure du taux d'encadrement par élève dans les écoles
Version simplifiée du dataset original UCI
Source : https://www.kaggle.com/datasets/schirmerchad/bostonhoustingmlnd
Prédiction de prix de maisons aux alentours de Seattle (district de King County)
Source : https://www.kaggle.com/datasets/harlfoxem/housesalesprediction
Prédiction de prix de maisons - Compétition Kaggle
Le geyser « Old Faithful » est un geyser en cône du parc de Yellowstone aux États-Unis
On a mesuré : - duration : la durée de l’éruption - waiting : l’intervalle de temps depuis la dernière éruption - kind : une étiquette 'short' ou 'long' du type d’éruption
Dataset pour classifier les espèces d'Iris
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQM3aH4Q3AplfE1MR3ROAp9Ok35fafmNT59ddXkdEvNdMkT8X6E">
On a les informations suivantes : - sepal_length : longueur du sépale (en cm) - sepal_width : largeur du sépale - length,petal : longueur du pétale - petal_width : largeur du pétale - species : 3 espèces d'iris : 'setosa', 'versicolor' ou 'virginica'
Source : UCI (http://archive.ics.uci.edu/)
Une version simplifiée du dataset des iris, avec seulement les mesures de pétales et 2 espèces : versicolor (0) et virginica (1)
Prédiction de malaise cardiaque (output) en fonction de différents paramètres comme l'âge, le taux de cholesterol, ...
Source : https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset
On veut prédire si une tumeur est maline ou non, en fonction de mesures sur une biopsie de la tumeur
Source : https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Dataset comparable à celui des Iris. On veut prédire l'espèce de manchots
Source : https://www.kaggle.com/ashkhagan/palmer-penguins-datasetalternative-iris-dataset
Classification d'étoiles
Source : https://www.kaggle.com/datasets/deepu1109/star-dataset
Prédire si un champignon est comestible ou non
Source : https://www.kaggle.com/uciml/mushroom-classification
Dataset très classique sur les survivants du Titanic
Source : https://www.kaggle.com/c/titanic
Dataset "PIMA Indian diabete"
Prédiction du diabète pour une population de femmes de la tribu Pima
Source : https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
On veut prédire le départ de clients pour la concurrence de clients Orange telecom (problème de ‘churn’ ou ‘attrition’)
Version "churn-big.csv" avec plus de données
Source : https://www.kaggle.com/datasets/mnassrib/telecom-churn-datasets
Prédiction d'attaque cérébrale
Source : https://www.kaggle.com/datasets/shashwatwork/cerebral-stroke-predictionimbalaced-dataset
Prédiction de pannes (UCI)
Source : https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification/code
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterExplore the field of breast cancer diagnosis with the insightful Wisconsin Breast Cancer dataset (Original). This dataset provides detailed attributes representing tumor characteristics observed in breast tissue samples. By analyzing these attributes, researchers and medical professionals can gain insights into tumor behavior and develop predictive models for cancer detection and prognosis.
| Features | |
|---|---|
| 1. Sample code number: Unique identifier for each tissue sample. | |
| 2. Clump Thickness: Assessment of the thickness of tumor cell clusters (1 - 10). | |
| 3. Uniformity of Cell Size: Uniformity in the size of tumor cells (1 - 10). | |
| 4. Uniformity of Cell Shape: Uniformity in the shape of tumor cells (1 - 10). | |
| 5. Marginal Adhesion: Degree of adhesion of tumor cells to surrounding tissue (1 - 10). | |
| 6. Single Epithelial Cell Size: Size of individual tumor cells (1 - 10). | |
| 7. Bare Nuclei: Presence of nuclei without surrounding cytoplasm (1 - 10). | |
| 8. Bland Chromatin: Assessment of chromatin structure in tumor cells (1 - 10). | |
| 9. Normal Nucleoli: Presence of normal-looking nucleoli in tumor cells (1 - 10). | |
| 10. Mitoses: Frequency of mitotic cell divisions (1 - 10). | |
| 11. Class: Classification of tumor type (2 for benign, 4 for malignant). |
The Breast Cancer Wisconsin dataset is sourced from tissue samples collected for diagnostic purposes, with attributes derived from microscopic examination. The dataset is anonymized and made available for research purposes, contributing to advancements in cancer diagnosis and treatment.