Facebook
TwitterThe dataset used in the paper is the University of California Irvine (UCI) iris data set.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Data Set Information:
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version. I had a list of what the 30 or so variables were, but a.) I lost it, and b.), I would not know which 13 variables are included in the set.
The attributes are (dontated by Riccardo Leardi, riclea '@' anchem.unige.it ) 1) Alcohol 2) Malic acid 3) Ash 4) Alcalinity of ash 5) Magnesium 6) Total phenols 7) Flavanoids 8) Nonflavanoid phenols 9) Proanthocyanins 10)Color intensity 11)Hue 12)OD280/OD315 of diluted wines 13)Proline
In a classification context, this is a well posed problem with "well behaved" class structures. A good data set for first testing of a new classifier, but not very challenging.
Facebook
TwitterThis is an augmented dataset. The original dataset link is given below.
Link: https://archive.ics.uci.edu/ml/datasets/Rice+Leaf+Diseases
Facebook
Twittera data set of digits where each digit is encoded as a unique 16 dimensional vector. This data set is borrowed from the UCI repository - https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits and can be used for multi class classification
Facebook
Twitterhttps://archive.ics.uci.edu/ml/datasets/Poker+Hand
Each record is an example of a hand consisting of five playing cards drawn from a standard deck of 52. Each card is described using two attributes (suit and rank), for a total of 10 predictive attributes. There is one Class attribute that describes the "Poker Hand". The order of cards is important, which is why there are 480 possible Royal Flush hands as compared to 4.
1) S1 "Suit of card #1" Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs}
2) C1 "Rank of card #1" Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King)
3) S2 "Suit of card #2" Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs}
4) C2 "Rank of card #2" Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King)
5) S3 "Suit of card #3" Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs}
6) C3 "Rank of card #3" Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King)
7) S4 "Suit of card #4" Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs}
8) C4 "Rank of card #4" Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King)
9) S5 "Suit of card #5" Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs}
10) C5 "Rank of card 5" Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King)
CLASS "Poker Hand" Ordinal (0-9)
0: Nothing in hand; not a recognized poker hand
1: One pair; one pair of equal ranks within five cards
2: Two pairs; two pairs of equal ranks within five cards
3: Three of a kind; three equal ranks within five cards
4: Straight; five cards, sequentially ranked with no gaps
5: Flush; five cards with the same suit
6: Full house; pair + different rank three of a kind
7: Four of a kind; four equal ranks within five cards
8: Straight flush; straight + flush
9: Royal flush; {Ace, King, Queen, Jack, Ten} + flush
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Datasets available at UCI Machine Learning Repository and other repositories. List of datasets used in the experiment with their sources. ForestCover dataset @ https://archive.ics.uci.edu/ml/datasets/Covertype KDD Cup99 dataset @ https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data PAMAP dataset @ https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring Powersupply @ http://www.cse.fau.edu/~xqzhu/stream.html SEA @ http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift Syn002 & Syn003 (generated) @ http://moa.cms.waikato.ac.nz/details/classification/streams/ MNIST @ https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html News20 @ https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Datasets for training and testing algorithms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification of typical algorithms for imbalanced sampling and representative literature.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The three Multi-Label datasets used in the article "Adapting Transformers for Multi-Label Text Classification".
- AAPD Dataset (ArXiv Academic Paper Dataset) [Yang et al. 2018]1
- Reuters-21578 Dataset: https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection
- MFHAD (Multilabel French HAL Abstracts Dataset)
1Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang. 2018.
SGM: Sequence Generation Model for Multi-label Classification. In Proceedings
of the 27th International Conference on Computational Linguistics. Association for
Computational Linguistics, Santa Fe, New Mexico, USA, 3915–3926.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Source More Info : https://archive.ics.uci.edu/datasets
The **UCI Machine Learning Repository **is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
The datasets collected in this project represent a diverse and comprehensive set of cancer-related data sourced from the UCI Machine Learning Repository. They cover a wide spectrum of cancer types and research perspectives, including breast cancer datasets such as the original, diagnostic, prognostic, and Coimbra variants, which focus on tumor features, recurrence, and biochemical markers. Cervical cancer is represented through datasets focusing on behavioral risks and general risk factors. The lung cancer dataset provides categorical diagnostic attributes, while the primary tumor dataset offers insights into tumor locations based on metastasis data. Additionally, specialized datasets like differentiated thyroid cancer recurrence, glioma grading with clinical and mutation features, and gene expression RNA-Seq data expand the scope into genetic and molecular-level cancer analysis. Together, these datasets support a wide range of machine learning applications including classification, prediction, survival analysis, and feature correlation across various types of cancer.
RRA_Think Differently, Create history’s next line.
Hello Data Hunters! Hope you're doing well. https://www.kaggle.com/shuvokumarbasak4004 (More Dataset) https://www.kaggle.com/shuvokumarbasak2030
Facebook
TwitterAlthough mixed-membership models have achieved great success in unsupervised learning, they have not been widely applied to classification problems. In this paper, we propose a family of discriminative mixed-membership models for classification by combining unsupervised mixed membership models with multi-class logistic regression. In particular, we propose two variants respectively applicable to text classification based on latent Dirichlet allocation and usual feature vector classification based on mixed membership naive Bayes models. The proposed models allow the number of components in the mixed membership to be different from the number of classes. We propose two variational inference based algorithms for learning the models, including a fast variational inference which is substantially more efficient than mean-field variational approximation. Through extensive experiments on UCI and text classification benchmark datasets, we show that the models are competitive with the state of the art, and can discover components not explicitly captured by the class labels.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of composite indicators of different algorithms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Paired t-test for statistical evaluation of the classification results on UCI datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single index evaluation of different algorithms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nous verrons dans ce tutoriel comment détecter des modèles à l’aide du classificateur bayésien naïf, une technique d’apprentissage-machine efficace pour détecter certains modèles et prévoir les dépendances au sein de votre jeu de données. Nous réexaminerons dans la première partie de ce tutoriel le jeu de données Iris utilisé dans le tutoriel précédent pour apprendre à utiliser le classificateur bayésien naïf. Nous appliquerons par la suite vos nouvelles connaissances pour déceler les pourriels parmi vos messages textes (SMS), de manière à identifier les messages que vous ne désirerez pas lire. Le jeu de données que nous utiliserons s’agit d’un jeu de données de source libre du Référentiel d’apprentissage-machine UCI. Nous examinerons ensuite la classification multi-étiquettes via le jeu de données CMU que nous avons utilisé antérieurement pour le classificateur des plus proches voisins. Enfin, nous vous donnerons un exemple d’utilisation non aboutie du classificateur bayésien et vous expliquerons pourquoi cela n’a pas fonctionné. The tutorial revisits the Iris flower dataset to introduce the basic steps of working with the Naive Bayes Classifier. It then applies the classifier to detect spam in SMS messages using the SMS Spam collection dataset from the UCI Machine Learning Repository, and performs multi-label classification using the CMU book dataset. The tutorial also presents a scenario where the Naive Bayes Classifier fails, providing an explanation for the failure. By the end of this tutorial, participants will have a solid understanding of the Naive Bayes classifier, be able to split data into training and testing sets, make predictions, evaluate classifier performance, identify spam, classify books, train a Gaussian Naive Bayes classifier for single or multiple labels, and utilize imputation techniques for handling missing data.
Facebook
Twitterhttps://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Covertype
Classification of pixels into 7 forest cover types based on attributes such as elevation, aspect, slope, hillshade, soil-type, and more. The Covertype dataset from the UCI ML repository.
Configuration Task Description
covertype Multiclass classification Classify the area as one of 7 cover classes.
Usage
from datasets import load_dataset
dataset = load_dataset("mstz/covertype")["train"]
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MCKML achieves the best performance on 6 out of 9 datasets.
Facebook
TwitterThis dataset was taken from UCI library. It has been cleaned using techniques- z-score normalization, one-hot encoding, outlier removal, min-max scaling, and feature selection.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cervical Cancer Risk Factors for Biopsy: This Dataset is Obtained from UCI Repository and kindly acknowledged! This file contains a List of Risk Factors for Cervical Cancer leading to a Biopsy Examination! About 11,000 new cases of invasive cervical cancer are diagnosed each year in the U.S. However, the number of new cervical cancer cases has been declining steadily over the past decades. Although it is the most preventable type of cancer, each year cervical cancer kills about 4,000 women in the U.S. and about 300,000 women worldwide. In the United States, cervical cancer mortality rates plunged by 74% from 1955 - 1992 thanks to increased screening and early detection with the Pap test. AGE Fifty percent of cervical cancer diagnoses occur in women ages 35 - 54, and about 20% occur in women over 65 years of age. The median age of diagnosis is 48 years. About 15% of women develop cervical cancer between the ages of 20 - 30. Cervical cancer is extremely rare in women younger than age 20. However, many young women become infected with multiple types of human papilloma virus, which then can increase their risk of getting cervical cancer in the future. Young women with early abnormal changes who do not have regular examinations are at high risk for localized cancer by the time they are age 40, and for invasive cancer by age 50. SOCIOECONOMIC AND ETHNIC FACTORS Although the rate of cervical cancer has declined among both Caucasian and African-American women over the past decades, it remains much more prevalent in African-Americans -- whose death rates are twice as high as Caucasian women. Hispanic American women have more than twice the risk of invasive cervical cancer as Caucasian women, also due to a lower rate of screening. These differences, however, are almost certainly due to social and economic differences. Numerous studies report that high poverty levels are linked with low screening rates. In addition, lack of health insurance, limited transportation, and language difficulties hinder a poor woman’s access to screening services. HIGH SEXUAL ACTIVITY Human papilloma virus (HPV) is the main risk factor for cervical cancer. In adults, the most important risk factor for HPV is sexual activity with an infected person. Women most at risk for cervical cancer are those with a history of multiple sexual partners, sexual intercourse at age 17 years or younger, or both. A woman who has never been sexually active has a very low risk for developing cervical cancer. Sexual activity with multiple partners increases the likelihood of many other sexually transmitted infections (chlamydia, gonorrhea, syphilis).Studies have found an association between chlamydia and cervical cancer risk, including the possibility that chlamydia may prolong HPV infection. FAMILY HISTORY Women have a higher risk of cervical cancer if they have a first-degree relative (mother, sister) who has had cervical cancer. USE OF ORAL CONTRACEPTIVES Studies have reported a strong association between cervical cancer and long-term use of oral contraception (OC). Women who take birth control pills for more than 5 - 10 years appear to have a much higher risk HPV infection (up to four times higher) than those who do not use OCs. (Women taking OCs for fewer than 5 years do not have a significantly higher risk.) The reasons for this risk from OC use are not entirely clear. Women who use OCs may be less likely to use a diaphragm, condoms, or other methods that offer some protection against sexual transmitted diseases, including HPV. Some research also suggests that the hormones in OCs might help the virus enter the genetic material of cervical cells. HAVING MANY CHILDREN Studies indicate that having many children increases the risk for developing cervical cancer, particularly in women infected with HPV. SMOKING Smoking is associated with a higher risk for precancerous changes (dysplasia) in the cervix and for progression to invasive cervical cancer, especially for women infected with HPV. IMMUNOSUPPRESSION Women with weak immune systems, (such as those with HIV / AIDS), are more susceptible to acquiring HPV. Immunocompromised patients are also at higher risk for having cervical precancer develop rapidly into invasive cancer. DIETHYLSTILBESTROL (DES) From 1938 - 1971, diethylstilbestrol (DES), an estrogen-related drug, was widely prescribed to pregnant women to help prevent miscarriages. The daughters of these women face a higher risk for cervical cancer. DES is no longer prsecribed.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Characteristics and experiment settings of the UCI datasets.
Facebook
TwitterThe dataset used in the paper is the University of California Irvine (UCI) iris data set.