89 datasets found

Cancer Multiple Dataset UCI MLR
kaggle.com
zip
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Medi Hunter - 4004 (2025). Cancer Multiple Dataset UCI MLR [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasakbd/cancer-multiple-dataset-uci-mlr/suggestions
Explore at:
zip(74213598 bytes)Available download formats
Dataset updated
Aug 5, 2025
Authors
Medi Hunter - 4004
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Source More Info : https://archive.ics.uci.edu/datasets

The **UCI Machine Learning Repository **is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

The datasets collected in this project represent a diverse and comprehensive set of cancer-related data sourced from the UCI Machine Learning Repository. They cover a wide spectrum of cancer types and research perspectives, including breast cancer datasets such as the original, diagnostic, prognostic, and Coimbra variants, which focus on tumor features, recurrence, and biochemical markers. Cervical cancer is represented through datasets focusing on behavioral risks and general risk factors. The lung cancer dataset provides categorical diagnostic attributes, while the primary tumor dataset offers insights into tumor locations based on metastasis data. Additionally, specialized datasets like differentiated thyroid cancer recurrence, glioma grading with clinical and mutation features, and gene expression RNA-Seq data expand the scope into genetic and molecular-level cancer analysis. Together, these datasets support a wide range of machine learning applications including classification, prediction, survival analysis, and feature correlation across various types of cancer.

RRA_Think Differently, Create history’s next line.

Hello Data Hunters! Hope you're doing well. https://www.kaggle.com/shuvokumarbasak4004 (More Dataset) https://www.kaggle.com/shuvokumarbasak2030
o
kr-vs-kp
openml.org
Updated Apr 6, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alen Shapiro (2014). kr-vs-kp [Dataset]. https://www.openml.org/d/3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2014
Authors
Alen Shapiro
Description
Author: Alen Shapiro Source: UCI Please cite: UCI citation policy

Title: Chess End-Game -- King+Rook versus King+Pawn on a7 (usually abbreviated KRKPA7). The pawn on a7 means it is one square away from queening. It is the King+Rook's side (white) to move.

Sources: (a) Database originally generated and described by Alen Shapiro. (b) Donor/Coder: Rob Holte (holte@uottawa.bitnet). The database was supplied to Holte by Peter Clark of the Turing Institute in Glasgow (pete@turing.ac.uk). (c) Date: 1 August 1989

Past Usage:

Alen D. Shapiro (1983,1987), "Structured Induction in Expert Systems", Addison-Wesley. This book is based on Shapiro's Ph.D. thesis (1983) at the University of Edinburgh entitled "The Role of Structured Induction in Expert Systems".

Stephen Muggleton (1987), "Structuring Knowledge by Asking Questions", pp.218-229 in "Progress in Machine Learning", edited by I. Bratko and Nada Lavrac, Sigma Press, Wilmslow, England SK9 5BB.

Robert C. Holte, Liane Acker, and Bruce W. Porter (1989), "Concept Learning and the Problem of Small Disjuncts", Proceedings of IJCAI. Also available as technical report AI89-106, Computer Sciences Department, University of Texas at Austin, Austin, Texas 78712.

Relevant Information: The dataset format is described below. Note: the format of this database was modified on 2/26/90 to conform with the format of all the other databases in the UCI repository of machine learning databases.

Number of Instances: 3196 total

Number of Attributes: 36

Attribute Summaries: Classes (2): -- White-can-win ("won") and White-cannot-win ("nowin"). I believe that White is deemed to be unable to win if the Black pawn can safely advance. Attributes: see Shapiro's book.

Missing Attributes: -- none

Class Distribution: In 1669 of the positions (52%), White can win. In 1527 of the positions (48%), White cannot win.

The format for instances in this database is a sequence of 37 attribute values. Each instance is a board-descriptions for this chess endgame. The first 36 attributes describe the board. The last (37th) attribute is the classification: "win" or "nowin". There are 0 missing values. A typical board-description is

f,f,f,f,f,f,f,f,f,f,f,f,l,f,n,f,f,t,f,f,f,f,f,f,f,t,f,f,f,f,f,f,f,t,t,n,won

The names of the features do not appear in the board-descriptions. Instead, each feature correponds to a particular position in the feature-value list. For example, the head of this list is the value for the feature "bkblk". The following is the list of features, in the order in which their values appear in the feature-value list:

[bkblk,bknwy,bkon8,bkona,bkspr,bkxbq,bkxcr,bkxwp,blxwp,bxqsq,cntxt,dsopp,dwipd, hdchk,katri,mulch,qxmsq,r2ar8,reskd,reskr,rimmx,rkxwp,rxmsq,simpl,skach,skewr, skrxp,spcop,stlmt,thrsk,wkcti,wkna8,wknck,wkovl,wkpos,wtoeg]

In the file, there is one instance (board position) per line.

Num Instances: 3196 Num Attributes: 37 Num Continuous: 0 (Int 0 / Real 0) Num Discrete: 37 Missing values: 0 / 0.0%
Integrated Heart Disease Dataset
kaggle.com
zip
Updated Apr 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahul Gyawali (2019). Integrated Heart Disease Dataset [Dataset]. https://www.kaggle.com/unikpoet/heartdisease
Explore at:
zip(37479 bytes)Available download formats
Dataset updated
Apr 2, 2019
Authors
Rahul Gyawali
Description
Context

This dataset integrates all the databases present in Heart Disease Dataset available at UCI Machine Learning Repository. Original one contains 4 databases: Cleveland, Hungarian, Long Beach, and Switzerland. Most of the work has been done using Cleveland dataset only.

Content

Originally there are 76 attributes in the dataset, Selection of attributes depends on one's need. Here I've taken 10 attributes for the prediction.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
d
UCI Machine Learning Repository
dknet.org
rrid.site
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning Repository [Dataset]. http://identifiers.org/RRID:SCR_026571
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_026571
Description
Collection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given

Breast Cancer Wisconsin (Prognostic) Data Set

kaggle.com

zip

Updated Mar 31, 2017

Facebook

Twitter

Click to copy link

Link copied

Cite

Sarah VCH (2017). Breast Cancer Wisconsin (Prognostic) Data Set [Dataset]. https://www.kaggle.com/sarahvch/breast-cancer-wisconsin-prognostic-data-set

Explore at:

zip(49800 bytes)Available download formats

Dataset updated

Mar 31, 2017

Authors

Sarah VCH

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Context

Data From: UCI Machine Learning Repository http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.names

Content

"Each record represents follow-up data for one breast cancer case. These are consecutive patients seen by Dr. Wolberg since 1984, and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis.

The first 30 features are computed from a digitized image of a
fine needle aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
A few of the images can be found at
http://www.cs.wisc.edu/~street/images/

The separation described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree. Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

The Recurrence Surface Approximation (RSA) method is a linear
programming model which predicts Time To Recur using both
recurrent and nonrecurrent cases. See references (i) and (ii)
above for details of the RSA method. 

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WPBC/

1) ID number 2) Outcome (R = recur, N = nonrecur) 3) Time (recurrence time if field 2 = R, disease-free time if field 2 = N) 4-33) Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)"

Acknowledgements

Creators:

Dr. William H. Wolberg, General Surgery Dept., University of
Wisconsin, Clinical Sciences Center, Madison, WI 53792
wolberg@eagle.surgery.wisc.edu

W. Nick Street, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
street@cs.wisc.edu 608-262-6619

Olvi L. Mangasarian, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
olvi@cs.wisc.edu

Inspiration

I'm really interested in trying out various machine learning algorithms on some real life science data.

UCI Datasets for Metaheuristic Feature Selection
kaggle.com
zip
Updated Nov 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deku (2025). UCI Datasets for Metaheuristic Feature Selection [Dataset]. https://www.kaggle.com/datasets/piyushsharma5654/uci-datasets-for-metaheuristic-feature-selection
Explore at:
zip(671346 bytes)Available download formats
Dataset updated
Nov 7, 2025
Authors
Deku
Description
This dataset is a curated collection of 5 classic, publicly available datasets from the UCI Machine Learning Repository.

Context This collection was compiled for the purpose of benchmarking and evaluating machine learning algorithms, particularly metaheuristic-based feature selection (FS) algorithms. The datasets were used in the research paper: "Hybrid-FS: A Novel Feature Selection Algorithm Integrating Sine-Cosine Optimization and Genetic Operators for High-Dimensional Data Classification."

Content The dataset contains 5 separate files, all originating from the UCI ML Repository:

Ionosphere: 351 instances, 34 features, 2 classes

Sonar: 208 instances, 60 features, 2 classes

Waveform (v2): 5000 instances, 40 features, 3 classes

Wine: 178 instances, 13 features, 3 classes

Zoo: 101 instances, 16 features, 7 classes

Original Sources & Citation All datasets are provided as-is from the UCI Machine Learning Repository. Please cite the original creators of each dataset as specified on their respective UCI pages.

Ionosphere: https://archive.ics.uci.edu/dataset/52/ionosphere

Sonar: https://archive.ics.uci.edu/dataset/151/connectionist-bench-sonar-mines-vs-rocks

Waveform: https://archive.ics.uci.edu/dataset/108/waveform+database+generator+version+2

Wine: https://archive.ics.uci.edu/dataset/109/wine

Zoo: https://archive.ics.uci.edu/dataset/111/zoo

arrhythmia

openml.org

Updated Apr 6, 2014

Facebook

Twitter

Click to copy link

Link copied

Cite

H. Altay Guvenir; Burak Acar; Haldun Muderrisoglu (2014). arrhythmia [Dataset]. https://www.openml.org/d/5

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 6, 2014

Authors

H. Altay Guvenir; Burak Acar; Haldun Muderrisoglu

Description

Author: H. Altay Guvenir, Burak Acar, Haldun Muderrisoglu
Source: UCI
Please cite: UCI

Cardiac Arrhythmia Database
The aim is to determine the type of arrhythmia from the ECG recordings. This database contains 279 attributes, 206 of which are linear valued and the rest are nominal.

Concerning the study of H. Altay Guvenir: "The aim is to distinguish between the presence and absence of cardiac arrhythmia and to classify it in one of the 16 groups. Class 01 refers to 'normal' ECG classes, 02 to 15 refers to different classes of arrhythmia and class 16 refers to the rest of unclassified ones. For the time being, there exists a computer program that makes such a classification. However, there are differences between the cardiologist's and the program's classification. Taking the cardiologist's as a gold standard we aim to minimize this difference by means of machine learning tools.

The names and id numbers of the patients were recently removed from the database.

Attribute Information

  1 Age: Age in years , linear
  2 Sex: Sex (0 = male; 1 = female) , nominal
  3 Height: Height in centimeters , linear
  4 Weight: Weight in kilograms , linear
  5 QRS duration: Average of QRS duration in msec., linear
  6 P-R interval: Average duration between onset of P and Q waves
   in msec., linear
  7 Q-T interval: Average duration between onset of Q and offset
   of T waves in msec., linear
  8 T interval: Average duration of T wave in msec., linear
  9 P interval: Average duration of P wave in msec., linear
 Vector angles in degrees on front plane of:, linear
 10 QRS
 11 T
 12 P
 13 QRST
 14 J
 15 Heart rate: Number of heart beats per minute ,linear
 Of channel DI:
  Average width, in msec., of: linear
  16 Q wave
  17 R wave
  18 S wave
  19 R' wave, small peak just after R
  20 S' wave
  21 Number of intrinsic deflections, linear
  22 Existence of ragged R wave, nominal
  23 Existence of diphasic derivation of R wave, nominal
  24 Existence of ragged P wave, nominal
  25 Existence of diphasic derivation of P wave, nominal
  26 Existence of ragged T wave, nominal
  27 Existence of diphasic derivation of T wave, nominal
 Of channel DII: 
  28 .. 39 (similar to 16 .. 27 of channel DI)
 Of channels DIII:
  40 .. 51
 Of channel AVR:
  52 .. 63
 Of channel AVL:
  64 .. 75
 Of channel AVF:
  76 .. 87
 Of channel V1:
  88 .. 99
 Of channel V2:
  100 .. 111
 Of channel V3:
  112 .. 123
 Of channel V4:
  124 .. 135
 Of channel V5:
  136 .. 147
 Of channel V6:
  148 .. 159
 Of channel DI:
  Amplitude , * 0.1 milivolt, of
  160 JJ wave, linear
  161 Q wave, linear
  162 R wave, linear
  163 S wave, linear
  164 R' wave, linear
  165 S' wave, linear
  166 P wave, linear
  167 T wave, linear
  168 QRSA , Sum of areas of all segments divided by 10,
    ( Area= width * height / 2 ), linear
  169 QRSTA = QRSA + 0.5 * width of T wave * 0.1 * height of T
    wave. (If T is diphasic then the bigger segment is
    considered), linear
 Of channel DII:
  170 .. 179
 Of channel DIII:
  180 .. 189
 Of channel AVR:
  190 .. 199
 Of channel AVL:
  200 .. 209
 Of channel AVF:
  210 .. 219
 Of channel V1:
  220 .. 229
 Of channel V2:
  230 .. 239
 Of channel V3:
  240 .. 249
 Of channel V4:
  250 .. 259
 Of channel V5:
  260 .. 269
 Of channel V6:
  270 .. 279

Class code - class - number of instances:

  01       Normal        245
  02       Ischemic changes (Coronary Artery Disease)  44
  03       Old Anterior Myocardial Infarction      15
  04       Old Inferior Myocardial Infarction      15
  05       Sinus tachycardy    13
  06       Sinus bradycardy    25
  07       Ventricular Premature Contraction (PVC)    3
  08       Supraventricular Premature Contraction    2
  09       Left bundle branch block     9 
  10       Right bundle branch block    50
  11       1. degree AtrioVentricular block    0 
  12       2. degree AV block        0
  13       3. degree AV block        0
  14       Left ventricule hypertrophy        4
  15       Atrial Fibrillation or Flutter        5
  16       Others         22

o
tic-tac-toe
openml.org
Updated Apr 6, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David W. Aha (2014). tic-tac-toe [Dataset]. https://www.openml.org/d/50
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2014
Authors
David W. Aha
Description
Author: David W. Aha
Source: UCI - 1991
Please cite: UCI

Tic-Tac-Toe Endgame database
This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where "x" is assumed to have played first. The target concept is "win for x" (i.e., true when "x" has one of 8 possible ways to create a "three-in-a-row").

Attribute Information

(x=player x has taken, o=player o has taken, b=blank) 1. top-left-square: {x,o,b} 2. top-middle-square: {x,o,b} 3. top-right-square: {x,o,b} 4. middle-left-square: {x,o,b} 5. middle-middle-square: {x,o,b} 6. middle-right-square: {x,o,b} 7. bottom-left-square: {x,o,b} 8. bottom-middle-square: {x,o,b} 9. bottom-right-square: {x,o,b} 10. Class: {positive,negative}
o
PhishingWebsites
openml.org
Updated Feb 16, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rami Mustafa A Mohammad ( University of Huddersfield; rami.mohammad '@' hud.ac.uk; rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield; t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai; fadi '@' cud.ac.ae) (2016). PhishingWebsites [Dataset]. https://www.openml.org/d/4534
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2016
Authors
Rami Mustafa A Mohammad ( University of Huddersfield; rami.mohammad '@' hud.ac.uk; rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield; t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai; fadi '@' cud.ac.ae)
Description
Author: Rami Mustafa A Mohammad ( University of Huddersfield","rami.mohammad '@' hud.ac.uk","rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield","t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai","fadi '@' cud.ac.ae)
Source: UCI
Please cite: Please refer to the Machine Learning Repository's citation policy

Source:

Rami Mustafa A Mohammad ( University of Huddersfield, rami.mohammad '@' hud.ac.uk, rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield,t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai,fadi '@' cud.ac.ae)

Data Set Information:

One of the challenges faced by our research was the unavailability of reliable training datasets. In fact this challenge faces any researcher in the field. However, although plenty of articles about predicting phishing websites have been disseminated these days, no reliable training dataset has been published publically, may be because there is no agreement in literature on the definitive features that characterize phishing webpages, hence it is difficult to shape a dataset that covers all possible features. In this dataset, we shed light on the important features that have proved to be sound and effective in predicting phishing websites. In addition, we propose some new features.

Attribute Information:

For Further information about the features see the features file in the data folder of UCI.

Relevant Papers:

Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi (2012) An Assessment of Features Related to Phishing Websites using an Automated Technique. In: International Conferece For Internet Technology And Secured Transactions. ICITST 2012 . IEEE, London, UK, pp. 492-497. ISBN 978-1-4673-5325-0

Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25 (2). pp. 443-458. ISSN 0941-0643

Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi Abdeljaber (2014) Intelligent Rule based Phishing Websites Classification. IET Information Security, 8 (3). pp. 153-160. ISSN 1751-8709

Citation Request:

Please refer to the Machine Learning Repository's citation policy
Description of UCI HAR and UniMiB SHAR.
plos.figshare.com
xls
Updated Aug 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarmela Raja Sekaran; Ying Han Pang; Lim Zheng You; Ooi Shih Yin (2024). Description of UCI HAR and UniMiB SHAR. [Dataset]. http://doi.org/10.1371/journal.pone.0304655.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0304655.t001
Dataset updated
Aug 13, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Sarmela Raja Sekaran; Ying Han Pang; Lim Zheng You; Ooi Shih Yin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recognising human activities using smart devices has led to countless inventions in various domains like healthcare, security, sports, etc. Sensor-based human activity recognition (HAR), especially smartphone-based HAR, has become popular among the research community due to lightweight computation and user privacy protection. Deep learning models are the most preferred solutions in developing smartphone-based HAR as they can automatically capture salient and distinctive features from input signals and classify them into respective activity classes. However, in most cases, the architecture of these models needs to be deep and complex for better classification performance. Furthermore, training these models requires extensive computational resources. Hence, this research proposes a hybrid lightweight model that integrates an enhanced Temporal Convolutional Network (TCN) with Gated Recurrent Unit (GRU) layers for salient spatiotemporal feature extraction without tedious manual feature extraction. Essentially, dilations are incorporated into each convolutional kernel in the TCN-GRU model to extend the kernel’s field of view without imposing additional model parameters. Moreover, fewer short filters are applied for each convolutional layer to alleviate excess parameters. Despite reducing computational cost, the proposed model utilises dilations, residual connections, and GRU layers for longer-term time dependency modelling by retaining longer implicit features of the input inertial sequences throughout training to provide sufficient information for future prediction. The performance of the TCN-GRU model is verified on two benchmark smartphone-based HAR databases, i.e., UCI HAR and UniMiB SHAR. The model attains promising accuracy in recognising human activities with 97.25% on UCI HAR and 93.51% on UniMiB SHAR. Since the current study exclusively works on the inertial signals captured by smartphones, future studies will explore the generalisation of the proposed TCN-GRU across diverse datasets, including various sensor types, to ensure its adaptability across different applications.
Classifying wine varieties
kaggle.com
zip
Updated Jun 20, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
brynja (2017). Classifying wine varieties [Dataset]. https://www.kaggle.com/brynja/wineuci
Explore at:
zip(4305 bytes)Available download formats
Dataset updated
Jun 20, 2017
Authors
brynja
Description
Context

Wine recognition dataset from UC Irvine. Great for testing out different classifiers

Labels: "name" - Number denoting a specific wine class

Number of instances of each wine class

Class 1 - 59

Class 2 - 71

Class 3 - 48

Features:

Alcohol

Malic acid

Ash

Alcalinity of ash

Magnesium

Total phenols

Flavanoids

Nonflavanoid phenols

Proanthocyanins

Color intensity

Hue

OD280/OD315 of diluted wines

Proline

Content

"This data set is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines"

Acknowledgements

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" }

UC Irvine data base: "https://archive.ics.uci.edu/ml/machine-learning-databases/wine"

Sources: (a) Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy. (b) Stefan Aeberhard, email: stefan@coral.cs.jcu.edu.au (c) July 1991 Past Usage: (1) S. Aeberhard, D. Coomans and O. de Vel, Comparison of Classifiers in High Dimensional Settings, Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Technometrics).

The data was used with many others for comparing various classifiers. The classes are separable, though only RDA has achieved 100% correct classification. (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) (All results using the leave-one-out technique)

(2) S. Aeberhard, D. Coomans and O. de Vel, "THE CLASSIFICATION PERFORMANCE OF RDA" Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Journal of Chemometrics).

Inspiration

This data set is great for drawing comparisons between algorithms and testing out classifications models when learning new techniques
h
census-income
huggingface.co
Updated Jul 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WC (2025). census-income [Dataset]. https://huggingface.co/datasets/cestwc/census-income
Explore at:
Dataset updated
Jul 21, 2025
Authors
WC
Description
Dataset Card for Census Income (Adult)

This dataset is a precise version of Adult or Census Income. This dataset from UCI somehow happens to occupy two links, but we checked and confirm that they are identical. We used the following python script to create this Hugging Face dataset. import pandas as pd from datasets import Dataset, DatasetDict, Features, Value, ClassLabel

URLs

url1 = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" url2 =… See the full description on the dataset page: https://huggingface.co/datasets/cestwc/census-income.
o
mfeat-factors
openml.org
Updated Apr 6, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P.W. Duin (2014). mfeat-factors [Dataset]. https://openml.org/search?type=data&sort=runs&id=12
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2014
Authors
Robert P.W. Duin
Description
Author: Robert P.W. Duin, Department of Applied Physics, Delft University of Technology
Source: UCI - 1998
Please cite: UCI

Multiple Features Dataset: Factors
One of a set of 6 datasets describing features of handwritten numerals (0 - 9) extracted from a collection of Dutch utility maps. Corresponding patterns in different datasets correspond to the same original character. 200 instances per class (for a total of 2,000 instances) have been digitized in binary images.

Attribute Information

The attributes represent 216 profile correlations. No more information is known.

Relevant Papers

A slightly different version of the database is used in
M. van Breukelen, R.P.W. Duin, D.M.J. Tax, and J.E. den Hartog, Handwritten digit recognition by combined classifiers, Kybernetika, vol. 34, no. 4, 1998, 381-386.

The database as is is used in:
A.K. Jain, R.P.W. Duin, J. Mao, Statistical Pattern Recognition: A Review, IEEE Transactions on Pattern Analysis and Machine Intelligence archive, Volume 22 Issue 1, January 2000
Daily Demand Forecasting Orders from UCI ML
kaggle.com
zip
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pham Huyen (2025). Daily Demand Forecasting Orders from UCI ML [Dataset]. https://www.kaggle.com/datasets/phamhuyen286/daily-demand-forecasting-orders-from-uci-ml
Explore at:
zip(2870 bytes)Available download formats
Dataset updated
Jan 7, 2025
Authors
Pham Huyen
Description
The dataset was collected during 60 days, this is a real database of a brazilian logistics company. The dataset has twelve predictive attributes and a target that is the total of orders for daily treatment. The database was used in academic research at the Universidade Nove de Julho.
d
Annual Income - PCS5031 Project
datadryad.org
search.dataone.org
zip
Updated Sep 28, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emerson Cruz; Mauro Ohara (2016). Annual Income - PCS5031 Project [Dataset]. http://doi.org/10.15146/R3T88S
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.15146/R3T88S
Dataset updated
Sep 28, 2016
Dataset provided by
Dryad
Authors
Emerson Cruz; Mauro Ohara
Time period covered
Sep 28, 2016
Description
The data used in this project is a sample from a census data(1994) from the US census database. The data generated will contain census prediction models income for the selected sample. When inserted new data on a specific person, the model will indicate whether the person will achieve a desired income census. From the data a computational learning process will be used to do inference trough bayesian networks

Data from: Imbalanced dataset for benchmarking

data.niaid.nih.gov

Updated Jan 24, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Lemaitre, Guillaume; Nogueira, Fernando; Aridas, Christos K.; Oliveira, Dayvid V. R. (2020). Imbalanced dataset for benchmarking [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_61452

Explore at:

Dataset updated

Jan 24, 2020

Dataset provided by

Universite de Bourgogne, Universitat de Girona
University of Patras
Universidade Federal de Pernambuco
ShoppeAI

Authors

Lemaitre, Guillaume; Nogueira, Fernando; Aridas, Christos K.; Oliveira, Dayvid V. R.

License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

Imbalanced dataset for benchmarking

The different algorithms of the imbalanced-learn toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.

Characteristics

ID	Name	Repository & Target	Ratio	# samples	# features
1	Ecoli	UCI, target: imU	8.6:1	336	7
2	Optical Digits	UCI, target: 8	9.1:1	5,620	64
3	SatImage	UCI, target: 4	9.3:1	6,435	36
4	Pen Digits	UCI, target: 5	9.4:1	10,992	16
5	Abalone	UCI, target: 7	9.7:1	4,177	8
6	Sick Euthyroid	UCI, target: sick euthyroid	9.8:1	3,163	25
7	Spectrometer	UCI, target: >=44	11:1	531	93
8	Car_Eval_34	UCI, target: good, v good	12:1	1,728	6
9	ISOLET	UCI, target: A, B	12:1	7,797	617
10	US Crime	UCI, target: >0.65	12:1	1,994	122
11	Yeast_ML8	LIBSVM, target: 8	13:1	2,417	103
12	Scene	LIBSVM, target: >one label	13:1	2,407	294
13	Libras Move	UCI, target: 1	14:1	360	90
14	Thyroid Sick	UCI, target: sick	15:1	3,772	28
15	Coil_2000	KDD, CoIL, target: minority	16:1	9,822	85
16	Arrhythmia	UCI, target: 06	17:1	452	279
17	Solar Flare M0	UCI, target: M->0	19:1	1,389	10
18	OIL	UCI, target: minority	22:1	937	49
19	Car_Eval_4	UCI, target: vgood	26:1	1,728	6
20	Wine Quality	UCI, wine, target: <=4	26:1	4,898	11
21	Letter Img	UCI, target: Z	26:1	20,000	16
22	Yeast _ME2	UCI, target: ME2	28:1	1,484	8
23	Webpage	LIBSVM, w7a, target: minority	33:1	49,749	300
24	Ozone Level	UCI, ozone, data	34:1	2,536	72
25	Mammography	UCI, target: minority	42:1	11,183	6
26	Protein homo.	KDD CUP 2004, minority	111:1	145,751	74
27	Abalone_19	UCI, target: 19	130:1	4,177	8

References

[1] Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).

[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).

[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.

[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.

Annotated Benchmark of Real-World Data for Approximate Functional Dependency...
zenodo.org
data.niaid.nih.gov
csv
Updated Jul 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcel Parciak; Marcel Parciak; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren (2023). Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery [Dataset]. http://doi.org/10.5281/zenodo.8098909
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8098909
Dataset updated
Jul 1, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marcel Parciak; Marcel Parciak; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery

This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport.

The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs.

The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small.

Dataset References

adult.csv: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

claims.csv: TSA Claims Data 2002 to 2006, published by the U.S. Department of Homeland Security.

dblp10k.csv: Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248. Made available as DBLP Dataset 2.

hospital.csv: Hospital dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.

t_biocase_... files: t_bioc_... files used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.

tax.csv: Tax dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.
o
solar-flare
openml.org
Updated Apr 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). solar-flare [Dataset]. https://www.openml.org/d/40686
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2017
Description
Author: Gary Bradshaw
Source: UCI
Please cite:

Solar Flare database Relevant Information: -- The database contains 3 potential classes, one for the number of times a certain type of solar flare occured in a 24 hour period. -- Each instance represents captured features for 1 active region on the sun. -- The data are divided into two sections. The second section (flare.data2) has had much more error correction applied to the it, and has consequently been treated as more reliable.

Number of Instances: flare.data1: 323, flare.data2: 1066

Number of attributes: 13 (includes 3 class attributes)

Attribute Information

1. Code for class (modified Zurich class) (A,B,C,D,E,F,H) 2. Code for largest spot size (X,R,S,A,H,K) 3. Code for spot distribution (X,O,I,C) 4. Activity (1 = reduced, 2 = unchanged) 5. Evolution (1 = decay, 2 = no growth, 3 = growth) 6. Previous 24 hour flare activity code (1 = nothing as big as an M1, 2 = one M1, 3 = more activity than one M1) 7. Historically-complex (1 = Yes, 2 = No) 8. Did region become historically complex (1 = yes, 2 = no) on this pass across the sun's disk 9. Area (1 = small, 2 = large)

Area of the largest spot (1 = <=5, 2 = >5)

From all these predictors three classes of flares are predicted, which are represented in the last three columns.

C-class flares production by this region Number
in the following 24 hours (common flares)

M-class flares production by this region Number in the following 24 hours (moderate flares)

X-class flares production by this region Number in the following 24 hours (severe flares)

CLASSTYPE: nominal CLASSINDEX: first
DAGHAR: A Benchmark for Domain Adaptation and Generalization in...
zenodo.org
zip
Updated Sep 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Otávio Oliveira Napoli; Otávio Oliveira Napoli; Dami Duarte; Dami Duarte; Patrick Alves; Darlinne Hubert Palo Soto; Henrique Evangelista de Oliveira; Anderson Rocha; Anderson Rocha; Levy Boccato; Levy Boccato; Edson Borin; Edson Borin; Patrick Alves; Darlinne Hubert Palo Soto; Henrique Evangelista de Oliveira (2024). DAGHAR: A Benchmark for Domain Adaptation and Generalization in Smartphone-Based Human Activity Recognition [Dataset]. http://doi.org/10.5281/zenodo.11992126
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11992126
Dataset updated
Sep 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Otávio Oliveira Napoli; Otávio Oliveira Napoli; Dami Duarte; Dami Duarte; Patrick Alves; Darlinne Hubert Palo Soto; Henrique Evangelista de Oliveira; Anderson Rocha; Anderson Rocha; Levy Boccato; Levy Boccato; Edson Borin; Edson Borin; Patrick Alves; Darlinne Hubert Palo Soto; Henrique Evangelista de Oliveira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DAGHAR benchmark is a curated dataset collection designed for domain adaptation and domain generalization studies in HAR tasks, using inertial sensors such as accelerometers and gyroscopes, from "A benchmark for domain adaptation and generalization in smartphone-based human activity recognition" work. It features raw inertial sensor data sourced exclusively from smartphones. Six public datasets were selected and standardized in terms of accelerometer units of measurement, sampling rate, gravity component, activity labels, user partitioning, and time window size. This standardization process allows for creating a comprehensive benchmark for evaluating the generalization capabilities of HAR models in cross-dataset scenarios.

The benchmark is based on the following datasets:

Ku-HAR, from "Sikder, N. and Nahid, A.A., 2021. KU-HAR: An open dataset for heterogeneous human activity recognition. Pattern Recognition Letters, 146, pp.46-54", avaliable at Mendeley. Distributed under CC BY 4.0.

MotionSense, from "Malekzadeh, M., Clegg, R.G., Cavallaro, A. and Haddadi, H., 2019, April. Mobile sensor data anonymization. In Proceedings of the international conference on internet of things design and implementation (pp. 49-58)", available at Kaggle. Distributed under Open Data Commons Open Database License (ODbL) v1.0.

RealWorld, from "Sztyler, T. and Stuckenschmidt, H., 2016, March. On-body localization of wearable devices: An investigation of position-aware activity recognition. In 2016 IEEE international conference on pervasive computing and communications (PerCom) (pp. 1-9). IEEE", available at this link. We obtained explicitly permission to distribute a copy of the preprocessed data from the original authors.

UCI-HAR, from "Reyes-Ortiz, J.L., Oneto, L., Samà, A., Parra, X. and Anguita, D., 2016. Transition-aware human activity recognition using smartphones. Neurocomputing, 171, pp.754-767", available at UCI Repository. Distributed under CC BY 4.0.

WISDM, from "Weiss, G.M., Yoneda, K. and Hayajneh, T., 2019. Smartphone and smartwatch-based biometrics using activities of daily living. Ieee Access, 7, pp.133190-133202", available at UCI repository. Distributed under CC BY 4.0.
UCI drug name dataset
kaggle.com
zip
Updated Jan 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Alghali (2024). UCI drug name dataset [Dataset]. https://www.kaggle.com/datasets/ahmedalghali/uci-drug-name-dataset
Explore at:
zip(76366968 bytes)Available download formats
Dataset updated
Jan 23, 2024
Authors
Ahmed Alghali
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Dataset

This dataset was created by Ahmed Alghali

Released under Database: Open Database, Contents: Database Contents

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Medi Hunter - 4004 (2025). Cancer Multiple Dataset UCI MLR [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasakbd/cancer-multiple-dataset-uci-mlr/suggestions

Cancer Multiple Dataset UCI MLR

UCI Machine Learning Repository is a collection of databases

Explore at:

zip(74213598 bytes)Available download formats

Dataset updated

Aug 5, 2025

Authors

Medi Hunter - 4004

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Source More Info : https://archive.ics.uci.edu/datasets

The **UCI Machine Learning Repository **is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

The datasets collected in this project represent a diverse and comprehensive set of cancer-related data sourced from the UCI Machine Learning Repository. They cover a wide spectrum of cancer types and research perspectives, including breast cancer datasets such as the original, diagnostic, prognostic, and Coimbra variants, which focus on tumor features, recurrence, and biochemical markers. Cervical cancer is represented through datasets focusing on behavioral risks and general risk factors. The lung cancer dataset provides categorical diagnostic attributes, while the primary tumor dataset offers insights into tumor locations based on metastasis data. Additionally, specialized datasets like differentiated thyroid cancer recurrence, glioma grading with clinical and mutation features, and gene expression RNA-Seq data expand the scope into genetic and molecular-level cancer analysis. Together, these datasets support a wide range of machine learning applications including classification, prediction, survival analysis, and feature correlation across various types of cancer.

RRA_Think Differently, Create history’s next line.

Hello Data Hunters! Hope you're doing well. https://www.kaggle.com/shuvokumarbasak4004 (More Dataset) https://www.kaggle.com/shuvokumarbasak2030

Clear search

Close search

Google apps

Main menu

Cancer Multiple Dataset UCI MLR

kr-vs-kp

Integrated Heart Disease Dataset

Context

Content

Acknowledgements

Inspiration

UCI Machine Learning Repository

Breast Cancer Wisconsin (Prognostic) Data Set

Context

Content

Acknowledgements

Inspiration

UCI Datasets for Metaheuristic Feature Selection

arrhythmia

Attribute Information

tic-tac-toe

Attribute Information

PhishingWebsites

Description of UCI HAR and UniMiB SHAR.

Classifying wine varieties

Context

Content

Acknowledgements

Inspiration

census-income

URLs

mfeat-factors

Attribute Information

Relevant Papers

Daily Demand Forecasting Orders from UCI ML

Annual Income - PCS5031 Project

Data from: Imbalanced dataset for benchmarking

Imbalanced dataset for benchmarking

Characteristics

References

Annotated Benchmark of Real-World Data for Approximate Functional Dependency...

solar-flare

Attribute Information

DAGHAR: A Benchmark for Domain Adaptation and Generalization in...

UCI drug name dataset

Dataset

Contents

Cancer Multiple Dataset UCI MLR

UCI Machine Learning Repository is a collection of databases