100+ datasets found

i
UCI datasets
ieee-dataport.org
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuan Sun (2025). UCI datasets [Dataset]. https://ieee-dataport.org/documents/uci-datasets
Explore at:
Dataset updated
May 14, 2025
Authors
Yuan Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
biology
P
UCI Machine Learning Repository Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan N. van Rijn; Jonathan K. Vis, UCI Machine Learning Repository Dataset [Dataset]. https://paperswithcode.com/dataset/uci-machine-learning-repository
Explore at:
Authors
Jan N. van Rijn; Jonathan K. Vis
Description
UCI Machine Learning Repository is a collection of over 550 datasets.
UCI dataset
springernature.figshare.com
bin
Updated Mar 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wan-Ting Hsieh; Sergio González Vázquez; Trista Chen (2023). UCI dataset [Dataset]. http://doi.org/10.6084/m9.figshare.20496258.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20496258.v1
Dataset updated
Mar 13, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Wan-Ting Hsieh; Sergio González Vázquez; Trista Chen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).

[2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.
s
UCI Machine Learning Repository
scicrunch.org
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning Repository [Dataset]. http://identifiers.org/RRID:SCR_026571
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_026571
Description
Collection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given
Z
UCI and OpenML Data Sets for Ordinal Quantification
data.niaid.nih.gov
zenodo.org
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bunse, Mirko (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8177301
Explore at:
Dataset updated
Jul 25, 2023
Dataset provided by
Moreo, Alejandro
Bunse, Mirko
Sebastiani, Fabrizio
Senz, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
g
UCI Heart Disease Data
gts.ai
json
Updated Jan 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2025). UCI Heart Disease Data [Dataset]. https://gts.ai/dataset-download/uci-heart-disease-data/
Explore at:
jsonAvailable download formats
Dataset updated
Jan 26, 2025
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
Description
The UCI Heart Disease Dataset with 14 key attributes for machine learning & research. Ideal for predictive modeling.
a
UCI Machine Learning Datasets 12/2013
academictorrents.com
bittorrent
Updated Dec 20, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI (2013). UCI Machine Learning Datasets 12/2013 [Dataset]. https://academictorrents.com/details/7fafb101f9c7961f9b840daeb4af43039107ddef
Explore at:
bittorrent(16365432846)Available download formats
Dataset updated
Dec 20, 2013
Dataset authored and provided by
UCI
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged. Many people deserve thanks for making the repository a success. Foremost among them are the d
i
UCI dataset
ieee-dataport.org
Updated Jun 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wutao Xiong (2024). UCI dataset [Dataset]. https://ieee-dataport.org/documents/uci-dataset
Explore at:
Dataset updated
Jun 12, 2024
Authors
Wutao Xiong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
and different customers have different starting times
Bike Rental Data Set - UCI
kaggle.com
Updated Nov 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Aguado (2022). Bike Rental Data Set - UCI [Dataset]. https://www.kaggle.com/datasets/aguado/bike-rental-data-set-uci
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Víctor Aguado
Description
Description

The existing bicycle rental systems in large cities have a system automated collection and return of the vehicle through a network of stations distributed throughout the entire metropolis. With the use of these systems, people can rent a bike in a location and return it in a different one depending on your needs. The data generated by these systems are attractive to researchers due to variables such as the duration of the trip, departure and destination points and travel time. Therefore, exchange systems Bicycles work as a network of sensors that are useful for mobility studies. With In order to improve management, one of these companies needs to anticipate the demand that there will be in a certain range of time depending on factors such as the time zone, the type day (weekday or holiday), the weather, etc.

The objective of this data set is to predict the demand in a series of specific time slots, using the historical data set as the basis to build a linear model.

Data Description

Two data sets will be delivered containing the number of rented bicycles in different time slots:

Training data. They will contain the response variable (number of bicycles rented in that strip)

Test data. They will not contain the response variable and the response variable must be predicted based on on the historical data of the training set.

The variables present in the 2 data sets are:

id: time slot identifier (not related to time order)

year: year (2011 or 2012)

hour: hour of the day (0 to 23)

season: 1 = winter, 2 = spring, 3 = summer, 4 = autumn

holiday: if the day was a holiday

workingday: if the day was a working day (neither a holiday nor a weekend)

weather: four categories (1 to 4) ranging from best to worst weather

temp: temperature in degrees Celsius

atemp: sensation of temperature in degrees Celsius

humidity: relative humidity

windspeed: wind speed (km/h)

count (only in the training set): total number of rentals in that band
o
kr-vs-kp
openml.org
Updated Apr 6, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alen Shapiro (2014). kr-vs-kp [Dataset]. https://www.openml.org/search?type=data&sort=runs&status=active&qualities.NumberOfClasses=%3D_2&qualities.NumberOfInstances=gte_0&id=3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2014
Authors
Alen Shapiro
Description
Author: Alen Shapiro Source: UCI Please cite: UCI citation policy

Title: Chess End-Game -- King+Rook versus King+Pawn on a7 (usually abbreviated KRKPA7). The pawn on a7 means it is one square away from queening. It is the King+Rook's side (white) to move.

Sources: (a) Database originally generated and described by Alen Shapiro. (b) Donor/Coder: Rob Holte (holte@uottawa.bitnet). The database was supplied to Holte by Peter Clark of the Turing Institute in Glasgow (pete@turing.ac.uk). (c) Date: 1 August 1989

Past Usage:

Alen D. Shapiro (1983,1987), "Structured Induction in Expert Systems", Addison-Wesley. This book is based on Shapiro's Ph.D. thesis (1983) at the University of Edinburgh entitled "The Role of Structured Induction in Expert Systems".

Stephen Muggleton (1987), "Structuring Knowledge by Asking Questions", pp.218-229 in "Progress in Machine Learning", edited by I. Bratko and Nada Lavrac, Sigma Press, Wilmslow, England SK9 5BB.

Robert C. Holte, Liane Acker, and Bruce W. Porter (1989), "Concept Learning and the Problem of Small Disjuncts", Proceedings of IJCAI. Also available as technical report AI89-106, Computer Sciences Department, University of Texas at Austin, Austin, Texas 78712.

Relevant Information: The dataset format is described below. Note: the format of this database was modified on 2/26/90 to conform with the format of all the other databases in the UCI repository of machine learning databases.

Number of Instances: 3196 total

Number of Attributes: 36

Attribute Summaries: Classes (2): -- White-can-win ("won") and White-cannot-win ("nowin"). I believe that White is deemed to be unable to win if the Black pawn can safely advance. Attributes: see Shapiro's book.

Missing Attributes: -- none

Class Distribution: In 1669 of the positions (52%), White can win. In 1527 of the positions (48%), White cannot win.

The format for instances in this database is a sequence of 37 attribute values. Each instance is a board-descriptions for this chess endgame. The first 36 attributes describe the board. The last (37th) attribute is the classification: "win" or "nowin". There are 0 missing values. A typical board-description is

f,f,f,f,f,f,f,f,f,f,f,f,l,f,n,f,f,t,f,f,f,f,f,f,f,t,f,f,f,f,f,f,f,t,t,n,won

The names of the features do not appear in the board-descriptions. Instead, each feature correponds to a particular position in the feature-value list. For example, the head of this list is the value for the feature "bkblk". The following is the list of features, in the order in which their values appear in the feature-value list:

[bkblk,bknwy,bkon8,bkona,bkspr,bkxbq,bkxcr,bkxwp,blxwp,bxqsq,cntxt,dsopp,dwipd, hdchk,katri,mulch,qxmsq,r2ar8,reskd,reskr,rimmx,rkxwp,rxmsq,simpl,skach,skewr, skrxp,spcop,stlmt,thrsk,wkcti,wkna8,wknck,wkovl,wkpos,wtoeg]

In the file, there is one instance (board position) per line.

Num Instances: 3196 Num Attributes: 37 Num Continuous: 0 (Int 0 / Real 0) Num Discrete: 37 Missing values: 0 / 0.0%
P
https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html Dataset
paperswithcode.com
Updated Oct 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html Dataset [Dataset]. https://paperswithcode.com/dataset/https-kdd-ics-uci-edu-databases-kddcup99
Explore at:
Dataset updated
Oct 28, 2024
Description
Click to add a brief description of the dataset (Markdown and LaTeX enabled).

Provide:

a high-level explanation of the dataset characteristics explain motivations and summary of its content potential use cases of the dataset
Obesity DataSet UCI ML
kaggle.com
Updated Feb 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tathagat Banerjee (2022). Obesity DataSet UCI ML [Dataset]. https://www.kaggle.com/datasets/tathagatbanerjee/obesity-dataset-uci-ml
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tathagat Banerjee
Description
Estimation of obesity levels based on eating habits and physical condition Data Set Download: Data Folder, Data Set Description

Abstract: This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.

Data Set Characteristics:

Multivariate

Number of Instances:

2111

Area:

Life

Attribute Characteristics:

Integer

Number of Attributes:

17

Date Donated

2019-08-27

Associated Tasks:

Classification, Regression, Clustering

Missing Values?

N/A

Number of Web Hits:

70843

Source:

Fabio Mendoza Palechor, Email: fmendoza1 '@' cuc.edu.co, Celphone: +573182929611 Alexis de la Hoz Manotas, Email: akdelahoz '@' gmail.com, Celphone: +573017756983

Data Set Information:

This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.

Attribute Information:

Read the article ([Web Link]) to see the description of the attributes.

Relevant Papers:

[1]Palechor, F. M., & de la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in Brief, 104344. [2]De-La-Hoz-Correa, E., Mendoza Palechor, F., De-La-Hoz-Manotas, A., Morales Ortega, R., & SÃ¡nchez HernÃ¡ndez, A. B. (2019). Obesity level estimation software based on decision trees.

Citation Request:

[1] Palechor, F. M., & de la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in Brief, 104344.
UCI Diabetes Data Set
kaggle.com
Updated May 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ergin Altıntaş (2020). UCI Diabetes Data Set [Dataset]. https://www.kaggle.com/ealtintas/uci-machine-learning-repository-diabetes-data-set/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 1, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ergin Altıntaş
Description
About this Dataset

This CSV contain a data set prepared for the use of participants for the 1994 AAAI Spring Symposium on Artificial Intelligence in Medicine.

Content

Original files were obtained from: https://archive.ics.uci.edu/ml/datasets/diabetes

Archived file diabetes-data.tar.z which contains 70 sets of data recorded on diabetes patients (several weeks' to months' worth of glucose, insulin, and lifestyle data per patient + a description of the problem domain) is extracted and processed and merged as a CSV file.

The Code field of the CSV is deciphered as follows:

33 = Regular insulin dose 34 = NPH insulin dose 35 = UltraLente insulin dose 48 = Unspecified blood glucose measurement 57 = Unspecified blood glucose measurement 58 = Pre-breakfast blood glucose measurement 59 = Post-breakfast blood glucose measurement 60 = Pre-lunch blood glucose measurement 61 = Post-lunch blood glucose measurement 62 = Pre-supper blood glucose measurement 63 = Post-supper blood glucose measurement 64 = Pre-snack blood glucose measurement 65 = Hypoglycemic symptoms 66 = Typical meal ingestion 67 = More-than-usual meal ingestion 68 = Less-than-usual meal ingestion 69 = Typical exercise activity 70 = More-than-usual exercise activity 71 = Less-than-usual exercise activity 72 = Unspecified special event
d
Replication Data for: Scalable Kernel Mean Matching
search.dataone.org
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chandra, Swarup (2023). Replication Data for: Scalable Kernel Mean Matching [Dataset]. http://doi.org/10.7910/DVN/ELFPEM
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ELFPEM
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Chandra, Swarup
Description
Datasets available at UCI Machine Learning Repository and other repositories. List of datasets used in the experiment with their sources. ForestCover dataset @ https://archive.ics.uci.edu/ml/datasets/Covertype KDD Cup99 dataset @ https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data PAMAP dataset @ https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring Powersupply @ http://www.cse.fau.edu/~xqzhu/stream.html SEA @ http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift Syn002 & Syn003 (generated) @ http://moa.cms.waikato.ac.nz/details/classification/streams/ MNIST @ https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html News20 @ https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html

arrhythmia

openml.org

Updated Apr 6, 2014

Facebook

Twitter

Click to copy link

Link copied

Cite

H. Altay Guvenir; Burak Acar; Haldun Muderrisoglu (2014). arrhythmia [Dataset]. https://www.openml.org/d/5

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 6, 2014

Authors

H. Altay Guvenir; Burak Acar; Haldun Muderrisoglu

Description

Author: H. Altay Guvenir, Burak Acar, Haldun Muderrisoglu
Source: UCI
Please cite: UCI

Cardiac Arrhythmia Database
The aim is to determine the type of arrhythmia from the ECG recordings. This database contains 279 attributes, 206 of which are linear valued and the rest are nominal.

Concerning the study of H. Altay Guvenir: "The aim is to distinguish between the presence and absence of cardiac arrhythmia and to classify it in one of the 16 groups. Class 01 refers to 'normal' ECG classes, 02 to 15 refers to different classes of arrhythmia and class 16 refers to the rest of unclassified ones. For the time being, there exists a computer program that makes such a classification. However, there are differences between the cardiologist's and the program's classification. Taking the cardiologist's as a gold standard we aim to minimize this difference by means of machine learning tools.

The names and id numbers of the patients were recently removed from the database.

Attribute Information

  1 Age: Age in years , linear
  2 Sex: Sex (0 = male; 1 = female) , nominal
  3 Height: Height in centimeters , linear
  4 Weight: Weight in kilograms , linear
  5 QRS duration: Average of QRS duration in msec., linear
  6 P-R interval: Average duration between onset of P and Q waves
   in msec., linear
  7 Q-T interval: Average duration between onset of Q and offset
   of T waves in msec., linear
  8 T interval: Average duration of T wave in msec., linear
  9 P interval: Average duration of P wave in msec., linear
 Vector angles in degrees on front plane of:, linear
 10 QRS
 11 T
 12 P
 13 QRST
 14 J
 15 Heart rate: Number of heart beats per minute ,linear
 Of channel DI:
  Average width, in msec., of: linear
  16 Q wave
  17 R wave
  18 S wave
  19 R' wave, small peak just after R
  20 S' wave
  21 Number of intrinsic deflections, linear
  22 Existence of ragged R wave, nominal
  23 Existence of diphasic derivation of R wave, nominal
  24 Existence of ragged P wave, nominal
  25 Existence of diphasic derivation of P wave, nominal
  26 Existence of ragged T wave, nominal
  27 Existence of diphasic derivation of T wave, nominal
 Of channel DII: 
  28 .. 39 (similar to 16 .. 27 of channel DI)
 Of channels DIII:
  40 .. 51
 Of channel AVR:
  52 .. 63
 Of channel AVL:
  64 .. 75
 Of channel AVF:
  76 .. 87
 Of channel V1:
  88 .. 99
 Of channel V2:
  100 .. 111
 Of channel V3:
  112 .. 123
 Of channel V4:
  124 .. 135
 Of channel V5:
  136 .. 147
 Of channel V6:
  148 .. 159
 Of channel DI:
  Amplitude , * 0.1 milivolt, of
  160 JJ wave, linear
  161 Q wave, linear
  162 R wave, linear
  163 S wave, linear
  164 R' wave, linear
  165 S' wave, linear
  166 P wave, linear
  167 T wave, linear
  168 QRSA , Sum of areas of all segments divided by 10,
    ( Area= width * height / 2 ), linear
  169 QRSTA = QRSA + 0.5 * width of T wave * 0.1 * height of T
    wave. (If T is diphasic then the bigger segment is
    considered), linear
 Of channel DII:
  170 .. 179
 Of channel DIII:
  180 .. 189
 Of channel AVR:
  190 .. 199
 Of channel AVL:
  200 .. 209
 Of channel AVF:
  210 .. 219
 Of channel V1:
  220 .. 229
 Of channel V2:
  230 .. 239
 Of channel V3:
  240 .. 249
 Of channel V4:
  250 .. 259
 Of channel V5:
  260 .. 269
 Of channel V6:
  270 .. 279

Class code - class - number of instances:

  01       Normal        245
  02       Ischemic changes (Coronary Artery Disease)  44
  03       Old Anterior Myocardial Infarction      15
  04       Old Inferior Myocardial Infarction      15
  05       Sinus tachycardy    13
  06       Sinus bradycardy    25
  07       Ventricular Premature Contraction (PVC)    3
  08       Supraventricular Premature Contraction    2
  09       Left bundle branch block     9 
  10       Right bundle branch block    50
  11       1. degree AtrioVentricular block    0 
  12       2. degree AV block        0
  13       3. degree AV block        0
  14       Left ventricule hypertrophy        4
  15       Atrial Fibrillation or Flutter        5
  16       Others         22

KOS bag of words data
kaggle.com
Updated May 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PhilipHarmuth (2017). KOS bag of words data [Dataset]. https://www.kaggle.com/datasets/harmuth/bagofwords/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
PhilipHarmuth
Description
Data Set Information:

Taken from https://archive.ics.uci.edu/ml/datasets/bag+of+words

For each text collection, D is the number of documents, W is the number of words in the vocabulary, and N is the total number of words in the collection (below, NNZ is the number of nonzero counts in the bag-of-words). After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons.

These data sets have no class labels, and for copyright reasons no filenames or other document-level metadata. These data sets are ideal for clustering and topic modeling experiments.

KOS blog entries: orig source: dailykos.com D=3430 W=6906 N=467714

Attribute Information:

The format of the docword.*.txt file is 3 header lines, followed by

NNZ triples:

D W NNZ docID wordID count docID wordID count docID wordID count docID wordID count ... docID wordID count docID wordID count

docID wordID count
Z
UCI datasets
data.niaid.nih.gov
zenodo.org
Updated Apr 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Drton, Mathias (2023). UCI datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7681647
Explore at:
Dataset updated
Apr 4, 2023
Dataset provided by
Zadorozhnyi, Oleksandr
Drton, Mathias
Reifferscheidt, David
Haug, Stephan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Collection of two datasets from the UCI website that could be used for structure learning tasks. Includes datasets regarding

Air Quality

US census 1990

Size: Two datasets of sizes 9471*17 and 2458285*68 correspondingly

Number of features: 15-68

Ground truth: No

Type of Graph: No ground truth

More information about the datasets is contained in the dataset_description.html files.
z
UCI Datasets: "Air quality" and "US Census (1990)"
zenodo.org
bin, csv, html
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). UCI Datasets: "Air quality" and "US Census (1990)" [Dataset]. http://doi.org/10.5281/zenodo.8063512
Explore at:
bin, csv, htmlAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8063512
Dataset updated
Jan 27, 2025
Dataset provided by
Zenodo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
Two preprocessed datasets collected from the UCI repository that can be used for the purpose of structure learning from multivariate data of different types.

Air Quality

This dataset represents hourly averaged measurements of 5 metal oxide chemical sensors embedded in an air quality chemical multisensor device. The certified analyzer was located on the field in a significantly polluted area, at road level, within an Italian city. Data were recorded from March 2004 to February 2005 (one year), representing the longest freely available recordings of on-field deployed air quality chemical sensor device responses [1]. More information about the attributes and their type can be found in airqualitydataset_description.html.

Size of dataset: 9358
Number of Features: 16
Type of data: discrete and continuous
Ground Truth: No

Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. There are 15 attributes. Date and Time as well as discrete and real covariates.

0 Date (DD/MM/YYYY)
1 Time (HH.MM.SS)
2 True hourly averaged concentration CO in mg/m^3 (reference analyzer)
3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
4 True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
5 True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
7 True hourly averaged NOx concentration in ppb (reference analyzer)
8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
9 True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
12 Temperature in Â°C
13 Relative Humidity (%)
14 AH Absolute Humidity

US Census (1990)

This dataset is a discretized version of the USCensus1990raw dataset. The data was collected as part of the 1990 census, and it describes one percent sample of the Public Use Microdata Samples (PUMS) person records drawn from the full 1990 census sample (all fifty states and the District of Columbia but not including "PUMA Cross State Lines One Percent Persons Records") [2]. More information about the attributes and their type can be found in census1990_description.html.

Size of dataset: 2458285
Number of features: 68
Ground truth: No

References:

[1] S. De Vito and E. Massera and M. Piga and L. Martinotto and G. Di Francia, On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario, Sensors and Actuators B: Chemical, Volume 129, Issue 2, 22 February 2008, Pages 750-757, ISSN 0925-4005 https://doi.org/10.1016/j.snb.2007.09.060

[2] Meek, Thiesson and Heckerman (2001), "The Learning Curve Method Applied to Clustering",The Journal of Machine Learning Research. (Also see MSR-TR-2001-34 available athttps://www.microsoft.com/en-us/research/wp-content/uploads/2001/01/lc-aistats.pdf)
Open-source data sets for classification task from UCI repository and...
figshare.com
txt
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xh niu (2024). Open-source data sets for classification task from UCI repository and Scikit-learn in section 4 [Dataset]. http://doi.org/10.6084/m9.figshare.26886055.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26886055.v1
Dataset updated
Aug 31, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
xh niu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets from Scikit-learn are: ‘Iris’, ‘Wine’, ‘Breast Cancer Wisconsin (Diagnostic)’. Datasets from UCI repository are: ‘Seeds’ ‘Banknote Authentication’ (‘Banknotes’), ‘Heart disease’ ‘ Parkinsons ‘, ‘Ecoli’, ‘Thyroid (Thyroid gland data)’
UCI Heart Disease Data Set
kaggle.com
Updated Jan 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lourens Walters (2021). UCI Heart Disease Data Set [Dataset]. https://www.kaggle.com/lourenswalters/uci-heart-disease-data-set/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lourens Walters
Description
Context

The dataset used can be found on the UCI Machine Learning Repository at the following location:

Heart Disease Dataset

There are several copies of this dataset to be found on Kaggle, with people focusing on different types of analyses of the data. This specific copy can be analysed by anyone interested, but is primarily used by a study group from the Udacity Bertelsmann Technology Scholarship to practice analysis of association between variables as well as implementation and comparison of various Machine Learning models.

Content

According to the paper by (Detrano et al., 1989) as found on the UCI Dataset webpage, the data represents data collected for 303 patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984. The 13 independent/ features variables can be divided into 3 groups as follows:

Routine evaluation (based on historical data):

ECG at rest

Serum Cholesterol

Fasting blood sugar

Non-invasive test data (informed consent obtained for data as part of research protocol):

Exercise ECG

ST-segment peak slope (upsloping, flat or downsloping)

ST-segment depression

Excercise Thallium scintigraphy (fixed, reversible or none)

Cardiac fluoroscopy (number of vessels appeared to contain calcium)

Other demographic and clinical variables (based on routine data):

Age

Sex

Chest pain type

Systolic blood pressure

ST-T-wave abnormality (T-wave abnormality)

Probably or definite ventricular hypertrophy (Este's criteria)

The dependent/ response variable was the angiographic test result indicating a >50% diameter narrowing.

Data Dictionary

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3632459%2Fa01747fb0158dc51c12bc0824c9c4ae4%2Fdata_dictionary2.png?generation=1609522473018549&alt=media" alt="">

Acknowledgements

UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Donor:

David W. Aha (aha '@' ics.uci.edu) (714) 856-8779

Inspiration

The objective of the analysis is to use statistical learning to identify factors associated with Coronary Artery Disease as indicated by a coronary angiography interpreted by a Cardiologist (as per paper written by Detrano et al cited before).

Facebook

Twitter

Click to copy link

Link copied

Cite

Yuan Sun (2025). UCI datasets [Dataset]. https://ieee-dataport.org/documents/uci-datasets

UCI datasets

Explore at:

Dataset updated

May 14, 2025

Authors

Yuan Sun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

biology

Clear search

Close search

Google apps

Main menu

UCI datasets

UCI Machine Learning Repository Dataset

UCI dataset

UCI Machine Learning Repository

UCI and OpenML Data Sets for Ordinal Quantification

UCI Heart Disease Data

UCI Machine Learning Datasets 12/2013

UCI dataset

Bike Rental Data Set - UCI

Description

Data Description

kr-vs-kp

https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html Dataset

Obesity DataSet UCI ML

Source:

Data Set Information:

Attribute Information:

Relevant Papers:

Citation Request:

UCI Diabetes Data Set

About this Dataset

Content

Replication Data for: Scalable Kernel Mean Matching

arrhythmia

Attribute Information

KOS bag of words data

Data Set Information:

NNZ triples:

docID wordID count

UCI datasets

UCI Datasets: "Air quality" and "US Census (1990)"

Open-source data sets for classification task from UCI repository and...

UCI Heart Disease Data Set

Context

Content

Data Dictionary

Acknowledgements

Inspiration

UCI datasets