100+ datasets found

i
UCI datasets
ieee-dataport.org
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuan Sun (2025). UCI datasets [Dataset]. https://ieee-dataport.org/documents/uci-datasets
Explore at:
Dataset updated
May 14, 2025
Authors
Yuan Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
biology
UCI dataset
springernature.figshare.com
bin
Updated Mar 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wan-Ting Hsieh; Sergio González Vázquez; Trista Chen (2023). UCI dataset [Dataset]. http://doi.org/10.6084/m9.figshare.20496258.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20496258.v1
Dataset updated
Mar 13, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Wan-Ting Hsieh; Sergio González Vázquez; Trista Chen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).

[2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.
h
uci-ml-repo
huggingface.co
Updated Mar 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Top MBA Applicants (2025). uci-ml-repo [Dataset]. https://huggingface.co/datasets/TopMBAApplicants/uci-ml-repo
Explore at:
Dataset updated
Mar 23, 2025
Dataset authored and provided by
Top MBA Applicants
Description
TopMBAApplicants/uci-ml-repo dataset hosted on Hugging Face and contributed by the HF Datasets community
UCI Machine Learning Repository
kaggle.com
Updated Dec 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MD.Romzan Alom (2024). UCI Machine Learning Repository [Dataset]. https://www.kaggle.com/datasets/mdromzanalom/uci-machine-learning-repository/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
MD.Romzan Alom
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by MD.Romzan Alom

Released under MIT

Contents
f
Basic information on 40 datasets from UCI repository used in this study...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gregor Stiglic; Simon Kocbek; Igor Pernek; Peter Kokol (2023). Basic information on 40 datasets from UCI repository used in this study including information about number of instances, attributes, classes, length of longest attribute name (LAN) and length of the longest nominal attribute value (LAV). [Dataset]. http://doi.org/10.1371/journal.pone.0033812.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0033812.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Gregor Stiglic; Simon Kocbek; Igor Pernek; Peter Kokol
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Basic information on 40 datasets from UCI repository used in this study including information about number of instances, attributes, classes, length of longest attribute name (LAN) and length of the longest nominal attribute value (LAV).
a
UCI Machine Learning Datasets 12/2013
academictorrents.com
bittorrent
Updated Dec 20, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI (2013). UCI Machine Learning Datasets 12/2013 [Dataset]. https://academictorrents.com/details/7fafb101f9c7961f9b840daeb4af43039107ddef
Explore at:
bittorrent(16365432846)Available download formats
Dataset updated
Dec 20, 2013
Dataset authored and provided by
UCI
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged. Many people deserve thanks for making the repository a success. Foremost among them are the d
Z
UCI and OpenML Data Sets for Ordinal Quantification
data.niaid.nih.gov
zenodo.org
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moreo, Alejandro (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8177301
Explore at:
Dataset updated
Jul 25, 2023
Dataset provided by
Bunse, Mirko
Moreo, Alejandro
Sebastiani, Fabrizio
Senz, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
heart-disease-data
kaggle.com
zip
Updated Aug 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nagaveda Reddy (2020). heart-disease-data [Dataset]. https://www.kaggle.com/nagavedareddy/heartdiseasedata
Explore at:
zip(3494 bytes)Available download formats
Dataset updated
Aug 5, 2020
Authors
Nagaveda Reddy
Description
Dataset

This dataset was created by Nagaveda Reddy

Contents
UCI Diabetes Data Set
kaggle.com
Updated May 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ergin Altıntaş (2020). UCI Diabetes Data Set [Dataset]. https://www.kaggle.com/ealtintas/uci-machine-learning-repository-diabetes-data-set/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 1, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ergin Altıntaş
Description
About this Dataset

This CSV contain a data set prepared for the use of participants for the 1994 AAAI Spring Symposium on Artificial Intelligence in Medicine.

Content

Original files were obtained from: https://archive.ics.uci.edu/ml/datasets/diabetes

Archived file diabetes-data.tar.z which contains 70 sets of data recorded on diabetes patients (several weeks' to months' worth of glucose, insulin, and lifestyle data per patient + a description of the problem domain) is extracted and processed and merged as a CSV file.

The Code field of the CSV is deciphered as follows:

33 = Regular insulin dose 34 = NPH insulin dose 35 = UltraLente insulin dose 48 = Unspecified blood glucose measurement 57 = Unspecified blood glucose measurement 58 = Pre-breakfast blood glucose measurement 59 = Post-breakfast blood glucose measurement 60 = Pre-lunch blood glucose measurement 61 = Post-lunch blood glucose measurement 62 = Pre-supper blood glucose measurement 63 = Post-supper blood glucose measurement 64 = Pre-snack blood glucose measurement 65 = Hypoglycemic symptoms 66 = Typical meal ingestion 67 = More-than-usual meal ingestion 68 = Less-than-usual meal ingestion 69 = Typical exercise activity 70 = More-than-usual exercise activity 71 = Less-than-usual exercise activity 72 = Unspecified special event
Open-source data sets for classification task from UCI repository and...
figshare.com
txt
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xh niu (2024). Open-source data sets for classification task from UCI repository and Scikit-learn in section 4 [Dataset]. http://doi.org/10.6084/m9.figshare.26886055.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26886055.v1
Dataset updated
Aug 31, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
xh niu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets from Scikit-learn are: ‘Iris’, ‘Wine’, ‘Breast Cancer Wisconsin (Diagnostic)’. Datasets from UCI repository are: ‘Seeds’ ‘Banknote Authentication’ (‘Banknotes’), ‘Heart disease’ ‘ Parkinsons ‘, ‘Ecoli’, ‘Thyroid (Thyroid gland data)’
n
uci-uni
networkrepository.com
csv
Updated Feb 28, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Network Data Repository (2016). uci-uni [Dataset]. https://networkrepository.com/socfb-uci-uni.php
Explore at:
csvAvailable download formats
Dataset updated
Feb 28, 2016
Dataset authored and provided by
Network Data Repository
License
https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php
Description
Facebook social network - A social friendship network extracted from Facebook consisting of people (nodes) with edges representing friendship ties.
d
Replication Data for: Scalable Kernel Mean Matching
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chandra, Swarup (2023). Replication Data for: Scalable Kernel Mean Matching [Dataset]. http://doi.org/10.7910/DVN/ELFPEM
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ELFPEM
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Chandra, Swarup
Description
Datasets available at UCI Machine Learning Repository and other repositories. List of datasets used in the experiment with their sources. ForestCover dataset @ https://archive.ics.uci.edu/ml/datasets/Covertype KDD Cup99 dataset @ https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data PAMAP dataset @ https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring Powersupply @ http://www.cse.fau.edu/~xqzhu/stream.html SEA @ http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift Syn002 & Syn003 (generated) @ http://moa.cms.waikato.ac.nz/details/classification/streams/ MNIST @ https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html News20 @ https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
King Rook King (UCI ML Repo)
kaggle.com
Updated Feb 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MayankDubey (2021). King Rook King (UCI ML Repo) [Dataset]. https://www.kaggle.com/mayankdubey1196/king-rook-king-uci-ml-repo/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
MayankDubey
Description
Dataset

This dataset was created by MayankDubey

Contents

Data from: Imbalanced dataset for benchmarking

data.niaid.nih.gov
zenodo.org

Updated Jan 24, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Nogueira, Fernando (2020). Imbalanced dataset for benchmarking [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_61452

Explore at:

Dataset updated

Jan 24, 2020

Dataset provided by

Lemaitre, Guillaume
Nogueira, Fernando
Aridas, Christos K.
Oliveira, Dayvid V. R.

License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

Imbalanced dataset for benchmarking

The different algorithms of the imbalanced-learn toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.

Characteristics

ID	Name	Repository & Target	Ratio	# samples	# features
1	Ecoli	UCI, target: imU	8.6:1	336	7
2	Optical Digits	UCI, target: 8	9.1:1	5,620	64
3	SatImage	UCI, target: 4	9.3:1	6,435	36
4	Pen Digits	UCI, target: 5	9.4:1	10,992	16
5	Abalone	UCI, target: 7	9.7:1	4,177	8
6	Sick Euthyroid	UCI, target: sick euthyroid	9.8:1	3,163	25
7	Spectrometer	UCI, target: >=44	11:1	531	93
8	Car_Eval_34	UCI, target: good, v good	12:1	1,728	6
9	ISOLET	UCI, target: A, B	12:1	7,797	617
10	US Crime	UCI, target: >0.65	12:1	1,994	122
11	Yeast_ML8	LIBSVM, target: 8	13:1	2,417	103
12	Scene	LIBSVM, target: >one label	13:1	2,407	294
13	Libras Move	UCI, target: 1	14:1	360	90
14	Thyroid Sick	UCI, target: sick	15:1	3,772	28
15	Coil_2000	KDD, CoIL, target: minority	16:1	9,822	85
16	Arrhythmia	UCI, target: 06	17:1	452	279
17	Solar Flare M0	UCI, target: M->0	19:1	1,389	10
18	OIL	UCI, target: minority	22:1	937	49
19	Car_Eval_4	UCI, target: vgood	26:1	1,728	6
20	Wine Quality	UCI, wine, target: <=4	26:1	4,898	11
21	Letter Img	UCI, target: Z	26:1	20,000	16
22	Yeast _ME2	UCI, target: ME2	28:1	1,484	8
23	Webpage	LIBSVM, w7a, target: minority	33:1	49,749	300
24	Ozone Level	UCI, ozone, data	34:1	2,536	72
25	Mammography	UCI, target: minority	42:1	11,183	6
26	Protein homo.	KDD CUP 2004, minority	111:1	145,751	74
27	Abalone_19	UCI, target: 19	130:1	4,177	8

References

[1] Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).

[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).

[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.

[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.

z
UCI Datasets: "Air quality" and "US Census (1990)"
zenodo.org
bin, csv, html
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). UCI Datasets: "Air quality" and "US Census (1990)" [Dataset]. http://doi.org/10.5281/zenodo.8063512
Explore at:
bin, csv, htmlAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8063512
Dataset updated
Jan 27, 2025
Dataset provided by
Zenodo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
Two preprocessed datasets collected from the UCI repository that can be used for the purpose of structure learning from multivariate data of different types.

Air Quality

This dataset represents hourly averaged measurements of 5 metal oxide chemical sensors embedded in an air quality chemical multisensor device. The certified analyzer was located on the field in a significantly polluted area, at road level, within an Italian city. Data were recorded from March 2004 to February 2005 (one year), representing the longest freely available recordings of on-field deployed air quality chemical sensor device responses [1]. More information about the attributes and their type can be found in airqualitydataset_description.html.

Size of dataset: 9358
Number of Features: 16
Type of data: discrete and continuous
Ground Truth: No

Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. There are 15 attributes. Date and Time as well as discrete and real covariates.

0 Date (DD/MM/YYYY)
1 Time (HH.MM.SS)
2 True hourly averaged concentration CO in mg/m^3 (reference analyzer)
3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
4 True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
5 True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
7 True hourly averaged NOx concentration in ppb (reference analyzer)
8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
9 True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
12 Temperature in Â°C
13 Relative Humidity (%)
14 AH Absolute Humidity

US Census (1990)

This dataset is a discretized version of the USCensus1990raw dataset. The data was collected as part of the 1990 census, and it describes one percent sample of the Public Use Microdata Samples (PUMS) person records drawn from the full 1990 census sample (all fifty states and the District of Columbia but not including "PUMA Cross State Lines One Percent Persons Records") [2]. More information about the attributes and their type can be found in census1990_description.html.

Size of dataset: 2458285
Number of features: 68
Ground truth: No

References:

[1] S. De Vito and E. Massera and M. Piga and L. Martinotto and G. Di Francia, On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario, Sensors and Actuators B: Chemical, Volume 129, Issue 2, 22 February 2008, Pages 750-757, ISSN 0925-4005 https://doi.org/10.1016/j.snb.2007.09.060

[2] Meek, Thiesson and Heckerman (2001), "The Learning Curve Method Applied to Clustering",The Journal of Machine Learning Research. (Also see MSR-TR-2001-34 available athttps://www.microsoft.com/en-us/research/wp-content/uploads/2001/01/lc-aistats.pdf)
DodgerLoopGame UCR Archive Dataset
zenodo.org
data.niaid.nih.gov
bin
Updated May 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). DodgerLoopGame UCR Archive Dataset [Dataset]. http://doi.org/10.5281/zenodo.11186628
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11186628
Dataset updated
May 14, 2024
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.

The traffic data are collected with the loop sensor installed on ramp for the 101 North freeway in Los Angeles. This location is close to Dodgers Stadium; therefore the traffic is affected by volume of visitors to the stadium. Missing values are represented with NaN. - Class 1: Normal Day - Class 2: Game Day There is nothing to infer from the order of examples in the train and test set. Missing values are represented with NaN in the text file. Data created by Ihler, Alexander, Jon Hutchins, and Padhraic Smyth (see [1][2][3]). Data edited by Chin-Chia Michael Yeh.

[1] Ihler, Alexander, Jon Hutchins, and Padhraic Smyth. "Adaptive event detection with time-varying poisson processes." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.

[2] “UCI Machine Learning Repository: Dodgers Loop Sensor Data Set.” UCI Machine Learning Repository, archive.ics.uci.edu/ml/datasets/dodgers+loop+sensor.

[3] “Caltrans PeMS.” Caltrans, pems.dot.ca.gov/.

Donator: C. Yeh
h
uci-shopper
huggingface.co
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Henning (2023). uci-shopper [Dataset]. https://huggingface.co/datasets/jlh/uci-shopper
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2023
Authors
John Henning
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Online Shoppers Purchasing Intention Dataset

Dataset Summary

This dataset is a reupload of the Online Shoppers Purchasing Intention Dataset from the UCI Machine Learning Repository.

NOTE: The information below is from the original dataset description from UCI's website.

Overview

Of the 12,330 sessions in the dataset, 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples… See the full description on the dataset page: https://huggingface.co/datasets/jlh/uci-shopper.
Default of credit card clients
kaggle.com
Updated Oct 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marios Michalopoulos (2019). Default of credit card clients [Dataset]. https://www.kaggle.com/mariosfish/default-of-credit-card-clients/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 3, 2019
Dataset provided by
Kaggle
Authors
Marios Michalopoulos
Description
Context

This notebook was created for analysis and prediction making of the Default of credit card clients Data Set from UCI Machine Learning Library. The data set can be accessed separately from the UCI Machine Learning Repository page, here.

Content

In their paper "The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. (Yeh I. C. & Lien C. H.,2009)", which can be found here, Yeh I. C. & Lien C. H. review six data mining techniques (discriminant analysis, logistic regression, Bayesclassifier, nearest neighbor, artificial neural networks, and classification trees) and their applications on credit scoring. Then, using the real cardholders’ credit risk data in Taiwan, they compare the classification accuracy among them.

Models

We will create 3 models in order to make predictions and compare them with the original paper. These models are: - Logistic Regression - Decision tree - Neural Network

After the initial predictions, each model will be "optimized" by GridSearchCV estimator, which will search for the best set of hyperparameters for every model.

Goal

Using the models we created, we will try to predict the class value of dpnm column with better scores (accuracy and f1) than the scores presented in the original paper.
UCI Heart Disease Data Set
kaggle.com
Updated Jan 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lourens Walters (2021). UCI Heart Disease Data Set [Dataset]. https://www.kaggle.com/lourenswalters/uci-heart-disease-data-set/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lourens Walters
Description
Context

The dataset used can be found on the UCI Machine Learning Repository at the following location:

Heart Disease Dataset

There are several copies of this dataset to be found on Kaggle, with people focusing on different types of analyses of the data. This specific copy can be analysed by anyone interested, but is primarily used by a study group from the Udacity Bertelsmann Technology Scholarship to practice analysis of association between variables as well as implementation and comparison of various Machine Learning models.

Content

According to the paper by (Detrano et al., 1989) as found on the UCI Dataset webpage, the data represents data collected for 303 patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984. The 13 independent/ features variables can be divided into 3 groups as follows:

Routine evaluation (based on historical data):

ECG at rest

Serum Cholesterol

Fasting blood sugar

Non-invasive test data (informed consent obtained for data as part of research protocol):

Exercise ECG

ST-segment peak slope (upsloping, flat or downsloping)

ST-segment depression

Excercise Thallium scintigraphy (fixed, reversible or none)

Cardiac fluoroscopy (number of vessels appeared to contain calcium)

Other demographic and clinical variables (based on routine data):

Age

Sex

Chest pain type

Systolic blood pressure

ST-T-wave abnormality (T-wave abnormality)

Probably or definite ventricular hypertrophy (Este's criteria)

The dependent/ response variable was the angiographic test result indicating a >50% diameter narrowing.

Data Dictionary

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3632459%2Fa01747fb0158dc51c12bc0824c9c4ae4%2Fdata_dictionary2.png?generation=1609522473018549&alt=media" alt="">

Acknowledgements

UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Donor:

David W. Aha (aha '@' ics.uci.edu) (714) 856-8779

Inspiration

The objective of the analysis is to use statistical learning to identify factors associated with Coronary Artery Disease as indicated by a coronary angiography interpreted by a Cardiologist (as per paper written by Detrano et al cited before).
f
Details of the datasets from UCI repository used in the experiments.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
QingJun Song; HaiYan Jiang; Qinghui Song; XieGuang Zhao; Xiaoxuan Wu (2023). Details of the datasets from UCI repository used in the experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0184834.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0184834.t003
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
QingJun Song; HaiYan Jiang; Qinghui Song; XieGuang Zhao; Xiaoxuan Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Details of the datasets from UCI repository used in the experiments.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yuan Sun (2025). UCI datasets [Dataset]. https://ieee-dataport.org/documents/uci-datasets

UCI datasets

Explore at:

Dataset updated

May 14, 2025

Authors

Yuan Sun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

biology

Clear search

Close search

Google apps

Main menu

UCI datasets

UCI dataset

uci-ml-repo

UCI Machine Learning Repository

Dataset

Contents

Basic information on 40 datasets from UCI repository used in this study...

UCI Machine Learning Datasets 12/2013

UCI and OpenML Data Sets for Ordinal Quantification

heart-disease-data

Dataset

Contents

UCI Diabetes Data Set

About this Dataset

Content

Open-source data sets for classification task from UCI repository and...

uci-uni

Replication Data for: Scalable Kernel Mean Matching

King Rook King (UCI ML Repo)

Dataset

Contents

Data from: Imbalanced dataset for benchmarking

Imbalanced dataset for benchmarking

Characteristics

References

UCI Datasets: "Air quality" and "US Census (1990)"

DodgerLoopGame UCR Archive Dataset

uci-shopper

Default of credit card clients

Context

Content

Models

Goal

UCI Heart Disease Data Set

Context

Content

Data Dictionary

Acknowledgements

Inspiration

Details of the datasets from UCI repository used in the experiments.

UCI datasets