UCI Machine Learning Repository is a collection of over 550 datasets.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).
[2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
biology
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
and different customers have different starting times
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Collection of two datasets from the UCI website that could be used for structure learning tasks. Includes datasets regarding
Air Quality
US census 1990
Size: Two datasets of sizes 9471*17 and 2458285*68 correspondingly
Number of features: 15-68
Ground truth: No
Type of Graph: No ground truth
More information about the datasets is contained in the dataset_description.html files.
The existing bicycle rental systems in large cities have a system automated collection and return of the vehicle through a network of stations distributed throughout the entire metropolis. With the use of these systems, people can rent a bike in a location and return it in a different one depending on your needs. The data generated by these systems are attractive to researchers due to variables such as the duration of the trip, departure and destination points and travel time. Therefore, exchange systems Bicycles work as a network of sensors that are useful for mobility studies. With In order to improve management, one of these companies needs to anticipate the demand that there will be in a certain range of time depending on factors such as the time zone, the type day (weekday or holiday), the weather, etc.
The objective of this data set is to predict the demand in a series of specific time slots, using the historical data set as the basis to build a linear model.
Two data sets will be delivered containing the number of rented bicycles in different time slots:
The variables present in the 2 data sets are:
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Diabetes UCI Dataset is a structured dataset designed for early-stage diabetes risk prediction, collected through questionnaire-based responses from patients at the Sylhet Diabetes Hospital in Bangladesh.
2) Data Utilization (1) Characteristics of the Diabetes UCI Dataset: • This dataset includes 16 key symptoms of diabetes such as age, gender, sudden weight loss, polyuria, polyphagia, and visual blurring, each recorded as binary indicators (Yes/No). The Class column serves as a binary classification label indicating whether the individual has diabetes (Positive/Negative). • All features are discrete or binary variables, making the dataset highly interpretable and well-structured for medical domain applications.
(2) Applications of the Diabetes UCI Dataset: • Training Early Diabetes Prediction Models: The dataset can be used to train machine learning binary classification models that predict the likelihood of diabetes onset based on various symptom-related features. • Risk Factor Analysis and Clinical Decision Support: It can be applied to statistical analysis of symptom influence on diabetes diagnosis, or to support the development of clinical decision support systems in healthcare environments.
Author: H. Altay Guvenir, Burak Acar, Haldun Muderrisoglu
Source: UCI
Please cite: UCI
Cardiac Arrhythmia Database
The aim is to determine the type of arrhythmia from the ECG recordings. This database contains 279 attributes, 206 of which are linear valued and the rest are nominal.
Concerning the study of H. Altay Guvenir: "The aim is to distinguish between the presence and absence of cardiac arrhythmia and to classify it in one of the 16 groups. Class 01 refers to 'normal' ECG classes, 02 to 15 refers to different classes of arrhythmia and class 16 refers to the rest of unclassified ones. For the time being, there exists a computer program that makes such a classification. However, there are differences between the cardiologist's and the program's classification. Taking the cardiologist's as a gold standard we aim to minimize this difference by means of machine learning tools.
The names and id numbers of the patients were recently removed from the database.
1 Age: Age in years , linear
2 Sex: Sex (0 = male; 1 = female) , nominal
3 Height: Height in centimeters , linear
4 Weight: Weight in kilograms , linear
5 QRS duration: Average of QRS duration in msec., linear
6 P-R interval: Average duration between onset of P and Q waves
in msec., linear
7 Q-T interval: Average duration between onset of Q and offset
of T waves in msec., linear
8 T interval: Average duration of T wave in msec., linear
9 P interval: Average duration of P wave in msec., linear
Vector angles in degrees on front plane of:, linear
10 QRS
11 T
12 P
13 QRST
14 J
15 Heart rate: Number of heart beats per minute ,linear
Of channel DI:
Average width, in msec., of: linear
16 Q wave
17 R wave
18 S wave
19 R' wave, small peak just after R
20 S' wave
21 Number of intrinsic deflections, linear
22 Existence of ragged R wave, nominal
23 Existence of diphasic derivation of R wave, nominal
24 Existence of ragged P wave, nominal
25 Existence of diphasic derivation of P wave, nominal
26 Existence of ragged T wave, nominal
27 Existence of diphasic derivation of T wave, nominal
Of channel DII:
28 .. 39 (similar to 16 .. 27 of channel DI)
Of channels DIII:
40 .. 51
Of channel AVR:
52 .. 63
Of channel AVL:
64 .. 75
Of channel AVF:
76 .. 87
Of channel V1:
88 .. 99
Of channel V2:
100 .. 111
Of channel V3:
112 .. 123
Of channel V4:
124 .. 135
Of channel V5:
136 .. 147
Of channel V6:
148 .. 159
Of channel DI:
Amplitude , * 0.1 milivolt, of
160 JJ wave, linear
161 Q wave, linear
162 R wave, linear
163 S wave, linear
164 R' wave, linear
165 S' wave, linear
166 P wave, linear
167 T wave, linear
168 QRSA , Sum of areas of all segments divided by 10,
( Area= width * height / 2 ), linear
169 QRSTA = QRSA + 0.5 * width of T wave * 0.1 * height of T
wave. (If T is diphasic then the bigger segment is
considered), linear
Of channel DII:
170 .. 179
Of channel DIII:
180 .. 189
Of channel AVR:
190 .. 199
Of channel AVL:
200 .. 209
Of channel AVF:
210 .. 219
Of channel V1:
220 .. 229
Of channel V2:
230 .. 239
Of channel V3:
240 .. 249
Of channel V4:
250 .. 259
Of channel V5:
260 .. 269
Of channel V6:
270 .. 279
Class code - class - number of instances:
01 Normal 245 02 Ischemic changes (Coronary Artery Disease) 44 03 Old Anterior Myocardial Infarction 15 04 Old Inferior Myocardial Infarction 15 05 Sinus tachycardy 13 06 Sinus bradycardy 25 07 Ventricular Premature Contraction (PVC) 3 08 Supraventricular Premature Contraction 2 09 Left bundle branch block 9 10 Right bundle branch block 50 11 1. degree AtrioVentricular block 0 12 2. degree AV block 0 13 3. degree AV block 0 14 Left ventricule hypertrophy 4 15 Atrial Fibrillation or Flutter 5 16 Others 22
Collection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Online Shoppers Purchasing Intention Dataset
Dataset Summary
This dataset is a reupload of the Online Shoppers Purchasing Intention Dataset from the UCI Machine Learning Repository.
NOTE: The information below is from the original dataset description from UCI's website.
Overview
Of the 12,330 sessions in the dataset, 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples… See the full description on the dataset page: https://huggingface.co/datasets/jlh/uci-shopper.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Provide:
a high-level explanation of the dataset characteristics explain motivations and summary of its content potential use cases of the dataset
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged. Many people deserve thanks for making the repository a success. Foremost among them are the d
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset contiene la lista de bases de datos que se puede encontrar en el repositorio web de UCI
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Source: The leaves were taken from plants in the farm of the University of Mauritius and nearby locations. Donors: Trishen Munisami trishen.munisami @ gmail.com Mahess Ramsurn ramsurn.mahess @ umail.uom.ac.mu Somveer Kishnah s.kishnah @ uom.ac.mu Sameerchand Pudaruth sameerchand.pudaruth @ gmail.com Data Set Information: - The leaves were placed on a white background and then photographed. - The pictures were taken in broad daylight to ensure optimum light intensity. Attribute Information: List of plant species: 1. Beaumier du perou 2. Eggplant 3. Fruitcitere 4. Guava 5. Hibiscus 6. Betel 7. Rose 8. Chrysanthemum 9. Ficus 10. Duranta gold 11. Ashanti blood 12. Bitter Orange 13. Coeur Demoiselle 14. Jackfruit 15. Mulberry Leaf 16. Pimento 17. Pomme Jacquot 18. Star Apple 19. Barbados Cherry 20. Sweet Olive 21. Croton 22. Thevetia 23. Vieux Garcon 24. Chocolate tree 25. Carricature plant 26. Coffee 27. Ketembilla 28. Chinese guava 29. Lychee 30. Geranium 31. Sweet potato 32. Papa
The dataset was collected during 60 days, this is a real database of a brazilian logistics company. The dataset has twelve predictive attributes and a target that is the total of orders for daily treatment. The database was used in academic research at the Universidade Nove de Julho.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of decision tree dimensions on 40 UCI datasets including the number of leaves.
The UCI Heart Disease Dataset with 14 key attributes for machine learning & research. Ideal for predictive modeling.
The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research.
Estimation of obesity levels based on eating habits and physical condition Data Set Download: Data Folder, Data Set Description
Abstract: This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.
Data Set Characteristics:
Multivariate
Number of Instances:
2111
Area:
Life
Attribute Characteristics:
Integer
Number of Attributes:
17
Date Donated
2019-08-27
Associated Tasks:
Classification, Regression, Clustering
Missing Values?
N/A
Number of Web Hits:
70843
Fabio Mendoza Palechor, Email: fmendoza1 '@' cuc.edu.co, Celphone: +573182929611 Alexis de la Hoz Manotas, Email: akdelahoz '@' gmail.com, Celphone: +573017756983
This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.
Read the article ([Web Link]) to see the description of the attributes.
[1]Palechor, F. M., & de la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in Brief, 104344. [2]De-La-Hoz-Correa, E., Mendoza Palechor, F., De-La-Hoz-Manotas, A., Morales Ortega, R., & Sánchez Hernández, A. B. (2019). Obesity level estimation software based on decision trees.
[1] Palechor, F. M., & de la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in Brief, 104344.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The different algorithms of the imbalanced-learn
toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.
ID | Name | Repository & Target | Ratio | # samples | # features |
---|---|---|---|---|---|
1 | Ecoli | UCI, target: imU | 8.6:1 | 336 | 7 |
2 | Optical Digits | UCI, target: 8 | 9.1:1 | 5,620 | 64 |
3 | SatImage | UCI, target: 4 | 9.3:1 | 6,435 | 36 |
4 | Pen Digits | UCI, target: 5 | 9.4:1 | 10,992 | 16 |
5 | Abalone | UCI, target: 7 | 9.7:1 | 4,177 | 8 |
6 | Sick Euthyroid | UCI, target: sick euthyroid | 9.8:1 | 3,163 | 25 |
7 | Spectrometer | UCI, target: >=44 | 11:1 | 531 | 93 |
8 | Car_Eval_34 | UCI, target: good, v good | 12:1 | 1,728 | 6 |
9 | ISOLET | UCI, target: A, B | 12:1 | 7,797 | 617 |
10 | US Crime | UCI, target: >0.65 | 12:1 | 1,994 | 122 |
11 | Yeast_ML8 | LIBSVM, target: 8 | 13:1 | 2,417 | 103 |
12 | Scene | LIBSVM, target: >one label | 13:1 | 2,407 | 294 |
13 | Libras Move | UCI, target: 1 | 14:1 | 360 | 90 |
14 | Thyroid Sick | UCI, target: sick | 15:1 | 3,772 | 28 |
15 | Coil_2000 | KDD, CoIL, target: minority | 16:1 | 9,822 | 85 |
16 | Arrhythmia | UCI, target: 06 | 17:1 | 452 | 279 |
17 | Solar Flare M0 | UCI, target: M->0 | 19:1 | 1,389 | 10 |
18 | OIL | UCI, target: minority | 22:1 | 937 | 49 |
19 | Car_Eval_4 | UCI, target: vgood | 26:1 | 1,728 | 6 |
20 | Wine Quality | UCI, wine, target: <=4 | 26:1 | 4,898 | 11 |
21 | Letter Img | UCI, target: Z | 26:1 | 20,000 | 16 |
22 | Yeast _ME2 | UCI, target: ME2 | 28:1 | 1,484 | 8 |
23 | Webpage | LIBSVM, w7a, target: minority | 33:1 | 49,749 | 300 |
24 | Ozone Level | UCI, ozone, data | 34:1 | 2,536 | 72 |
25 | Mammography | UCI, target: minority | 42:1 | 11,183 | 6 |
26 | Protein homo. | KDD CUP 2004, minority | 111:1 | 145,751 | 74 |
27 | Abalone_19 | UCI, target: 19 | 130:1 | 4,177 | 8 |
[1] Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).
[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).
[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.
[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.
UCI Machine Learning Repository is a collection of over 550 datasets.