UCI Machine Learning Repository is a collection of over 550 datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
biology
Collection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged. Many people deserve thanks for making the repository a success. Foremost among them are the d
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).
[2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.
This CSV contain a data set prepared for the use of participants for the 1994 AAAI Spring Symposium on Artificial Intelligence in Medicine.
Original files were obtained from: https://archive.ics.uci.edu/ml/datasets/diabetes
Archived file diabetes-data.tar.z which contains 70 sets of data recorded on diabetes patients (several weeks' to months' worth of glucose, insulin, and lifestyle data per patient + a description of the problem domain) is extracted and processed and merged as a CSV file.
The Code field of the CSV is deciphered as follows:
33 = Regular insulin dose 34 = NPH insulin dose 35 = UltraLente insulin dose 48 = Unspecified blood glucose measurement 57 = Unspecified blood glucose measurement 58 = Pre-breakfast blood glucose measurement 59 = Post-breakfast blood glucose measurement 60 = Pre-lunch blood glucose measurement 61 = Post-lunch blood glucose measurement 62 = Pre-supper blood glucose measurement 63 = Post-supper blood glucose measurement 64 = Pre-snack blood glucose measurement 65 = Hypoglycemic symptoms 66 = Typical meal ingestion 67 = More-than-usual meal ingestion 68 = Less-than-usual meal ingestion 69 = Typical exercise activity 70 = More-than-usual exercise activity 71 = Less-than-usual exercise activity 72 = Unspecified special event
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
The original ionosphere dataset from UCI machine learning repository is a binary classification dataset with dimensionality 34. There is one attribute having values all zeros, which is discarded. So the total number of dimensions are 33. The ‘bad’ class is considered as outliers class and the ‘good’ class as inliers.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Collection of two datasets from the UCI website that could be used for structure learning tasks. Includes datasets regarding
Size: Two datasets of sizes 9471*17 and 2458285*68 correspondingly
Number of features: 15-68
Ground truth: No
Type of Graph: No ground truth
More information about the datasets is contained in the dataset_description.html files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets from Scikit-learn are: ‘Iris’, ‘Wine’, ‘Breast Cancer Wisconsin (Diagnostic)’. Datasets from UCI repository are: ‘Seeds’ ‘Banknote Authentication’ (‘Banknotes’), ‘Heart disease’ ‘ Parkinsons ‘, ‘Ecoli’, ‘Thyroid (Thyroid gland data)’
Datasets available at UCI Machine Learning Repository and other repositories. List of datasets used in the experiment with their sources. ForestCover dataset @ https://archive.ics.uci.edu/ml/datasets/Covertype KDD Cup99 dataset @ https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data PAMAP dataset @ https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring Powersupply @ http://www.cse.fau.edu/~xqzhu/stream.html SEA @ http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift Syn002 & Syn003 (generated) @ http://moa.cms.waikato.ac.nz/details/classification/streams/ MNIST @ https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html News20 @ https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by MD.Romzan Alom
Released under MIT
Classification dataset of 90 molecules according to its effects in the circadian rhythm from the UCI Machine Learning Repository.
This dataset was created by Nagaveda Reddy
This notebook was created for analysis and prediction making of the Default of credit card clients Data Set from UCI Machine Learning Library. The data set can be accessed separately from the UCI Machine Learning Repository page, here.
In their paper "The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. (Yeh I. C. & Lien C. H.,2009)", which can be found here, Yeh I. C. & Lien C. H. review six data mining techniques (discriminant analysis, logistic regression, Bayesclassifier, nearest neighbor, artificial neural networks, and classification trees) and their applications on credit scoring. Then, using the real cardholders’ credit risk data in Taiwan, they compare the classification accuracy among them.
We will create 3 models in order to make predictions and compare them with the original paper. These models are: - Logistic Regression - Decision tree - Neural Network
After the initial predictions, each model will be "optimized" by GridSearchCV
estimator, which will search for the best set of hyperparameters for every model.
Using the models we created, we will try to predict the class value of dpnm
column with better scores (accuracy and f1) than the scores presented in the original paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two preprocessed datasets collected from the UCI repository that can be used for the purpose of structure learning from multivariate data of different types.
Air Quality
This dataset represents hourly averaged measurements of 5 metal oxide chemical sensors embedded in an air quality chemical multisensor device. The certified analyzer was located on the field in a significantly polluted area, at road level, within an Italian city. Data were recorded from March 2004 to February 2005 (one year), representing the longest freely available recordings of on-field deployed air quality chemical sensor device responses [1]. More information about the attributes and their type can be found in airqualitydataset_description.html.
Size of dataset: 9358
Number of Features: 16
Type of data: discrete and continuous
Ground Truth: No
Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. There are 15 attributes. Date and Time as well as discrete and real covariates.
0 Date (DD/MM/YYYY)
1 Time (HH.MM.SS)
2 True hourly averaged concentration CO in mg/m^3 (reference analyzer)
3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
4 True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
5 True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
7 True hourly averaged NOx concentration in ppb (reference analyzer)
8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
9 True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
12 Temperature in °C
13 Relative Humidity (%)
14 AH Absolute Humidity
US Census (1990)
This dataset is a discretized version of the USCensus1990raw dataset. The data was collected as part of the 1990 census, and it describes one percent sample of the Public Use Microdata Samples (PUMS) person records drawn from the full 1990 census sample (all fifty states and the District of Columbia but not including "PUMA Cross State Lines One Percent Persons Records") [2]. More information about the attributes and their type can be found in census1990_description.html.
Size of dataset: 2458285
Number of features: 68
Ground truth: No
References:
[1] S. De Vito and E. Massera and M. Piga and L. Martinotto and G. Di Francia, On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario, Sensors and Actuators B: Chemical, Volume 129, Issue 2, 22 February 2008, Pages 750-757, ISSN 0925-4005 https://doi.org/10.1016/j.snb.2007.09.060
[2] Meek, Thiesson and Heckerman (2001), "The Learning Curve Method Applied to Clustering",The Journal of Machine Learning Research. (Also see MSR-TR-2001-34 available athttps://www.microsoft.com/en-us/research/wp-content/uploads/2001/01/lc-aistats.pdf)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Online Shoppers Purchasing Intention Dataset
Dataset Summary
This dataset is a reupload of the Online Shoppers Purchasing Intention Dataset from the UCI Machine Learning Repository.
NOTE: The information below is from the original dataset description from UCI's website.
Overview
Of the 12,330 sessions in the dataset, 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples… See the full description on the dataset page: https://huggingface.co/datasets/jlh/uci-shopper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.
The traffic data are collected with the loop sensor installed on ramp for the 101 North freeway in Los Angeles. This location is close to Dodgers Stadium; therefore the traffic is affected by volume of visitors to the stadium. Missing values are represented with NaN. - Class 1: Normal Day - Class 2: Game Day There is nothing to infer from the order of examples in the train and test set. Missing values are represented with NaN in the text file. Data created by Ihler, Alexander, Jon Hutchins, and Padhraic Smyth (see [1][2][3]). Data edited by Chin-Chia Michael Yeh.
[1] Ihler, Alexander, Jon Hutchins, and Padhraic Smyth. "Adaptive event detection with time-varying poisson processes." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.
[2] “UCI Machine Learning Repository: Dodgers Loop Sensor Data Set.” UCI Machine Learning Repository, archive.ics.uci.edu/ml/datasets/dodgers+loop+sensor.
[3] “Caltrans PeMS.” Caltrans, pems.dot.ca.gov/.
Donator: C. Yeh
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is related to direct marketing campaigns conducted by a Portuguese banking institution, with campaigns relying on phone calls. Often multiple contacts with the same client were necessary to determine whether they would subscribe ('yes') or not ('no') to a bank term deposit. The dataset includes four files:
The smaller subsets are designed for testing computationally intensive machine learning algorithms (e.g., SVM). The primary classification objective is to predict whether a client will subscribe to a term deposit ('yes' or 'no'), based on the target variable y.
UCI Machine Learning Repository is a collection of over 550 datasets.