CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).
[2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
This dataset was created by Nagaveda Reddy
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Basic information on 40 datasets from UCI repository used in this study including information about number of instances, attributes, classes, length of longest attribute name (LAN) and length of the longest nominal attribute value (LAV).
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by Machine Learning researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).
Source: https://archive.ics.uci.edu/ml/datasets/heart+disease
The original ionosphere dataset from UCI machine learning repository is a binary classification dataset with dimensionality 34. There is one attribute having values all zeros, which is discarded. So the total number of dimensions are 33. The ‘bad’ class is considered as outliers class and the ‘good’ class as inliers.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Kukku
Released under Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period. Of the 12,330 sessions in the dataset, 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples ending with shopping.The dataset consists of 10 numerical and 8 categorical attributes. The 'Revenue' attribute can be used as the class label.The dataset contains 18 columns, each representing specific attributes of online shopping behavior:Administrative and Administrative_Duration: Number of pages visited and time spent on administrative pages.Informational and Informational_Duration: Number of pages visited and time spent on informational pages.ProductRelated and ProductRelated_Duration: Number of pages visited and time spent on product-related pages.BounceRates and ExitRates: Metrics indicating user behavior during the session.PageValues: Value of the page based on e-commerce metrics.SpecialDay: Likelihood of shopping based on special days.Month: Month of the session.OperatingSystems, Browser, Region, TrafficType: Technical and geographical attributes.VisitorType: Categorizes users as returning, new, or others.Weekend: Indicates if the session occurred on a weekend.Revenue: Target variable indicating whether a transaction was completed (True or False).The original dataset has been picked up from the UCI Machine Learning Repository, the link to which is as follows :https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+datasetAdditional Variable InformationThe dataset consists of 10 numerical and 8 categorical attributes. The 'Revenue' attribute can be used as the class label. "Administrative", "Administrative Duration", "Informational", "Informational Duration", "Product Related" and "Product Related Duration" represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another. The "Bounce Rate", "Exit Rate" and "Page Value" features represent the metrics measured by "Google Analytics" for each page in the e-commerce site. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session. The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction. The "Special Day" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8. The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.
This dataset was created by somaktukai
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Details of the datasets from UCI repository used in the experiments.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is downloaded from UCI repository. https://archive.ics.uci.edu/ml/datasets/nursery the dataset contains categorical data to rank nursery school applicants. The original dataset contains 5 classes. Classes were reorganized in order to remain with only two classes (”recommended” or ”not recommended”).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.
The traffic data are collected with the loop sensor installed on ramp for the 101 North freeway in Los Angeles. This location is close to Dodgers Stadium; therefore the traffic is affected by volume of visitors to the stadium. Missing values are represented with NaN. - Class 1: Normal Day - Class 2: Game Day There is nothing to infer from the order of examples in the train and test set. Missing values are represented with NaN in the text file. Data created by Ihler, Alexander, Jon Hutchins, and Padhraic Smyth (see [1][2][3]). Data edited by Chin-Chia Michael Yeh.
[1] Ihler, Alexander, Jon Hutchins, and Padhraic Smyth. "Adaptive event detection with time-varying poisson processes." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.
[2] “UCI Machine Learning Repository: Dodgers Loop Sensor Data Set.” UCI Machine Learning Repository, archive.ics.uci.edu/ml/datasets/dodgers+loop+sensor.
[3] “Caltrans PeMS.” Caltrans, pems.dot.ca.gov/.
Donator: C. Yeh
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Consisting of six multi-label datasets from the UCI Machine Learning repository.
Each dataset contains missing values which have been artificially added at the following rates: 5, 10, 15, 20, 25, and 30%. The “amputation” was performed using the “Missing Completely at Random” mechanism.
File names are represented as follows:
amp_DB_MR.arff
where:
DB = original dataset;
MR = missing rate.
For more details, please read:
IEEE Access article (in review process)
The dataset is downloaded from UCI repository http://archive.ics.uci.edu/ml/datasets/gas+sensors+for+home+activity+monitoring In the dataset there are only two classes banana and wine. Class background is not included in this dataset The dataset is sequential according to the ID
This dataset was created by Gaurav Sharma
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Heart
The Heart dataset from the UCI ML repository. Does the patient have heart disease?
Configurations and tasks
Configuration Task
hungary Binary classification
Usage
from datasets import load_dataset
dataset = load_dataset("mstz/heart", "hungary")["train"]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of the publicly-available UCI Machine Learning Repository datasets used for method comparison.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset is downloaded from UCI repository https://archive.ics.uci.edu/ml/datasets/covertype The dataset contains 1 to 7 Forest Cover Type. The task is to predict the forest cover type from cartographic variables only (no remotely sensed data)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/
The goal of this dataset is to predict total reactive power consumption in a household. This dataset contains 1440 time series obtained from the Individual household electric power consumption dataset from the UCI repository. The time series has 5 dimensions. This includes measurements for voltage, current annd 3 sub-metering energy usage.
Please refer to https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption for more details
Source Georges Hebrail (georges.hebrail '@' edf.fr), Senior Researcher, EDF R&D, Clamart, France Alice Berard, TELECOM ParisTech Master of Engineering Internship at EDF R&D, Clamart, France
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).
[2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.