56 datasets found

UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
t
Regression and Survival Data Sets - Dataset - LDM
service.tib.eu
resodate.org
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Regression and Survival Data Sets - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/regression-and-survival-data-sets
Explore at:
Dataset updated
Jan 3, 2025
Description
The dataset used in the paper is a collection of 48 regression and 35 survival data sets from the UCI repository.
Wine Quality by UCI
kaggle.com
zip
Updated Apr 9, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huseyin ELCI (2020). Wine Quality by UCI [Dataset]. https://www.kaggle.com/huseyinelci/wne-qualty-by-uci
Explore at:
zip(101809 bytes)Available download formats
Dataset updated
Apr 9, 2020
Authors
Huseyin ELCI
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

The two differents datasets are related to Red Wine and White Wine variants of the Portuguese "**Vinho Verde**" wine. For more details, consult the reference [*Paulo Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis*, 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.)

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).
This dataset is also available from the UCI machine learning repository, Source I just shared it to Kaggle for Convenience. If I am mistaken and the public license type disallowed me from doing so, I will take to remove this dataset, if requested and notified to me.

Content

For more information, please read [Cortez et al., 2009].

Number of

Number of Instances:

Tables Count
Red Wine 1599
White Wine 4898

Number of Attributes:

11 + output attribute. Input and Output of feature: Input variables (based on physicochemical tests): 1. fixed acidity 2. volatile acidity 3. citric acid 4. residual sugar 5. chlorides 6. free sulfur dioxide 7. total sulfur dioxide 8. density 9. pH 10. sulphates 11. alcohol

Output variable (based on sensory data): 12. quality (score between 0 and 10)

Acknowledgements

This dataset is also available from the UCI machine learning repository, Source
I just shared it to Kaggle for Convenience. If I am mistaken and the public license type disallowed me from doing so, I will take to remove this dataset, if requested and notified to me. I am not the owner of this dataset. Also, if you plan to use this database in your article research or else you must taken and read main Source in the UCI machine learning repository.

Inspiration - Relevant Papers:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. For Research In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Additional Information about Wine: For a good evaluation, I recommend you to know a little more about wine. WikiPedia will be good for you. Source 1: Acids in Wine | Source 2: Chemistry of Wine
f
Data from: Dataset Description.
figshare.com
xls
Updated Nov 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rana Muhammad Amir Latif; Tahir Iqbal; Ismaeel Abdel Qader; Atif Ikram; Hadeel Alsolai; Bayan Alabdullah; Fatimah Alhayan; Taher M. Ghazal (2025). Dataset Description. [Dataset]. http://doi.org/10.1371/journal.pone.0336241.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0336241.t002
Dataset updated
Nov 7, 2025
Dataset provided by
PLOS ONE
Authors
Rana Muhammad Amir Latif; Tahir Iqbal; Ismaeel Abdel Qader; Atif Ikram; Hadeel Alsolai; Bayan Alabdullah; Fatimah Alhayan; Taher M. Ghazal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Urban air pollution remains a critical challenge for public health and environmental sustainability. This study investigates the predictive capabilities of five machine learning (ML) models: Linear Regression (LR), Decision Tree (DT), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Support Vector Regression (SVR) for forecasting the Air Quality Index (AQI) using the widely adopted Air Quality dataset from the UCI ML Repository. Although collected in 2004–2005, the dataset continues to serve as a benchmark in recent literature and provides a reproducible testbed for methodological evaluation. After structured pre-processing, feature engineering, and chronological train–validation–test splitting, models were rigorously tuned and assessed using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination (R2), with 95% bootstrap confidence intervals and corrected resampled t-tests confirming statistical significance. Ensemble models achieved the best performance, with Random Forest obtaining the lowest RMSE (12.48) and MAE (9.35), and XGBoost achieving the highest R2 (0.89). Feature importance analysis identified NOx, PM2.5, and CO as the most influential predictors. We incorporated Shapley Additive exPlanations (SHAP) analyses and case-level visualizations to support interpretability, providing transparent insights for practical decision-making. While the study is limited by the absence of external validation and genetic variables (e.g., APOE), it establishes a reproducible, interpretable, and computationally efficient ML framework for AQI forecasting. The findings highlight the continuing relevance of benchmark datasets for reproducible evaluation and demonstrate the potential of interpretable ML-based approaches for smart city air quality management and public health policy.

Tables	Count
Red Wine	1599
White Wine	4898

Breast Cancer Wisconsin (Prognostic) Data Set

kaggle.com

zip

Updated Mar 31, 2017

Facebook

Twitter

Click to copy link

Link copied

Cite

Sarah VCH (2017). Breast Cancer Wisconsin (Prognostic) Data Set [Dataset]. https://www.kaggle.com/sarahvch/breast-cancer-wisconsin-prognostic-data-set

Explore at:

zip(49800 bytes)Available download formats

Dataset updated

Mar 31, 2017

Authors

Sarah VCH

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Context

Data From: UCI Machine Learning Repository http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.names

Content

"Each record represents follow-up data for one breast cancer case. These are consecutive patients seen by Dr. Wolberg since 1984, and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis.

The first 30 features are computed from a digitized image of a
fine needle aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
A few of the images can be found at
http://www.cs.wisc.edu/~street/images/

The separation described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree. Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

The Recurrence Surface Approximation (RSA) method is a linear
programming model which predicts Time To Recur using both
recurrent and nonrecurrent cases. See references (i) and (ii)
above for details of the RSA method. 

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WPBC/

1) ID number 2) Outcome (R = recur, N = nonrecur) 3) Time (recurrence time if field 2 = R, disease-free time if field 2 = N) 4-33) Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)"

Acknowledgements

Creators:

Dr. William H. Wolberg, General Surgery Dept., University of
Wisconsin, Clinical Sciences Center, Madison, WI 53792
wolberg@eagle.surgery.wisc.edu

W. Nick Street, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
street@cs.wisc.edu 608-262-6619

Olvi L. Mangasarian, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
olvi@cs.wisc.edu

Inspiration

I'm really interested in trying out various machine learning algorithms on some real life science data.

Model performance (mean ± 95% CI) with significance testing against...
figshare.com
xls
Updated Nov 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rana Muhammad Amir Latif; Tahir Iqbal; Ismaeel Abdel Qader; Atif Ikram; Hadeel Alsolai; Bayan Alabdullah; Fatimah Alhayan; Taher M. Ghazal (2025). Model performance (mean ± 95% CI) with significance testing against baselines. Bold values denote best results. [Dataset]. http://doi.org/10.1371/journal.pone.0336241.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0336241.t006
Dataset updated
Nov 7, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Rana Muhammad Amir Latif; Tahir Iqbal; Ismaeel Abdel Qader; Atif Ikram; Hadeel Alsolai; Bayan Alabdullah; Fatimah Alhayan; Taher M. Ghazal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Model performance (mean ± 95% CI) with significance testing against baselines. Bold values denote best results.
h
Residual-Bayesian-Attention
huggingface.co
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wencan Guan (2025). Residual-Bayesian-Attention [Dataset]. https://huggingface.co/datasets/guanwencan/Residual-Bayesian-Attention
Explore at:
Dataset updated
Oct 8, 2025
Authors
Wencan Guan
Description
This collection contains six commonly used regression datasets from the UCI Machine Learning Repository.

1. California Housing

File: california_housing.csv Samples: 20,640 Features: 8 (MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude) Target: Median house value Source: Scikit-learn built-in dataset

2. Household Power Consumption

File: household_power_timeseries.csv Samples: 17,520 Features: 7 (Global active/reactive power, Voltage… See the full description on the dataset page: https://huggingface.co/datasets/guanwencan/Residual-Bayesian-Attention.
UCI ML Parkinsons dataset
kaggle.com
zip
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elnaz Alikarami (2025). UCI ML Parkinsons dataset [Dataset]. https://www.kaggle.com/datasets/elnazalikarami/uci-ml-parkinsons-dataset
Explore at:
zip(316796 bytes)Available download formats
Dataset updated
Jul 8, 2025
Authors
Elnaz Alikarami
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Oxford Parkinson's Disease Detection Dataset UCI Machine Learning Repository

dataset's original link : https://archive.ics.uci.edu/dataset/174/parkinsons

Dataset Characteristics Multivariate

Subject Area Health and Medicine

Associated Tasks Classification

Feature Type Real

Instances

197

Features

22

Dataset Information Additional Information

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).

Has Missing Values?

No
t
SGEMM - Dataset - LDM
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). SGEMM - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/sgemm
Explore at:
Dataset updated
Dec 2, 2024
Description
The SGEMM dataset is a regression task from the UCI repository.
Z
Household Reactive Power Consumption Dataset
data.niaid.nih.gov
Updated Mar 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang Wei Tan; Christoph Bergmeir; Francois Petitjean; Geoffrey I Webb (2021). Household Reactive Power Consumption Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3902705
Explore at:
Dataset updated
Mar 24, 2021
Dataset provided by
Monash University
Authors
Chang Wei Tan; Christoph Bergmeir; Francois Petitjean; Geoffrey I Webb
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/

The goal of this dataset is to predict total reactive power consumption in a household. This dataset contains 1440 time series obtained from the Individual household electric power consumption dataset from the UCI repository. The time series has 5 dimensions. This includes measurements for voltage, current annd 3 sub-metering energy usage.

Please refer to https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption for more details

Source Georges Hebrail (georges.hebrail '@' edf.fr), Senior Researcher, EDF R&D, Clamart, France Alice Berard, TELECOM ParisTech Master of Engineering Internship at EDF R&D, Clamart, France
News Title Sentiment Dataset
zenodo.org
bin
Updated Mar 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang Wei Tan; Chang Wei Tan; Christoph Bergmeir; Christoph Bergmeir; Francois Petitjean; Francois Petitjean; Geoffrey I Webb; Geoffrey I Webb (2021). News Title Sentiment Dataset [Dataset]. http://doi.org/10.5281/zenodo.3902726
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3902726
Dataset updated
Mar 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Chang Wei Tan; Chang Wei Tan; Christoph Bergmeir; Christoph Bergmeir; Francois Petitjean; Francois Petitjean; Geoffrey I Webb; Geoffrey I Webb
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/

The goal of this dataset is to predict sentiment score for news title. This dataset contains 83164 time series obtained from the News Popularity in Multiple Social Media Platforms dataset from the UCI repository. This is a large data set of news items and their respective social feedback on multiple platforms: Facebook, Google+ and LinkedIn. The collected data relates to a period of 8 months, between November 2015 and July 2016, accounting for about 100,000 news items on four different topics: economy, microsoft, obama and palestine. This data set is tailored for evaluative comparisons in predictive analytics tasks, although allowing for tasks in other research areas such as topic detection and tracking, sentiment analysis in short text, first story detection or news recommendation. The time series has 3 dimensions.

Please refer to https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms for more details

Citation request
Nuno Moniz and Luis Torgo (2018), Multi-Source Social Feedback of Online News Feeds, CoRR
Z
Appliances Energy Dataset
data-staging.niaid.nih.gov
zenodo.org
Updated Mar 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang Wei Tan; Christoph Bergmeir; Francois Petitjean; Geoffrey I Webb (2021). Appliances Energy Dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3902636
Explore at:
Dataset updated
Mar 24, 2021
Dataset provided by
Monash University
Authors
Chang Wei Tan; Christoph Bergmeir; Francois Petitjean; Geoffrey I Webb
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/

The goal of this dataset is to predict total energy usage in kWh of a house. This dataset contains 138 time series obtained from the Appliances Energy Prediction dataset from the UCI repository. The time series has 24 dimensions. This includes temperature and humidity measurements of 9 rooms in a house, monitored with a ZigBee wireless sensor network. It also includes weather and climate data such as temperature, pressure, humidity, wind speed, visibility and dewpoint measured from Chievres airport. The data set is averaged for 10 minutes period and spanning 4.5 months.

Please refer to https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction for more details

Relevant papers Luis M. Candanedo, Veronique Feldheim, Dominique Deramaix, Data driven prediction models of energy use of appliances in a low-energy house, Energy and Buildings, Volume 140, 1 April 2017, Pages 81-97, ISSN 0378-7788

Citation request Luis M. Candanedo, Veronique Feldheim, Dominique Deramaix, Data driven prediction models of energy use of appliances in a low-energy house, Energy and Buildings, Volume 140, 1 April 2017, Pages 81-97, ISSN 0378-7788
t
Appliances Energy Dataset - Vdataset - LDM in NFDI4Energy
service.tib.eu
Updated Nov 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Appliances Energy Dataset - Vdataset - LDM in NFDI4Energy [Dataset]. https://service.tib.eu/ldm_nfdi4energy/ldmservice/dataset/openaire_7c522fbe-7b1d-42e8-aa1c-274af3d535a3
Explore at:
Dataset updated
Nov 17, 2025
Description
{"This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/ The goal of this dataset is to predict total energy usage in kWh of a house. This dataset contains 138 time series obtained from the Appliances Energy Prediction dataset from the UCI repository. The time series has 24 dimensions. This includes temperature and humidity measurements of 9 rooms in a house, monitored with a ZigBee wireless sensor network. It also includes weather and climate data such as temperature, pressure, humidity, wind speed, visibility and dewpoint measured from Chievres airport. The data set is averaged for 10 minutes period and spanning 4.5 months. Please refer to https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction for more details Relevant papers Luis M. Candanedo, Veronique Feldheim, Dominique Deramaix, Data driven prediction models of energy use of appliances in a low-energy house, Energy and Buildings, Volume 140, 1 April 2017, Pages 81-97, ISSN 0378-7788 Citation request Luis M. Candanedo, Veronique Feldheim, Dominique Deramaix, Data driven prediction models of energy use of appliances in a low-energy house, Energy and Buildings, Volume 140, 1 April 2017, Pages 81-97, ISSN 0378-7788"}
Comparative Summary of Related Studies on AQI Prediction.
plos.figshare.com
xls
Updated Nov 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rana Muhammad Amir Latif; Tahir Iqbal; Ismaeel Abdel Qader; Atif Ikram; Hadeel Alsolai; Bayan Alabdullah; Fatimah Alhayan; Taher M. Ghazal (2025). Comparative Summary of Related Studies on AQI Prediction. [Dataset]. http://doi.org/10.1371/journal.pone.0336241.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0336241.t001
Dataset updated
Nov 7, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Rana Muhammad Amir Latif; Tahir Iqbal; Ismaeel Abdel Qader; Atif Ikram; Hadeel Alsolai; Bayan Alabdullah; Fatimah Alhayan; Taher M. Ghazal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparative Summary of Related Studies on AQI Prediction.
h
abalone
huggingface.co
Updated Apr 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mattia (2023). abalone [Dataset]. https://huggingface.co/datasets/mstz/abalone
Explore at:
Dataset updated
Apr 5, 2023
Authors
Mattia
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Abalone

The Abalone dataset from the UCI ML repository. Predict the age of the given abalone.

Configurations and tasks

Configuration Task Description

abalone Regression Predict the age of the abalone.

binary Binary classification Does the abalone have more than 9 rings?

Usage

from datasets import load_dataset

dataset = load_dataset("mstz/abalone")["train"]

Features

Target feature in bold.

Feature Type

sex [string]… See the full description on the dataset page: https://huggingface.co/datasets/mstz/abalone.
Details of the datasets from UCI repository used in the experiments.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
QingJun Song; HaiYan Jiang; Qinghui Song; XieGuang Zhao; Xiaoxuan Wu (2023). Details of the datasets from UCI repository used in the experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0184834.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0184834.t003
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
QingJun Song; HaiYan Jiang; Qinghui Song; XieGuang Zhao; Xiaoxuan Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Details of the datasets from UCI repository used in the experiments.
Sales Data
kaggle.com
zip
Updated Aug 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sky (2023). Sales Data [Dataset]. https://www.kaggle.com/datasets/yshailesh/sales-dataset-for-prediction
Explore at:
zip(7223540 bytes)Available download formats
Dataset updated
Aug 11, 2023
Authors
Sky
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Typically sales datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions. The dataset is maintained on their site, where it can be found by the title "Durg Store". The data files contains two datasets one stores all the historical sales data while the 2nd dataset contains all the store information.

Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Inspiration Analyses for this dataset could include regression, time series, clustering, classification and more.
Climate Model Simulation Crashes Data Set
kaggle.com
zip
Updated May 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anala Keshava (2020). Climate Model Simulation Crashes Data Set [Dataset]. https://www.kaggle.com/analakeshava/climate-model-simulation-crashes-data-set
Explore at:
zip(133132 bytes)Available download formats
Dataset updated
May 27, 2020
Authors
Anala Keshava
Description
The dataset is taken from UCI machine learning repository. The link for the dataset is https://archive.ics.uci.edu/ml/datasets/Climate+Model+Simulation+Crashes (https://archive.ics.uci.edu/ml/datasets/Climate+Model+Simulation+Crashes). This dataset contains samples of 18 climate model input parameter values to predict climate model simulation crashes.
Energy Efficiency Data Set
kaggle.com
Updated May 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ujjwal Chowdhury (2022). Energy Efficiency Data Set [Dataset]. https://www.kaggle.com/datasets/ujjwalchowdhury/energy-efficiency-data-set
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 12, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ujjwal Chowdhury
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This Data Set is collected from UCI Machine Learning Repository.

Data Set Description in UCI as follows: " Abstract: This study looked into assessing the heating load and cooling load requirements of buildings (that is, energy efficiency) as a function of building parameters.

We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer. "
Data from: A new hybrid ensemble model with voting-based outlier detection...
figshare.com
txt
Updated Aug 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenyu Zhang; Dongqi Yang; Shuai Zhang (2020). A new hybrid ensemble model with voting-based outlier detection and balanced sampling for credit scoring [Dataset]. http://doi.org/10.6084/m9.figshare.12782552.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12782552.v2
Dataset updated
Aug 11, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Wenyu Zhang; Dongqi Yang; Shuai Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Three datasets from the UC Irvine (UCI) machine learning repository, that is, the Australian, German, and Japanese datasets, were adopted for the current study. The Australian credit dataset contains 690 samples, of which 307 are positive and 383 are negative. The dimensions of its input features are 15. The German credit dataset contains 1000 samples, 700 of which are positive and 300 are negative. The dimensions of its input features are 21. The Japanese credit dataset contains 690 samples, of which 383 are positive and 307 are negative. The dimensions of its input features are 16.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302

UCI and OpenML Data Sets for Ordinal Quantification

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8177302

Dataset updated

Jul 25, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

Clear search

Close search

Google apps

Main menu

UCI and OpenML Data Sets for Ordinal Quantification

Regression and Survival Data Sets - Dataset - LDM

Wine Quality by UCI

Context

Content

Number of

Number of Instances:

Number of Attributes:

Acknowledgements

Inspiration - Relevant Papers:

Data from: Dataset Description.

Breast Cancer Wisconsin (Prognostic) Data Set

Context

Content

Acknowledgements

Inspiration

Model performance (mean ± 95% CI) with significance testing against...

Residual-Bayesian-Attention

UCI ML Parkinsons dataset

Instances

Features

SGEMM - Dataset - LDM

Household Reactive Power Consumption Dataset

News Title Sentiment Dataset

Appliances Energy Dataset

Appliances Energy Dataset - Vdataset - LDM in NFDI4Energy

Comparative Summary of Related Studies on AQI Prediction.

abalone

Details of the datasets from UCI repository used in the experiments.

Sales Data

Climate Model Simulation Crashes Data Set

Energy Efficiency Data Set

Data from: A new hybrid ensemble model with voting-based outlier detection...

UCI and OpenML Data Sets for Ordinal QuantificationSee More Versions

UCI and OpenML Data Sets for Ordinal Quantification