56 datasets found
  1. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  2. t

    Regression and Survival Data Sets - Dataset - LDM

    • service.tib.eu
    • resodate.org
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Regression and Survival Data Sets - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/regression-and-survival-data-sets
    Explore at:
    Dataset updated
    Jan 3, 2025
    Description

    The dataset used in the paper is a collection of 48 regression and 35 survival data sets from the UCI repository.

  3. Wine Quality by UCI

    • kaggle.com
    zip
    Updated Apr 9, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huseyin ELCI (2020). Wine Quality by UCI [Dataset]. https://www.kaggle.com/huseyinelci/wne-qualty-by-uci
    Explore at:
    zip(101809 bytes)Available download formats
    Dataset updated
    Apr 9, 2020
    Authors
    Huseyin ELCI
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    The two differents datasets are related to Red Wine and White Wine variants of the Portuguese "**Vinho Verde**" wine. For more details, consult the reference [*Paulo Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis*, 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.)

    These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

    This dataset is also available from the UCI machine learning repository, Source I just shared it to Kaggle for Convenience. If I am mistaken and the public license type disallowed me from doing so, I will take to remove this dataset, if requested and notified to me.

    Content

    For more information, please read [Cortez et al., 2009].

    Number of

    Number of Instances:

    TablesCount
    Red Wine1599
    White Wine4898

    Number of Attributes:

    11 + output attribute. Input and Output of feature: Input variables (based on physicochemical tests): 1. fixed acidity 2. volatile acidity 3. citric acid 4. residual sugar 5. chlorides 6. free sulfur dioxide 7. total sulfur dioxide 8. density 9. pH 10. sulphates 11. alcohol

    Output variable (based on sensory data): 12. quality (score between 0 and 10)

    Acknowledgements

    This dataset is also available from the UCI machine learning repository, Source
    I just shared it to Kaggle for Convenience. If I am mistaken and the public license type disallowed me from doing so, I will take to remove this dataset, if requested and notified to me. I am not the owner of this dataset. Also, if you plan to use this database in your article research or else you must taken and read main Source in the UCI machine learning repository.

    Inspiration - Relevant Papers:

    • P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. For Research In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
    • Additional Information about Wine: For a good evaluation, I recommend you to know a little more about wine. WikiPedia will be good for you. Source 1: Acids in Wine | Source 2: Chemistry of Wine
  4. f

    Data from: Dataset Description.

    • figshare.com
    xls
    Updated Nov 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rana Muhammad Amir Latif; Tahir Iqbal; Ismaeel Abdel Qader; Atif Ikram; Hadeel Alsolai; Bayan Alabdullah; Fatimah Alhayan; Taher M. Ghazal (2025). Dataset Description. [Dataset]. http://doi.org/10.1371/journal.pone.0336241.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 7, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rana Muhammad Amir Latif; Tahir Iqbal; Ismaeel Abdel Qader; Atif Ikram; Hadeel Alsolai; Bayan Alabdullah; Fatimah Alhayan; Taher M. Ghazal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Urban air pollution remains a critical challenge for public health and environmental sustainability. This study investigates the predictive capabilities of five machine learning (ML) models: Linear Regression (LR), Decision Tree (DT), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Support Vector Regression (SVR) for forecasting the Air Quality Index (AQI) using the widely adopted Air Quality dataset from the UCI ML Repository. Although collected in 2004–2005, the dataset continues to serve as a benchmark in recent literature and provides a reproducible testbed for methodological evaluation. After structured pre-processing, feature engineering, and chronological train–validation–test splitting, models were rigorously tuned and assessed using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination (R2), with 95% bootstrap confidence intervals and corrected resampled t-tests confirming statistical significance. Ensemble models achieved the best performance, with Random Forest obtaining the lowest RMSE (12.48) and MAE (9.35), and XGBoost achieving the highest R2 (0.89). Feature importance analysis identified NOx, PM2.5, and CO as the most influential predictors. We incorporated Shapley Additive exPlanations (SHAP) analyses and case-level visualizations to support interpretability, providing transparent insights for practical decision-making. While the study is limited by the absence of external validation and genetic variables (e.g., APOE), it establishes a reproducible, interpretable, and computationally efficient ML framework for AQI forecasting. The findings highlight the continuing relevance of benchmark datasets for reproducible evaluation and demonstrate the potential of interpretable ML-based approaches for smart city air quality management and public health policy.

  5. Breast Cancer Wisconsin (Prognostic) Data Set

    • kaggle.com
    zip
    Updated Mar 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah VCH (2017). Breast Cancer Wisconsin (Prognostic) Data Set [Dataset]. https://www.kaggle.com/sarahvch/breast-cancer-wisconsin-prognostic-data-set
    Explore at:
    zip(49800 bytes)Available download formats
    Dataset updated
    Mar 31, 2017
    Authors
    Sarah VCH
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    Data From: UCI Machine Learning Repository http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.names

    Content

    "Each record represents follow-up data for one breast cancer case. These are consecutive patients seen by Dr. Wolberg since 1984, and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis.

    The first 30 features are computed from a digitized image of a
    fine needle aspirate (FNA) of a breast mass. They describe
    characteristics of the cell nuclei present in the image.
    A few of the images can be found at
    http://www.cs.wisc.edu/~street/images/
    
    The separation described above was obtained using
    Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
    Construction Via Linear Programming." Proceedings of the 4th
    Midwest Artificial Intelligence and Cognitive Science Society,
    pp. 97-101, 1992], a classification method which uses linear
    programming to construct a decision tree. Relevant features
    were selected using an exhaustive search in the space of 1-4
    features and 1-3 separating planes.
    
    The actual linear program used to obtain the separating plane
    in the 3-dimensional space is that described in:
    [K. P. Bennett and O. L. Mangasarian: "Robust Linear
    Programming Discrimination of Two Linearly Inseparable Sets",
    Optimization Methods and Software 1, 1992, 23-34].
    
    The Recurrence Surface Approximation (RSA) method is a linear
    programming model which predicts Time To Recur using both
    recurrent and nonrecurrent cases. See references (i) and (ii)
    above for details of the RSA method. 
    
    This database is also available through the UW CS ftp server:
    
    ftp ftp.cs.wisc.edu
    cd math-prog/cpo-dataset/machine-learn/WPBC/
    

    1) ID number 2) Outcome (R = recur, N = nonrecur) 3) Time (recurrence time if field 2 = R, disease-free time if field 2 = N) 4-33) Ten real-valued features are computed for each cell nucleus:

    a) radius (mean of distances from center to points on the perimeter)
    b) texture (standard deviation of gray-scale values)
    c) perimeter
    d) area
    e) smoothness (local variation in radius lengths)
    f) compactness (perimeter^2 / area - 1.0)
    g) concavity (severity of concave portions of the contour)
    h) concave points (number of concave portions of the contour)
    i) symmetry 
    j) fractal dimension ("coastline approximation" - 1)"
    

    Acknowledgements

    Creators:

    Dr. William H. Wolberg, General Surgery Dept., University of
    Wisconsin, Clinical Sciences Center, Madison, WI 53792
    wolberg@eagle.surgery.wisc.edu
    
    W. Nick Street, Computer Sciences Dept., University of
    Wisconsin, 1210 West Dayton St., Madison, WI 53706
    street@cs.wisc.edu 608-262-6619
    
    Olvi L. Mangasarian, Computer Sciences Dept., University of
    Wisconsin, 1210 West Dayton St., Madison, WI 53706
    olvi@cs.wisc.edu 
    

    Inspiration

    I'm really interested in trying out various machine learning algorithms on some real life science data.

  6. Model performance (mean ± 95% CI) with significance testing against...

    • figshare.com
    xls
    Updated Nov 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rana Muhammad Amir Latif; Tahir Iqbal; Ismaeel Abdel Qader; Atif Ikram; Hadeel Alsolai; Bayan Alabdullah; Fatimah Alhayan; Taher M. Ghazal (2025). Model performance (mean ± 95% CI) with significance testing against baselines. Bold values denote best results. [Dataset]. http://doi.org/10.1371/journal.pone.0336241.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 7, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rana Muhammad Amir Latif; Tahir Iqbal; Ismaeel Abdel Qader; Atif Ikram; Hadeel Alsolai; Bayan Alabdullah; Fatimah Alhayan; Taher M. Ghazal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Model performance (mean ± 95% CI) with significance testing against baselines. Bold values denote best results.

  7. h

    Residual-Bayesian-Attention

    • huggingface.co
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wencan Guan (2025). Residual-Bayesian-Attention [Dataset]. https://huggingface.co/datasets/guanwencan/Residual-Bayesian-Attention
    Explore at:
    Dataset updated
    Oct 8, 2025
    Authors
    Wencan Guan
    Description

    This collection contains six commonly used regression datasets from the UCI Machine Learning Repository.

      1. California Housing
    

    File: california_housing.csv Samples: 20,640 Features: 8 (MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude) Target: Median house value Source: Scikit-learn built-in dataset

      2. Household Power Consumption
    

    File: household_power_timeseries.csv Samples: 17,520 Features: 7 (Global active/reactive power, Voltage… See the full description on the dataset page: https://huggingface.co/datasets/guanwencan/Residual-Bayesian-Attention.

  8. UCI ML Parkinsons dataset

    • kaggle.com
    zip
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elnaz Alikarami (2025). UCI ML Parkinsons dataset [Dataset]. https://www.kaggle.com/datasets/elnazalikarami/uci-ml-parkinsons-dataset
    Explore at:
    zip(316796 bytes)Available download formats
    Dataset updated
    Jul 8, 2025
    Authors
    Elnaz Alikarami
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Oxford Parkinson's Disease Detection Dataset UCI Machine Learning Repository

    dataset's original link : https://archive.ics.uci.edu/dataset/174/parkinsons

    Dataset Characteristics Multivariate

    Subject Area Health and Medicine

    Associated Tasks Classification

    Feature Type Real

    Instances

    197

    Features

    22

    Dataset Information Additional Information

    This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

    The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

    Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).

    Has Missing Values?

    No

  9. t

    SGEMM - Dataset - LDM

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). SGEMM - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/sgemm
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    The SGEMM dataset is a regression task from the UCI repository.

  10. Z

    Household Reactive Power Consumption Dataset

    • data.niaid.nih.gov
    Updated Mar 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chang Wei Tan; Christoph Bergmeir; Francois Petitjean; Geoffrey I Webb (2021). Household Reactive Power Consumption Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3902705
    Explore at:
    Dataset updated
    Mar 24, 2021
    Dataset provided by
    Monash University
    Authors
    Chang Wei Tan; Christoph Bergmeir; Francois Petitjean; Geoffrey I Webb
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/

    The goal of this dataset is to predict total reactive power consumption in a household. This dataset contains 1440 time series obtained from the Individual household electric power consumption dataset from the UCI repository. The time series has 5 dimensions. This includes measurements for voltage, current annd 3 sub-metering energy usage.

    Please refer to https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption for more details

    Source Georges Hebrail (georges.hebrail '@' edf.fr), Senior Researcher, EDF R&D, Clamart, France Alice Berard, TELECOM ParisTech Master of Engineering Internship at EDF R&D, Clamart, France

  11. News Title Sentiment Dataset

    • zenodo.org
    bin
    Updated Mar 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chang Wei Tan; Chang Wei Tan; Christoph Bergmeir; Christoph Bergmeir; Francois Petitjean; Francois Petitjean; Geoffrey I Webb; Geoffrey I Webb (2021). News Title Sentiment Dataset [Dataset]. http://doi.org/10.5281/zenodo.3902726
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 24, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Chang Wei Tan; Chang Wei Tan; Christoph Bergmeir; Christoph Bergmeir; Francois Petitjean; Francois Petitjean; Geoffrey I Webb; Geoffrey I Webb
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/

    The goal of this dataset is to predict sentiment score for news title. This dataset contains 83164 time series obtained from the News Popularity in Multiple Social Media Platforms dataset from the UCI repository. This is a large data set of news items and their respective social feedback on multiple platforms: Facebook, Google+ and LinkedIn. The collected data relates to a period of 8 months, between November 2015 and July 2016, accounting for about 100,000 news items on four different topics: economy, microsoft, obama and palestine. This data set is tailored for evaluative comparisons in predictive analytics tasks, although allowing for tasks in other research areas such as topic detection and tracking, sentiment analysis in short text, first story detection or news recommendation. The time series has 3 dimensions.

    Please refer to https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms for more details

    Citation request
    Nuno Moniz and Luis Torgo (2018), Multi-Source Social Feedback of Online News Feeds, CoRR

  12. Z

    Appliances Energy Dataset

    • data-staging.niaid.nih.gov
    • zenodo.org
    Updated Mar 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chang Wei Tan; Christoph Bergmeir; Francois Petitjean; Geoffrey I Webb (2021). Appliances Energy Dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3902636
    Explore at:
    Dataset updated
    Mar 24, 2021
    Dataset provided by
    Monash University
    Authors
    Chang Wei Tan; Christoph Bergmeir; Francois Petitjean; Geoffrey I Webb
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/

    The goal of this dataset is to predict total energy usage in kWh of a house. This dataset contains 138 time series obtained from the Appliances Energy Prediction dataset from the UCI repository. The time series has 24 dimensions. This includes temperature and humidity measurements of 9 rooms in a house, monitored with a ZigBee wireless sensor network. It also includes weather and climate data such as temperature, pressure, humidity, wind speed, visibility and dewpoint measured from Chievres airport. The data set is averaged for 10 minutes period and spanning 4.5 months.

    Please refer to https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction for more details

    Relevant papers Luis M. Candanedo, Veronique Feldheim, Dominique Deramaix, Data driven prediction models of energy use of appliances in a low-energy house, Energy and Buildings, Volume 140, 1 April 2017, Pages 81-97, ISSN 0378-7788

    Citation request Luis M. Candanedo, Veronique Feldheim, Dominique Deramaix, Data driven prediction models of energy use of appliances in a low-energy house, Energy and Buildings, Volume 140, 1 April 2017, Pages 81-97, ISSN 0378-7788

  13. t

    Appliances Energy Dataset - Vdataset - LDM in NFDI4Energy

    • service.tib.eu
    Updated Nov 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Appliances Energy Dataset - Vdataset - LDM in NFDI4Energy [Dataset]. https://service.tib.eu/ldm_nfdi4energy/ldmservice/dataset/openaire_7c522fbe-7b1d-42e8-aa1c-274af3d535a3
    Explore at:
    Dataset updated
    Nov 17, 2025
    Description

    {"This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/ The goal of this dataset is to predict total energy usage in kWh of a house. This dataset contains 138 time series obtained from the Appliances Energy Prediction dataset from the UCI repository. The time series has 24 dimensions. This includes temperature and humidity measurements of 9 rooms in a house, monitored with a ZigBee wireless sensor network. It also includes weather and climate data such as temperature, pressure, humidity, wind speed, visibility and dewpoint measured from Chievres airport. The data set is averaged for 10 minutes period and spanning 4.5 months. Please refer to https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction for more details Relevant papers Luis M. Candanedo, Veronique Feldheim, Dominique Deramaix, Data driven prediction models of energy use of appliances in a low-energy house, Energy and Buildings, Volume 140, 1 April 2017, Pages 81-97, ISSN 0378-7788 Citation request Luis M. Candanedo, Veronique Feldheim, Dominique Deramaix, Data driven prediction models of energy use of appliances in a low-energy house, Energy and Buildings, Volume 140, 1 April 2017, Pages 81-97, ISSN 0378-7788"}

  14. Comparative Summary of Related Studies on AQI Prediction.

    • plos.figshare.com
    xls
    Updated Nov 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rana Muhammad Amir Latif; Tahir Iqbal; Ismaeel Abdel Qader; Atif Ikram; Hadeel Alsolai; Bayan Alabdullah; Fatimah Alhayan; Taher M. Ghazal (2025). Comparative Summary of Related Studies on AQI Prediction. [Dataset]. http://doi.org/10.1371/journal.pone.0336241.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 7, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rana Muhammad Amir Latif; Tahir Iqbal; Ismaeel Abdel Qader; Atif Ikram; Hadeel Alsolai; Bayan Alabdullah; Fatimah Alhayan; Taher M. Ghazal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparative Summary of Related Studies on AQI Prediction.

  15. h

    abalone

    • huggingface.co
    Updated Apr 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mattia (2023). abalone [Dataset]. https://huggingface.co/datasets/mstz/abalone
    Explore at:
    Dataset updated
    Apr 5, 2023
    Authors
    Mattia
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Abalone

    The Abalone dataset from the UCI ML repository. Predict the age of the given abalone.

      Configurations and tasks
    

    Configuration Task Description

    abalone Regression Predict the age of the abalone.

    binary Binary classification Does the abalone have more than 9 rings?

      Usage
    

    from datasets import load_dataset

    dataset = load_dataset("mstz/abalone")["train"]

      Features
    

    Target feature in bold.

    Feature Type

    sex [string]… See the full description on the dataset page: https://huggingface.co/datasets/mstz/abalone.

  16. Details of the datasets from UCI repository used in the experiments.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    QingJun Song; HaiYan Jiang; Qinghui Song; XieGuang Zhao; Xiaoxuan Wu (2023). Details of the datasets from UCI repository used in the experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0184834.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    QingJun Song; HaiYan Jiang; Qinghui Song; XieGuang Zhao; Xiaoxuan Wu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Details of the datasets from UCI repository used in the experiments.

  17. Sales Data

    • kaggle.com
    zip
    Updated Aug 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sky (2023). Sales Data [Dataset]. https://www.kaggle.com/datasets/yshailesh/sales-dataset-for-prediction
    Explore at:
    zip(7223540 bytes)Available download formats
    Dataset updated
    Aug 11, 2023
    Authors
    Sky
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Typically sales datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions. The dataset is maintained on their site, where it can be found by the title "Durg Store". The data files contains two datasets one stores all the historical sales data while the 2nd dataset contains all the store information.

    Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

    Inspiration Analyses for this dataset could include regression, time series, clustering, classification and more.

  18. Climate Model Simulation Crashes Data Set

    • kaggle.com
    zip
    Updated May 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anala Keshava (2020). Climate Model Simulation Crashes Data Set [Dataset]. https://www.kaggle.com/analakeshava/climate-model-simulation-crashes-data-set
    Explore at:
    zip(133132 bytes)Available download formats
    Dataset updated
    May 27, 2020
    Authors
    Anala Keshava
    Description

    The dataset is taken from UCI machine learning repository. The link for the dataset is https://archive.ics.uci.edu/ml/datasets/Climate+Model+Simulation+Crashes (https://archive.ics.uci.edu/ml/datasets/Climate+Model+Simulation+Crashes). This dataset contains samples of 18 climate model input parameter values to predict climate model simulation crashes.

  19. Energy Efficiency Data Set

    • kaggle.com
    Updated May 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ujjwal Chowdhury (2022). Energy Efficiency Data Set [Dataset]. https://www.kaggle.com/datasets/ujjwalchowdhury/energy-efficiency-data-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ujjwal Chowdhury
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This Data Set is collected from UCI Machine Learning Repository.

    Data Set Description in UCI as follows: " Abstract: This study looked into assessing the heating load and cooling load requirements of buildings (that is, energy efficiency) as a function of building parameters.

    We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer. "

  20. Data from: A new hybrid ensemble model with voting-based outlier detection...

    • figshare.com
    txt
    Updated Aug 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenyu Zhang; Dongqi Yang; Shuai Zhang (2020). A new hybrid ensemble model with voting-based outlier detection and balanced sampling for credit scoring [Dataset]. http://doi.org/10.6084/m9.figshare.12782552.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 11, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Wenyu Zhang; Dongqi Yang; Shuai Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three datasets from the UC Irvine (UCI) machine learning repository, that is, the Australian, German, and Japanese datasets, were adopted for the current study. The Australian credit dataset contains 690 samples, of which 307 are positive and 383 are negative. The dimensions of its input features are 15. The German credit dataset contains 1000 samples, 700 of which are positive and 300 are negative. The dimensions of its input features are 21. The Japanese credit dataset contains 690 samples, of which 383 are positive and 307 are negative. The dimensions of its input features are 16.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Organization logo

UCI and OpenML Data Sets for Ordinal Quantification

Explore at:
zipAvailable download formats
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

Search
Clear search
Close search
Google apps
Main menu