100+ datasets found

Store Sales - T.S Forecasting...Merged Dataset
kaggle.com
zip
Updated Dec 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shramana Bhattacharya (2021). Store Sales - T.S Forecasting...Merged Dataset [Dataset]. https://www.kaggle.com/shramanabhattacharya/store-sales-ts-forecastingmerged-dataset
Explore at:
zip(2847585 bytes)Available download formats
Dataset updated
Dec 15, 2021
Authors
Shramana Bhattacharya
Description
This dataset is a merged dataset created from the data provided in the competition "Store Sales - Time Series Forecasting". The other datasets that were provided there apart from train and test (for example holidays_events, oil, stores, etc.) could not be used in the final prediction. According to my understanding, through the EDA of the merged dataset, we will be able to get a clearer picture of the other factors that might also affect the final prediction of grocery sales. Therefore, I created this merged dataset and posted it here for the further scope of analysis.

##### Data Description Data Field Information (This is a copy of the description as provided in the actual dataset)

Train.csv - id: store id - date: date of the sale - store_nbr: identifies the store at which the products are sold. -**family**: identifies the type of product sold. - sales: gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips). - onpromotion: gives the total number of items in a product family that were being promoted at a store on a given date. - Store metadata, including ****city, state, type, and cluster.**** - cluster is a grouping of similar stores. - Holidays and Events, with metadata NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was celebrated, look for the corresponding row where the type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to pay back the Bridge. Additional holidays are days added to a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday). - dcoilwtico: Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.)

**Note: ***There is a transaction column in the training dataset which displays the sales transactions on that particular date. * Test.csv - The test data, having the same features like the training data. You will predict the target sales for the dates in this file. - The dates in the test data are for the 15 days after the last date in the training data. **Note: ***There is a no transaction column in the test dataset as was there in the training dataset. Therefore, while building the model, you might exclude this column and may use it only for EDA.*

submission.csv - A sample submission file in the correct format.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
d
Process-guided deep learning water temperature predictions: 5 Model...
catalog.data.gov
gimi9.com
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Process-guided deep learning water temperature predictions: 5 Model prediction data [Dataset]. https://catalog.data.gov/dataset/process-guided-deep-learning-water-temperature-predictions-5-model-prediction-data
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
Multiple modeling frameworks were used to predict daily temperatures at 0.5m depth intervals for a set of diverse lakes in the U.S. states of Minnesota and Wisconsin. Process-Based (PB) models were configured and calibrated with training data to reduce root-mean squared error. Uncalibrated models used default configurations (PB0; see Winslow et al. 2016 for details) and no parameters were adjusted according to model fit with observations. Deep Learning (DL) models were Long Short-Term Memory artificial recurrent neural network models which used training data to adjust model structure and weights for temperature predictions (Jia et al. 2019). Process-Guided Deep Learning (PGDL) models were DL models with an added physical constraint for energy conservation as a loss term. These models were pre-trained with uncalibrated Process-Based model outputs (PB0) before training on actual temperature observations.
h
entity-prediction-training-data
huggingface.co
Updated Dec 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Pivonka (2025). entity-prediction-training-data [Dataset]. https://huggingface.co/datasets/DavePiv/entity-prediction-training-data
Explore at:
Dataset updated
Dec 2, 2025
Authors
David Pivonka
Description
Entity Prediction Training Data

Training dataset for the Entity Prediction Model (Model A) of the Synthetic Data Pipeline.

Dataset Description

This dataset contains training examples for predicting plausible Actors and Recipients in UN Peacekeeping scenarios based on structured event definitions.

Features

Mission Context: UN mission name and acronym Year: Year of the scenario Event Classification: PLOVER event type (Assault, Aid, Consult, etc.) Mode: Event… See the full description on the dataset page: https://huggingface.co/datasets/DavePiv/entity-prediction-training-data.
A Meta-Learner Approach to Multistep-Ahead Time Series Prediction
zenodo.org
data.niaid.nih.gov
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fouad Bahrpeyma; Fouad Bahrpeyma; andrew.mccarren@dcu.ie; andrew.mccarren@dcu.ie; Mark; Mark (2023). A Meta-Learner Approach to Multistep-Ahead Time Series Prediction [Dataset]. http://doi.org/10.5281/zenodo.7913884
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7913884
Dataset updated
May 9, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fouad Bahrpeyma; Fouad Bahrpeyma; andrew.mccarren@dcu.ie; andrew.mccarren@dcu.ie; Mark; Mark
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

The application of machine learning has become commonplace for problems in modern data science. The democratization of the decision process when choosing a machine learning algorithm has also received considerable attention through the use of meta features and automated machine learning for both classification and regression type problems. However, this is not the case for multistep-ahead time series problems. Time series models generally rely upon the series itself to make future predictions, as opposed to independent features used in regression and classification problems. The structure of a time series is generally described by features such as trend, seasonality, cyclicality, and irregularity. In this research, we demonstrate how time series metrics for these features, in conjunction with an ensemble based regression learner, were used to predict the standardized mean square error of candidate time series prediction models. These experiments used datasets that cover a wide feature space and enable researchers to select the single best performing model or the top N performing models. A robust evaluation was carried out to test the learner's performance on both synthetic and real time series.

Proposed Dataset

The dataset proposed here gives the results for 20 step ahead predictions for eight Machine Learning/Multi-step ahead prediction strategies for 5,842 time series datasets outlined here. It was used as the training data for the Meta Learners in this research. The meta features used are columns C to AE. Columns AH outlines the method/strategy used and columns AI to BB (the error) is the outcome variable for each prediction step. The description of the method/strategies is as follows:

Machine Learning methods:

NN: Neural Network

ARIMA: Autoregressive Integrated Moving Average

SVR: Support Vector Regression

LSTM: Long Short Term Memory

RNN: Recurrent Neural Network

Multistep ahead prediction strategy:

OSAP: One Step ahead strategy

MRFA: Multi Resolution Forecast Aggregation
loan_Prediction_train
kaggle.com
zip
Updated Nov 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jpfitzger (2022). loan_Prediction_train [Dataset]. https://www.kaggle.com/datasets/jpfitzger/loan-prediction-train
Explore at:
zip(7975 bytes)Available download formats
Dataset updated
Nov 18, 2022
Authors
Jpfitzger
Description
Dataset

This dataset was created by Jpfitzger

Contents
f
Training dataset and prediction results.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ma, Xiangchao; Gao, Jingzhou; Xu, Kai; Zhan, Changheng; He, Weiming; Lei, Shuyao; Zhang, Jianqi (2024). Training dataset and prediction results. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001435774
Explore at:
Dataset updated
May 2, 2024
Authors
Ma, Xiangchao; Gao, Jingzhou; Xu, Kai; Zhan, Changheng; He, Weiming; Lei, Shuyao; Zhang, Jianqi
Description
The yolk shell is widely used in optoelectronic devices due to its excellent optical properties. Compared to single metal nanostructures, yolk shells have more controllable degrees of freedom, which may make experiments and simulations more complex. Using neural networks can efficiently simplify the computational process of yolk shell. In our work, the relationship between the size and the absorption efficiency of the yolk-shell structure is established using a backpropagation neural network (BPNN), significantly simplifying the calculation process while ensuring accuracy equivalent to discrete dipole scattering (DDSCAT). The absorption efficiency of the yolk shell was comprehensively described through the forward and reverse prediction processes. In forward prediction, the absorption spectrum of yolk shell is obtained through its size parameter. In reverse prediction, the size parameters of yolk shells are predicted through absorption spectra. A comparison with the traditional DDSCAT demonstrated the high precision prediction capability and fast computation of this method, with minimal memory consumption.
d
Prediction data from: Machine learning predicts which rivers, streams, and...
datadryad.org
dataone.org
+1more
zip
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Greenhill; Hannah Druckenmiller; Sherrie Wang; David Keiser; Manuela Girotto; Jason Moore; Nobuhiro Yamaguchi; Alberto Todeschini; Joseph Shapiro (2023). Prediction data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates [Dataset]. http://doi.org/10.5061/dryad.z34tmpgm7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.z34tmpgm7
Dataset updated
Dec 10, 2023
Dataset provided by
Dryad
Authors
Simon Greenhill; Hannah Druckenmiller; Sherrie Wang; David Keiser; Manuela Girotto; Jason Moore; Nobuhiro Yamaguchi; Alberto Todeschini; Joseph Shapiro
Time period covered
Sep 27, 2023
Description
This dataset contains model outputs that were analyzed to produce the main results of the paper.
g
Process-guided deep learning water temperature predictions: 6 Model...
gimi9.com
data.usgs.gov
+2more
Updated Jul 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Process-guided deep learning water temperature predictions: 6 Model evaluation (test data and RMSE) [Dataset]. https://gimi9.com/dataset/data-gov_process-guided-deep-learning-water-temperature-predictions-6-model-evaluation-test-data-an
Explore at:
Dataset updated
Jul 1, 2024
Description
This dataset includes evaluation data ("test" data) and performance metrics for water temperature predictions from multiple modeling frameworks. Process-Based (PB) models were configured and calibrated with training data to reduce root-mean squared error. Uncalibrated models used default configurations (PB0; see Winslow et al. 2016 for details) and no parameters were adjusted according to model fit with observations. Deep Learning (DL) models were Long Short-Term Memory artificial recurrent neural network models which used training data to adjust model structure and weights for temperature predictions (Jia et al. 2019). Process-Guided Deep Learning (PGDL) models were DL models with an added physical constraint for energy conservation as a loss term. These models were pre-trained with uncalibrated Process-Based model outputs (PB0) before training on actual temperature observations. Performance was measured as root-mean squared errors relative to temperature observations during the test period. Test data include compiled water temperature data from a variety of sources, including the Water Quality Portal (Read et al. 2017), the North Temperate Lakes Long-TERM Ecological Research Program (https://lter.limnology.wisc.edu/), the Minnesota department of Natural Resources, and the Global Lake Ecological Observatory Network (gleon.org). This dataset is part of a larger data release of lake temperature model inputs and outputs for 68 lakes in the U.S. states of Minnesota and Wisconsin (http://dx.doi.org/10.5066/P9AQPIVD).
U
Process-guided deep learning water temperature predictions: 4 Training data
data.usgs.gov
datasets.ai
+2more
Updated Nov 22, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Read Jordan S; Appling Alison P; Watkins William D (2019). Process-guided deep learning water temperature predictions: 4 Training data [Dataset]. http://doi.org/10.5066/P9AQPIVD
Explore at:
Unique identifier
https://doi.org/10.5066/P9AQPIVD
Dataset updated
Nov 22, 2019
Dataset provided by
United States Geological Survey
Authors
Read Jordan S; Appling Alison P; Watkins William D
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Apr 1, 1980 - Dec 31, 2018
Description
This dataset includes compiled water temperature data from a variety of sources, including the Water Quality Portal (Read et al. 2017), the North Temperate Lakes Long-TERM Ecological Research Program (https://lter.limnology.wisc.edu/), the Minnesota department of Natural Resources, and the Global Lake Ecological Observatory Network (gleon.org). This dataset is part of a larger data release of lake temperature model inputs and outputs for 68 lakes in the U.S. states of Minnesota and Wisconsin (http://dx.doi.org/10.5066/P9AQPIVD).
H
Data from: Data augmentation for disruption prediction via robust surrogate...
dataverse.harvard.edu
osti.gov
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/FMJCAD
Dataset updated
Aug 31, 2024
Dataset provided by
Harvard Dataverse
Authors
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.
d
Model Archive and Data Release: Input data, trained model data, and model...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Model Archive and Data Release: Input data, trained model data, and model outputs for predicting streamflow and base flow for the Mississippi Embayment Regional Study Area using a random forest model [Dataset]. https://catalog.data.gov/dataset/model-archive-and-data-release-input-data-trained-model-data-and-model-outputs-for-predict
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data archive contains datasets developed for the purpose of training and applying random forest models to the Mississippi Embayment Regional Aquifer. The random forest models are designed to predict total stream flow and baseflow as a function of a combination of watershed characteristics and monthly weather data. These datasets are associated with a report (SIR 2022-xxxx) and code contained in a USGS GitLab repository. The GitLab repository (https://code.usgs.gov/map/maprandomforest/) contains much more information about how these data may be used to supply predictions of stream flow and baseflow.
Weather Prediction
kaggle.com
zenodo.org
zip
Updated Mar 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2024). Weather Prediction [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-prediction
Explore at:
zip(958204 bytes)Available download formats
Dataset updated
Mar 10, 2024
Authors
The Devastator
Description
Credit to the original author: The dataset was originally published here

Weather prediction dataset

A dataset for teaching machine learning and deep learning

Hands-on teaching of modern machine learning and deep learning techniques heavily relies on the use of well-suited datasets. The "weather prediction dataset" is a novel tabular dataset that was specifically created for teaching machine learning and deep learning to an academic audience. The dataset contains intuitively accessible weather observations from 18 locations in Europe. It was designed to be suitable for a large variety of different training goals, many of which are not easily giving way to unrealistically high prediction accuracy. Teachers or instructors thus can chose the difficulty of the training goals and thereby match it with the respective learner audience or lesson objective. The compact size and complexity of the dataset make it possible to quickly train common machine learning and deep learning models on a standard laptop so that they can be used in live hands-on sessions.

The dataset can be found in the `\dataset` folder and be downloaded from zenodo: https://doi.org/10.5281/zenodo.4980359

References

If you make use of this dataset, in particular if this is in form of an academic contribution, then please cite the following two references:

Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface air temperature and precipitation series for the European Climate Assessment. Int. J. of Climatol., 22, 1441-1453. Data and metadata available at http://www.ecad.eu

Florian Huber, Dafne van Kuppevelt, Peter Steinbach, Colin Sauze, Yang Liu, Berend Weel, "Will the sun shine? – An accessible dataset for teaching machine learning and deep learning", DOI TO BE ADDED!

Map of the locations of the 18 weather stations from which data was collected
Titanic Dataset - Machine Learning from Disaster
kaggle.com
zip
Updated Sep 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Chauhan (2022). Titanic Dataset - Machine Learning from Disaster [Dataset]. https://www.kaggle.com/datasets/whenamancodes/titanic-dataset-machine-learning-from-disaster
Explore at:
zip(34877 bytes)Available download formats
Dataset updated
Sep 20, 2022
Authors
Aman Chauhan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

The data has been split into two groups:

training set (train.csv)

test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary:

| Variable | Definition | Key | | --- | --- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | Age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe
Data from: Robust Validation: Confident Predictions Even When Distributions...
tandf.figshare.com
bin
Updated Dec 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi (2023). Robust Validation: Confident Predictions Even When Distributions Shift* [Dataset]. http://doi.org/10.6084/m9.figshare.24904721.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24904721.v1
Dataset updated
Dec 26, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy—coming from robust statistics and optimization—is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an f-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.’s CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity.
Data_Sheet_1_Prediction of patient choice tendency in medical...
frontiersin.figshare.com
pdf
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuwen Lyu; Qian Xu; Zhenchao Yang; Junrong Liu (2023). Data_Sheet_1_Prediction of patient choice tendency in medical decision-making based on machine learning algorithm.pdf [Dataset]. http://doi.org/10.3389/fpubh.2023.1087358.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2023.1087358.s001
Dataset updated
Jun 21, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Yuwen Lyu; Qian Xu; Zhenchao Yang; Junrong Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveMachine learning (ML) algorithms, as an early branch of artificial intelligence technology, can effectively simulate human behavior by training on data from the training set. Machine learning algorithms were used in this study to predict patient choice tendencies in medical decision-making. Its goal was to help physicians understand patient preferences and to serve as a resource for the development of decision-making schemes in clinical treatment. As a result, physicians and patients can have better conversations at lower expenses, leading to better medical decisions.MethodPatient medical decision-making tendencies were predicted by primary survey data obtained from 248 participants at third-level grade-A hospitals in China. Specifically, 12 predictor variables were set according to the literature review, and four types of outcome variables were set based on the optimization principle of clinical diagnosis and treatment. That is, the patient's medical decision-making tendency, which is classified as treatment effect, treatment cost, treatment side effect, and treatment experience. In conjunction with the study's data characteristics, three ML classification algorithms, decision tree (DT), k-nearest neighbor (KNN), and support vector machine (SVM), were used to predict patients' medical decision-making tendency, and the performance of the three types of algorithms was compared.ResultsThe accuracy of the DT algorithm for predicting patients' choice tendency in medical decision making is 80% for treatment effect, 60% for treatment cost, 56% for treatment side effects, and 60% for treatment experience, followed by the KNN algorithm at 78%, 66%, 74%, 84%, and the SVM algorithm at 82%, 76%, 80%, 94%. At the same time, the comprehensive evaluation index F1-score of the DT algorithm are 0.80, 0.61, 0.58, 0.60, the KNN algorithm are 0.75, 0.65, 0.71, 0.84, and the SVM algorithm are 0.81, 0.74, 0.73, 0.94.ConclusionAmong the three ML classification algorithms, SVM has the highest accuracy and the best performance. Therefore, the prediction results have certain reference values and guiding significance for physicians to formulate clinical treatment plans. The research results are helpful to promote the development and application of a patient-centered medical decision assistance system, to resolve the conflict of interests between physicians and patients and assist them to realize scientific decision-making.
d
A machine learning based prediction model for life expectancy
datadryad.org
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Nov 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo (2022). A machine learning based prediction model for life expectancy [Dataset]. http://doi.org/10.5061/dryad.z612jm6fv
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.z612jm6fv
Dataset updated
Nov 14, 2022
Dataset provided by
Dryad
Authors
Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo
Time period covered
Oct 12, 2022
Description
Microsoft Excel
Z
Data Set for Predicting the Performance of ATL Model Transformations Based...
data.niaid.nih.gov
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Groner, Raffaela; Bellmann, Peter; Höppner, Stefan; Thiam, Patrick; Schwenker, Friedhelm; Kestler, Hans A.; Tichy, Matthias (2024). Data Set for Predicting the Performance of ATL Model Transformations Based on Generated Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10395169
Explore at:
Dataset updated
Jul 7, 2024
Dataset provided by
Institute of Neural Information Processing, Ulm University
Institute of Neural Information Processing, Institute of Medical Systems Biology, Ulm University
Institute of Medical Systems Biology, Ulm University
Institute of Software Engineering and Programming Languages, Ulm University
Authors
Groner, Raffaela; Bellmann, Peter; Höppner, Stefan; Thiam, Patrick; Schwenker, Friedhelm; Kestler, Hans A.; Tichy, Matthias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Predicting the execution time of model transformations can help to understand how a transformation reacts to a given input model without creating and transforming the respective model.

In our previous data set (https://doi.org/10.5281/zenodo.8385957), we have documented our experiments in which we predict the performance of ATL transformations using predictive models obtained from training linear regression, random forest and support vector regression. As input for the prediction, our approach uses a characterization of the input model. In these experiments, we only used data from real models.

However, a common problem is that transformation developers do not have enough models available to use such a prediction approach. Therefore, in a new variant of our experiments, we investigated whether the three considered machine learning approaches can predict the performance of transformations if we use data from generated models for training. We also investigated whether it is possible to achieve good predictions with smaller training data. The dataset provided here offers the corresponding raw data, scripts, and results.

A detailed documentation is available in documentaion.pdf.
c
Not in Employment, Education, or Training Price Prediction Data
coinbase.com
Updated Dec 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Not in Employment, Education, or Training Price Prediction Data [Dataset]. https://www.coinbase.com/price-prediction/solana-notinemploymenteducationtraining-pump
Explore at:
Dataset updated
Dec 2, 2025
Variables measured
Growth Rate, Predicted Price
Measurement technique
User-defined projections based on compound growth. This is not a formal financial forecast.
Description
This dataset contains the predicted prices of the asset Not in Employment, Education, or Training over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
E
Data from: Deep learning genomic-enabled prediction of plant traits
data.moa.gov.et
html
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CIMMYT Ethiopia (2025). Deep learning genomic-enabled prediction of plant traits [Dataset]. https://data.moa.gov.et/dataset/hdl-11529-10548082
Explore at:
htmlAvailable download formats
Dataset updated
Jan 20, 2025
Dataset provided by
CIMMYT Ethiopia
Description
Machine learning (ML) is a field of computer science that uses statistical techniques to give computer systems the ability to "learn" (i.e., progressively improve performance on a specific task) from data, without being explicitly programmed to do this. ML is closely related to (and often overlaps with) computational statistics, which also focuses on making predictions through the use of computers. In general, ML explores algorithms that can learn from current data and make predictions on new data, through building a model from sample inputs. The field of statistics and ML had a root in common and will continue to come closer together in the future. In this paper we explore the novel deep learning (DL) methodology in the context of genomic selection. DL models with densely connected network architecture were compared with one of the most often used genome-enabled prediction models genomic best linear unbiased prediction (GBLUP). We used nine published real genomic data sets to compare the models and obtain a “meta picture” of the performance of DL models with a densely connected network architecture.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shramana Bhattacharya (2021). Store Sales - T.S Forecasting...Merged Dataset [Dataset]. https://www.kaggle.com/shramanabhattacharya/store-sales-ts-forecastingmerged-dataset

Store Sales - T.S Forecasting...Merged Dataset

Use Machine Learning to predict grocery sales

Explore at:

zip(2847585 bytes)Available download formats

Dataset updated

Dec 15, 2021

Authors

Shramana Bhattacharya

Description

This dataset is a merged dataset created from the data provided in the competition "Store Sales - Time Series Forecasting". The other datasets that were provided there apart from train and test (for example holidays_events, oil, stores, etc.) could not be used in the final prediction. According to my understanding, through the EDA of the merged dataset, we will be able to get a clearer picture of the other factors that might also affect the final prediction of grocery sales. Therefore, I created this merged dataset and posted it here for the further scope of analysis.

##### Data Description Data Field Information (This is a copy of the description as provided in the actual dataset)

Train.csv - id: store id - date: date of the sale - store_nbr: identifies the store at which the products are sold. -**family**: identifies the type of product sold. - sales: gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips). - onpromotion: gives the total number of items in a product family that were being promoted at a store on a given date. - Store metadata, including ****city, state, type, and cluster.**** - cluster is a grouping of similar stores. - Holidays and Events, with metadata NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was celebrated, look for the corresponding row where the type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to pay back the Bridge. Additional holidays are days added to a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday). - dcoilwtico: Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.)

**Note: ***There is a transaction column in the training dataset which displays the sales transactions on that particular date. * Test.csv - The test data, having the same features like the training data. You will predict the target sales for the dates in this file. - The dates in the test data are for the 15 days after the last date in the training data. **Note: ***There is a no transaction column in the test dataset as was there in the training dataset. Therefore, while building the model, you might exclude this column and may use it only for EDA.*

submission.csv - A sample submission file in the correct format.

Clear search

Close search

Google apps

Main menu

Store Sales - T.S Forecasting...Merged Dataset

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Process-guided deep learning water temperature predictions: 5 Model...

entity-prediction-training-data

A Meta-Learner Approach to Multistep-Ahead Time Series Prediction

loan_Prediction_train

Dataset

Contents

Training dataset and prediction results.

Prediction data from: Machine learning predicts which rivers, streams, and...

Process-guided deep learning water temperature predictions: 6 Model...

Process-guided deep learning water temperature predictions: 4 Training data

Data from: Data augmentation for disruption prediction via robust surrogate...

Model Archive and Data Release: Input data, trained model data, and model...

Weather Prediction

Weather prediction dataset

A dataset for teaching machine learning and deep learning

References

Map of the locations of the 18 weather stations from which data was collected

Titanic Dataset - Machine Learning from Disaster

Overview

The data has been split into two groups:

Data Dictionary:

Variable Notes

Data from: Robust Validation: Confident Predictions Even When Distributions...

Data_Sheet_1_Prediction of patient choice tendency in medical...

A machine learning based prediction model for life expectancy

Data Set for Predicting the Performance of ATL Model Transformations Based...

Not in Employment, Education, or Training Price Prediction Data

Data from: Deep learning genomic-enabled prediction of plant traits

Store Sales - T.S Forecasting...Merged Dataset

Use Machine Learning to predict grocery sales