Facebook
TwitterThis dataset is a merged dataset created from the data provided in the competition "Store Sales - Time Series Forecasting". The other datasets that were provided there apart from train and test (for example holidays_events, oil, stores, etc.) could not be used in the final prediction. According to my understanding, through the EDA of the merged dataset, we will be able to get a clearer picture of the other factors that might also affect the final prediction of grocery sales. Therefore, I created this merged dataset and posted it here for the further scope of analysis.
##### Data Description Data Field Information (This is a copy of the description as provided in the actual dataset)
Train.csv - id: store id - date: date of the sale - store_nbr: identifies the store at which the products are sold. -**family**: identifies the type of product sold. - sales: gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips). - onpromotion: gives the total number of items in a product family that were being promoted at a store on a given date. - Store metadata, including ****city, state, type, and cluster.**** - cluster is a grouping of similar stores. - Holidays and Events, with metadata NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was celebrated, look for the corresponding row where the type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to pay back the Bridge. Additional holidays are days added to a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday). - dcoilwtico: Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.)
**Note: ***There is a transaction column in the training dataset which displays the sales transactions on that particular date. * Test.csv - The test data, having the same features like the training data. You will predict the target sales for the dates in this file. - The dates in the test data are for the 15 days after the last date in the training data. **Note: ***There is a no transaction column in the test dataset as was there in the training dataset. Therefore, while building the model, you might exclude this column and may use it only for EDA.*
submission.csv - A sample submission file in the correct format.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Facebook
TwitterMultiple modeling frameworks were used to predict daily temperatures at 0.5m depth intervals for a set of diverse lakes in the U.S. states of Minnesota and Wisconsin. Process-Based (PB) models were configured and calibrated with training data to reduce root-mean squared error. Uncalibrated models used default configurations (PB0; see Winslow et al. 2016 for details) and no parameters were adjusted according to model fit with observations. Deep Learning (DL) models were Long Short-Term Memory artificial recurrent neural network models which used training data to adjust model structure and weights for temperature predictions (Jia et al. 2019). Process-Guided Deep Learning (PGDL) models were DL models with an added physical constraint for energy conservation as a loss term. These models were pre-trained with uncalibrated Process-Based model outputs (PB0) before training on actual temperature observations.
Facebook
TwitterEntity Prediction Training Data
Training dataset for the Entity Prediction Model (Model A) of the Synthetic Data Pipeline.
Dataset Description
This dataset contains training examples for predicting plausible Actors and Recipients in UN Peacekeeping scenarios based on structured event definitions.
Features
Mission Context: UN mission name and acronym Year: Year of the scenario Event Classification: PLOVER event type (Assault, Aid, Consult, etc.) Mode: Event⊠See the full description on the dataset page: https://huggingface.co/datasets/DavePiv/entity-prediction-training-data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
The application of machine learning has become commonplace for problems in modern data science. The democratization of the decision process when choosing a machine learning algorithm has also received considerable attention through the use of meta features and automated machine learning for both classification and regression type problems. However, this is not the case for multistep-ahead time series problems. Time series models generally rely upon the series itself to make future predictions, as opposed to independent features used in regression and classification problems. The structure of a time series is generally described by features such as trend, seasonality, cyclicality, and irregularity. In this research, we demonstrate how time series metrics for these features, in conjunction with an ensemble based regression learner, were used to predict the standardized mean square error of candidate time series prediction models. These experiments used datasets that cover a wide feature space and enable researchers to select the single best performing model or the top N performing models. A robust evaluation was carried out to test the learner's performance on both synthetic and real time series.
Proposed Dataset
The dataset proposed here gives the results for 20 step ahead predictions for eight Machine Learning/Multi-step ahead prediction strategies for 5,842 time series datasets outlined here. It was used as the training data for the Meta Learners in this research. The meta features used are columns C to AE. Columns AH outlines the method/strategy used and columns AI to BB (the error) is the outcome variable for each prediction step. The description of the method/strategies is as follows:
Machine Learning methods:
Multistep ahead prediction strategy:
Facebook
TwitterThis dataset was created by Jpfitzger
Facebook
TwitterThe yolk shell is widely used in optoelectronic devices due to its excellent optical properties. Compared to single metal nanostructures, yolk shells have more controllable degrees of freedom, which may make experiments and simulations more complex. Using neural networks can efficiently simplify the computational process of yolk shell. In our work, the relationship between the size and the absorption efficiency of the yolk-shell structure is established using a backpropagation neural network (BPNN), significantly simplifying the calculation process while ensuring accuracy equivalent to discrete dipole scattering (DDSCAT). The absorption efficiency of the yolk shell was comprehensively described through the forward and reverse prediction processes. In forward prediction, the absorption spectrum of yolk shell is obtained through its size parameter. In reverse prediction, the size parameters of yolk shells are predicted through absorption spectra. A comparison with the traditional DDSCAT demonstrated the high precision prediction capability and fast computation of this method, with minimal memory consumption.
Facebook
TwitterThis dataset contains model outputs that were analyzed to produce the main results of the paper.
Facebook
TwitterThis dataset includes evaluation data ("test" data) and performance metrics for water temperature predictions from multiple modeling frameworks. Process-Based (PB) models were configured and calibrated with training data to reduce root-mean squared error. Uncalibrated models used default configurations (PB0; see Winslow et al. 2016 for details) and no parameters were adjusted according to model fit with observations. Deep Learning (DL) models were Long Short-Term Memory artificial recurrent neural network models which used training data to adjust model structure and weights for temperature predictions (Jia et al. 2019). Process-Guided Deep Learning (PGDL) models were DL models with an added physical constraint for energy conservation as a loss term. These models were pre-trained with uncalibrated Process-Based model outputs (PB0) before training on actual temperature observations. Performance was measured as root-mean squared errors relative to temperature observations during the test period. Test data include compiled water temperature data from a variety of sources, including the Water Quality Portal (Read et al. 2017), the North Temperate Lakes Long-TERM Ecological Research Program (https://lter.limnology.wisc.edu/), the Minnesota department of Natural Resources, and the Global Lake Ecological Observatory Network (gleon.org). This dataset is part of a larger data release of lake temperature model inputs and outputs for 68 lakes in the U.S. states of Minnesota and Wisconsin (http://dx.doi.org/10.5066/P9AQPIVD).
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This dataset includes compiled water temperature data from a variety of sources, including the Water Quality Portal (Read et al. 2017), the North Temperate Lakes Long-TERM Ecological Research Program (https://lter.limnology.wisc.edu/), the Minnesota department of Natural Resources, and the Global Lake Ecological Observatory Network (gleon.org). This dataset is part of a larger data release of lake temperature model inputs and outputs for 68 lakes in the U.S. states of Minnesota and Wisconsin (http://dx.doi.org/10.5066/P9AQPIVD).
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.
Facebook
TwitterThis data archive contains datasets developed for the purpose of training and applying random forest models to the Mississippi Embayment Regional Aquifer. The random forest models are designed to predict total stream flow and baseflow as a function of a combination of watershed characteristics and monthly weather data. These datasets are associated with a report (SIR 2022-xxxx) and code contained in a USGS GitLab repository. The GitLab repository (https://code.usgs.gov/map/maprandomforest/) contains much more information about how these data may be used to supply predictions of stream flow and baseflow.
Facebook
TwitterCredit to the original author: The dataset was originally published here
Hands-on teaching of modern machine learning and deep learning techniques heavily relies on the use of well-suited datasets. The "weather prediction dataset" is a novel tabular dataset that was specifically created for teaching machine learning and deep learning to an academic audience. The dataset contains intuitively accessible weather observations from 18 locations in Europe. It was designed to be suitable for a large variety of different training goals, many of which are not easily giving way to unrealistically high prediction accuracy. Teachers or instructors thus can chose the difficulty of the training goals and thereby match it with the respective learner audience or lesson objective. The compact size and complexity of the dataset make it possible to quickly train common machine learning and deep learning models on a standard laptop so that they can be used in live hands-on sessions.
The dataset can be found in the `\dataset` folder and be downloaded from zenodo: https://doi.org/10.5281/zenodo.4980359
If you make use of this dataset, in particular if this is in form of an academic contribution, then please cite the following two references:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the âground truthâ) for each passenger. Your model will be based on âfeaturesâ like passengersâ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
| Variable | Definition | Key | | --- | --- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | Age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
More - Find More Excitingđ Datasets Here - An Upvoteđ A Dayá(`âżÂŽ)á , Keeps Aman Hurray Hurray..... Ù©(ËâĄË)Û¶Hehe
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategyâcoming from robust statistics and optimizationâis thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an f-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.âs CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveMachine learning (ML) algorithms, as an early branch of artificial intelligence technology, can effectively simulate human behavior by training on data from the training set. Machine learning algorithms were used in this study to predict patient choice tendencies in medical decision-making. Its goal was to help physicians understand patient preferences and to serve as a resource for the development of decision-making schemes in clinical treatment. As a result, physicians and patients can have better conversations at lower expenses, leading to better medical decisions.MethodPatient medical decision-making tendencies were predicted by primary survey data obtained from 248 participants at third-level grade-A hospitals in China. Specifically, 12 predictor variables were set according to the literature review, and four types of outcome variables were set based on the optimization principle of clinical diagnosis and treatment. That is, the patient's medical decision-making tendency, which is classified as treatment effect, treatment cost, treatment side effect, and treatment experience. In conjunction with the study's data characteristics, three ML classification algorithms, decision tree (DT), k-nearest neighbor (KNN), and support vector machine (SVM), were used to predict patients' medical decision-making tendency, and the performance of the three types of algorithms was compared.ResultsThe accuracy of the DT algorithm for predicting patients' choice tendency in medical decision making is 80% for treatment effect, 60% for treatment cost, 56% for treatment side effects, and 60% for treatment experience, followed by the KNN algorithm at 78%, 66%, 74%, 84%, and the SVM algorithm at 82%, 76%, 80%, 94%. At the same time, the comprehensive evaluation index F1-score of the DT algorithm are 0.80, 0.61, 0.58, 0.60, the KNN algorithm are 0.75, 0.65, 0.71, 0.84, and the SVM algorithm are 0.81, 0.74, 0.73, 0.94.ConclusionAmong the three ML classification algorithms, SVM has the highest accuracy and the best performance. Therefore, the prediction results have certain reference values and guiding significance for physicians to formulate clinical treatment plans. The research results are helpful to promote the development and application of a patient-centered medical decision assistance system, to resolve the conflict of interests between physicians and patients and assist them to realize scientific decision-making.
Facebook
TwitterMicrosoft Excel
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predicting the execution time of model transformations can help to understand how a transformation reacts to a given input model without creating and transforming the respective model.
In our previous data set (https://doi.org/10.5281/zenodo.8385957), we have documented our experiments in which we predict the performance of ATL transformations using predictive models obtained from training linear regression, random forest and support vector regression. As input for the prediction, our approach uses a characterization of the input model. In these experiments, we only used data from real models.
However, a common problem is that transformation developers do not have enough models available to use such a prediction approach. Therefore, in a new variant of our experiments, we investigated whether the three considered machine learning approaches can predict the performance of transformations if we use data from generated models for training. We also investigated whether it is possible to achieve good predictions with smaller training data. The dataset provided here offers the corresponding raw data, scripts, and results.
A detailed documentation is available in documentaion.pdf.
Facebook
TwitterThis dataset contains the predicted prices of the asset Not in Employment, Education, or Training over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
TwitterMachine learning (ML) is a field of computer science that uses statistical techniques to give computer systems the ability to "learn" (i.e., progressively improve performance on a specific task) from data, without being explicitly programmed to do this. ML is closely related to (and often overlaps with) computational statistics, which also focuses on making predictions through the use of computers. In general, ML explores algorithms that can learn from current data and make predictions on new data, through building a model from sample inputs. The field of statistics and ML had a root in common and will continue to come closer together in the future. In this paper we explore the novel deep learning (DL) methodology in the context of genomic selection. DL models with densely connected network architecture were compared with one of the most often used genome-enabled prediction models genomic best linear unbiased prediction (GBLUP). We used nine published real genomic data sets to compare the models and obtain a âmeta pictureâ of the performance of DL models with a densely connected network architecture.
Facebook
TwitterThis dataset is a merged dataset created from the data provided in the competition "Store Sales - Time Series Forecasting". The other datasets that were provided there apart from train and test (for example holidays_events, oil, stores, etc.) could not be used in the final prediction. According to my understanding, through the EDA of the merged dataset, we will be able to get a clearer picture of the other factors that might also affect the final prediction of grocery sales. Therefore, I created this merged dataset and posted it here for the further scope of analysis.
##### Data Description Data Field Information (This is a copy of the description as provided in the actual dataset)
Train.csv - id: store id - date: date of the sale - store_nbr: identifies the store at which the products are sold. -**family**: identifies the type of product sold. - sales: gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips). - onpromotion: gives the total number of items in a product family that were being promoted at a store on a given date. - Store metadata, including ****city, state, type, and cluster.**** - cluster is a grouping of similar stores. - Holidays and Events, with metadata NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was celebrated, look for the corresponding row where the type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to pay back the Bridge. Additional holidays are days added to a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday). - dcoilwtico: Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.)
**Note: ***There is a transaction column in the training dataset which displays the sales transactions on that particular date. * Test.csv - The test data, having the same features like the training data. You will predict the target sales for the dates in this file. - The dates in the test data are for the 15 days after the last date in the training data. **Note: ***There is a no transaction column in the test dataset as was there in the training dataset. Therefore, while building the model, you might exclude this column and may use it only for EDA.*
submission.csv - A sample submission file in the correct format.