Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.
Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.
How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.
Dataset Structure:
The dataset consists of three main files, each with its specific role:
Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).
https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db
Test2:
The test dataset mirrors the structure of train.csv
but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.
https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.
https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db
Id: A unique identifier for each (Store, Date) combination within the test set.
Store: A unique identifier for each store.
Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).
Customers: The number of customers visiting the store on a given day.
Open: An indicator of whether the store was open (1 = open, 0 = closed).
StateHoliday: Indicates if the day is a state holiday, with values like:
'a' = public holiday,
'b' = Easter holiday,
'c' = Christmas,
'0' = no holiday.
SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).
StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.
Assortment: Describes the level of product assortment in the store:
'a' = basic,
'b' = extra,
'c' = extended.
CompetitionDistance: Distance (in meters) to the nearest competitor store.
CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.
Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).
Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).
Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.
PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.
To work with this dataset, you will need to have specific software installed, including:
DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.
Python Libraries: Key libraries for working with the dataset include:
pandas
for data manipulation,
numpy
for numerical operations,
matplotlib
and seaborn
for data visualization,
scikit-learn
for machine learning algorithms.
Several additional resources are available for working with the dataset:
Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.
Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb
, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.
Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.
Trained Models (.pkl files):
The models trained during the project are saved as .pkl
files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.
sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv
contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.
These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.
Prolonged stress and high mental workload can have deteriorating long-term effects developing several stress-related diseases. The existing stress detection techniques are often uni-modal and limited to controlled setups. One sensing modality could be unobtrusive but mostly results in unreliable sensor readings, especially in uncontrolled environments. Our study recorded multi-modal physiological signals from twenty-five participants in controlled and uncontrolled environments by performing given and self-chosen tasks of high and low mental demand. In this version, we processed and published a subset of the dataset from six participants while working on the rest. The subset of the data is used to check the feasibility of our study by engineering features from electroencephalography (EEG), photoplethysmography (PPG), electrodermal activity (EDA), and temperature sensor data. Machine learning methods were used for the binary classification of the tasks. Personalized models in the uncontrolled environment achieved a mean classification accuracy of up to 83% while using one of the four labels, unveiling some unintentional mislabeling by participants. In controlled environments, multi-modality improved the accuracy by at least 7%. Generalized machine learning models achieved close to chance-level performances. This work underlines the importance of multi-modal recordings and provides the research community with an experimental paradigm to take studies of mental workload and stress workload and stress out of controlled into uncontrolled environments
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.
Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.
How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.
Dataset Structure:
The dataset consists of three main files, each with its specific role:
Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).
https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db
Test2:
The test dataset mirrors the structure of train.csv
but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.
https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.
https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db
Id: A unique identifier for each (Store, Date) combination within the test set.
Store: A unique identifier for each store.
Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).
Customers: The number of customers visiting the store on a given day.
Open: An indicator of whether the store was open (1 = open, 0 = closed).
StateHoliday: Indicates if the day is a state holiday, with values like:
'a' = public holiday,
'b' = Easter holiday,
'c' = Christmas,
'0' = no holiday.
SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).
StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.
Assortment: Describes the level of product assortment in the store:
'a' = basic,
'b' = extra,
'c' = extended.
CompetitionDistance: Distance (in meters) to the nearest competitor store.
CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.
Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).
Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).
Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.
PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.
To work with this dataset, you will need to have specific software installed, including:
DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.
Python Libraries: Key libraries for working with the dataset include:
pandas
for data manipulation,
numpy
for numerical operations,
matplotlib
and seaborn
for data visualization,
scikit-learn
for machine learning algorithms.
Several additional resources are available for working with the dataset:
Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.
Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb
, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.
Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.
Trained Models (.pkl files):
The models trained during the project are saved as .pkl
files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.
sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv
contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.
These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.