taken from this Kaggle competition:
Dataset Description
In this competition, you will predict sales for the thousands of product families sold at Favorita stores located in Ecuador. The training data includes dates, store and product information, whether that item was being promoted, as well as the sales numbers. Additional files include supplementary information that may be useful in building your models.
File Descriptions and Data Field Information
train.csv… See the full description on the dataset page: https://huggingface.co/datasets/t4tiana/store-sales-time-series-forecasting.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains hourly sensor data collected over a period of time. The primary objective is to forecast future sensor values using various time series forecasting methods, such as SARIMA, Prophet, and machine learning models. The dataset includes an ID column, a Datetime column and a Count column, where the Count represents the sensor reading at each timestamp.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Time Series PILE
The Time-series Pile is a large collection of publicly available data from diverse domains, ranging from healthcare to engineering and finance. It comprises of over 5 public time-series databases, from several diverse domains for time series foundation model pre-training and evaluation.
Time Series PILE Description
We compiled a large collection of publicly available datasets from diverse domains into the Time Series Pile. It has 13 unique domains of data… See the full description on the dataset page: https://huggingface.co/datasets/AutonLab/Timeseries-PILE.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F8734253%2F832430253683be01796f74de8f532b34%2Fweather%20forecasting.png?generation=1730602999355141&alt=media" alt="">
Weather is recorded every 10 minutes throughout the entire year of 2020, comprising 20 meteorological indicators measured at a Max Planck Institute weather station. The dataset provides comprehensive atmospheric measurements including air temperature, humidity, wind patterns, radiation, and precipitation. With over 52,560 data points per variable (365 days × 24 hours × 6 measurements per hour), this high-frequency sampling offers detailed insights into weather patterns and atmospheric conditions. The measurements include both basic weather parameters and derived quantities such as vapor pressure deficit and potential temperature, making it suitable for both meteorological research and practical applications. You can find some initial analysis using this dataset here: "Weather Long-term Time Series Forecasting Analysis".
The dataset is provided in a CSV format with the following columns:
Column Name | Description |
---|---|
date | Date and time of the observation. |
p | Atmospheric pressure in millibars (mbar). |
T | Air temperature in degrees Celsius (°C). |
Tpot | Potential temperature in Kelvin (K), representing the temperature an air parcel would have if moved to a standard pressure level. |
Tdew | Dew point temperature in degrees Celsius (°C), indicating the temperature at which air becomes saturated with moisture. |
rh | Relative humidity as a percentage (%), showing the amount of moisture in the air relative to the maximum it can hold at that temperature. |
VPmax | Maximum vapor pressure in millibars (mbar), representing the maximum pressure exerted by water vapor at the given temperature. |
VPact | Actual vapor pressure in millibars (mbar), indicating the current water vapor pressure in the air. |
VPdef | Vapor pressure deficit in millibars (mbar), measuring the difference between maximum and actual vapor pressure, used to gauge drying potential. |
sh | Specific humidity in grams per kilogram (g/kg), showing the mass of water vapor per kilogram of air. |
H2OC | Concentration of water vapor in millimoles per mole (mmol/mol) of dry air. |
rho | Air density in grams per cubic meter (g/m³), reflecting the mass of air per unit volume. |
wv | Wind speed in meters per second (m/s), measuring the horizontal motion of air. |
max. wv | Maximum wind speed in meters per second (m/s), indicating the highest recorded wind speed over the period. |
wd | Wind direction in degrees (°), representing the direction from which the wind is blowing. |
rain | Total rainfall in millimeters (mm), showing the amount of precipitation over the observation period. |
raining | Duration of rainfall in seconds (s), recording the time for which rain occurred during the observation period. |
SWDR | Short-wave downward radiation in watts per square meter (W/m²), measuring incoming solar radiation. |
PAR | Photosynthetically active radiation in micromoles per square meter per second (µmol/m²/s), indicating the amount of light available for photosynthesis. |
max. PAR | Maximum photosynthetically active radiation recorded in the observation period in µmol/m²/s. |
Tlog | Temperature logged in degrees Celsius (°C), potentially from a secondary sensor or logger. |
OT | Likely refers to an "operational timestamp" or an offset in time, but may need clarification depending on the dataset's context. |
This high-resolution meteorological dataset enables applications across multiple domains. For weather forecasting, the frequent measurements support development of prediction models, while climate researchers can study microclimate variations and seasonal patterns. In agriculture, temperature and vapor pressure deficit data aids crop modeling and irrigation planning. The wind and radiation measurements benefit renewable energy planning, while the comprehensive atmospheric data supports environmental monitoring. The dataset's detailed nature makes it particularly suitable for machine learning applications and educational purposes in meteorology and data science.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset called CESNET-TimeSeries24 was collected by long-term monitoring of selected statistical metrics for 40 weeks for each IP address on the ISP network CESNET3 (Czech Education and Science Network). The dataset encompasses network traffic from more than 275,000 active IP addresses, assigned to a wide variety of devices, including office computers, NATs, servers, WiFi routers, honeypots, and video-game consoles found in dormitories. Moreover, the dataset is also rich in network anomaly types since it contains all types of anomalies, ensuring a comprehensive evaluation of anomaly detection methods.
Last but not least, the CESNET-TimeSeries24 dataset provides traffic time series on institutional and IP subnet levels to cover all possible anomaly detection or forecasting scopes. Overall, the time series dataset was created from the 66 billion IP flows that contain 4 trillion packets that carry approximately 3.7 petabytes of data. The CESNET-TimeSeries24 dataset is a complex real-world dataset that will finally bring insights into the evaluation of forecasting models in real-world environments.
Please cite the usage of our dataset as:
Koumar, J., Hynek, K., Čejka, T. et al. CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting. Sci Data 12, 338 (2025). https://doi.org/10.1038/s41597-025-04603-x
@Article{cesnettimeseries24,
author={Koumar, Josef and Hynek, Karel and {\v{C}}ejka, Tom{\'a}{\v{s}} and {\v{S}}i{\v{s}}ka, Pavel},
title={CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting},
journal={Scientific Data},
year={2025},
month={Feb},
day={26},
volume={12},
number={1},
pages={338},
issn={2052-4463},
doi={10.1038/s41597-025-04603-x},
url={https://doi.org/10.1038/s41597-025-04603-x}
}
We create evenly spaced time series for each IP address by aggregating IP flow records into time series datapoints. The created datapoints represent the behavior of IP addresses within a defined time window of 10 minutes. The vector of time-series metrics v_{ip, i} describes the IP address ip in the i-th time window. Thus, IP flows for vector v_{ip, i} are captured in time windows starting at t_i and ending at t_{i+1}. The time series are built from these datapoints.
Datapoints created by the aggregation of IP flows contain the following time-series metrics:
Multiple time aggregation: The original datapoints in the dataset are aggregated by 10 minutes of network traffic. The size of the aggregation interval influences anomaly detection procedures, mainly the training speed of the detection model. However, the 10-minute intervals can be too short for longitudinal anomaly detection methods. Therefore, we added two more aggregation intervals to the datasets--1 hour and 1 day.
Time series of institutions: We identify 283 institutions inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution's data.
Time series of institutional subnets: We identify 548 institution subnets inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution subnet's data.
The file hierarchy is described below:
cesnet-timeseries24/
|- institution_subnets/
| |- agg_10_minutes/
| |- agg_1_hour/
| |- agg_1_day/
| |- identifiers.csv
|- institutions/
| |- agg_10_minutes/
| |- agg_1_hour/
| |- agg_1_day/
| |- identifiers.csv
|- ip_addresses_full/
| |- agg_10_minutes/
| |- agg_1_hour/
| |- agg_1_day/
| |- identifiers.csv
|- ip_addresses_sample/
| |- agg_10_minutes/
| |- agg_1_hour/
| |- agg_1_day/
| |- identifiers.csv
|- times/
| |- times_10_minutes.csv
| |- times_1_hour.csv
| |- times_1_day.csv
|- ids_relationship.csv
|- weekends_and_holidays.csv
The following list describes time series data fields in CSV files:
Moreover, the time series created by re-aggregation contains following time series metrics instead of n_dest_ip, n_dest_asn, and n_dest_port:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Monash Time Series Forecasting Repository which contains 30+ datasets of related time series for global forecasting research. This repository includes both real-world and competition time series datasets covering varied domains.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of COVID-19 time series data of India since 24th March 2020. The data set is for all the States and Union Territories of India and is divided into five parts, including i) Confirmed cases; ii) Death Count; iii) Recovered Cases; iv) Temperature of that place; and v) Percentage humidity in the region. The data set also provides basic details of confirmed cases and death count for all the countries of the world updated daily since 30 January 2020. The end user can contact the corresponding author (Rohit Salgotra : nicresearchgroup@gmail.com) for more details. . [Dataset is updated Twice a Week]
The Authors can Refer to and CITE our latest Papers on COVID: 1. Rohit Salgotra, Mostafa Gandomi, Amir H Gandomi. "Evolutionary Modelling of the COVID-19 Pandemic in Fifteen Most Affected Countries" Chaos, Solitons & Fractals: (2020). https://doi.org/10.1016/j.chaos.2020.110118 2. Rohit Salgotra, Mostafa Gandomi, Amir H Gandomi. "Time Series Analysis and Forecast of the COVID-19 Pandemic in India using Genetic Programming" Chaos, Solitons & Fractals: (2020). https://doi.org/10.1016/j.chaos.2020.109945
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The described database was created using data obtained from the California Independent System Operator (CAISO) and the National Renewable Energy Laboratory (NREL). All data was collected at five-minute intervals, and subsequently cleaned and modified to create a database comprising three time series: solar energy production, wind energy production, and electricity demand. The database contains 12 columns, including date, station (1: Winter, 2: Spring, 3: Summer, 4: Autumn), day of the week (0: Monday, ... , 6: Sunday), DHI (W/m2), DNI (W/m2), GHI (W/m2), wind speed (m/s), humidity (%), temperature (degrees), solar energy production (MW), wind energy production (MW), and electricity demand (MW).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Dynamical System Multivariate Time Series (DSMTS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system under fully nominal conditions (no outliers or anomalies).
The DSMTS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Multivariate Time Series Forecasting especially for industrial processes of complex systems:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
LOTSA Data
The Large-scale Open Time Series Archive (LOTSA) is a collection of open time series datasets for time series forecasting. It was collected for the purpose of pre-training Large Time Series Models. See the paper and codebase for more information.
Citation
If you're using LOTSA data in your research or applications, please cite it using this BibTeX: BibTeX: @article{woo2024unified, title={Unified Training of Universal Time Series Forecasting Transformers}… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/lotsa_data.
All datasets contain univariate time series and they are available in a new format that we name as .tsf, pioneered by the sktime .ts format.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📶 Beam-Level (5G) Time-Series Dataset
This dataset introduces a novel multivariate time series specifically curated to support research in enabling accurate prediction of KPIs across communication networks, as illustrated below:
Precise forecasting of network traffic is critical for optimizing network management and enhancing resource allocation efficiency. This task is of both practical and theoretical importance to researchers in networking and machine learning, offering a… See the full description on the dataset page: https://huggingface.co/datasets/netop/Beam-Level-Traffic-Timeseries-Dataset.
This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the first part of a two-part exercise focusing on time series analysis.
Introduction
Time series are a special class of dataset, where a response variable is tracked over time. The frequency of measurement and the timespan of the dataset can vary widely. At its most simple, a time series model includes an explanatory time component and a response variable. Mixed models can include additional explanatory variables (check out the nlme
and lme4
R packages). We will be covering a few simple applications of time series analysis in these lessons.
Opportunities
Analysis of time series presents several opportunities. In aquatic sciences, some of the most common questions we can answer with time series modeling are:
Can we forecast conditions in the future?
Challenges
Time series datasets come with several caveats, which need to be addressed in order to effectively model the system. A few common challenges that arise (and can occur together within a single dataset) are:
Autocorrelation: Data points are not independent from one another (i.e., the measurement at a given time point is dependent on previous time point(s)).
Data gaps: Data are not collected at regular intervals, necessitating interpolation between measurements. There are often gaps between monitoring periods. For many time series analyses, we need equally spaced points.
Seasonality: Cyclic patterns in variables occur at regular intervals, impeding clear interpretation of a monotonic (unidirectional) trend. Ex. We can assume that summer temperatures are higher.
Heteroscedasticity: The variance of the time series is not constant over time.
Covariance: the covariance of the time series is not constant over time. Many of these models assume that the variance and covariance are similar over the time-->heteroschedasticity.
Learning Objectives
After successfully completing this notebook, you will be able to:
Choose appropriate time series analyses for trend detection and forecasting
Discuss the influence of seasonality on time series analysis
Interpret and communicate results of time series analyses
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Supplementary material for the paper entitled "One-step ahead forecasting of geophysical processes within a purely statistical framework"Abstract: The simplest way to forecast geophysical processes, an engineering problem with a widely recognised challenging character, is the so called “univariate time series forecasting” that can be implemented using stochastic or machine learning regression models within a purely statistical framework. Regression models are in general fast-implemented, in contrast to the computationally intensive Global Circulation Models, which constitute the most frequently used alternative for precipitation and temperature forecasting. For their simplicity and easy applicability, the former have been proposed as benchmarks for the latter by forecasting scientists. Herein, we assess the one-step ahead forecasting performance of 20 univariate time series forecasting methods, when applied to a large number of geophysical and simulated time series of 91 values. We use two real-world annual datasets, a dataset composed by 112 time series of precipitation and another composed by 185 time series of temperature, as well as their respective standardized datasets, to conduct several real-world experiments. We further conduct large-scale experiments using 12 simulated datasets. These datasets contain 24 000 time series in total, which are simulated using stochastic models from the families of Autoregressive Moving Average and Autoregressive Fractionally Integrated Moving Average. We use the first 50, 60, 70, 80 and 90 data points for model-fitting and model-validation and make predictions corresponding to the 51st, 61st, 71st, 81st and 91st respectively. The total number of forecasts produced herein is 2 177 520, among which 47 520 are obtained using the real-world datasets. The assessment is based on eight error metrics and accuracy statistics. The simulation experiments reveal the most and least accurate methods for long-term forecasting applications, also suggesting that the simple methods may be competitive in specific cases. Regarding the results of the real-world experiments using the original (standardized) time series, the minimum and maximum medians of the absolute errors are found to be 68 mm (0.55) and 189 mm (1.42) respectively for precipitation, and 0.23 °C (0.33) and 1.10 °C (1.46) respectively for temperature. Since there is an absence of relevant information in the literature, the numerical results obtained using the standardised real-world datasets could be used as rough benchmarks for the one-step ahead predictability of annual precipitation and temperature.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The sp500stock_data_description.csv file provides detailed information on the existence of four modalities (text, image, time series, and table) for 4,213 S&P 500 stocks. The hs300stock_data_description.csv file provides detailed information on the existence of four modalities (text, image, time series, and table) for 858 HS 300 stocks.
If you find our research helpful, please cite our paper:
@article{xu2025finmultitime, title={FinMultiTime: A Four-Modal Bilingual Dataset for… See the full description on the dataset page: https://huggingface.co/datasets/Wenyan0110/Multimodal-Dataset-Image_Text_Table_TimeSeries-for-Financial-Time-Series-Forecasting.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Alcohol_Sales.csv: This dataset was taken from https://fred.stlouisfed.org/series/S4248SM144NCEN(old url https://fred.stlouisfed.org/series/.)
energydata_complete.csv: Experimental data used to create regression models of appliances energy use in a low energy building. Data Set Information: The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters). The original source of the dataset: http://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction
Because of the sheer number of products available, the German book market is one of the largest business trading today. In order to display a highly individual profile to customers and, at the same time, keep the effort involved in selecting and ordering as low as possible, the key to success for the bookshop therefore lies in the effective purchasing from a choice of roughly 96,000 new titles each year. The challenge for the bookseller is to buy the right amount of the right books at the right time.
It is with this in mind that this year’s DATA MINING CUP Competition will be held in cooperation with Libri, Germany’s leading book wholesaler. Among Libri’s many successful support measures for booksellers, purchase recommendations give the bookshop a competitive advantage. Accordingly, the DATA MINING CUP 2009 challenge will be to forecast of purchase quantities of a clearly defined title portfolio per location, using simulated data.
The task of the DATA MINING CUP Competition 2009 is to forecast purchase quantities for 8 titles for 2,418 different locations. In order to create the model, simulated purchase data from an additional 2,394 locations will be supplied. All data refers to a fixed period of time. The object is to forecast the purchase quantities of these 8 different titles for the 2,418 locations as exactly as possible.
There are two text files available to assist in solving the problem: dmc2009_train.txt (train data file) and dmc2009_forecast.txt (data of 2,418 locations for whom a prediction is to be made).
This data is publicly available in the data-mining-website.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The research conducted using this univariate data set is on time series decomposition and a review of how to implement four decomposition methods namely: Classical decomposition, X11, Signal extraction in ARIMA time series(SEATS) and Seasonal trend decomposition procedure based on Loess(STL) decomposition. Following decomposition, forecasting with decomposition is implemented on the monthly electricity available for distribution to South Africa by Eskom time series data set. R Studio was used for the research. explain the components of a time series, moving averages, . Other data sets as well as those that are R built-in were used in the second section of the work, that is, to illustrate the components of a time series and moving averages. Following this the monthly electricity available for distribution to South Africa by Eskom time series data set was used for the third and fourth section of the research. That is, to implement the time series decomposition methods, analyze the random component of the methods, as well as to forecast with decomposition and to compute the forecast accuracy of four different forecasting methods.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
🗂 Dataset Description Title: Custom Sales Forecasting Dataset
This dataset contains a synthetic yet realistic representation of product sales across multiple stores and time periods. It is designed for use in time series forecasting, retail analytics, or machine learning experiments focusing on demand prediction and inventory planning. Each row corresponds to daily sales data for a given product at a particular store, enriched with contextual information like promotions and holidays.
This dataset is ideal for:
Building and testing time series models (ARIMA, Prophet, LSTM, etc.)
Forecasting product demand
Evaluating store-level sales trends
Training machine learning models with tabular time series data
Column Name | Description |
---|---|
order_id | Unique identifier for the order placed by a customer. |
customer_id | Unique identifier for the customer making the purchase. |
order_date | Date on which the order was placed (YYYY-MM-DD ). |
product_category | Category of the product purchased (e.g., Sports, Home, Beauty). |
product_price | Original price of a single unit of the product (before discount). |
quantity | Number of units of the product ordered. |
payment_method | Method used for payment (e.g., PayPal, Cash on Delivery). |
delivery_status | Current delivery status of the order (e.g., Delivered, Pending). |
city | City to which the order was delivered. |
state | U.S. state where the customer is located. |
zipcode | Postal code of the delivery location. |
product_id | Unique identifier for the purchased product. |
discount_applied | Fractional discount applied to the order (e.g., 0.20 for 20% off). |
order_value | Total value of the order after discount (product_price * quantity * (1 - discount_applied) ). |
review_rating | Customer’s review rating of the order on a 1–5 scale. |
return_requested | Boolean value indicating if the customer requested a return (True /False ). |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Improving the accuracy of prediction on future values based on the past and current observations has been pursued by enhancing the prediction's methods, combining those methods or performing data pre-processing. In this paper, another approach is taken, namely by increasing the number of input in the dataset. This approach would be useful especially for a shorter time series data. By filling the in-between values in the time series, the number of training set can be increased, thus increasing the generalization capability of the predictor. The algorithm used to make prediction is Neural Network as it is widely used in literature for time series tasks. For comparison, Support Vector Regression is also employed. The dataset used in the experiment is the frequency of USPTO's patents and PubMed's scientific publications on the field of health, namely on Apnea, Arrhythmia, and Sleep Stages. Another time series data designated for NN3 Competition in the field of transportation is also used for benchmarking. The experimental result shows that the prediction performance can be significantly increased by filling in-between data in the time series. Furthermore, the use of detrend and deseasonalization which separates the data into trend, seasonal and stationary time series also improve the prediction performance both on original and filled dataset. The optimal number of increase on the dataset in this experiment is about five times of the length of original dataset.
taken from this Kaggle competition:
Dataset Description
In this competition, you will predict sales for the thousands of product families sold at Favorita stores located in Ecuador. The training data includes dates, store and product information, whether that item was being promoted, as well as the sales numbers. Additional files include supplementary information that may be useful in building your models.
File Descriptions and Data Field Information
train.csv… See the full description on the dataset page: https://huggingface.co/datasets/t4tiana/store-sales-time-series-forecasting.