The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a large fraction may be misattributing the reasons for their improvement. Moreover, they may have been able to achieve the same improvement with a much simpler modification, requiring just a single line of code.
Five datasets used in NeurTraL-AD paper: \textit{RacketSports (RS).} Accelerometer and gyroscope recording of players playing four different racket sports. Each sport is designated as a different class. \textit{Epilepsy (EPSY).} Accelerometer recording of healthy actors simulating four different activity classes, one of them being an epileptic shock. \textit{Naval air training and operating procedures standardization (NAT).} Positions of sensors mounted on different body parts of a person performing activities. There are six different activity classes in the dataset. \textit{Character trajectories (CT).} Velocity trajectories of a pen on a WACOM tablet. There are $20$ different characters in this dataset. \textit{Spoken Arabic Digits (SAD).} MFCC features of ten arabic digits spoken by $88$ different speakers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This code implements the TimeSeriesForest algorithm to classify different types of changes in gully environments. i)gully topographical change, ii)no change outside gully, iii) no change inside gully, and iv) non-topographical change. The algorithm is specifically designed for time series classification tasks, where the input data represents the characteristics of gullies over time. The code follows a series of steps to prepare the data, train the classifier, calculate performance metrics, and generate predictions. The data preparation phase involves importing training and testing data from CSV files. The training data is then divided into classes based on their labels, and a subset of the top rows is selected for each class to create a balanced training dataset. Time series data and corresponding labels are extracted from the training data, while only the time series data is extracted from the testing data. Next, the code calculates various performance metrics to evaluate the trained classifier. It splits the training data into training and testing sets, initializes the TimeSeriesForest classifier, and trains it using the training set. The accuracy of the classifier is calculated on the testing set, and feature importances are determined. Predictions are generated for both the testing set and new data using the trained classifier. The code then computes a confusion matrix to analyze the classification results, visualizing it using Seaborn and Matplotlib. Performance metrics such as True Accuracy, Kappa, Producer's Accuracy, and User's Accuracy are calculated and printed to assess the classifier's effectiveness in classifying gully changes. Lastly, the code performs ensemble predictions by combining the testing data with the generated predictions. The results, including predictions and associated probabilities, are saved to an output file. Overall, this code provides a practical implementation of the TimeSeriesForest algorithm for classifying types of changes in gully environments, demonstrating its potential for environmental monitoring and management.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Using the hctsa toolbox v0.97 (link in References below), we computed 7,500+ time-series features on each of the time-series classification tasks contained in the UEA/UCR Time Series Classification Repository. This repository provides the computed hctsa output files (.mat-files) for each classification task.We used the computed feature matrices to select a small subset of 22 hctsa estimators (termed catch22) that were the most useful for the UEA/UCR datasets:C.H. Lubba, S.S. Sethi, P. Knaute, S.R. Schultz, B.D. Fulcher, N.S. Jones. catch22: CAnonical Time-series CHaracteristics. arXiv (2019). https://arxiv.org/abs/1901.10200The matrices can be read in from Python as well using the Matlab_IO interface for which examples can be found in our selection pipeline for catch22 ("op_importance" in References) and in the "hctsaAnalysisPython" GitHub repository.
This is the required files to run the experiment published in the paper "Indexing and classifying gigabytes of time series under time warping". It contains the nearest neighbour indices for each query in each dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AcTBeCalf Dataset Description
The AcTBeCalf dataset is a comprehensive dataset designed to support the classification of pre-weaned calf behaviors from accelerometer data. It contains detailed accelerometer readings aligned with annotated behaviors, providing a valuable resource for research in multivariate time-series classification and animal behavior analysis. The dataset includes accelerometer data collected from 30 pre-weaned Holstein Friesian and Jersey calves, housed in group pens at the Teagasc Moorepark Research Farm, Ireland. Each calf was equipped with a 3D accelerometer sensor (AX3, Axivity Ltd, Newcastle, UK) sampling at 25 Hz and attached to a neck collar from one week of birth over 13 weeks.
This dataset encompasses 27.4 hours of accelerometer data aligned with calf behaviors, including both prominent behaviors like lying, standing, and running, as well as less frequent behaviors such as grooming, social interaction, and abnormal behaviors.
The dataset consists of a single CSV file with the following columns:
dateTime: Timestamp of the accelerometer reading, sampled at 25 Hz.
calfid: Identification number of the calf (1-30).
accX: Accelerometer reading for the X axis (top-bottom direction)*.
accY: Accelerometer reading for the Y axis (backward-forward direction)*.
accZ: Accelerometer reading for the Z axis (left-right direction)*.
behavior: Annotated behavior based on an ethogram of 23 behaviors.
segId: Segment identification number associated with each accelerometer reading/row, representing all readings of the same behavior segment.
Code Files Description
The dataset is accompanied by several code files to facilitate the preprocessing and analysis of the accelerometer data and to support the development and evaluation of machine learning models. The main code files included in the dataset repository are:
accelerometer_time_correction.ipynb: This script corrects the accelerometer time drift, ensuring the alignment of the accelerometer data with the reference time.
shake_pattern_detector.py: This script includes an algorithm to detect shake patterns in the accelerometer signal for aligning the accelerometer time series with reference times.
aligning_accelerometer_data_with_annotations.ipynb: This notebook aligns the accelerometer time series with the annotated behaviors based on timestamps.
manual_inspection_ts_validation.ipynb: This notebook provides a manual inspection process for ensuring the accurate alignment of the accelerometer data with the annotated behaviors.
additional_ts_generation.ipynb: This notebook generates additional time-series data from the original X, Y, and Z accelerometer readings, including Magnitude, ODBA (Overall Dynamic Body Acceleration), VeDBA (Vectorial Dynamic Body Acceleration), pitch, and roll.
genSplit.py: This script provides the logic used for the generalized subject separation for machine learning model training, validation and testing.
active_inactive_classification.ipynb: This notebook details the process of classifying behaviors into active and inactive categories using a RandomForest model, achieving a balanced accuracy of 92%.
four_behv_classification.ipynb: This notebook employs the mini-ROCKET feature derivation mechanism and a RidgeClassifierCV to classify behaviors into four categories: drinking milk, lying, running, and other, achieving a balanced accuracy of 84%.
Kindly cite one of the following papers when using this data:
Dissanayake, O., McPherson, S. E., Allyndrée, J., Kennedy, E., Cunningham, P., & Riaboff, L. (2024). Evaluating ROCKET and Catch22 features for calf behaviour classification from accelerometer data using Machine Learning models. arXiv preprint arXiv:2404.18159.
Dissanayake, O., McPherson, S. E., Allyndrée, J., Kennedy, E., Cunningham, P., & Riaboff, L. (2024). Development of a digital tool for monitoring the behaviour of pre-weaned calves using accelerometer neck-collars. arXiv preprint arXiv:2406.17352
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
The electric grid is a key enabling infrastructure for the ambitious transition towards carbon neutrality as we grapple with climate change. With deepening penetration of renewable energy resources and electrified transportation, the reliable and secure operation of the electric grid becomes increasingly challenging. In this paper, we present PSML, a first-of-its-kind open-access multi-scale time-series dataset, to aid in the development of data-driven machine learning (ML) based approaches towards reliable operation of future electric grids. The dataset is generated through a novel transmission + distribution (T+D) co-simulation designed to capture the increasingly important interactions and uncertainties of the grid dynamics, containing electric load, renewable generation, weather, voltage and current measurements at multiple spatio-temporal scales. Using PSML, we provide state-of-the-art ML baselines on three challenging use cases of critical importance to achieve: (i) early detection, accurate classification and localization of dynamic disturbance events; (ii) robust hierarchical forecasting of load and renewable energy with the presence of uncertainties and extreme events; and (iii) realistic synthetic generation of physical-law-constrained measurement time series. We envision that this dataset will enable advances for ML in dynamic systems, while simultaneously allowing ML researchers to contribute towards carbon-neutral electricity and mobility.
Data Navigation
Please download, unzip and put somewhere for later benchmark results reproduction and data loading and performance evaluation for proposed methods.
wget https://zenodo.org/record/5130612/files/PSML.zip?download=1
7z x 'PSML.zip?download=1' -o./
Minute-level Load and Renewable
Minute-level PMU Measurements
Millisecond-level PMU Measurements
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network traffic datasets created by Single Flow Time Series Analysis
Datasets were created for the paper: Network Traffic Classification based on Single Flow Time Series Analysis -- Josef Koumar, Karel Hynek, Tomáš Čejka -- which was published at The 19th International Conference on Network and Service Management (CNSM) 2023. Please cite usage of our datasets as:
J. Koumar, K. Hynek and T. Čejka, "Network Traffic Classification Based on Single Flow Time Series Analysis," 2023 19th International Conference on Network and Service Management (CNSM), Niagara Falls, ON, Canada, 2023, pp. 1-7, doi: 10.23919/CNSM59352.2023.10327876.
This Zenodo repository contains 23 datasets created from 15 well-known published datasets which are cited in the table below. Each dataset contains 69 features created by Time Series Analysis of Single Flow Time Series. The detailed description of features from datasets is in the file: feature_description.pdf
In the following table is a description of each dataset file:
File name | Detection problem | Citation of original raw dataset |
botnet_binary.csv | Binary detection of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
botnet_multiclass.csv | Multi-class classification of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
cryptomining_design.csv | Binary detection of cryptomining; the design part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
cryptomining_evaluation.csv | Binary detection of cryptomining; the evaluation part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
dns_malware.csv | Binary detection of malware DNS | Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021. |
doh_cic.csv | Binary detection of DoH |
Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020 |
doh_real_world.csv | Binary detection of DoH | Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022 |
dos.csv | Binary detection of DoS | Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019. |
edge_iiot_binary.csv | Binary detection of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
edge_iiot_multiclass.csv | Multi-class classification of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
https_brute_force.csv | Binary detection of HTTPS Brute Force | Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020 |
ids_cic_binary.csv | Binary detection of intrusion in IDS | Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018. |
ids_cic_multiclass.csv | Multi-class classification of intrusion in IDS | Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018. |
ids_unsw_nb_15_binary.csv | Binary detection of intrusion in IDS | Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015. |
ids_unsw_nb_15_multiclass.csv | Multi-class classification of intrusion in IDS | Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015. |
iot_23.csv | Binary detection of IoT malware | Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23 |
ton_iot_binary.csv | Binary detection of IoT malware | Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021 |
ton_iot_multiclass.csv | Multi-class classification of IoT malware | Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021 |
tor_binary.csv | Binary detection of TOR | Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017. |
tor_multiclass.csv | Multi-class classification of TOR | Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017. |
vpn_iscx_binary.csv | Binary detection of VPN | Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016. |
vpn_iscx_multiclass.csv | Multi-class classification of VPN | Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016. |
vpn_vnat_binary.csv | Binary detection of VPN | Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022 |
vpn_vnat_multiclass.csv | Multi-class classification of VPN | Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
The application of machine learning has become commonplace for problems in modern data science. The democratization of the decision process when choosing a machine learning algorithm has also received considerable attention through the use of meta features and automated machine learning for both classification and regression type problems. However, this is not the case for multistep-ahead time series problems. Time series models generally rely upon the series itself to make future predictions, as opposed to independent features used in regression and classification problems. The structure of a time series is generally described by features such as trend, seasonality, cyclicality, and irregularity. In this research, we demonstrate how time series metrics for these features, in conjunction with an ensemble based regression learner, were used to predict the standardized mean square error of candidate time series prediction models. These experiments used datasets that cover a wide feature space and enable researchers to select the single best performing model or the top N performing models. A robust evaluation was carried out to test the learner's performance on both synthetic and real time series.
Proposed Dataset
The dataset proposed here gives the results for 20 step ahead predictions for eight Machine Learning/Multi-step ahead prediction strategies for 5,842 time series datasets outlined here. It was used as the training data for the Meta Learners in this research. The meta features used are columns C to AE. Columns AH outlines the method/strategy used and columns AI to BB (the error) is the outcome variable for each prediction step. The description of the method/strategies is as follows:
Machine Learning methods:
NN: Neural Network
ARIMA: Autoregressive Integrated Moving Average
SVR: Support Vector Regression
LSTM: Long Short Term Memory
RNN: Recurrent Neural Network
Multistep ahead prediction strategy:
OSAP: One Step ahead strategy
MRFA: Multi Resolution Forecast Aggregation
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This is the source code for the paper "Efficient search of the best warping window for Dynamic Time Warping".This work focused on fast learning/searching for the best warping window for Dynamic Time Warping and Time Series Classification.For more info, visit https://github.com/ChangWeiTan/FastWWSearch
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Part of MONSTER: https://arxiv.org/abs/2502.15122.
S2Agri-17
Category Satellite
Num. Examples 59,268,823
Num. Channels 10
Length 24
Sampling Freq. 10 days
Num. Classes 17
License CC BY 4.0
Citations [1] [2]
S2Agri is a land cover classification dataset and contains a single tile of Sentinel-2 data (T31TFM), which covers a 12,100 km2area in France: see Figure [1, 2]. Ten spectral bands covering the visible and infrared frequencies are used, and these are provided… See the full description on the dataset page: https://huggingface.co/datasets/monster-monash/S2Agri-17.
The original dataset was provided by Orange telecom in France, which contains anonymized and aggregated human mobility data. The Multivariate-Mobility-Paris dataset comprises information from 2020-08-24 to 2020-11-04 (72 days during the COVID-19 pandemic), with time granularity of 30 minutes and spatial granularity of 6 coarse regions in Paris, France. In other words, it represents a multivariate time series dataset.
This dataset can be used for several time-series tasks such as univariate/multivariate forecasting/classification with classic, machine learning, and privacy-preserving machine learning techniques.
The Census data API provides access to the most comprehensive set of data on current month and cumulative year-to-date exports using the North American Industry Classification System (NAICS). The NAICS endpoint in the Census data API also provides value, shipping weight, and method of transportation totals at the district level for all U.S. trading partners. The Census data API will help users research new markets for their products, establish pricing structures for potential export markets, and conduct economic planning. If you have any questions regarding U.S. international trade data, please call us at 1(800)549-0595 option #4 or email us at eid.international.trade.data@census.gov.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.
FiftyWords is a data set of word outlines taken from the George Washington library by T. Rath and used in the paper "Word image matching using dynamic time warping", CVPR 2003. Each case is a word. A series is formed by taking the height profile of the word.
Donator: T. Rath, R. Manmatha
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All accuracy value is mentioned in percentage (%).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This compressed folder contains annual CONUS-wide LCZ maps ranging from 1986 to 2020, which is the main and final LCZ product of this dataset. The maps are derived from a lightweight contextual Random Forest model with spatial and temporal post-classification processing. Each map is provided in the Geo TIFF file format with representing year indicated in the file name. For example, the file "TP_2020.tif" represents the LCZ map for 2020. All LCZ maps have a spatial resolution at 100m and projection of USA Contiguous Albers Equal Area Conic (EPSG=5070). The LCZ classes are indicated by numbers 1-17. Note that LCZ class 7 (Lightweight low-rise) is not present in this dataset. Pixels of value 0 represents NoData.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.
The traffic data are collected with the loop sensor installed on ramp for the 101 North freeway in Los Angeles. This location is close to Dodgers Stadium; therefore the traffic is affected by volume of visitors to the stadium. Missing values are represented with NaN. - Class 1: Normal Day - Class 2: Game Day There is nothing to infer from the order of examples in the train and test set. Missing values are represented with NaN in the text file. Data created by Ihler, Alexander, Jon Hutchins, and Padhraic Smyth (see [1][2][3]). Data edited by Chin-Chia Michael Yeh.
[1] Ihler, Alexander, Jon Hutchins, and Padhraic Smyth. "Adaptive event detection with time-varying poisson processes." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.
[2] “UCI Machine Learning Repository: Dodgers Loop Sensor Data Set.” UCI Machine Learning Repository, archive.ics.uci.edu/ml/datasets/dodgers+loop+sensor.
[3] “Caltrans PeMS.” Caltrans, pems.dot.ca.gov/.
Donator: C. Yeh
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RMSE value of different methods for different test percents are grouped together and best RMSE values are highlighted.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data for the paper
Hankui K. Zhang, Dong Luo, Zhongbin Li, Classifying raw irregular Landsat time series (CRIT) for large area land cover mapping by adapting Transformer model.
It stores the daily raw Landsat ARD annual good quality surface reflectance time series for 1985, 2006 and 2018 for CONUS with 7 land cover classes. Details are in the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The original data was retrieved from http://www.timeseriesclassification.com/description.php?Dataset=RacketSports
Original data description:
The data was created by university students plyaing badminton or squash whilst wearing a smart watch (Sony Smart watch 35). The watch relayed the x-y-z coordinates for
both the gyroscope and accelerometer to an android phone (One Plus 56). The phone
wrote these values to an Attribute-Relation File Format (arff) file using an app developed
by a UEA computer science masters student. The problem is to identify which sport and which stroke the players are making. The data was collected at a rate of 10 HZ over 3 seconds whilst the player played
either a forehand/backhand in squash or a clear/smash in badminton.
The data was collected as part of an undergraduate project by Phillip Perks in 2017/18.
Pre-processing
Data processing was done as described in: https://github.com/NLeSC/mcfly-tutorial/blob/master/utils/tutorial_racketsports.py
The original data was split into train and test set. Here the data was loaded and further divided into train, test, validation sets.
To keep it simple we here simply divided the original test part into test and validation.
The resulting data was stored as numpy .npy files.
The zip file contains three sets of time series data (X_train, X_test, X_valid) and the respective labels (y_train, y_test, y_valid).
Reference:
http://www.timeseriesclassification.com/description.php?Dataset=RacketSports
(The data was collected as part of an undergraduate project by Phillip Perks in 2017/18.)
The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a large fraction may be misattributing the reasons for their improvement. Moreover, they may have been able to achieve the same improvement with a much simpler modification, requiring just a single line of code.