Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem — (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed for research and development in Structural Health Monitoring (SHM) using embedded systems and machine learning techniques. It contains simulated time-series sensor data representing the physical state of building structures under various operational and environmental conditions.
The data includes measurements from:
Accelerometers (X, Y, Z axis)
Strain gauges (microstrain readings)
Temperature sensors (in degrees Celsius)
Each record is timestamped and labeled with one of three structural conditions:
0: Healthy
1: Minor Damage
2: Severe Damage
This resource contains a video recording for a presentation given as part of the National Water Quality Monitoring Council conference in April 2021. The presentation covers the motivation for performing quality control for sensor data, the development of PyHydroQC, a Python package with functions for automating sensor quality control including anomaly detection and correction, and the performance of the algorithms applied to data from multiple sites in the Logan River Observatory.
The initial abstract for the presentation: Water quality sensors deployed to aquatic environments make measurements at high frequency and commonly include artifacts that do not represent the environmental phenomena targeted by the sensor. Sensors are subject to fouling from environmental conditions, often exhibit drift and calibration shifts, and report anomalies and erroneous readings due to issues with datalogging, transmission, and other unknown causes. The suitability of data for analyses and decision making often depend on subjective and time-consuming quality control processes consisting of manual review and adjustment of data. Data driven and machine learning techniques have the potential to automate identification and correction of anomalous data, streamlining the quality control process. We explored documented approaches and selected several for implementation in a reusable, extensible Python package designed for anomaly detection for aquatic sensor data. Implemented techniques include regression approaches that estimate values in a time series, flag a point as anomalous if the difference between the sensor measurement exceeds a threshold, and offer replacement values for correcting anomalies. Additional algorithms that scaffold the central regression approaches include rules-based preprocessing, thresholds for determining anomalies that adjust with data variability, and the ability to detect and correct anomalies using forecasted and backcasted estimation. The techniques were developed and tested based on several years of data from aquatic sensors deployed at multiple sites in the Logan River Observatory in northern Utah, USA. Performance was assessed based on labels and corrections applied previously by trained technicians. In this presentation, we describe the techniques for detection and correction, report their performance, illustrate the workflow for applying to high frequency aquatic sensor data, and demonstrate the possibility for additional approaches to help increase automation of aquatic sensor data post processing.
Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem — (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.
The Global Monthly and Seasonal Urban and Land Backscatter Time Series, 1993-2020, is a multi-sensor, multi-decadal, data set of global microwave backscatter, for 1993 to 2020. It assembles data from C-band sensors onboard the European Remote Sensing Satellites (ERS-1 and ERS-2) covering 1993-2000, Advanced Scatterometer (ASCAT) onboard EUMETSAT satellites for 2007-2020, and the Ku-band sensor onboard the QuikSCAT satellite for 1999-2009, onto a common spatial grid (0.05 degree latitude /longitude resolution) and time step (both monthly and seasonal). Data are provided for all land (except high latitudes and islands), and for urban grid cells, based on a specific masking that removes grid cells with > 50% open water or < 20% built land. The all-land data allows users to choose and evaluate other urban masks. There is an offset between C-band and Ku-band backscatter from both vegetated and urban surfaces that is not spatially constant. There is a strong linear correlation (overall R-squared value = 0.69) between 2015 ASCAT urban backscatter and a continental-scale gridded product of building volume, across 8,450 urban grid cells (0.05 degree resolution) from large cities in Europe, China, and the United States.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains high-frequency time-series data collected from a coal-fired industrial boiler operating in a chemical plant in Zhejiang, China. The boiler is equipped with multiple sensors capturing parameters such as pressure, temperature, flow rate, and oxygen levels. The dataset reflects a real-world industrial scenario, where 8.6% of the data represents abnormal operating conditions (outliers), making it particularly suitable for long-tailed distribution studies, anomaly detection, and robust forecasting tasks in industrial time-series modeling.
This resource contains the supporting data and code files for the analyses presented in "Toward automating post processing of aquatic sensor data," an article published in the journal Environmental Modelling and Software. This paper describes pyhydroqc, a Python package developed to identify and correct anomalous values in time series data collected by in situ aquatic sensors. For more information on pyhydroqc, see the code repository (https://github.com/AmberSJones/pyhydroqc) and the documentation (https://ambersjones.github.io/pyhydroqc/). The package may be installed from the Python Package Index (more info: https://packaging.python.org/tutorials/installing-packages/).
Included in this resource are input data, Python scripts to run the package on the input data (anomaly detection and correction), results from running the algorithm, and Python scripts for generating the figures in the manuscript. The organization and structure of the files are described in detail in the readme file. The input data were collected as part of the Logan River Observatory (LRO). The data in this resource represent a subset of data available for the LRO and were compiled by querying the LRO’s operational database. All available data for the LRO can be sourced at http://lrodata.usu.edu/tsa/ or on HydroShare: https://www.hydroshare.org/search/?q=logan%20river%20observatory.
There are two sets of scripts in this resource: 1.) Scripts that reproduce plots for the paper using saved results, and 2.) Code used to generate the complete results for the series in the case study. While all figures can be reproduced, there are challenges to running the code for the complete results (it is computationally intensive, different results will be generated due to the stochastic nature of the models, and the code was developed with an early version of the package), which is why the saved results are included in this resource. For a simple example of running pyhydroqc functions for anomaly detection and correction on a subset of data, see this resource: https://www.hydroshare.org/resource/92f393cbd06b47c398bdd2bbb86887ac/.
This resource contains an example script for using the software package pyhydroqc. pyhydroqc was developed to identify and correct anomalous values in time series data collected by in situ aquatic sensors. For more information, see the code repository: https://github.com/AmberSJones/pyhydroqc and the documentation: https://ambersjones.github.io/pyhydroqc/. The package may be installed from the Python Package Index.
This script applies the functions to data from a single site in the Logan River Observatory, which is included in the repository. The data collected in the Logan River Observatory are sourced at http://lrodata.usu.edu/tsa/ or on HydroShare: https://www.hydroshare.org/search/?q=logan%20river%20observatory.
Anomaly detection methods include ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short Term Memory). These are time series regression methods that detect anomalies by comparing model estimates to sensor observations and labeling points as anomalous when they exceed a threshold. There are multiple possible approaches for applying LSTM for anomaly detection/correction. - Vanilla LSTM: uses past values of a single variable to estimate the next value of that variable. - Multivariate Vanilla LSTM: uses past values of multiple variables to estimate the next value for all variables. - Bidirectional LSTM: uses past and future values of a single variable to estimate a value for that variable at the time step of interest. - Multivariate Bidirectional LSTM: uses past and future values of multiple variables to estimate a value for all variables at the time step of interest.
The correction approach uses piecewise ARIMA models. Each group of consecutive anomalous points is considered as a unit to be corrected. Separate ARIMA models are developed for valid points preceding and following the anomalous group. Model estimates are blended to achieve a correction.
The anomaly detection and correction workflow involves the following steps: 1. Retrieving data 2. Applying rules-based detection to screen data and apply initial corrections 3. Identifying and correcting sensor drift and calibration (if applicable) 4. Developing a model (i.e., ARIMA or LSTM) 5. Applying model to make time series predictions 6. Determining a threshold and detecting anomalies by comparing sensor observations to modeled results 7. Widening the window over which an anomaly is identified 8. Aggregating detections resulting from multiple models 9. Making corrections for anomalous events
Instructions to run the notebook through the CUAHSI JupyterHub: 1. Click "Open with..." at the top of the resource and select the CUAHSI JupyterHub. You may need to sign into CUAHSI JupyterHub using your HydroShare credentials. 2. Select 'Python 3.8 - Scientific' as the server and click Start. 2. From your JupyterHub directory, click on the ExampleNotebook.ipynb file. 3. Execute each cell in the code by clicking the Run button.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We release the DOO-RE dataset which consists of data streams from 11 types of various ambient sensors by collecting data 24/7 from a real-world meeting room. 4 types of ambient sensors, called environment-driven sensors, measure continuous state changes in the environment (e.g. sound), and 4 types of sensors, called user-driven sensors, capture user state changes (e.g. motion). The remaining 3 types of sensors, called actuator-driven sensors, check whether the attached actuators are active (e.g. projector on/off). The values of each sensor are automatically collected by IoT agents which are responsible for each sensor in our IoT system. A part of the collected sensor data stream representing a user activity is extracted as an activity episode in the DOO-RE dataset. Each episode's activity labels are annotated and validated by cross-checking and the consent of multiple annotators. A total of 9 activity types appear in the space: 3 based on single users and 6 based on group (i.e. 2 or more people) users. As a result, DOO-RE is constructed with 696 labeled episodes for single and group activities from the meeting room. DOO-RE is a novel dataset created in a public space that contains the properties of the real-world environment and has the potential to be good uses for developing powerful activity recognition approaches.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains time series measurements from three distinct case studies, each provided in separate CSV files. The data was collected as part of the research detailed in the accompanying paper "Multi-Parameter Multi-Sensor Data Fusion for Drinking Water Distribution System Water Quality Management" by Gleeson et al. (2025).Important NotesUsers should exercise extreme caution when analysing these datasets:- Case study 3 contains notable data quality issues- Operational activities preceding the data collection period in case study 3 resulted in unusual patterns that require careful consideration during analysis- While the accompanying paper discusses four case studies, case study 4 data is not included in this open dataset due to Non-Disclosure Agreement restrictions with the water company involved
https://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses
This dataset comprises sensor readings collected from various sensors deployed in an environment. Each entry in the dataset includes the following information:
The dataset also includes additional information in the form of histograms and time series data:
This dataset is valuable for tasks such as anomaly detection, predictive maintenance, and environmental monitoring.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper presents a benchmark dataset called EO4WildFires; a multi-sensor (multi spectral; Sentinel-2, Synthetic-Aperture Radar - SAR; Sentinel-1, meteorological parameters; NASA Power) time-series dataset that spans 45 countries, which can be used for developing machine learning and deep learning methods targeted for the estimation of the area that a forest wildfire might cover.
This novel EO4WildFires dataset is annotated using EFFIS (European Forest Fire Information System) as forest fire detection and size estimation data source. A total of 31,742 wildfire events are gathered from 2018 to 2022. For each event, Sentinel-2 (multispectral), Sentinel-1 (SAR) and meteorological data are assembled into a single data cube. The meteorological parameters that are included in the data cube are: ratio of actual partial pressure of water vapor to the partial pressure at saturation, average temperature, bias corrected average total precipitation, average wind speed, fraction of land covered by snowfall, percent of root zone soil wetness, snow depth, snow precipitation, as well as percent of soil moisture.
The main problem that this dataset is designed to address, is the severity forecasting before wildfires occur. The dataset is not used to predict wildfire events, but rather to predict the severity (size of area damaged by fire) of a wildfire event, if that happens in a specific place under the current and historical forest status, as recorded from multispectral and SAR images, and meteorological data.
Using the data cube for the collected wildfire events, the EO4WildFires dataset is used to realize three (3) different preliminary experiments, in order to evaluate the contributing factors for wildfire severity prediction. The first experiment evaluates wildfire size using only the meteorological parameters, the second one utilizes both the multispectral and SAR parts of the dataset, while the third exploits all dataset parts. In each experiment, machine learning models are developed, and their accuracy is evaluated.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of the data is to analyse data produced by fishing ships in order to reduce fuel consumption. The data has been used in WP3 of DataBio project. Data were processed and supplied by the VTT and UPV/EHU DataBio project team.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Fatigue is a broad, multifactorial concept encompassing feelings of reduced physical and mental energy levels. Fatigue strongly impacts patient health-related quality of life across a huge range of conditions, yet, to date, tools available to understand fatigue are severely limited. Methods: After using a recurrent neural network-based algorithm to impute missing time series data form a multisensor wearable device, we compared supervised and unsupervised machine learning approaches to gain insights on the relationship between self-reported non-pathological fatigue and multimodal sensor data. Results: A total of 27 healthy subjects and 405 recording days were analyzed. Recorded data included continuous multimodal wearable sensor time series on physical activity, vital signs, and other physiological parameters, and daily questionnaires on fatigue. The best results were obtained when using the causal convolutional neural network model for unsupervised representation learning of multivariate sensor data, and random forest as a classifier trained on subject-reported physical fatigue labels (weighted precision of 0.70 ± 0.03 and recall of 0.73 ± 0.03). When using manually engineered features on sensor data to train our random forest (weighted precision of 0.70 ± 0.05 and recall of 0.72 ± 0.01), both physical activity (energy expenditure, activity counts, and steps) and vital signs (heart rate, heart rate variability, and respiratory rate) were important parameters to measure. Furthermore, vital signs contributed the most as top features for predicting mental fatigue compared to physical ones. These results support the idea that fatigue is a highly multimodal concept. Analysis of clusters from sensor data highlighted a digital phenotype indicating the presence of fatigue (95% of observations) characterized by a high intensity of physical activity. Mental fatigue followed similar trends but was less predictable. Potential future directions could focus on anomaly detection assuming longer individual monitoring periods. Conclusion: Taken together, these results are the first demonstration that multimodal digital data can be used to inform, quantify, and augment subjectively captured non-pathological fatigue measures.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Market Overview: The global one-stop time series database solution market is projected to reach a value of USD 3.6 billion by 2033, exhibiting a CAGR of 12.7% during the forecast period (2023-2033). The demand for time series databases has been surging due to the exponential growth of IoT devices, sensor networks, and industrial automation, resulting in an unprecedented volume of time-stamped data. The need for real-time data analysis, forecasting, and anomaly detection across various sectors, including manufacturing, finance, healthcare, and transportation, has further fueled the proliferation of one-stop time series database solutions. Market Segmentation and Key Trends: The market is segmented based on application (individual and enterprise), type (cloud-based and on-premises), and region. North America currently holds the largest market share, while the Asia Pacific region is anticipated to witness significant growth in the coming years. Key trends shaping the market include the rise of cloud-based time series databases, increased adoption of machine learning and AI for advanced data analysis, and the integration of time series databases with other big data technologies such as data lakes and data warehouses. Major companies operating in the market include InfluxData, Timescale, Chronosphere, OpenTSDB, VictoriaMetrics, QuestDB, and DataStax.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises 524 recordings of 6-dimensional time series data, capturing forces in three directions and torques in three directions during the assembly of small car model wheels. The data was collected using an equidistant sampling method with a sampling period of 0.004 seconds. Each time series represents the process of assembling one wheel, specifically the placement of a tire onto a rim, and includes a label indicating whether the assembly was successful (OK). The wheels were assembled in batches of four, and the recordings were obtained over six different days. The labels of recordings from two (days 3 and 4) of the six days are invalid as described in [1]. The labels presented in this data set are only binary (they do not describe the reason of the failure). The labels of recordings from days 5 and 6 are created by human while the other labels came from a convolutional neural network based computer vision classifier and can be inaccurate as described in section 5.4 of [1].
Dataset Structure:
File: ForceTorqueTimeSeries.csv
Columns:
idx (1-524): Index of the recording corresponding to the assembly of one wheel.
label (true/false): Indicates whether the assembly was successful (TRUE = product is OK).
meas_id (1-6): Identifier for the day on which the recording was made (refer to Table 2.1 in [1]).
force_x: X-component of the force measured by the sensor mounted on the delta robot's end effector.
force_y: Y-component of the force.
force_z: Z-component of the force.
torque_x: X-component of the torque.
torque_y: Y-component of the torque.
torque_z: Z-component of the torque.
Additional Files:
IMG_3351.MOV: A video demonstrating the assembly process for one batch of four wheels.
F3-BP-2024-Trna-Ales-Ales Trna - 2024 - Anomaly detection in robotic assembly process using force and torque sensors.pdf: Bachelor thesis [1] detailing the dataset and preliminary experiments on fault detection.
F3-BP-2024-Hanzlik-Vojtech-Anomaly_Detection_Bachelors_Thesis.pdf: Bachelor thesis [2] describing the data acquisition process.
References:
Trna, A. (2024). Anomaly detection in robotic assembly process using force and torque sensors [Bachelor’s thesis, Czech Technical University in Prague].
Hanzlik, V. (2024). Edge AI integration for anomaly detection in assembly using Delta robot [Bachelor’s thesis, Czech Technical University in Prague].
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
This data set includes the time series data from 16 electrochemical, optical, temperature and humidity sensors in 60 experiments to characterize the conditions preceding cooktop ignition compared to the conditions of normal cooking. The sensors are placed in the exhaust duct above a mock-up kitchen cooktop. Experiments cover a broad range of conditions, including both unattended cooking and normal cooking scenarios, where 39 experiments led to auto-ignition. The experiments involve a variety of cooking oils and foods and were conducted using either an electric coil cooktop, gas-fueled cooktop, or electric oven.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Spatial and temporal flow dynamics within the hyporheic zone, particularly hyporheic transport and exchange, are important processes and key to ecosystem health with enhanced degradation of micro-pollutants and stream denitrification. The application of heat is widely used as a tracer to determine flux in the hyporheic zone, however, most applications only consider 1D flow. Hence, this data demonstrates how 3D flow fields in the hyporheic zone can be measured.Time series temperature data from the 56 sensor array of the active heat pulse sensor (HPS) Hot Rod from a sand tank experiment under different flow scenarios as well as data collected from a field site in the Mount Lofty Ranges, South Australia are provided.Date coverage: 2017
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains time series data of human joint angles collected using MoJoXlab sensor systems during various activities such as walking, jumping, squatting, and leg exercises. The dataset includes data for basic activities such as walking, sitting and relaxing, standing, lying down (supine position), jumping, and squatting. Additionally, it includes data for seated active and assisted knee extension/flexion and heel slide exercises. The data were collected at a sampling frequency of 50 Hz and exported as CSV files from both Xsens (https://www.xsens.com/) and NGIMU (https://x-io.co.uk/ngimu/) software. The joint angles were calculated using MoJoXlab software, which utilizes quaternion values to estimate the orientation of the sensors. The dataset consists of time series data collected from 15 participants using two sensor systems, Xsens and NGIMU, during various lower limb movements. The data were collected at a sampling frequency of 50 Hz and exported as CSV files. The dataset includes quaternion values for orientation, specifically for the left thigh (LT), right thigh (RT), left shank (LS), right shank (RS), left foot (LF), right foot (RF), and pelvis. The sensor positions are recommended by Xsens lower limb protocol. The column headers indicate the orientation (i.e., W, X, Y, Z and q0, q1, q2, q3), and P01_LT denotes participant 1 data for the left thigh sensor position. The dataset is useful for researchers and practitioners interested in studying human movement and developing algorithms for joint angle estimation. The data can be used to compare and validate different sensor systems and algorithms for estimating joint angles and develop and test new algorithms. The data can be downloaded and used for non-commercial research purposes with proper attribution to the authors and the data source.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NGA Flight Time Series Dataset
The NGA Flight Time Series Dataset contains detailed sensor measurements collected during flights of general aviation aircraft. Each file represents a complete flight, with time-series data recorded throughout the flight.
Dataset Overview
Number of Examples: 7,681 flights Format: CSV (Comma-Separated Values) Domain: Aviation, Time Series Analysis Intended Use: Time-series forecasting, anomaly detection, aviation analytics
Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem — (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.