A fleet is a group of systems (e.g., cars, aircraft) that are designed and manufactured the same way and are intended to be used the same way. For example, a fleet of delivery trucks may consist of one hundred instances of a particular model of truck, each of which is intended for the same type of service—almost the same amount of time and distance driven every day, approximately the same total weight carried, etc. For this reason, one may imagine that data mining for fleet monitoring may merely involve collecting operating data from the multiple systems in the fleet and developing some sort of model, such as a model of normal operation that can be used for anomaly detection. However, one then may realize that each member of the fleet will be unique in some ways—there will be minor variations in manufacturing, quality of parts, and usage. For this reason, the typical machine learning and statis- tics algorithm’s assumption that all the data are independent and identically distributed is not correct. One may realize that data from each system in the fleet must be treated as unique so that one can notice significant changes in the operation of that system.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
Several different unsupervised anomaly detection algorithms have been applied to Space Shuttle Main Engine (SSME) data to serve the purpose of developing a comprehensive suite of Integrated Systems Health Management (ISHM) tools. As the theoretical bases for these methods vary considerably, it is reasonable to conjecture that the resulting anomalies detected by them may differ quite significantly as well. As such, it would be useful to apply a common metric with which to compare the results. However, for such a quantitative analysis to be statistically significant, a sufficient number of examples of both nominally categorized and anomalous data must be available. Due to the lack of sufficient examples of anomalous data, use of any statistics that rely upon a statistically significant sample of anomalous data is infeasible. Therefore, the main focus of this paper will be to compare actual examples of anomalies detected by the algorithms via the sensors in which they appear, as well the times at which they appear. We find that there is enough overlap in detection of the anomalies among all of the different algorithms tested in order for them to corroborate the severity of these anomalies. In certain cases, the severity of these anomalies is supported by their categorization as failures by experts, with realistic physical explanations. For those anomalies that can not be corroborated by at least one other method, this overlap says less about the severity of the anomaly, and more about their technical nuances, which will also be discussed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
The dataset provider, Solenix, is an international company providing software e...
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
******### Ano-AAD Dataset: Comprehensive Anomalous Human Action Detection in Videos******
The Ano-AAD dataset is a groundbreaking resource designed to advance the field of anomaly detection in video surveillance. Compiled from an extensive array of sources, including popular social media platforms and various websites, this dataset captures a wide range of human behaviors, both normal and anomalous. By providing a rich and diverse set of video data, the Ano-AAD dataset is poised to significantly enhance the capabilities of surveillance systems and contribute to the development of more sophisticated safety protocols.
#### Inception and Objective
The primary objective behind the creation of the Ano-AAD dataset was to address the pressing need for a comprehensive, well-annotated collection of video footage that can be used to train and evaluate models for detecting anomalous human actions. Recognizing the limitations of existing datasets, which often lack diversity and sufficient examples of real-world scenarios, we embarked on a meticulous process to gather, annotate, and validate a diverse array of videos. Our goal was to ensure that the dataset encompasses a wide variety of environments and actions, thereby providing a robust foundation for the development of advanced anomaly detection algorithms.
#### Data Collection Process
The data collection process for the Ano-AAD dataset was both extensive and methodical. We identified and selected videos from various social media platforms, such as Facebook and YouTube, as well as other online sources. These videos were chosen to represent a broad spectrum of real-world scenarios, including both typical daily activities and less frequent, but critical, anomalous events. Each video was carefully reviewed to ensure it met our criteria for relevance, clarity, and authenticity.
#### Categorization and Annotation
A cornerstone of the Ano-AAD dataset is its detailed categorization and annotation of human actions. Each video clip was meticulously labeled to differentiate between normal activities—such as walking, sitting, and working—and anomalous behaviors, which include arrests, burglaries, explosions, fighting, fire raising, ill treatment, traffic irregularities, attacks, and other violent acts. This comprehensive annotation process was essential to creating a dataset that accurately reflects the complexities of real-world surveillance challenges. Our team of annotators underwent rigorous training to ensure consistency and reliability in the labeling process, and multiple rounds of validation were conducted to maintain high-quality annotations.
#### Ethical Considerations
Throughout the data collection and annotation process, we adhered to strict ethical guidelines and privacy regulations. All videos were sourced from publicly available content, and efforts were made to anonymize individuals to protect their privacy. We prioritized compliance with data protection principles, ensuring that our work not only advanced technological capabilities but also respected the rights and privacy of individuals depicted in the footage.
#### Technical Specifications
The Ano-AAD dataset comprises a total of 354 abnormal videos, and 41 normal videos amounting to 8.7 GB of abnormal data and a cumulative abnormal video duration of 11 hours and 25 minutes. We also added 41 nomal videos of time duration of 41 miniutes. Each video was processed to maintain a uniform format and resolution, typically standardized nomal videos of time duration to MP4 . This consistency in video quality ensures that the dataset can be seamlessly integrated into various machine learning models and computer vision algorithms, facilitating the development and testing of anomaly detection systems.
#### Dataset Breakdown
| Serial Number | Anomaly Class | Total Number of Videos | Size | Duration (HH:MM) | |---------------|----------------------|------------------------|----------|------------------| | 1 | Arrest | 49 | 1.7 GB | 2:10 | | 2 | Burglary | 48 | 948.7 MB | 1:26 | | 3 | Explosion | 49 | 773 MB | 1:01 | | 4 | Fighting | 50 | 2.0 GB | 2:23 | | 5 | Fire Raising | 49 | 999.4 MB | 1:20 | | 6 | Ill Treatment | 32 | 812.5 MB | 1:07 | | 7 | Traffic Irregularities | 13 | 79.3 MB | 0:05 | | 8 | Attack | 38 | 543.8 MB | 0:41 | | 9 | Violence | 26 | 836 MB | 1:08 | | Total | ...
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary
This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.
They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD
How to Get Started
All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:
import numpy as np
hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")
mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")
Citing the Datasets
If you use any of these datasets, please cite the following paper:
@article{garske2024erx, title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning}, author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC}, journal={arXiv preprint arXiv:2408.14947}, year={2024},}
If you use the beach dataset please cite the following paper as well (original source):
@article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a large-scale anomaly detection dataset collected from IBM Cloud's Console over approximately 4.5 months. This high-dimensional dataset captures telemetry data from multiple data centers, specifically designed to aid researchers in developing and benchmarking anomaly detection methods in large-scale cloud environments. It contains 39,365 entries, each representing a 5-minute interval, with 117,448 features/attributes, as interval_start is used as the index. The dataset includes detailed information on request counts, HTTP response codes, and various aggregated statistics. The dataset also includes labeled anomaly events identified through IBM's internal monitoring tools, providing a comprehensive resource for real-world anomaly detection research and evaluation.
File Descriptions
location_downtime.csv
- Details planned and unplanned downtimes for IBM Cloud data centers, including start and end times in ISO 8601 format.unpivoted_data.parquet
- Contains raw telemetry data with 413 million+ rows, covering details like location, HTTP status codes, request types, and aggregated statistics (min, max, median response times).anomaly_windows.csv
- Ground truth for anomalies, listing start and end times of recorded anomalies, categorized by source (Issue Tracker, Instant Messenger, Test Log).pivoted_data_all.parquet
- Pivoted version of the telemetry dataset with 39,365 rows and 117,449 columns, including aggregated statistics across multiple metrics and intervals.demo/demo.[ipynb|html]
: This demo file provides examples of how to access data in the Parquet files, available in Jupyter Notebook (.ipynb
) and HTML (.html
) formats, respectively.Further details of the dataset can be found in Appendix B: Dataset Characteristics of the paper titled "Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset." Sample code for training anomaly detectors using this data is provided in this package.
When using the dataset, please cite it as follows:
@misc{islam2024anomaly,
title={Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset},
author={Mohammad Saiful Islam and Mohamed Sami Rakha and William Pourmajidi and Janakan Sivaloganathan and John Steinbacher and Andriy Miranskyy},
year={2024},
eprint={2411.09047},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2411.09047}
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This item is part of the collection "AIS Trajectories from Danish Waters for Abnormal Behavior Detection"
DOI: https://doi.org/10.11583/DTU.c.6287841
Using Deep Learning for detection of maritime abnormal behaviour in spatio temporal trajectories is a relatively new and promising application. Open access to the Automatic Identification System (AIS) has made large amounts of maritime trajectories publically avaliable. However, these trajectories are unannotated when it comes to the detection of abnormal behaviour.
The lack of annotated datasets for abnormality detection on maritime trajectories makes it difficult to evaluate and compare suggested models quantitavely. With this dataset, we attempt to provide a way for researchers to evaluate and compare performance.
We have manually labelled trajectories which showcase abnormal behaviour following an collision accident. The annotated dataset consists of 521 data points with 25 abnormal trajectories. The abnormal trajectories cover amoung other; Colliding vessels, vessels engaged in Search-and-Rescue activities, law enforcement, and commercial maritime traffic forced to deviate from the normal course
These datasets consists of labelled trajectories for the purpose of evaluating unsupervised models for detection of abnormal maritime behavior. For unlabelled datasets for training please refer to the collection. Link in Related publications.
The dataset is an example of a SAR event and cannot not be considered representative of a large population of all SAR events.
The dataset consists of a total of 521 trajectories of which 25 is labelled as abnormal. the data is captured on a single day in a specific region. The remaining normal traffic is representative of the traffic during the winter season. The normal traffic in the ROI has a fairly high seasonality related to fishing and leisure sailing traffic.
The data is saved using the pickle format for Python. Each dataset is split into 2 files with naming convention:
datasetInfo_XXX
data_XXX
Files named "data_XXX" contains the extracted trajectories serialized sequentially one at a time and must be read as such. Please refer to provided utility functions for examples. Files named "datasetInfo" contains Metadata related to the dataset and indecies at which trajectories begin in "data_XXX" files.
The data are sequences of maritime trajectories defined by their; timestamp, latitude/longitude position, speed, course, and unique ship identifer MMSI. In addition, the dataset contains metadata related to creation parameters. The dataset has been limited to a specific time period, ship types, moving AIS navigational statuses, and filtered within an region of interest (ROI). Trajectories were split if exceeding an upper limit and short trajectories were discarded. All values are given as metadata in the dataset and used in the naming syntax.
Naming syntax: data_AIS_Custom_STARTDATE_ENDDATE_SHIPTYPES_MINLENGTH_MAXLENGTH_RESAMPLEPERIOD.pkl
See datasheet for more detailed information and we refer to provided utility functions for examples on how to read and plot the data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bologna, Bologna, Italy) in the period March-May 2018.
The data set has been used to train a autoencoder-based model to automatically detect anomalies in a semi-supervised fashion, on a real HPC system.
This work is described in:
1) "Anomaly Detection using Autoencoders in High Performance Computing Systems", Andrea Borghesi, Andrea Bartolini, Michele Lombardi, Michela Milano, Luca Benini, IAAI19 (proceedings in process) -- https://arxiv.org/abs/1902.08447
2) "Online Anomaly Detection in HPC Systems", Andrea Borghesi, Antonio Libri, Luca Benini, Andrea Bartolini, AICAS19 (proceedings in process) -- https://arxiv.org/abs/1811.05269
See the git repository for usage examples & details --> https://github.com/AndreaBorghesi/anomaly_detection_HPC
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
IMAD-DS is a dataset developed for multi-rate multi-sensor anomaly detection (AD) in industrial environments, that considers varying operational and environmental conditions known as domain shifts.
Dataset Overview:
This dataset includes data from two scaled industrial machines: a robotic arm and a brushless motor.
It includes both normal and abnormal data recorded under various operating conditions to account for domain shifts. These shifts are categorized into:
Robotic Arm: The robotic arm is a scaled version of a robotic arm used to move silicon wafers in a factory. Anomalies are created by removing bolts at the nodes of the arm, resulting in an imbalance in the machine.
Brushless Motor: The brushless motor is a scaled representation of an industrial brushless motor. Two anomalies are introduced: first, a magnet is moved closer to the motor load, causing oscillations by interacting with two symmetrical magnets on the load; second, a belt that rotates in unison with the motor shaft is tightened, creating mechanical stress.
The following domain shifts are included in the dataset:
Operational Domain Shifts: Variations caused by changes in machine conditions (e.g., load changes for the robotic arm and speed changes for the brushless motor).
Environmental Domain Shifts: Variations due to changes in background noise levels.
Combinations of operating and environmental conditions divide each machine's dataset into two subsets: the source domain and the target domain. The source domain has a large number of training examples. The target domain, instead, has limited training data. This discrepancy highlights a common issue in the industry where sufficient training data is often unavailable for the target domain, as machine data is collected under controlled environments that do not fully represent the deployment environments.
Data Collection and Processing:
Data is collected using the STEVAL-STWINBX1 IoT Sensor Industrial Node. The sensor used to record the dataset are the following.
· Analog Microphone (16 kHz)
· 3-axis Accelerometer (6.7 kHz)
· 3-axis Gyroscope (6.7 kHz)
Recordings are conducted in an anechoic chamber to control acoustic conditions precisely
Data Format:
Files are already divided into train and test sets. Inside each folder, each sensor's data is stored in a separate '.parquet' file.
Sensor files related to the same segment of machine data share a unique ID. The mapping of each machine data segment to the sensor files is given in .csv files inside the train and test folders. Those .csv files also contain metadata denoting the operational and environmental conditions of a specific segment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The authors of the MVTec AD: the MVTec Anomaly Detection dataset addressed the critical task of detecting anomalous structures within natural image data, a crucial aspect of computer vision applications. To facilitate the development of methods for unsupervised anomaly detection, they introduced the MVTec AD dataset, comprising 5354 high-resolution color images encompassing various object and texture categories. The dataset comprises both normal images, intended for training, and images with anomalies, designed for testing. These anomalies manifest in over 70 distinct types of defects, including scratches, dents, contaminations, and structural alterations. The authors also provided pixel-precise ground truth annotations for all anomalies.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides a detailed look into transactional behavior and financial activity patterns, ideal for exploring fraud detection and anomaly identification. It contains 2,512 samples of transaction data, covering various transaction attributes, customer demographics, and usage patterns. Each entry offers comprehensive insights into transaction behavior, enabling analysis for financial security and fraud detection applications.
Key Features:
This dataset is ideal for data scientists, financial analysts, and researchers looking to analyze transactional patterns, detect fraud, and build predictive models for financial security applications. The dataset was designed for machine learning and pattern analysis tasks and is not intended as a primary data source for academic publications.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is the "evaluation dataset" for the DCASE 2020 Challenge Task 2 "Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring" [task description].
In the task, three datasets have been released: "development dataset", "additional training dataset", and "evaluation dataset". This evaluation dataset was the last of the three released. This dataset includes around 400 samples for each Machine Type and Machine ID used in the evaluation dataset, none of which have a condition label (i.e., normal or anomaly).
The recording procedure and data format are the same as the development dataset and additional training dataset. The Machine IDs in this dataset are the same as those in the additional training dataset. For more information, please see the pages of the development dataset and the task description.
After the DCASE 2020 Challenge, we released the ground truth for this evaluation dataset.
Directory structure
Once you unzip the downloaded files from Zenodo, you can see the following directory structure. Machine Type information is given by directory name, and Machine ID and condition information are given by file name, as:
/eval_data
/ToyCar
/test (Normal and anomaly data for all Machine IDs are included, but they do not have a condition label.)
/id_05_00000000.wav
...
/id_05_00000514.wav
/id_06_00000000.wav
...
/id_07_00000514.wav
/ToyConveyor (The other Machine Types have the same directory structure as ToyCar.)
/fan
/pump
/slider
/valve
The paths of audio files are:
"/eval_data//test/id_[0-9]+.wav"
For example, the Machine Type and Machine ID of "/ToyCar/test/id_05_00000000.wav" are "ToyCar" and "05", respectively. Unlike the development dataset and additional training dataset, its condition label is hidden.
Baseline system
A simple baseline system is available on the Github repository [URL]. The baseline system provides a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. It is a good starting point, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.
Conditions of use
This dataset was created jointly by NTT Corporation and Hitachi, Ltd. and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Publication
If you use this dataset, please cite all the following three papers:
Yuma Koizumi, Shoichiro Saito, Noboru Harada, Hisashi Uematsu, and Keisuke Imoto, "ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection," in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019. [pdf]
Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” in Proc. 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2019. [pdf]
Yuma Koizumi, Yohei Kawaguchi, Keisuke Imoto, Toshiki Nakamura, Yuki Nikaido, Ryo Tanabe, Harsh Purohit, Kaori Suefusa, Takashi Endo, Masahiro Yasuda, and Noboru Harada, "Description and Discussion on DCASE2020 Challenge Task2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring," in Proc. 5th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2020. [pdf]
Feedback
If there is any problem, please contact us:
Yuma Koizumi, koizumi.yuma@ieee.org
Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
Keisuke Imoto, keisuke.imoto@ieee.org
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation Tests (v2) conducted for the paper: What do anomaly scores actually mean? Key characteristics of algorithms' dynamics beyond accuracy by F. Iglesias, H. O. Marques, A. Zimek, T. Zseby Context and methodology Anomaly detection is intrinsic to a large number of data analysis applications today. Most of the algorithms used assign an outlierness score to each instance prior to establishing anomalies in a binary form. The experiments in this repository study how different algorithms generate different dynamics in the outlierness scores and react in very different ways to possible model perturbations that affect data. The study elaborated in the referred paper presents new indices and coefficients to assess the dynamics and explores the responses of the algorithms as a function of variations in these indices, revealing key aspects of the interdependence between algorithms, data geometries and the ability to discriminate anomalies. Therefeore, this repository reproduces the conducted experiments, which study eight algorithms (ABOD, HBOS, iForest, K-NN, LOF, OCSVM, SDO and GLOSH), submitted to seven perturbations related to: cardinality, dimensionality, outlier proportion, inlier-outlier density ratio, density layers, clusters and local outliers, and collects behavioural profiles with eleven measurements (Adjusted Average Precission, ROC-AUC, Perini's Confidence [1], Perini's Stability [2], S-curves, Discriminant Power, Robust Coefficients of Variations for Inliers and Outliers, Coherence, Bias and Robustness) under two types of normalization: linear and Gaussian, the latter aiming to standardize the outlierness scores issued by different algorithms [3]. This repository is framed within the research on the following domains: algorithm evaluation, outlier detection, anomaly detection, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison. References [1] Perini, L., Vercruyssen, V., Davis, J.: Quantifying the confidence of anomaly detectors in their example-wise predictions. In: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Springer Verlag (2020). [2] Perini, L., Galvin, C., Vercruyssen, V.: A Ranking Stability Measure for Quantifying the Robustness of Anomaly Detection Methods. In: 2nd Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning @ ECML/PKDD (2020). [3] Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings of the 2011 SIAM International Conference on Data Mining (SDM), pp. 13–24 (2011) Technical details Experiments are tested Python 3.9.6. Provided scripts generate all synthetic data and results. We keep them in the repo for the sake of comparability and replicability ("outputs.zip" file). The file and folder structure is as follows: "compare_scores_group.py" is a Python script to extract new dynamic indices proposed in the paper. "generate_data.py" is a Python script to generate datasets used for evaluation. "latex_table.py" is a Python script to show results in a latex-table format. "merge_indices.py" is a Python script to merge accuracy and dynamic indices in the same table-structured summary. "metric_corr.py" is a Python script to calculate correlation estimations between indices. "outdet.py" is a Python script that runs outlier detection with different algorithms on diverse datasets. "perini_tests.py" is a Python script to run Perini's confidence and stability on all datasets and algorithms' performances. "scatterplots.py" is a Python script that generates scatter plots for comparing accuracy and dynamic performances. "README.md" provides explanations and step by step instructions for replication. "requirements.txt" contains references to required Python libraries and versions. "outputs.zip" contains all result tables, plots and synthetic data generated with the scripts. [data/real_data] contain CSV versions of the Wilt, Shuttle, Waveform and Cardiotocography datasets (inherited and adapted from the LMU repository) License The CC-BY license applies to all data generated with the "generated_data.py" script. All distributed code is under the GNU GPL license. For the "ExCeeD.py" and "stability.py" scripts, please consult and refer to the original sources provided above.
ToyADMOS2 dataset is a large-scale dataset for anomaly detection in machine operating sounds (ADMOS), designed for evaluating systems under domain-shift conditions. It consists of two sub-datasets for machine-condition inspection: fault diagnosis of machines with geometrically fixed tasks ("toy car") and fault diagnosis of machines with moving tasks ("toy train"). Domain shifts are represented by introducing several differences in operating conditions, such as the use of the same machine type but with different machine models and part configurations, different operating speeds, microphone arrangements, etc. Each sub-dataset contains over 27 k samples of normal machine-operating sounds and over 8 k samples of anomalous sounds recorded at a 48-kHz sampling rate. A subset of the ToyADMOS2 dataset was used in the DCASE 2021 challenge task 2: Unsupervised anomalous sound detection for machine condition monitoring under domain shifted conditions.
What makes this dataset different from others is that it is not used as is, but in conjunction with the tool provided on GitHub. The mixer tool lets you create datasets with any combination of recordings by describing the amount you need in a recipe file.
The samples are compressed as MPEG-4 ALS (MPEG-4 Audio Lossless Coding) with a suffix of '.mp4' that you can load by using the audioread or librosa python module.
The total size of files under a folder ToyADMOS2 is 149 GB, and the total size of example benchmark datasets that are created from the ToyADMOS2 dataset is 13.2 GB.
The detail of the dataset is described in [1] and GitHub: https://github.com/nttcslab/ToyADMOS2-dataset
License: see LICENSE.pdf for the detail of the license.
[1] Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito, "ToyADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions," 2021. https://arxiv.org/abs/2106.02369
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Industrial Screw Driving Datasets
Overview
This repository contains a collection of real-world industrial screw driving datasets, designed to support research in manufacturing process monitoring, anomaly detection, and quality control. Each dataset represents different aspects and challenges of automated screw driving operations, with a focus on natural process variations and degradation patterns.
Scenario name Number of work pieces Repetitions (screw cylces) per workpiece Individual screws per workpiece Observations Unique classes Purpose
s01_thread-degradation 100 25 2 5.000 1 Investigation of thread degradation through repeated fastening
s02_surface-friction 250 25 2 12.500 8 Surface friction effects on screw driving operations
s03_error-collection-1
1 2
20
s04_error-collection-2 2.500 1 2 5.000 25
s05_injection-molding-manipulations-upper-workpiece 1.200 1 2 2.400 44 Investigation of changes in the injection molding process of the workpieces
Dataset Collection
The datasets were collected from operational industrial environments, specifically from automated screw driving stations used in manufacturing. Each scenario investigates specific mechanical phenomena that can occur during industrial screw driving operations:
Currently Available Datasets:
Focus: Investigation of thread degradation through repeated fastening
Samples: 5,000 screw operations (4,089 normal, 911 faulty)
Features: Natural degradation patterns, no artificial error induction
Equipment: Delta PT 40x12 screws, thermoplastic components
Process: 25 cycles per location, two locations per workpiece
First published in: HICSS 2024 (West & Deuse, 2024)
Focus: Surface friction effects on screw driving operations
Samples: 12,500 screw operations (9,512 normal, 2,988 faulty)
Features: Eight distinct surface conditions (baseline to mechanical damage)
Equipment: Delta PT 40x12 screws, thermoplastic components, surface treatment materials
Process: 25 cycles per location, two locations per workpiece
First published in: CIE51 2024 (West & Deuse, 2024)
Manipulations of the injection molding process with no changes during tightening
Samples: 2,400 screw operations (2,397 normal, 3 faulty)
Features: 44 classes in five distinct groups:
Mold temperature
Glass fiber content
Recyclate content
Switching point
Injection velocity
Equipment: Delta PT 40x12 screws, thermoplastic components
Unpublished, work in progress
Upcoming Datasets:
Focus: Varius manipulations of the screw driving process
Features: More than 20 different errors recorded
First published in: Publication planned
Status: In preparation
Focus: Varius manipulations of the screw driving process
Features: 25 distinct errors recorded over the course of a week
First published in: Publication planned
Status: In preparation
Manipulations of the injection molding process with no changes during tightening
Additional scenarios may be added to this collection as they become available.
Data Format
Each dataset follows a standardized structure:
JSON files containing individual screw operation data
CSV files with operation metadata and labels
Comprehensive documentation in README files
Example code for data loading and processing is available in the companion library PyScrew
Research Applications
These datasets are suitable for various research purposes:
Machine learning model development and validation
Process monitoring and control systems
Quality assurance methodology development
Manufacturing analytics research
Anomaly detection algorithm benchmarking
Usage Notes
All datasets include both normal operations and process anomalies
Complete time series data for torque, angle, and additional parameters available
Detailed documentation of experimental conditions and setup
Data collection procedures and equipment specifications available
Access and Citation
These datasets are provided under an open-access license to support research and development in manufacturing analytics. When using any of these datasets, please cite the corresponding publication as detailed in each dataset's README file.
Related Tools
We recommend using our library PyScrew to load and prepare the data. However, the the datasets can be processed using standard JSON and CSV processing libraries. Common data analysis and machine learning frameworks may be used for the analysis. The .tar file provided all information required for each scenario.
Contact and Support
For questions, issues, or collaboration interests regarding these datasets, either:
Open an issue in our GitHub repository PyScrew
Contact us directly via email
Acknowledgments
These datasets were collected and prepared by:
RIF Institute for Research and Transfer e.V.
University of Kassel, Institute of Material Engineering
Technical University Dortmund, Institute for Production Systems
The preparation and provision of the research was supported by:
German Ministry of Education and Research (BMBF)
European Union's "NextGenerationEU" program
The research is part of this funding program
More information regarding the research project is available here
Change Log
Version Date Features
v1.1.3 18.02.2025
v1.1.2 12.02.2025
label.csv
and README.md
in all scenariosv1.1.1 12.02.2025
Reupload of both s01 and s02 as zip (smaller size) and tar (faster extraction) files
Change to the data structure (now organized as subdirectories per class in json/
)
v1.1.0 30.01.2025
s02_surface-friction
v1.0.0 24.01.2025
s01_thread-degradation
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context
The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.
The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.
Content :
This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.
The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).
Dataset Columns:
No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance
Acknowledgements :
I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.
Ravikumar Gattu , Susmitha Choppadandi
Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).
**Dataset License: ** CC0: Public Domain
Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.
ML techniques benefits from this Dataset :
This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :
Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.
Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.
3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.
New gravity compilation has been compiled for the Lake Superior region. The gravity compilation includes survey stations available from Natural Resources Canada, National Centers for Environmental Information (formerly National Geophysical Data Center), Minnesota Geological Survey, and U.S. Geological Survey. Individual databases were combined and duplicates were removed resulting in a database of 63,880 gravity stations. The gravity station data were reprocessed from observed gravity to simple Bouguer anomaly following standard methods that depend on the station type (for example, land, lake surface, or lake bottom observation) and used a reduction density of 2,670 kg/m3. The compilation provides a consistent dataset appropriate for gravity modeling that extends across Lake Superior shores.
A fleet is a group of systems (e.g., cars, aircraft) that are designed and manufactured the same way and are intended to be used the same way. For example, a fleet of delivery trucks may consist of one hundred instances of a particular model of truck, each of which is intended for the same type of service—almost the same amount of time and distance driven every day, approximately the same total weight carried, etc. For this reason, one may imagine that data mining for fleet monitoring may merely involve collecting operating data from the multiple systems in the fleet and developing some sort of model, such as a model of normal operation that can be used for anomaly detection. However, one then may realize that each member of the fleet will be unique in some ways—there will be minor variations in manufacturing, quality of parts, and usage. For this reason, the typical machine learning and statis- tics algorithm’s assumption that all the data are independent and identically distributed is not correct. One may realize that data from each system in the fleet must be treated as unique so that one can notice significant changes in the operation of that system.