Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
ESA Anomaly Dataset is the first large-scale, real-life satellite telemetry dataset with curated anomaly annotations originated from three ESA missions. We hope that this unique dataset will allow researchers and scientists from academia, research institutes, national and international space agencies, and industry to benchmark models and approaches on a common baseline as well as research and develop novel, computational-efficient approaches for anomaly detection in satellite telemetry data.
The dataset results from the work of an 18-month project carried by an industry Consortium composed of Airbus Defence and Space, KP Labs and the European Space Agency’s European Space Operations Centre. The project, funded by the European Space Agency (ESA), is part of the Artificial Intelligence for Automation (A²I) Roadmap (De Canio et al., 2023), a large endeavour started in 2021 to automate space operations by leveraging artificial intelligence.
Further details can be found on the arXiv and Github.
References
De Canio, G. et al. (2023) Development of an actionable AI roadmap for automating mission operations. In, 2023 SpaceOps Conference. American Institute of Aeronautics and Astronautics, Dubai, United Arab Emirates.
The Numenta Anomaly Benchmark (NAB) is a novel benchmark for evaluating algorithms for anomaly detection in streaming, online applications. It is comprised of over 50 labeled real-world and artificial timeseries data files plus a novel scoring mechanism designed for real-time applications. All of the data and code is fully open-source, with extensive documentation, and a scoreboard of anomaly detection algorithms: github.com/numenta/NAB. The full dataset is included here, but please go to the repo for details on how to evaluate anomaly detection algorithms on NAB.
The NAB corpus of 58 timeseries data files is designed to provide data for research in streaming anomaly detection. It is comprised of both real-world and artifical timeseries data containing labeled anomalous periods of behavior. Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.
The majority of the data is real-world from a variety of sources such as AWS server metrics, Twitter volume, advertisement clicking metrics, traffic data, and more. All data is included in the repository, with more details in the data readme. We are in the process of adding more data, and actively searching for more data. Please contact us at nab@numenta.org if you have similar data (ideally with known anomalies) that you would like to see incorporated into NAB.
The NAB version will be updated whenever new data (and corresponding labels) is added to the corpus; NAB is currently in v1.0.
realAWSCloudwatch/
AWS server metrics as collected by the AmazonCloudwatch service. Example metrics include CPU Utilization, Network Bytes In, and Disk Read Bytes.
realAdExchange/
Online advertisement clicking rates, where the metrics are cost-per-click (CPC) and cost per thousand impressions (CPM). One of the files is normal, without anomalies.
realKnownCause/
This is data for which we know the anomaly causes; no hand labeling.
ambient_temperature_system_failure.csv
: The ambient temperature in an office
setting.cpu_utilization_asg_misconfiguration.csv
: From Amazon Web Services (AWS)
monitoring CPU usage – i.e. average CPU usage across a given cluster. When
usage is high, AWS spins up a new machine, and uses fewer machines when usage
is low.ec2_request_latency_system_failure.csv
: CPU usage data from a server in
Amazon's East Coast datacenter. The dataset ends with complete system failure
resulting from a documented failure of AWS API servers. There's an interesting
story behind this data in the "http://numenta.com/blog/anomaly-of-the-week.html">Numenta
blog.machine_temperature_system_failure.csv
: Temperature sensor data of an
internal component of a large, industrial mahcine. The first anomaly is a
planned shutdown of the machine. The second anomaly is difficult to detect and
directly led to the third anomaly, a catastrophic failure of the machine.nyc_taxi.csv
: Number of NYC taxi passengers, where the five anomalies occur
during the NYC marathon, Thanksgiving, Christmas, New Years day, and a snow
storm. The raw data is from the NYC Taxi and Limousine Commission.
The data file included here consists of aggregating the total number of
taxi passengers into 30 minute buckets.rogue_agent_key_hold.csv
: Timing the key holds for several users of a
computer, where the anomalies represent a change in the user.rogue_agent_key_updown.csv
: Timing the key strokes for several users of a
computer, where the anomalies represent a change in the user.realTraffic/
Real time traffic data from the Twin Cities Metro area in Minnesota, collected by the Minnesota Department of Transportation. Included metrics include occupancy, speed, and travel time from specific sensors.
realTweets/
A collection of Twitter mentions of large publicly-traded companies such as Google and IBM. The metric value represents the number of mentions for a given ticker symbol every 5 minutes.
artificialNoAnomaly/
Artifically-generated data without any anomalies.
artificialWithAnomaly/
Artifically-generated data with varying types of anomalies.
We encourage you to publish your results on running NAB, and share them with us at nab@numenta.org. Please cite the following publication when referring to NAB:
Lavin, Alexander and Ahmad, Subutai. "Evaluating Real-time Anomaly Detection Algorithms – the Numenta Anomaly Benchmark", Fourteenth International Conference on Machine Learning and Applications, December 2015. [PDF]
ToyADMOS dataset is a machine operating sounds dataset of approximately 540 hours of normal machine operating sounds and over 12,000 samples of anomalous sounds collected with four microphones at a 48kHz sampling rate, prepared by Yuma Koizumi and members in NTT Media Intelligence Laboratories. The dataset consists of three sub-dataset: "toy car" for product inspection task, "toy conveyor" for fault diagnosis for fixed machine task, and "toy train" for fault diagnosis for moving machine task.
Since the total size of the ToyADMOS dataset is over 440GB, each sub-dataset is split into 7-9 files by 7-zip (7z-format). The total size of the compressed dataset is approximately 180GB, and that of each sub-dataset is approximately 60GB. Download the zip files corresponding to sub-datasets of interest and use your favorite compression tool to unzip these split zip files.
The detail of the dataset is described in [1] and GitHub: https://github.com/YumaKoizumi/ToyADMOS-dataset
License: see the file named LICENSE.pdf
[1] Yuma Koizumi, Shoichiro Saito, Noboru Harada, Hisashi Uematsu and Keisuke Imoto, "ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection," in Proc of Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract:
In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
The major contributions have been materialized in the form of novel algorithms.
Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.
General Information:
This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.
You may find details of this dataset from the original paper:
Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".
If you use the data, implementation, or any details of the paper, please cite!
BIBTEX:
_
@inproceedings{nedelkoski2020multi, title={Multi-source Distributed System Data for AI-Powered Analytics}, author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej}, booktitle={European Conference on Service-Oriented and Cloud Computing}, pages={161--176}, year={2020}, organization={Springer} }
_
The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.
The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.
Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.
Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
See the official website: https://autovi.utc.fr
Modern industrial production lines must be set up with robust defect inspection modules that are able to withstand high product variability. This means that in a context of industrial production, new defects that are not yet known may appear, and must therefore be identified.
On industrial production lines, the typology of potential defects is vast (texture, part failure, logical defects, etc.). Inspection systems must therefore be able to detect non-listed defects, i.e. not-yet-observed defects upon the development of the inspection system. To solve this problem, research and development of unsupervised AI algorithms on real-world data is required.
Renault Group and the Université de technologie de Compiègne (Roberval and Heudiasyc Laboratories) have jointly developed the Automotive Visual Inspection Dataset (AutoVI), the purpose of which is to be used as a scientific benchmark to compare and develop advanced unsupervised anomaly detection algorithms under real production conditions. The images were acquired on Renault Group's automotive production lines, in a genuine industrial production line environment, with variations in brightness and lighting on constantly moving components. This dataset is representative of actual data acquisition conditions on automotive production lines.
The dataset contains 3950 images, split into 1530 training images and 2420 testing images.
The evaluation code can be found at https://github.com/phcarval/autovi_evaluation_code.
DisclaimerAll defects shown were intentionally created on Renault Group's production lines for the purpose of producing this dataset. The images were examined and labeled by Renault Group experts, and all defects were corrected after shooting.
LicenseCopyright © 2023-2024 Renault Group
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of the license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/.
For using the data in a way that falls under the commercial use clause of the license, please contact us.
AttributionPlease use the following for citing the dataset in scientific work:
Carvalho, P., Lafou, M., Durupt, A., Leblanc, A., & Grandvalet, Y. (2024). The Automotive Visual Inspection Dataset (AutoVI): A Genuine Industrial Production Dataset for Unsupervised Anomaly Detection [Dataset]. https://doi.org/10.5281/zenodo.10459003
ContactIf you have any questions or remarks about this dataset, please contact us at philippe.carvalho@utc.fr, meriem.lafou@renault.com, alexandre.durupt@utc.fr, antoine.leblanc@renault.com, yves.grandvalet@utc.fr.
Changelog
v1.0.0
Cropped engine_wiring, pipe_clip and pipe_staple images
Reduced tank_screw, underbody_pipes and underbody_screw image sizes
v0.1.1
Added ground truth segmentation maps
Fixed categorization of some images
Added new defect categories
Removed tube_fastening and kitting_cart
Removed duplicates in pipe_clip
The Squirrel Cage Induction Motor Fault Diagnosis Dataset is a multi-sensor data collection gathered to expand research on anomaly detection, fault diagnosis, and predictive maintenance, mainly using non-invasive methods such as thermal observation or vibration measurement. The measurements were gathered using an advanced Wrocław University of Science and Technology laboratory designed to simulate and study motor defects. The collected dataset is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Available data:
thermal images
The example dataset utilization is presented in the GitHub repository: motor-fault-diagnosis
Related publications:
Unraveling Induction Motor State through Thermal Imaging and Edge Processing: A Step towards Explainable Fault Diagnosis
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection (MIMII dataset). It contains the sounds generated from four types of industrial machines, i.e. valves, pumps, fans, and slide rails. Each type of machine includes seven individual product models*1, and the data for each model contains normal sounds (from 5000 seconds to 10000 seconds) and anomalous sounds (about 1000 seconds). To resemble a real-life scenario, various anomalous sounds were recorded (e.g., contamination, leakage, rotating unbalance, and rail damage). Also, the background noise recorded in multiple real factories was mixed with the machine sounds. The sounds were recorded by eight-channel microphone array with 16 kHz sampling rate and 16 bit per sample. The MIMII dataset assists benchmark for sound-based machine fault diagnosis. Users can test the performance for specific functions e.g., unsupervised anomaly detection, transfer learning, noise robustness, etc. The detail of the dataset is described in [1][2].
This dataset is made available by Hitachi, Ltd. under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
A baseline sample code for anomaly detection is available on GitHub: https://github.com/MIMII-hitachi/mimii_baseline/
*1: This version "public 1.0" contains four models (model ID 00, 02, 04, and 06). The rest three models will be released in a future edition.
[1] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” arXiv preprint arXiv:1909.09347, 2019.
[2] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” in Proc. 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2019.
This resource contains an example script for using the software package pyhydroqc. pyhydroqc was developed to identify and correct anomalous values in time series data collected by in situ aquatic sensors. For more information, see the code repository: https://github.com/AmberSJones/pyhydroqc and the documentation: https://ambersjones.github.io/pyhydroqc/. The package may be installed from the Python Package Index.
This script applies the functions to data from a single site in the Logan River Observatory, which is included in the repository. The data collected in the Logan River Observatory are sourced at http://lrodata.usu.edu/tsa/ or on HydroShare: https://www.hydroshare.org/search/?q=logan%20river%20observatory.
Anomaly detection methods include ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short Term Memory). These are time series regression methods that detect anomalies by comparing model estimates to sensor observations and labeling points as anomalous when they exceed a threshold. There are multiple possible approaches for applying LSTM for anomaly detection/correction. - Vanilla LSTM: uses past values of a single variable to estimate the next value of that variable. - Multivariate Vanilla LSTM: uses past values of multiple variables to estimate the next value for all variables. - Bidirectional LSTM: uses past and future values of a single variable to estimate a value for that variable at the time step of interest. - Multivariate Bidirectional LSTM: uses past and future values of multiple variables to estimate a value for all variables at the time step of interest.
The correction approach uses piecewise ARIMA models. Each group of consecutive anomalous points is considered as a unit to be corrected. Separate ARIMA models are developed for valid points preceding and following the anomalous group. Model estimates are blended to achieve a correction.
The anomaly detection and correction workflow involves the following steps: 1. Retrieving data 2. Applying rules-based detection to screen data and apply initial corrections 3. Identifying and correcting sensor drift and calibration (if applicable) 4. Developing a model (i.e., ARIMA or LSTM) 5. Applying model to make time series predictions 6. Determining a threshold and detecting anomalies by comparing sensor observations to modeled results 7. Widening the window over which an anomaly is identified 8. Aggregating detections resulting from multiple models 9. Making corrections for anomalous events
Instructions to run the notebook through the CUAHSI JupyterHub: 1. Click "Open with..." at the top of the resource and select the CUAHSI JupyterHub. You may need to sign into CUAHSI JupyterHub using your HydroShare credentials. 2. Select 'Python 3.8 - Scientific' as the server and click Start. 2. From your JupyterHub directory, click on the ExampleNotebook.ipynb file. 3. Execute each cell in the code by clicking the Run button.
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Mudestreda Multimodal Device State Recognition Dataset
obtained from real industrial milling device with Time Series and Image Data for Classification, Regression, Anomaly Detection, Remaining Useful Life (RUL) estimation, Signal Drift measurement, Zero Shot Flank Took Wear, and Feature Engineering purposes.
The official dataset used in the paper "Multimodal Isotropic Neural Architecture with Patch Embedding" ICONIP23.
Official repository: https://github.com/hubtru/Minape
Conference paper: https://link.springer.com/chapter/10.1007/978-981-99-8079-6_14
Mudestreda (MD) | Size 512 Samples (Instances, Observations)| Modalities 4 | Classes 3 |
Future research: Regression, Remaining Useful Life (RUL) estimation, Signal Drift detection, Anomaly Detection, Multivariate Time Series Prediction, and Feature Engineering.
Notice: Tables and images do not render properly.
Recommended: README.md
includes the Mudestreda description and images Mudestreda.png
and Mudestreda_Stage.png
.
Data Overview
Task: Uni/Multi-Modal Classification
Domain: Industrial Flank Tool Wear of the Milling Machine
Input (sample): 4 Images: 1 Tool Image, 3 Spectrograms (X, Y, Z axis)
Output: Machine state classes: Sharp
, Used
, Dulled
Evaluation: Accuracies, Precision, Recal, F1-score, ROC curve
Each tool's wear is categorized sequentially: Sharp → Used → Dulled.
The dataset includes measurements from ten tools: T1 to T10.
Data splitting options include random or chronological distribution, without shuffling.
Options:
Original data or Augmented data
Random distribution or Tool Distribution (see Dataset Splitting)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task (MIMII DG). The dataset consists of normal and abnormal operating sounds of five different types of industrial machines, i.e., fans, gearboxes, bearing, slide rails, and valves. The data for each machine type includes three subsets called "sections", and each section roughly corresponds to a type of domain shift. This dataset is a subset of the dataset for DCASE 2022 Challenge Task 2, so the dataset is entirely the same as data included in the development dataset. For more information, please see the pages of the development dataset and the task description for DCASE 2022 Challenge Task 2.
Baseline system
Two simple baseline systems are available on the Github repositories autoencoder-based baseline and MobileNetV2-based baseline. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.
Conditions of use
This dataset was made by Hitachi, Ltd. and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Citation
We will publish a paper on the dataset and will announce the citation information for them, so please make sure to cite them if you use this dataset.
Feedback
If there is any problem, pease contact us
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information:This dataset was created for research on blockchain anomaly and fraud detection. And donated to IEEE data port online community. https://github.com/epicprojects/blockchain-anomaly-detection
ToyADMOS2 dataset is a large-scale dataset for anomaly detection in machine operating sounds (ADMOS), designed for evaluating systems under domain-shift conditions. It consists of two sub-datasets for machine-condition inspection: fault diagnosis of machines with geometrically fixed tasks ("toy car") and fault diagnosis of machines with moving tasks ("toy train"). Domain shifts are represented by introducing several differences in operating conditions, such as the use of the same machine type but with different machine models and part configurations, different operating speeds, microphone arrangements, etc. Each sub-dataset contains over 27 k samples of normal machine-operating sounds and over 8 k samples of anomalous sounds recorded at a 48-kHz sampling rate. A subset of the ToyADMOS2 dataset was used in the DCASE 2021 challenge task 2: Unsupervised anomalous sound detection for machine condition monitoring under domain shifted conditions.
What makes this dataset different from others is that it is not used as is, but in conjunction with the tool provided on GitHub. The mixer tool lets you create datasets with any combination of recordings by describing the amount you need in a recipe file.
The samples are compressed as MPEG-4 ALS (MPEG-4 Audio Lossless Coding) with a suffix of '.mp4' that you can load by using the audioread or librosa python module.
The total size of files under a folder ToyADMOS2 is 149 GB, and the total size of example benchmark datasets that are created from the ToyADMOS2 dataset is 13.2 GB.
The detail of the dataset is described in [1] and GitHub: https://github.com/nttcslab/ToyADMOS2-dataset
License: see LICENSE.pdf for the detail of the license.
[1] Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito, "ToyADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions," 2021. https://arxiv.org/abs/2106.02369
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource contains the supporting data and code files for the analyses presented in "Toward automating post processing of aquatic sensor data," an article published in the journal Environmental Modelling and Software. This paper describes pyhydroqc, a Python package developed to identify and correct anomalous values in time series data collected by in situ aquatic sensors. For more information on pyhydroqc, see the code repository (https://github.com/AmberSJones/pyhydroqc) and the documentation (https://ambersjones.github.io/pyhydroqc/). The package may be installed from the Python Package Index (more info: https://packaging.python.org/tutorials/installing-packages/).
Included in this resource are input data, Python scripts to run the package on the input data (anomaly detection and correction), results from running the algorithm, and Python scripts for generating the figures in the manuscript. The organization and structure of the files are described in detail in the readme file. The input data were collected as part of the Logan River Observatory (LRO). The data in this resource represent a subset of data available for the LRO and were compiled by querying the LRO’s operational database. All available data for the LRO can be sourced at http://lrodata.usu.edu/tsa/ or on HydroShare: https://www.hydroshare.org/search/?q=logan%20river%20observatory.
There are two sets of scripts in this resource: 1.) Scripts that reproduce plots for the paper using saved results, and 2.) Code used to generate the complete results for the series in the case study. While all figures can be reproduced, there are challenges to running the code for the complete results (it is computationally intensive, different results will be generated due to the stochastic nature of the models, and the code was developed with an early version of the package), which is why the saved results are included in this resource. For a simple example of running pyhydroqc functions for anomaly detection and correction on a subset of data, see this resource: https://www.hydroshare.org/resource/92f393cbd06b47c398bdd2bbb86887ac/.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is the "evaluation dataset" for the DCASE 2020 Challenge Task 2 "Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring" [task description].
In the task, three datasets have been released: "development dataset", "additional training dataset", and "evaluation dataset". This evaluation dataset was the last of the three released. This dataset includes around 400 samples for each Machine Type and Machine ID used in the evaluation dataset, none of which have a condition label (i.e., normal or anomaly).
The recording procedure and data format are the same as the development dataset and additional training dataset. The Machine IDs in this dataset are the same as those in the additional training dataset. For more information, please see the pages of the development dataset and the task description.
After the DCASE 2020 Challenge, we released the ground truth for this evaluation dataset.
Directory structure
Once you unzip the downloaded files from Zenodo, you can see the following directory structure. Machine Type information is given by directory name, and Machine ID and condition information are given by file name, as:
/eval_data
/ToyCar
/test (Normal and anomaly data for all Machine IDs are included, but they do not have a condition label.)
/id_05_00000000.wav
...
/id_05_00000514.wav
/id_06_00000000.wav
...
/id_07_00000514.wav
/ToyConveyor (The other Machine Types have the same directory structure as ToyCar.)
/fan
/pump
/slider
/valve
The paths of audio files are:
"/eval_data/
For example, the Machine Type and Machine ID of "/ToyCar/test/id_05_00000000.wav" are "ToyCar" and "05", respectively. Unlike the development dataset and additional training dataset, its condition label is hidden.
Baseline system
A simple baseline system is available on the Github repository [URL]. The baseline system provides a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. It is a good starting point, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.
Conditions of use
This dataset was created jointly by NTT Corporation and Hitachi, Ltd. and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Publication
If you use this dataset, please cite all the following three papers:
Yuma Koizumi, Shoichiro Saito, Noboru Harada, Hisashi Uematsu, and Keisuke Imoto, "ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection," in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019. [pdf]
Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” in Proc. 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2019. [pdf]
Yuma Koizumi, Yohei Kawaguchi, Keisuke Imoto, Toshiki Nakamura, Yuki Nikaido, Ryo Tanabe, Harsh Purohit, Kaori Suefusa, Takashi Endo, Masahiro Yasuda, and Noboru Harada, "Description and Discussion on DCASE2020 Challenge Task2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring," in Proc. 5th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2020. [pdf]
Feedback
If there is any problem, please contact us:
Yuma Koizumi, koizumi.yuma@ieee.org
Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
Keisuke Imoto, keisuke.imoto@ieee.org
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bologna, Bologna, Italy) in the period March-May 2018.
The data set has been used to train a autoencoder-based model to automatically detect anomalies in a semi-supervised fashion, on a real HPC system.
This work is described in:
1) "Anomaly Detection using Autoencoders in High Performance Computing Systems", Andrea Borghesi, Andrea Bartolini, Michele Lombardi, Michela Milano, Luca Benini, IAAI19 (proceedings in process) -- https://arxiv.org/abs/1902.08447
2) "Online Anomaly Detection in HPC Systems", Andrea Borghesi, Antonio Libri, Luca Benini, Andrea Bartolini, AICAS19 (proceedings in process) -- https://arxiv.org/abs/1811.05269
See the git repository for usage examples & details --> https://github.com/AndreaBorghesi/anomaly_detection_HPC
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data sets were originally created for the following publications:
M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.
H.-P. Kriegel, E. Schubert, A. Zimek Evaluation of Multiple Clustering Solutions In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.
The outlier data set versions were introduced in:
E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel On Evaluation of Outlier Rankings and Outlier Scores In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.
They are derived from the original image data available at https://aloi.science.uva.nl/
The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005
Additional information is available at: https://elki-project.github.io/datasets/multi_view
The following views are currently available:
Feature type
Description
Files
Object number
Sparse 1000 dimensional vectors that give the true object assignment
objs.arff.gz
RGB color histograms
Standard RGB color histograms (uniform binning)
aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
HSV color histograms
Standard HSV/HSB color histograms in various binnings
aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
Color similiarity
Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)
aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
Haralick features
First 13 Haralick features (radius 1 pixel)
aloi-haralick-1.csv.gz
Front to back
Vectors representing front face vs. back faces of individual objects
front.arff.gz
Basic light
Vectors indicating basic light situations
light.arff.gz
Manual annotations
Manually annotated object groups of semantically related objects such as cups
manual1.arff.gz
Outlier Detection Versions
Additionally, we generated a number of subsets for outlier detection:
Feature type
Description
Files
RGB Histograms
Downsampled to 100000 objects (553 outliers)
aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
Downsampled to 75000 objects (717 outliers)
aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
Downsampled to 50000 objects (1508 outliers)
aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created for research on blockchain anomaly and fraud detection. And donated to IEEE data port online community.https://github.com/epicprojects/blockchain-anomaly-detection Files: bitcoin_hacks_2010_2013.csv: Contains known hashes of bitcoin theft/malicious transactions from 2010-2013malicious_tx_in.csv: Contains hashes of input transactions flowing into malicious transactions.malicious_tx_out.csv Contains hashes of output transactions flowing out of malicious transactions. anomalies_theft_tx.csv: Known bitcoin thefts transaction hashes.anomalies_loss_tx.csv: Known bitcoin losses transaction hashes.anomalies_misc_tx.csv: Known bitcoin hacks transaction hashes.anomalies_seizure1_tx.csv: Known bitcoin transaction hashes involved in 1st FBI silk road sizure. (https://en.wikipedia.org/wiki/Silk_Road_(marketplace))anomalies_seizure2_tx.csv: Known bitcoin transaction hashes involved in 2nd FBI silk road sizure. (https://en.wikipedia.org/wiki/Silk_Road_(marketplace))
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.
They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD
All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:
import numpy as np
# Load image file
hsi_array = np.load("data/beach_hsi.npy")
n_pixels, n_lines, n_bands = hsi_array.shape
print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")
# Load image mask
mask_array = np.load("data/beach_mask.npy")
m_pixels, m_lines = mask_array.shape
print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")
If you use any of these datasets, please cite the following paper:
@article{garske2024erx,
title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning},
author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC},
journal={arXiv preprint arXiv:2408.14947},
year={2024},
}
If you use the beach dataset please cite the following paper as well (original source):
@article{mao2022openhsi,
title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone},
author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy},
journal={Remote Sensing},
volume={14},
number={9},
pages={2244},
year={2022},
publisher={MDPI}
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains a small set of application runs from Eclipse supercomputer. The applications run with and without synthetic HPC performance anomalies. More detailed information regarding synthetic anomalies can be found at: https://github.com/peaclab/HPAS.
We have chosen four applications, namely LAMMPS, sw4, sw4Lite, and ExaMiniMD, to encompass both real and proxy applications. We have executed each application five times on four compute nodes without introducing any anomalies. To showcase our experiment, we have specifically selected the "memleak" anomaly as it is one of the most commonly occurring types. Additionally, we have also executed each application five times with the chosen anomaly. The dataset we have collected consists of a total of 160 samples, with 80 samples labeled as anomalous and 80 samples labeled as healthy. For the details of applications please refer to the paper.
The applications were run on Eclipse, which is situated at Sandia National Laboratories. Eclipse comprises 1488 compute nodes, each equipped with 128GB of memory and two sockets. Each socket contains 18 E5-2695 v4 CPU cores with 2-way hyperthreading, providing substantial computational power for scientific and engineering applications.
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Support data for our paper:
USING UMAP TO INSPECT AUDIO DATA FOR UNSUPERVISED ANOMALY DETECTION UNDER DOMAIN-SHIFT CONDITIONS
ArXiv preprint can be found here. Code for the experiment software pipeline described in the paper can be found here. The pipeline requires and generates different forms of data. Here we provide the following:
AudioSet_wav_fragments.zip: This is a custom selection of 39437 wav files (32kHz, mono, 10 seconds) randomly extracted from AudioSet (originally released under CC-BY). In addition to this custom subset, the paper also uses the following ones, which can be downloaded at their respective websites:
DCASE2021 Task 2 Development Dataset
DCASE2021 Task 2 Additional Training Dataset
Fraunhofer's IDMT-ISA-ELECTRIC-ENGINE Dataset
dcase2021_uads_umaps.zip: To compute the UMAPs, first the log-STFT, log-mel and L3 representations must be extracted, and then the UMAPs must be computed. This can take a substantial amount of time and resources. For convenience, we provide here the 72 UMAPs discussed in the paper.
dcase2021_uads_umap_plots.zip: Also for convenience, we provide here the 198 high-resolution scatter plots rendered from the UMAPs.
For a comprehensive visual inspection of the computed representations, it is sufficient to download the plots only. Users interested in exploring the plots interactively will need to download all the audio datasets and compute the log-STFT, log-mel and L3 representations as well as the UMAPs themselves (code provided in the GitHub repository). UMAPs for further representations can also be computed and plotted.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
ESA Anomaly Dataset is the first large-scale, real-life satellite telemetry dataset with curated anomaly annotations originated from three ESA missions. We hope that this unique dataset will allow researchers and scientists from academia, research institutes, national and international space agencies, and industry to benchmark models and approaches on a common baseline as well as research and develop novel, computational-efficient approaches for anomaly detection in satellite telemetry data.
The dataset results from the work of an 18-month project carried by an industry Consortium composed of Airbus Defence and Space, KP Labs and the European Space Agency’s European Space Operations Centre. The project, funded by the European Space Agency (ESA), is part of the Artificial Intelligence for Automation (A²I) Roadmap (De Canio et al., 2023), a large endeavour started in 2021 to automate space operations by leveraging artificial intelligence.
Further details can be found on the arXiv and Github.
References
De Canio, G. et al. (2023) Development of an actionable AI roadmap for automating mission operations. In, 2023 SpaceOps Conference. American Institute of Aeronautics and Astronautics, Dubai, United Arab Emirates.