48 datasets found

Z
Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms
data.niaid.nih.gov
zenodo.org
Updated Aug 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Garske, Samuel (2024). Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13370799
Explore at:
Dataset updated
Aug 29, 2024
Dataset provided by
Garske, Samuel
Mao, Yiwei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary

This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.

They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD

How to Get Started

All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:

import numpy as np

Load image file

hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")

Load image mask

mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")

Citing the Datasets

If you use any of these datasets, please cite the following paper:

@article{garske2024erx, title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning}, author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC}, journal={arXiv preprint arXiv:2408.14947}, year={2024},}

If you use the beach dataset please cite the following paper as well (original source):

@article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }

HAI Security Dataset

kaggle.com

zip

Updated Apr 27, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

ICS Security Dataset (2022). HAI Security Dataset [Dataset]. https://www.kaggle.com/icsdataset/hai-security-dataset

Explore at:

zip(487855254 bytes)Available download formats

Dataset updated

Apr 27, 2022

Authors

ICS Security Dataset

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

HIL-based Augmented ICS (HAI) Security Dataset

The HAI dataset was collected from a realistic industiral control system (ICS) testbed augmented with a Hardware-In-the-Loop (HIL) simulator that emulates steam-turbine power generation and pumped-storage hydropower generation.

Click here to find out more about the HAI dataset.

Please e-mail us here if you have any questions about the dataset.

Background

In 2017, three laboratory-scale CPS testbeds were initially launched, namely GE’s turbine testbed, Emerson’s boiler testbed, and FESTO’s modular production system (MPS) water-treatment testbed. These testbeds are related to relatively simple processes, and were operated independently of each other.
In 2018, a complex process system was built to combine the three systems using a HIL simulator, where generation of thermal power and pumped-storage hydropower was simulated. This ensured that the variables were highly coupled and correlated for a richer dataset. In addition, an open platform communications united architecture (OPC-UA) gateway was installed to facilitate data collection from heterogeneous devices.
The first version of HAI dataset, HAI 1.0, was made available on GitHub and Kaggle in February 2020. This dataset included ICS operational data from normal and anomalous situations for 38 attacks. Subsequently, a debugged version of HAI 1.0, namely HAI 20.07, was released for the HAICon 2020 competition in August 2020.
HAI 21.03 was released in 2021, and was based on a more tightly coupled HIL simulator to produce clearer attack effects with additional attacks. This version provides more quantitative information and covers a variety of operational situations, and provides better insights into the dynamic changes of the physical system.
HAI 22.04 contained more sophisticated attacks that are significantly more difficult to detect than those in the previous versions. Comparing only the baseline TaPRs of HAICon 2020 and HAICon 2021, detection difficulty in HAI 22.04 is approximately four times higher than HAI 21.03.

HAI Testbed

The testbed consists of four different processes: boiler process, turbine process, water treatement process and HIL simulation:

Boiler Process (P1): This includes water-to-water heat trasfer at a low pressure and a moderate temperature. This process is controlled using Emerson Ovation DCS.
Turbine Process (P2): A rotor kit process that closely simulates the behavior of an actual rotating machine. It is controlled by GE's Mark VIe DCS.
Water treatment Process (P3): This process includes pumping water to the upper reservoir and releasing it back into the lower reservoir. It is controlled by Siemens's S7-300 PLC.
HIL Simulation(P4): Both the boiler and turbine processes are interconnected to synchronize with the rotating speed of the virtual steam-turbine power generation model. The pump and value in the water-treatment process are controlled by the pumped-storage hydropower generation model. The dSPACE's SCALEXIO system is used for the HIL simulations and is interconnected with the real-world processes through a Siemens S7-1500 PLC and ET200 remote IO devices for data-acquisition system based on the OPC gateway.

HAI Datasets

Two major versions of HAI datasets have been released thus far. Each dataset consists of several CSV files, and each file satisfies time continuity. The quantitative summary of each version are as follows:

Note: The version numbering follows a date-based scheme, where the version number indicates the released date of the HAI dataset. HAI 20.07 is the bug-fixed version of HAI v1.0 released in February 2020.

version	Data Points (points/sec)	Normal Datset Files(interval, size)	Attack Dataset Files (interval, size, attack count)
HAI 22.04	86	train1.csv ( 26 hours, 51 MB) train2.csv ( 56 hours, 109 MB) train3.csv (35 hours, 67 MB) train4.csv (24 hours, 46 MB) train5.csv ( 66 hours, 125 MB) train6.csv (72 hours, 137 MB))	test1.csv (24 hours, 48 MB, 07 attacks) test2.csv (23 hours, 45 MB, 17 attacks) test3.csv (17 hours, 33 MB, 10 attacks) test4.csv (36hours, 70MB, 24 attacks)

|HAI 21.03|78|train1.csv ( 60 hours, 100 MB)
train2.csv ( 63 hours, 116 MB)
train3.csv (229 hours, 246 MB) | test1.csv (12 hours, 22 MB, 05 attacks)
test2.csv (33 hours, 62 MB, 20 attacks)
test3.csv (30 hours, 56 MB, 08 attacks)
test4.csv (11 hours, 20MB, 05 attacks)
test5.csv (26 hours, 48MB, 12 attacks)| |HAI 20.07
(HAI 1.0)| 59| train1.csv (86 hours, 127 MB)
train2.csv (91 hours, 98 MB) | test1.csv (81 hours, 119 MB)
test2.csv (42 hours, 62 MB)|

Data fields

The time-series data in each CSV file satisfies time continuity. The first column represents the observed time as “yyyy-MM-dd hh:mm:ss,” while the rest columns provide the recorded SCADA data points. The last four columns provide data labels for whether an attack occurred or not, where the attack column was applicable to all process and the other three columns were for the corresponding control processes.

Refer to the latest technical manual for the details for each column.

time	P1_B2004	P2_B2016	...	attack	attack_P1	...	attack_P3
20190926 13:00:00	0.09830	1.07370	...	0	0	...	0
20190926 13:00:01	0.09830	1.07410	...	1	0	...	1
20190926 13:00:02	0.09830	1.07380	...	1	0	...	1
20190926 13:00:03	0.09830	1.07360	...	1	1	...	1
20190926 13:00:04	0.09830	1.07430	...	1	1	...	1

Getting the dataset

Type git clone, and the paste the below URL. $ git clone https://github.com/icsdataset/hai To unzip multiple gzip files, you can use: $ gunzip *.gz

Performance Evaluation

Use of eTaPR (Enhanced Time-series Aware Precision and Recall) metric is strongly recommended to evaluate your anomaly detection model, which provides fairness to performance comparisons with other studies. Got something to suggest? Let us know!

Projects using the dataset

Here are some projects and experiments that are using or featuring the dataset in interesting ways. Got something to add? Let us know!

The related projects so far are as follows.

Anomaly Detection

Year 2022

Year 2020

Testbed/Dataset

Year 2021

Probabilistic attack sequence generation and execution based on mitre att&ck for ics datasets

Year 2020

[Expansion of ICS testbed for security validation based on MITRE ATT&CK techniques][TB_20_01]
[Expanding a programmable cps testbed for network attack analysis][TB_20_02]
[Co-occurrence based security event analysis and visualization for cyber physical systems][TB_20_03]

The Automotive Visual Inspection Dataset (AutoVI): A Genuine Industrial...
zenodo.org
autovi.utc.fr
+1more
bin, txt, zip
Updated Jun 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philippe Carvalho; Philippe Carvalho; Meriem Lafou; Alexandre Durupt; Alexandre Durupt; Antoine Leblanc; Yves Grandvalet; Yves Grandvalet; Meriem Lafou; Antoine Leblanc (2024). The Automotive Visual Inspection Dataset (AutoVI): A Genuine Industrial Production Dataset for Unsupervised Anomaly Detection [Dataset]. http://doi.org/10.5281/zenodo.10459003
Explore at:
zip, txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10459003
Dataset updated
Jun 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Philippe Carvalho; Philippe Carvalho; Meriem Lafou; Alexandre Durupt; Alexandre Durupt; Antoine Leblanc; Yves Grandvalet; Yves Grandvalet; Meriem Lafou; Antoine Leblanc
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
See the official website: https://autovi.utc.fr

Modern industrial production lines must be set up with robust defect inspection modules that are able to withstand high product variability. This means that in a context of industrial production, new defects that are not yet known may appear, and must therefore be identified.

On industrial production lines, the typology of potential defects is vast (texture, part failure, logical defects, etc.). Inspection systems must therefore be able to detect non-listed defects, i.e. not-yet-observed defects upon the development of the inspection system. To solve this problem, research and development of unsupervised AI algorithms on real-world data is required.

Renault Group and the Université de technologie de Compiègne (Roberval and Heudiasyc Laboratories) have jointly developed the Automotive Visual Inspection Dataset (AutoVI), the purpose of which is to be used as a scientific benchmark to compare and develop advanced unsupervised anomaly detection algorithms under real production conditions. The images were acquired on Renault Group's automotive production lines, in a genuine industrial production line environment, with variations in brightness and lighting on constantly moving components. This dataset is representative of actual data acquisition conditions on automotive production lines.

The dataset contains 3950 images, split into 1530 training images and 2420 testing images.

The evaluation code can be found at https://github.com/phcarval/autovi_evaluation_code.

Disclaimer
All defects shown were intentionally created on Renault Group's production lines for the purpose of producing this dataset. The images were examined and labeled by Renault Group experts, and all defects were corrected after shooting.

License
Copyright © 2023-2024 Renault Group

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of the license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/.

For using the data in a way that falls under the commercial use clause of the license, please contact us.

Attribution
Please use the following for citing the dataset in scientific work:

Carvalho, P., Lafou, M., Durupt, A., Leblanc, A., & Grandvalet, Y. (2024). The Automotive Visual Inspection Dataset (AutoVI): A Genuine Industrial Production Dataset for Unsupervised Anomaly Detection [Dataset]. https://doi.org/10.5281/zenodo.10459003

Contact
If you have any questions or remarks about this dataset, please contact us at philippe.carvalho@utc.fr, meriem.lafou@renault.com, alexandre.durupt@utc.fr, antoine.leblanc@renault.com, yves.grandvalet@utc.fr.

Changelog

v1.0.0

Cropped engine_wiring, pipe_clip and pipe_staple images

Reduced tank_screw, underbody_pipes and underbody_screw image sizes

v0.1.1

Added ground truth segmentation maps

Fixed categorization of some images

Added new defect categories

Removed tube_fastening and kitting_cart

Removed duplicates in pipe_clip
ESA Anomaly Dataset
kaggle.com
data.niaid.nih.gov
+1more
Updated Jul 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sam Mahoney (2024). ESA Anomaly Dataset [Dataset]. https://www.kaggle.com/datasets/sammahoney/esa-anomaly-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 9, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sam Mahoney
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Uploader is not affiliated with ESA, if Authors or ESA require removal of public access please contact.

ESA Anomaly Dataset is the first large-scale, real-life satellite telemetry dataset with curated anomaly annotations originated from three ESA missions. We hope that this unique dataset will allow researchers and scientists from academia, research institutes, national and international space agencies, and industry to benchmark models and approaches on a common baseline as well as research and develop novel, computational-efficient approaches for anomaly detection in satellite telemetry data.

The dataset results from the work of an 18-month project carried by an industry Consortium composed of Airbus Defence and Space, KP Labs and the European Space Agency’s European Space Operations Centre. The project, funded by the European Space Agency (ESA), is part of the Artificial Intelligence for Automation (A²I) Roadmap (De Canio et al., 2023), a large endeavour started in 2021 to automate space operations by leveraging artificial intelligence.

Further details can be found on: - arXiv: https://arxiv.org/abs/2406.17826 - Github: https://github.com/kplabs-pl/ESA-ADB

DOI 10.5281/zenodo.12528696 Resource type Dataset Publisher European Space Agency Languages English

De Canio, G., Kotowski, K., & Haskamp, C. (2024). ESA Anomaly Dataset (1.0) [Data set]. European Space Agency. https://doi.org/10.5281/zenodo.12528696
Z
Mudestreda Multimodal Device State Recognition Dataset
data.niaid.nih.gov
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Admadi, Zahra (2024). Mudestreda Multimodal Device State Recognition Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8238652
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Admadi, Zahra
Truchan, Hubert
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Description
Mudestreda Multimodal Device State Recognition Dataset

obtained from real industrial milling device with Time Series and Image Data for Classification, Regression, Anomaly Detection, Remaining Useful Life (RUL) estimation, Signal Drift measurement, Zero Shot Flank Took Wear, and Feature Engineering purposes.

The official dataset used in the paper "Multimodal Isotropic Neural Architecture with Patch Embedding" ICONIP23.

Official repository: https://github.com/hubtru/Minape

Conference paper: https://link.springer.com/chapter/10.1007/978-981-99-8079-6_14

Mudestreda (MD) | Size 512 Samples (Instances, Observations)| Modalities 4 | Classes 3 |

Future research: Regression, Remaining Useful Life (RUL) estimation, Signal Drift detection, Anomaly Detection, Multivariate Time Series Prediction, and Feature Engineering.

Notice: Tables and images do not render properly.

Recommended: README.md includes the Mudestreda description and images Mudestreda.png and Mudestreda_Stage.png.

Data Overview

Task: Uni/Multi-Modal Classification

Domain: Industrial Flank Tool Wear of the Milling Machine

Input (sample): 4 Images: 1 Tool Image, 3 Spectrograms (X, Y, Z axis)

Output: Machine state classes: Sharp, Used, Dulled

Evaluation: Accuracies, Precision, Recal, F1-score, ROC curve

Each tool's wear is categorized sequentially: Sharp → Used → Dulled.

The dataset includes measurements from ten tools: T1 to T10.

Data splitting options include random or chronological distribution, without shuffling.

Options:

Original data or Augmented data

Random distribution or Tool Distribution (see Dataset Splitting)
Z
Data set for anomaly detection on a HPC system
data.niaid.nih.gov
zenodo.org
Updated Apr 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Borghesi (2023). Data set for anomaly detection on a HPC system [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3251872
Explore at:
Dataset updated
Apr 19, 2023
Dataset provided by
Andrea Borghesi
Francesco Beneventi
Andrea Bartolini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bologna, Bologna, Italy) in the period March-May 2018.

The data set has been used to train a autoencoder-based model to automatically detect anomalies in a semi-supervised fashion, on a real HPC system.

This work is described in:

1) "Anomaly Detection using Autoencoders in High Performance Computing Systems", Andrea Borghesi, Andrea Bartolini, Michele Lombardi, Michela Milano, Luca Benini, IAAI19 (proceedings in process) -- https://arxiv.org/abs/1902.08447

2) "Online Anomaly Detection in HPC Systems", Andrea Borghesi, Antonio Libri, Luca Benini, Andrea Bartolini, AICAS19 (proceedings in process) -- https://arxiv.org/abs/1811.05269

See the git repository for usage examples & details --> https://github.com/AndreaBorghesi/anomaly_detection_HPC
SMD_OnmiAD
kaggle.com
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mitch (2023). SMD_OnmiAD [Dataset]. https://www.kaggle.com/datasets/mgusat/smd-onmiad
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mitch
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Source: OmniAD paper, code and https://github.com/NetManAIOps/OmniAnomaly

SMD (Server Machine Dataset) is a new 5-week-long dataset. We collected it from a large Internet company. This dataset contains 3 groups of entities [EN: inherent / native clustering] Each of them is named by machine-
Z
ToyADMOS2 dataset: Another dataset of miniature-machine operating sounds for...
data.niaid.nih.gov
zenodo.org
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ohishi, Yasunori (2024). ToyADMOS2 dataset: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4580269
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Niizumi, Daisuke
Harada, Noboru
Takeuchi, Daiki
Ohishi, Yasunori
Saito, Shoichiro
Yasuda, Masahiro
Description
ToyADMOS2 dataset is a large-scale dataset for anomaly detection in machine operating sounds (ADMOS), designed for evaluating systems under domain-shift conditions. It consists of two sub-datasets for machine-condition inspection: fault diagnosis of machines with geometrically fixed tasks ("toy car") and fault diagnosis of machines with moving tasks ("toy train"). Domain shifts are represented by introducing several differences in operating conditions, such as the use of the same machine type but with different machine models and part configurations, different operating speeds, microphone arrangements, etc. Each sub-dataset contains over 27 k samples of normal machine-operating sounds and over 8 k samples of anomalous sounds recorded at a 48-kHz sampling rate. A subset of the ToyADMOS2 dataset was used in the DCASE 2021 challenge task 2: Unsupervised anomalous sound detection for machine condition monitoring under domain shifted conditions.

What makes this dataset different from others is that it is not used as is, but in conjunction with the tool provided on GitHub. The mixer tool lets you create datasets with any combination of recordings by describing the amount you need in a recipe file.

The samples are compressed as MPEG-4 ALS (MPEG-4 Audio Lossless Coding) with a suffix of '.mp4' that you can load by using the audioread or librosa python module.

The total size of files under a folder ToyADMOS2 is 149 GB, and the total size of example benchmark datasets that are created from the ToyADMOS2 dataset is 13.2 GB.

The detail of the dataset is described in [1] and GitHub: https://github.com/nttcslab/ToyADMOS2-dataset

License: see LICENSE.pdf for the detail of the license.

[1] Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito, "ToyADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions," 2021. https://arxiv.org/abs/2106.02369
MIMII DG: Sound Dataset for Malfunctioning Industrial Machine Investigation...
zenodo.org
data.niaid.nih.gov
zip
Updated May 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kota Dohi; Kota Dohi; Tomoya Nishida; Harsh Purohit; Ryo Tanabe; Takashi Endo; Masaaki Yamamoto; Yuki Nikaido; Yohei Kawaguchi; Yohei Kawaguchi; Tomoya Nishida; Harsh Purohit; Ryo Tanabe; Takashi Endo; Masaaki Yamamoto; Yuki Nikaido (2022). MIMII DG: Sound Dataset for Malfunctioning Industrial Machine Investigation for Domain Generalization Task [Dataset]. http://doi.org/10.5281/zenodo.6529888
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6529888
Dataset updated
May 11, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kota Dohi; Kota Dohi; Tomoya Nishida; Harsh Purohit; Ryo Tanabe; Takashi Endo; Masaaki Yamamoto; Yuki Nikaido; Yohei Kawaguchi; Yohei Kawaguchi; Tomoya Nishida; Harsh Purohit; Ryo Tanabe; Takashi Endo; Masaaki Yamamoto; Yuki Nikaido
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task (MIMII DG). The dataset consists of normal and abnormal operating sounds of five different types of industrial machines, i.e., fans, gearboxes, bearing, slide rails, and valves. The data for each machine type includes three subsets called "sections", and each section roughly corresponds to a type of domain shift. This dataset is a subset of the dataset for DCASE 2022 Challenge Task 2, so the dataset is entirely the same as data included in the development dataset. For more information, please see the pages of the development dataset and the task description for DCASE 2022 Challenge Task 2.

Baseline system

Two simple baseline systems are available on the Github repositories autoencoder-based baseline and MobileNetV2-based baseline. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

Conditions of use

This dataset was made by Hitachi, Ltd. and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Citation

We will publish a paper on the dataset and will announce the citation information for them, so please make sure to cite them if you use this dataset.

Feedback

If there is any problem, pease contact us

Kota Dohi, kota.dohi.gr@hitachi.com

Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
d
pyhydroqc Sensor Data QC: Single Site Example
search.dataone.org
hydroshare.org
+1more
Updated Dec 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amber Spackman Jones (2023). pyhydroqc Sensor Data QC: Single Site Example [Dataset]. http://doi.org/10.4211/hs.92f393cbd06b47c398bdd2bbb86887ac
Explore at:
Unique identifier
https://doi.org/10.4211/hs.92f393cbd06b47c398bdd2bbb86887ac
Dataset updated
Dec 30, 2023
Dataset provided by
Hydroshare
Authors
Amber Spackman Jones
Time period covered
Jan 1, 2017 - Dec 31, 2017
Description
This resource contains an example script for using the software package pyhydroqc. pyhydroqc was developed to identify and correct anomalous values in time series data collected by in situ aquatic sensors. For more information, see the code repository: https://github.com/AmberSJones/pyhydroqc and the documentation: https://ambersjones.github.io/pyhydroqc/. The package may be installed from the Python Package Index.

This script applies the functions to data from a single site in the Logan River Observatory, which is included in the repository. The data collected in the Logan River Observatory are sourced at http://lrodata.usu.edu/tsa/ or on HydroShare: https://www.hydroshare.org/search/?q=logan%20river%20observatory.

Anomaly detection methods include ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short Term Memory). These are time series regression methods that detect anomalies by comparing model estimates to sensor observations and labeling points as anomalous when they exceed a threshold. There are multiple possible approaches for applying LSTM for anomaly detection/correction. - Vanilla LSTM: uses past values of a single variable to estimate the next value of that variable. - Multivariate Vanilla LSTM: uses past values of multiple variables to estimate the next value for all variables. - Bidirectional LSTM: uses past and future values of a single variable to estimate a value for that variable at the time step of interest. - Multivariate Bidirectional LSTM: uses past and future values of multiple variables to estimate a value for all variables at the time step of interest.

The correction approach uses piecewise ARIMA models. Each group of consecutive anomalous points is considered as a unit to be corrected. Separate ARIMA models are developed for valid points preceding and following the anomalous group. Model estimates are blended to achieve a correction.

The anomaly detection and correction workflow involves the following steps: 1. Retrieving data 2. Applying rules-based detection to screen data and apply initial corrections 3. Identifying and correcting sensor drift and calibration (if applicable) 4. Developing a model (i.e., ARIMA or LSTM) 5. Applying model to make time series predictions 6. Determining a threshold and detecting anomalies by comparing sensor observations to modeled results 7. Widening the window over which an anomaly is identified 8. Aggregating detections resulting from multiple models 9. Making corrections for anomalous events

Instructions to run the notebook through the CUAHSI JupyterHub: 1. Click "Open with..." at the top of the resource and select the CUAHSI JupyterHub. You may need to sign into CUAHSI JupyterHub using your HydroShare credentials. 2. Select 'Python 3.8 - Scientific' as the server and click Start. 2. From your JupyterHub directory, click on the ExampleNotebook.ipynb file. 3. Execute each cell in the code by clicking the Run button.
Dataset Artifact for Prodigy: Towards Unsupervised Anomaly Detection in...
zenodo.org
data.niaid.nih.gov
json, tar
Updated Nov 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Burak Aksar; Burak Aksar; Efe Sencan; Efe Sencan; Benjamin Schwaller; Benjamin Schwaller; Omar Aaziz; Vitus Leung; Jim Brandt; Brian Kulis; Manuel Egele; Ayse Coskun; Omar Aaziz; Vitus Leung; Jim Brandt; Brian Kulis; Manuel Egele; Ayse Coskun (2023). Dataset Artifact for Prodigy: Towards Unsupervised Anomaly Detection in Production HPC Systems [Dataset]. http://doi.org/10.5281/zenodo.8079388
Explore at:
tar, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8079388
Dataset updated
Nov 12, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Burak Aksar; Burak Aksar; Efe Sencan; Efe Sencan; Benjamin Schwaller; Benjamin Schwaller; Omar Aaziz; Vitus Leung; Jim Brandt; Brian Kulis; Manuel Egele; Ayse Coskun; Omar Aaziz; Vitus Leung; Jim Brandt; Brian Kulis; Manuel Egele; Ayse Coskun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains a small set of application runs from Eclipse supercomputer. The applications run with and without synthetic HPC performance anomalies. More detailed information regarding synthetic anomalies can be found at: https://github.com/peaclab/HPAS.

We have chosen four applications, namely LAMMPS, sw4, sw4Lite, and ExaMiniMD, to encompass both real and proxy applications. We have executed each application five times on four compute nodes without introducing any anomalies. To showcase our experiment, we have specifically selected the "memleak" anomaly as it is one of the most commonly occurring types. Additionally, we have also executed each application five times with the chosen anomaly. The dataset we have collected consists of a total of 160 samples, with 80 samples labeled as anomalous and 80 samples labeled as healthy. For the details of applications please refer to the paper.

The applications were run on Eclipse, which is situated at Sandia National Laboratories. Eclipse comprises 1488 compute nodes, each equipped with 128GB of memory and two sockets. Each socket contains 18 E5-2695 v4 CPU cores with 2-way hyperthreading, providing substantial computational power for scientific and engineering applications.
BETH Dataset
kaggle.com
Updated Jul 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kate Highnam (2021). BETH Dataset [Dataset]. https://www.kaggle.com/katehighnam/beth-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 29, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kate Highnam
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset corresponds to the paper "BETH Dataset: Real Cybersecurity Data for Anomaly Detection Research" by Kate Highnam* (@jinxmirror13), Kai Arulkumaran* (@kaixhin), Zachary Hanif*, and Nicholas R. Jennings (@LboroVC).

This paper was published in the ICML Workshop on Uncertainty and Robustness in Deep Learning 2021 and Conference on Applied Machine Learning for Information Security (CAMLIS 2021)

THIS DATASET IS STILL BEING UPDATED

Context

When deploying machine learning (ML) models in the real world, anomalous data points and shifts in the data distribution are inevitable. From a cyber security perspective, these anomalies and dataset shifts are driven by both defensive and adversarial advancement. To withstand the cost of critical system failure, the development of robust models is therefore key to the performance, protection, and longevity of deployed defensive systems.

We present the BPF-extended tracking honeypot (BETH) dataset as the first cybersecurity dataset for uncertainty and robustness benchmarking. Collected using a novel honeypot tracking system, our dataset has the following properties that make it attractive for the development of robust ML methods: 1. At over eight million data points, this is one of the largest cyber security datasets available 2. It contains modern host activity and attacks 3. It is fully labelled 4. It contains highly structured but heterogeneous features 5. Each host contains benign activity and at most a single attack, which is ideal for behavioural analysis and other research tasks. In addition to the described dataset

Further data is currently being collected and analysed to add alternative attack vectors to the dataset.

There are several existing cyber security datasets used in ML research, including the KDD Cup 1999 Data (Hettich & Bay, 1999), the 1998 DARPA Intrusion Detection Evaluation Dataset (Labs, 1998; Lippmann et al., 2000), the ISCX IDS 2012 dataset (Shiravi et al., 2012), and NSL-KDD (Tavallaee et al., 2009), which primarily removes duplicates from the KDD Cup 1999 Data. Each includes millions of records of realistic activity for enterprise applications, with labels for attacks or benign activity. The KDD1999, NSLKDD, and ISCX datasets contain network traffic, while the DARPA1998 dataset also includes limited process calls. However, these datasets are at best almost a decade old, and are collected on in-premise servers. In contrast, BETH contains modern host activity and activity collected from cloud services, making it relevant for current real-world deployments. In addition, some datasets include artificial user activity (Shiravi et al., 2012) while BETH contains only real activity. BETH is also one of the few datasets to include both kernel-process and network logs, providing a holistic view of malicious behaviour.

Content

The BETH dataset currently represents 8,004,918 events collected over 23 honeypots, running for about five noncontiguous hours on a major cloud provider. For benchmarking and discussion, we selected the initial subset of the process logs. This subset was further divided into training, validation, and testing sets with a rough 60/20/20 split based on host, quantity of logs generated, and the activity logged—only the test set includes an attack

The dataset is composed of two sensor logs: kernel-level process calls and network traffic. The initial benchmark subset only includes process logs. Each process call consists of 14 raw features and 2 hand-crafted labels.

See the paper for more details. For details on the events recorded within the logs, see this report.

Benchmarks

Code for our benchmarks, as detailed in the paper, are available through Github at: https://github.com/jinxmirror13/BETH_Dataset_Analysis

Acknowledgements

Thank you to Dr. Arinbjörn Kolbeinsson for his assistance in analysing the data and the reviewers for their positive feedback.
Data from: Multi-Source Distributed System Data for AI-powered Analytics
zenodo.org
explore.openaire.eu
zip
Updated Nov 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao (2022). Multi-Source Distributed System Data for AI-powered Analytics [Dataset]. http://doi.org/10.5281/zenodo.3549604
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3549604
Dataset updated
Nov 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract:

In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
The major contributions have been materialized in the form of novel algorithms.
Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.

General Information:

This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.

You may find details of this dataset from the original paper:

Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".

If you use the data, implementation, or any details of the paper, please cite!

BIBTEX:

_

@inproceedings{nedelkoski2020multi, title={Multi-source Distributed System Data for AI-Powered Analytics}, author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej}, booktitle={European Conference on Service-Oriented and Cloud Computing}, pages={161--176}, year={2020}, organization={Springer} }

_

The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.

The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.

Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.

Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/
P
SMAP Dataset
paperswithcode.com
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyle Hundman; Valentino Constantinou; Christopher Laporte; Ian Colwell; Tom Soderstrom (2024). SMAP Dataset [Dataset]. https://paperswithcode.com/dataset/smap
Explore at:
Dataset updated
Jun 6, 2024
Authors
Kyle Hundman; Valentino Constantinou; Christopher Laporte; Ian Colwell; Tom Soderstrom
Description
Soil Moisture Active Passive (SMAP) dataset is a dataset of soil samples and telemetry information using the Mars rover by NASA. Originally published in https://arxiv.org/abs/1802.04431 and used for the unsupervised anomaly detection task in time series data. Later it was used in many popular anomaly detection methods and benchmarks that distribute it in their repositories e.g., https://github.com/OpsPAI/MTAD
g
ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...
elki-project.github.io
zenodo.org
Updated Sep 2, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erich Schubert; Arthur Zimek (2011). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6355684
Dataset updated
Sep 2, 2011
Dataset provided by
TU Dortmund University
University of Southern Denmark, Denmark
Authors
Erich Schubert; Arthur Zimek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "Amsterdam Library of Object Images" is a collection of 110250 images of 1000 small objects, taken under various light conditions and rotation angles. All objects were placed on a black background. Thus the images are taken under rather uniform conditions, which means there is little uncontrolled bias in the data set (unless mixed with other sources). They do however not resemble a "typical" image collection. The data set has a rather unique property for its size: there are around 100 different images of each object, so it is well suited for clustering. By downsampling some objects it can also be used for outlier detection. For multi-view research, we offer a number of different feature vector sets for evaluating this data set.
Data from: MIMII Dataset: Sound Dataset for Malfunctioning Industrial...
zenodo.org
zip
Updated Feb 29, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido; Kaori Suefusa; Kaori Suefusa; Yohei Kawaguchi; Yohei Kawaguchi; Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido (2020). MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection [Dataset]. http://doi.org/10.5281/zenodo.3384388
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3384388
Dataset updated
Feb 29, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido; Kaori Suefusa; Kaori Suefusa; Yohei Kawaguchi; Yohei Kawaguchi; Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection (MIMII dataset). It contains the sounds generated from four types of industrial machines, i.e. valves, pumps, fans, and slide rails. Each type of machine includes seven individual product models*1, and the data for each model contains normal sounds (from 5000 seconds to 10000 seconds) and anomalous sounds (about 1000 seconds). To resemble a real-life scenario, various anomalous sounds were recorded (e.g., contamination, leakage, rotating unbalance, and rail damage). Also, the background noise recorded in multiple real factories was mixed with the machine sounds. The sounds were recorded by eight-channel microphone array with 16 kHz sampling rate and 16 bit per sample. The MIMII dataset assists benchmark for sound-based machine fault diagnosis. Users can test the performance for specific functions e.g., unsupervised anomaly detection, transfer learning, noise robustness, etc. The detail of the dataset is described in [1][2].

This dataset is made available by Hitachi, Ltd. under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

A baseline sample code for anomaly detection is available on GitHub: https://github.com/MIMII-hitachi/mimii_baseline/

*1: This version "public 1.0" contains four models (model ID 00, 02, 04, and 06). The rest three models will be released in a future edition.

[1] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” arXiv preprint arXiv:1909.09347, 2019.

[2] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” in Proc. 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2019.
DCASE2021 UAD-S UMAP Data
zenodo.org
data.niaid.nih.gov
zip
Updated Aug 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andres Fernandez Rodriguez; Andres Fernandez Rodriguez; Mark D. Plumbley; Mark D. Plumbley (2021). DCASE2021 UAD-S UMAP Data [Dataset]. http://doi.org/10.5281/zenodo.5123024
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5123024
Dataset updated
Aug 23, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andres Fernandez Rodriguez; Andres Fernandez Rodriguez; Mark D. Plumbley; Mark D. Plumbley
License
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Description
Support data for our paper:

USING UMAP TO INSPECT AUDIO DATA FOR UNSUPERVISED ANOMALY DETECTION UNDER DOMAIN-SHIFT CONDITIONS

ArXiv preprint can be found here. Code for the experiment software pipeline described in the paper can be found here. The pipeline requires and generates different forms of data. Here we provide the following:

AudioSet_wav_fragments.zip: This is a custom selection of 39437 wav files (32kHz, mono, 10 seconds) randomly extracted from AudioSet (originally released under CC-BY). In addition to this custom subset, the paper also uses the following ones, which can be downloaded at their respective websites:

DCASE2021 Task 2 Development Dataset

DCASE2021 Task 2 Additional Training Dataset

Fraunhofer's IDMT-ISA-ELECTRIC-ENGINE Dataset

dcase2021_uads_umaps.zip: To compute the UMAPs, first the log-STFT, log-mel and L3 representations must be extracted, and then the UMAPs must be computed. This can take a substantial amount of time and resources. For convenience, we provide here the 72 UMAPs discussed in the paper.

dcase2021_uads_umap_plots.zip: Also for convenience, we provide here the 198 high-resolution scatter plots rendered from the UMAPs.

For a comprehensive visual inspection of the computed representations, it is sufficient to download the plots only. Users interested in exploring the plots interactively will need to download all the audio datasets and compute the log-STFT, log-mel and L3 representations as well as the UMAPs themselves (code provided in the GitHub repository). UMAPs for further representations can also be computed and plotted.
P
Amazon-Fraud Dataset
paperswithcode.com
Updated Dec 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yingtong Dou; Zhiwei Liu; Li Sun; Yutong Deng; Hao Peng; Philip S. Yu (2024). Amazon-Fraud Dataset [Dataset]. https://paperswithcode.com/dataset/amazon-fraud
Explore at:
Dataset updated
Dec 23, 2024
Authors
Yingtong Dou; Zhiwei Liu; Li Sun; Yutong Deng; Hao Peng; Philip S. Yu
Description
Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.

Dataset Statistics

# Nodes %Fraud Nodes (Class=1)
11,944 9.5

Relation # Edges
U-P-U
U-S-U
U-V-U 1,036,737
All

Graph Construction

The Amazon dataset includes product reviews under the Musical Instruments category. Similar to this paper, we label users with more than 80% helpful votes as benign entities and users with less than 20% helpful votes as fraudulent entities. we conduct a fraudulent user detection task on the Amazon-Fraud dataset, which is a binary classification task. We take 25 handcrafted features from this paper as the raw node features for Amazon-Fraud. We take users as nodes in the graph and design three relations: 1) U-P-U: it connects users reviewing at least one same product; 2) U-S-V: it connects users having at least one same star rating within one week; 3) U-V-U: it connects users with top 5% mutual review text similarities (measured by TF-IDF) among all users.

To download the dataset, please visit this Github repo. For any other questions, please email ytongdou(AT)gmail.com for inquiry.
P
UCF-Crime Dataset
paperswithcode.com
Updated Nov 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Waqas Sultani; Chen Chen; Mubarak Shah (2023). UCF-Crime Dataset [Dataset]. https://paperswithcode.com/dataset/ucf-crime
Explore at:
Dataset updated
Nov 26, 2023
Authors
Waqas Sultani; Chen Chen; Mubarak Shah
Description
The UCF-Crime dataset is a large-scale dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world surveillance videos, with 13 realistic anomalies including Abuse, Arrest, Arson, Assault, Road Accident, Burglary, Explosion, Fighting, Robbery, Shooting, Stealing, Shoplifting, and Vandalism. These anomalies are selected because they have a significant impact on public safety.

This dataset can be used for two tasks. First, general anomaly detection considering all anomalies in one group and all normal activities in another group. Second, for recognizing each of 13 anomalous activities.
R&D Dataset for LHC Olympics 2020 Anomaly Detection Challenge
zenodo.org
explore.openaire.eu
bin
Updated Apr 17, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gregor Kasieczka; Ben Nachman; David Shih; Gregor Kasieczka; Ben Nachman; David Shih (2022). R&D Dataset for LHC Olympics 2020 Anomaly Detection Challenge [Dataset]. http://doi.org/10.5281/zenodo.4287694
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4287694
Dataset updated
Apr 17, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gregor Kasieczka; Ben Nachman; David Shih; Gregor Kasieczka; Ben Nachman; David Shih
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the first R&D dataset for the LHC Olympics 2020 Anomaly Detection Challenge. It consists of 1M QCD dijet events and 100k W'->XY events, with X->qq and Y->qq. The W', X, and Y masses are 3.5 TeV, 500 GeV and 100 GeV respectively. The events are produced using Pythia8 and Delphes 3.4.1, with no pileup or MPI included. They are selected using a single fat-jet (R=1) trigger with pT threshold of 1.2 TeV.

The events are randomly shuffled together, but for the purposes of testing and development, we provide the user with a signal/background truth bit for each event. Obviously, the truth bit will not be included in the actual challenge.

These events are stored as pandas dataframes saved to compressed h5 format. For each event, all Delphes reconstructed particles in the event are assumed to be massless and are recorded in detector coordinates (pT, eta, phi). More detailed information such as particle charge is not included. Events are zero padded to constant size arrays of 700 particles, with the truth bit appended at the end. The array format is therefore (Nevents=1.1M, 2101).

For more information, including an example Jupyter notebook illustrating how to read and process the events, see the official LHC Olympics 2020 webpage.

https://lhco2020.github.io/homepage/

UPDATE May 18 2020

We have uploaded a second signal dataset for R&D, consisting of 100k W'->XY with X,Y->qqq (i.e. 3-prong substructure). Everything else about this signal dataset (particle masses, trigger, Pythia configuration, detector simulation) is the same as the previous one described above.

UPDATE November 23 2020

We now include high-level feature files for the background and 2-prong signal (events_anomalydetection_v2.features.h5) and for the 3-prong signal (events_anomalydetection_Z_XY_qqq.features.h5). To produce the features, we have clustered every event into R=1 jets using the anti-kT algorithm. The features (calculated using fastjet plugins) are the 3-momenta, invariant masses, and n-jettiness variables tau1, tau2 and tau3 for the highest pT jet (j1) and the second highest pT jet (j2):

'pxj1', 'pyj1', 'pzj1', 'mj1', 'tau1j1', 'tau2j1', 'tau3j1', 'pxj2', 'pyj2', 'pzj2', 'mj2', 'tau1j2', 'tau2j2', 'tau3j2'

The rows (events) in each feature file should be ordered exactly the same as in their corresponding raw event file. For convenience, we have also included the label (1 for signal and 0 for background) as an additional column in the first feature file (events_anomalydetection_v2.features.h5).

# Nodes	%Fraud Nodes (Class=1)
11,944	9.5

Relation	# Edges
	U-P-U
	U-S-U
U-V-U	1,036,737
	All

Facebook

Twitter

Click to copy link

Link copied

Cite

Garske, Samuel (2024). Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13370799

Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms

Explore at:

Dataset updated

Aug 29, 2024

Dataset provided by

Garske, Samuel
Mao, Yiwei

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Summary

This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.

They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD

How to Get Started

All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:

import numpy as np

Load image file

hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")

Load image mask

mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")

Citing the Datasets

If you use any of these datasets, please cite the following paper:

@article{garske2024erx, title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning}, author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC}, journal={arXiv preprint arXiv:2408.14947}, year={2024},}

If you use the beach dataset please cite the following paper as well (original source):

@article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }

Clear search

Close search

Google apps

Main menu

Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms

Load image file

Load image mask

HAI Security Dataset

HIL-based Augmented ICS (HAI) Security Dataset

Background

HAI Testbed

HAI Datasets

Data fields

Getting the dataset

Performance Evaluation

Projects using the dataset

Anomaly Detection

Year 2022

Year 2021

Year 2020

Testbed/Dataset

Year 2021

Year 2020

The Automotive Visual Inspection Dataset (AutoVI): A Genuine Industrial...

ESA Anomaly Dataset

Mudestreda Multimodal Device State Recognition Dataset

Data set for anomaly detection on a HPC system

SMD_OnmiAD

ToyADMOS2 dataset: Another dataset of miniature-machine operating sounds for...

MIMII DG: Sound Dataset for Malfunctioning Industrial Machine Investigation...

pyhydroqc Sensor Data QC: Single Site Example

Dataset Artifact for Prodigy: Towards Unsupervised Anomaly Detection in...

BETH Dataset

THIS DATASET IS STILL BEING UPDATED

Context

Content

Benchmarks

Acknowledgements

Data from: Multi-Source Distributed System Data for AI-powered Analytics

SMAP Dataset

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

Data from: MIMII Dataset: Sound Dataset for Malfunctioning Industrial...

DCASE2021 UAD-S UMAP Data

Amazon-Fraud Dataset

UCF-Crime Dataset

R&D Dataset for LHC Olympics 2020 Anomaly Detection Challenge

Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms

Load image file

Load image mask