48 datasets found
  1. Z

    Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Garske, Samuel (2024). Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13370799
    Explore at:
    Dataset updated
    Aug 29, 2024
    Dataset provided by
    Garske, Samuel
    Mao, Yiwei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.

    They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD

    How to Get Started

    All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:

    import numpy as np

    Load image file

    hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")

    Load image mask

    mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")

    Citing the Datasets

    If you use any of these datasets, please cite the following paper:

    @article{garske2024erx, title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning}, author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC}, journal={arXiv preprint arXiv:2408.14947}, year={2024},}

    If you use the beach dataset please cite the following paper as well (original source):

    @article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }

  2. HAI Security Dataset

    • kaggle.com
    zip
    Updated Apr 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ICS Security Dataset (2022). HAI Security Dataset [Dataset]. https://www.kaggle.com/icsdataset/hai-security-dataset
    Explore at:
    zip(487855254 bytes)Available download formats
    Dataset updated
    Apr 27, 2022
    Authors
    ICS Security Dataset
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    HIL-based Augmented ICS (HAI) Security Dataset

    The HAI dataset was collected from a realistic industiral control system (ICS) testbed augmented with a Hardware-In-the-Loop (HIL) simulator that emulates steam-turbine power generation and pumped-storage hydropower generation.

    Click here to find out more about the HAI dataset.

    Please e-mail us here if you have any questions about the dataset.

    Background

    • In 2017, three laboratory-scale CPS testbeds were initially launched, namely GE’s turbine testbed, Emerson’s boiler testbed, and FESTO’s modular production system (MPS) water-treatment testbed. These testbeds are related to relatively simple processes, and were operated independently of each other.

    • In 2018, a complex process system was built to combine the three systems using a HIL simulator, where generation of thermal power and pumped-storage hydropower was simulated. This ensured that the variables were highly coupled and correlated for a richer dataset. In addition, an open platform communications united architecture (OPC-UA) gateway was installed to facilitate data collection from heterogeneous devices.

    • The first version of HAI dataset, HAI 1.0, was made available on GitHub and Kaggle in February 2020. This dataset included ICS operational data from normal and anomalous situations for 38 attacks. Subsequently, a debugged version of HAI 1.0, namely HAI 20.07, was released for the HAICon 2020 competition in August 2020.

    • HAI 21.03 was released in 2021, and was based on a more tightly coupled HIL simulator to produce clearer attack effects with additional attacks. This version provides more quantitative information and covers a variety of operational situations, and provides better insights into the dynamic changes of the physical system.

    • HAI 22.04 contained more sophisticated attacks that are significantly more difficult to detect than those in the previous versions. Comparing only the baseline TaPRs of HAICon 2020 and HAICon 2021, detection difficulty in HAI 22.04 is approximately four times higher than HAI 21.03.

    HAI Testbed

    The testbed consists of four different processes: boiler process, turbine process, water treatement process and HIL simulation:

    • Boiler Process (P1): This includes water-to-water heat trasfer at a low pressure and a moderate temperature. This process is controlled using Emerson Ovation DCS.
    • Turbine Process (P2): A rotor kit process that closely simulates the behavior of an actual rotating machine. It is controlled by GE's Mark VIe DCS.
    • Water treatment Process (P3): This process includes pumping water to the upper reservoir and releasing it back into the lower reservoir. It is controlled by Siemens's S7-300 PLC.

    • HIL Simulation(P4): Both the boiler and turbine processes are interconnected to synchronize with the rotating speed of the virtual steam-turbine power generation model. The pump and value in the water-treatment process are controlled by the pumped-storage hydropower generation model. The dSPACE's SCALEXIO system is used for the HIL simulations and is interconnected with the real-world processes through a Siemens S7-1500 PLC and ET200 remote IO devices for data-acquisition system based on the OPC gateway.

    HAI Datasets

    Two major versions of HAI datasets have been released thus far. Each dataset consists of several CSV files, and each file satisfies time continuity. The quantitative summary of each version are as follows:

    Note: The version numbering follows a date-based scheme, where the version number indicates the released date of the HAI dataset. HAI 20.07 is the bug-fixed version of HAI v1.0 released in February 2020.

    versionData Points
    (points/sec)
    Normal Datset
    Files(interval, size)
    Attack Dataset
    Files (interval, size, attack count)
    HAI 22.0486train1.csv ( 26 hours, 51 MB)
    train2.csv ( 56 hours, 109 MB)
    train3.csv (35 hours, 67 MB)
    train4.csv (24 hours, 46 MB)
    train5.csv ( 66 hours, 125 MB)
    train6.csv (72 hours, 137 MB))
    test1.csv (24 hours, 48 MB, 07 attacks)
    test2.csv (23 hours, 45 MB, 17 attacks)
    test3.csv (17 hours, 33 MB, 10 attacks)
    test4.csv (36hours, 70MB, 24 attacks)

    |HAI 21.03|78|train1.csv ( 60 hours, 100 MB)
    train2.csv ( 63 hours, 116 MB)
    train3.csv (229 hours, 246 MB) | test1.csv (12 hours, 22 MB, 05 attacks)
    test2.csv (33 hours, 62 MB, 20 attacks)
    test3.csv (30 hours, 56 MB, 08 attacks)
    test4.csv (11 hours, 20MB, 05 attacks)
    test5.csv (26 hours, 48MB, 12 attacks)| |HAI 20.07
    (HAI 1.0)| 59| train1.csv (86 hours, 127 MB)
    train2.csv (91 hours, 98 MB) | test1.csv (81 hours, 119 MB)
    test2.csv (42 hours, 62 MB)|

    Data fields

    The time-series data in each CSV file satisfies time continuity. The first column represents the observed time as “yyyy-MM-dd hh:mm:ss,” while the rest columns provide the recorded SCADA data points. The last four columns provide data labels for whether an attack occurred or not, where the attack column was applicable to all process and the other three columns were for the corresponding control processes.

    Refer to the latest technical manual for the details for each column.

    timeP1_B2004P2_B2016...P4_HT_LDattackattack_P1...attack_P3
    20190926 13:00:000.098301.07370...000...0
    20190926 13:00:010.098301.07410...010...1
    20190926 13:00:020.098301.07380...010...1
    20190926 13:00:030.098301.07360...011...1
    20190926 13:00:040.098301.07430...011...1

    Getting the dataset

    Type git clone, and the paste the below URL. $ git clone https://github.com/icsdataset/hai To unzip multiple gzip files, you can use: $ gunzip *.gz

    Performance Evaluation

    Use of eTaPR (Enhanced Time-series Aware Precision and Recall) metric is strongly recommended to evaluate your anomaly detection model, which provides fairness to performance comparisons with other studies. Got something to suggest? Let us know!

    Projects using the dataset

    Here are some projects and experiments that are using or featuring the dataset in interesting ways. Got something to add? Let us know!

    The related projects so far are as follows.

    Anomaly Detection

    Year 2022

    1. Benchmarking machine learning based detection of cyber attacks for critical infrastructure
    2. A Hybrid Algorithm Incorporating Vector Quantization and One-Class Support Vector Machine for industrial Anomaly Detection
    3. Variational restricted Boltzmann machines to automated anomaly detection

      Year 2021

    4. Research on improvement of anomaly detection performance in industrial control systems

    5. E-sfd: Explainable sensor fault detection in the ics anomaly detection system

    6. Stacked-autoencoder based anomaly detection with industrial control system

    7. Improved mitigation of cyber threats in iiot for smart cities: A new-era approach and scheme

    8. Towards building intrusion detection systems for multivariate time-series data

    9. Revitalizing self-organizing map: Anomaly detection using forecasting error patterns

    10. Cluster-based deep one-class classification model for anomaly detection

    11. Measurement data intrusion detection in industrial control systems based on unsupervised learning

    12. A machine learning approach for anomaly detection in industrial control systems based on measurement data

    Year 2020

    1. Anomaly detection in time-series data environment
    2. Detecting anomalies in time-series data using unsupervised learning and analysis on infrequent signatures

    Testbed/Dataset

    Year 2021

    1. Probabilistic attack sequence generation and execution based on mitre att&ck for ics datasets

    Year 2020

    1. [Expansion of ICS testbed for security validation based on MITRE ATT&CK techniques][TB_20_01]
    2. [Expanding a programmable cps testbed for network attack analysis][TB_20_02]
    3. [Co-occurrence based security event analysis and visualization for cyber physical systems][TB_20_03]
  3. The Automotive Visual Inspection Dataset (AutoVI): A Genuine Industrial...

    • zenodo.org
    • autovi.utc.fr
    • +1more
    bin, txt, zip
    Updated Jun 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philippe Carvalho; Philippe Carvalho; Meriem Lafou; Alexandre Durupt; Alexandre Durupt; Antoine Leblanc; Yves Grandvalet; Yves Grandvalet; Meriem Lafou; Antoine Leblanc (2024). The Automotive Visual Inspection Dataset (AutoVI): A Genuine Industrial Production Dataset for Unsupervised Anomaly Detection [Dataset]. http://doi.org/10.5281/zenodo.10459003
    Explore at:
    zip, txt, binAvailable download formats
    Dataset updated
    Jun 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Philippe Carvalho; Philippe Carvalho; Meriem Lafou; Alexandre Durupt; Alexandre Durupt; Antoine Leblanc; Yves Grandvalet; Yves Grandvalet; Meriem Lafou; Antoine Leblanc
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    See the official website: https://autovi.utc.fr

    Modern industrial production lines must be set up with robust defect inspection modules that are able to withstand high product variability. This means that in a context of industrial production, new defects that are not yet known may appear, and must therefore be identified.

    On industrial production lines, the typology of potential defects is vast (texture, part failure, logical defects, etc.). Inspection systems must therefore be able to detect non-listed defects, i.e. not-yet-observed defects upon the development of the inspection system. To solve this problem, research and development of unsupervised AI algorithms on real-world data is required.

    Renault Group and the Université de technologie de Compiègne (Roberval and Heudiasyc Laboratories) have jointly developed the Automotive Visual Inspection Dataset (AutoVI), the purpose of which is to be used as a scientific benchmark to compare and develop advanced unsupervised anomaly detection algorithms under real production conditions. The images were acquired on Renault Group's automotive production lines, in a genuine industrial production line environment, with variations in brightness and lighting on constantly moving components. This dataset is representative of actual data acquisition conditions on automotive production lines.

    The dataset contains 3950 images, split into 1530 training images and 2420 testing images.

    The evaluation code can be found at https://github.com/phcarval/autovi_evaluation_code.

    Disclaimer
    All defects shown were intentionally created on Renault Group's production lines for the purpose of producing this dataset. The images were examined and labeled by Renault Group experts, and all defects were corrected after shooting.

    License
    Copyright © 2023-2024 Renault Group

    This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of the license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/.

    For using the data in a way that falls under the commercial use clause of the license, please contact us.

    Attribution
    Please use the following for citing the dataset in scientific work:

    Carvalho, P., Lafou, M., Durupt, A., Leblanc, A., & Grandvalet, Y. (2024). The Automotive Visual Inspection Dataset (AutoVI): A Genuine Industrial Production Dataset for Unsupervised Anomaly Detection [Dataset]. https://doi.org/10.5281/zenodo.10459003

    Contact
    If you have any questions or remarks about this dataset, please contact us at philippe.carvalho@utc.fr, meriem.lafou@renault.com, alexandre.durupt@utc.fr, antoine.leblanc@renault.com, yves.grandvalet@utc.fr.

    Changelog

    • v1.0.0
      • Cropped engine_wiring, pipe_clip and pipe_staple images
      • Reduced tank_screw, underbody_pipes and underbody_screw image sizes
    • v0.1.1
      • Added ground truth segmentation maps
      • Fixed categorization of some images
      • Added new defect categories
      • Removed tube_fastening and kitting_cart
      • Removed duplicates in pipe_clip
  4. ESA Anomaly Dataset

    • kaggle.com
    • data.niaid.nih.gov
    • +1more
    Updated Jul 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sam Mahoney (2024). ESA Anomaly Dataset [Dataset]. https://www.kaggle.com/datasets/sammahoney/esa-anomaly-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sam Mahoney
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Uploader is not affiliated with ESA, if Authors or ESA require removal of public access please contact.

    ESA Anomaly Dataset is the first large-scale, real-life satellite telemetry dataset with curated anomaly annotations originated from three ESA missions. We hope that this unique dataset will allow researchers and scientists from academia, research institutes, national and international space agencies, and industry to benchmark models and approaches on a common baseline as well as research and develop novel, computational-efficient approaches for anomaly detection in satellite telemetry data.

    The dataset results from the work of an 18-month project carried by an industry Consortium composed of Airbus Defence and Space, KP Labs and the European Space Agency’s European Space Operations Centre. The project, funded by the European Space Agency (ESA), is part of the Artificial Intelligence for Automation (A²I) Roadmap (De Canio et al., 2023), a large endeavour started in 2021 to automate space operations by leveraging artificial intelligence.

    Further details can be found on: - arXiv: https://arxiv.org/abs/2406.17826 - Github: https://github.com/kplabs-pl/ESA-ADB

    DOI 10.5281/zenodo.12528696 Resource type Dataset Publisher European Space Agency Languages English

    De Canio, G., Kotowski, K., & Haskamp, C. (2024). ESA Anomaly Dataset (1.0) [Data set]. European Space Agency. https://doi.org/10.5281/zenodo.12528696

  5. Z

    Mudestreda Multimodal Device State Recognition Dataset

    • data.niaid.nih.gov
    Updated Jul 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Admadi, Zahra (2024). Mudestreda Multimodal Device State Recognition Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8238652
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Admadi, Zahra
    Truchan, Hubert
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Description

    Mudestreda Multimodal Device State Recognition Dataset

    obtained from real industrial milling device with Time Series and Image Data for Classification, Regression, Anomaly Detection, Remaining Useful Life (RUL) estimation, Signal Drift measurement, Zero Shot Flank Took Wear, and Feature Engineering purposes.

    The official dataset used in the paper "Multimodal Isotropic Neural Architecture with Patch Embedding" ICONIP23.

    Official repository: https://github.com/hubtru/Minape

    Conference paper: https://link.springer.com/chapter/10.1007/978-981-99-8079-6_14

    Mudestreda (MD) | Size 512 Samples (Instances, Observations)| Modalities 4 | Classes 3 |

    Future research: Regression, Remaining Useful Life (RUL) estimation, Signal Drift detection, Anomaly Detection, Multivariate Time Series Prediction, and Feature Engineering.

    Notice: Tables and images do not render properly.

    Recommended: README.md includes the Mudestreda description and images Mudestreda.png and Mudestreda_Stage.png.

    Data Overview

    Task: Uni/Multi-Modal Classification

    Domain: Industrial Flank Tool Wear of the Milling Machine

    Input (sample): 4 Images: 1 Tool Image, 3 Spectrograms (X, Y, Z axis)

    Output: Machine state classes: Sharp, Used, Dulled

    Evaluation: Accuracies, Precision, Recal, F1-score, ROC curve

    Each tool's wear is categorized sequentially: Sharp → Used → Dulled.

    The dataset includes measurements from ten tools: T1 to T10.

    Data splitting options include random or chronological distribution, without shuffling.

    Options:

    Original data or Augmented data

    Random distribution or Tool Distribution (see Dataset Splitting)

  6. Z

    Data set for anomaly detection on a HPC system

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Borghesi (2023). Data set for anomaly detection on a HPC system [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3251872
    Explore at:
    Dataset updated
    Apr 19, 2023
    Dataset provided by
    Andrea Borghesi
    Francesco Beneventi
    Andrea Bartolini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bologna, Bologna, Italy) in the period March-May 2018.

    The data set has been used to train a autoencoder-based model to automatically detect anomalies in a semi-supervised fashion, on a real HPC system.

    This work is described in:

    1) "Anomaly Detection using Autoencoders in High Performance Computing Systems", Andrea Borghesi, Andrea Bartolini, Michele Lombardi, Michela Milano, Luca Benini, IAAI19 (proceedings in process) -- https://arxiv.org/abs/1902.08447

    2) "Online Anomaly Detection in HPC Systems", Andrea Borghesi, Antonio Libri, Luca Benini, Andrea Bartolini, AICAS19 (proceedings in process) -- https://arxiv.org/abs/1811.05269

    See the git repository for usage examples & details --> https://github.com/AndreaBorghesi/anomaly_detection_HPC

  7. SMD_OnmiAD

    • kaggle.com
    Updated Nov 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mitch (2023). SMD_OnmiAD [Dataset]. https://www.kaggle.com/datasets/mgusat/smd-onmiad
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mitch
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Source: OmniAD paper, code and https://github.com/NetManAIOps/OmniAnomaly

    SMD (Server Machine Dataset) is a new 5-week-long dataset. We collected it from a large Internet company. This dataset contains 3 groups of entities [EN: inherent / native clustering] Each of them is named by machine-

  8. Z

    ToyADMOS2 dataset: Another dataset of miniature-machine operating sounds for...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ohishi, Yasunori (2024). ToyADMOS2 dataset: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4580269
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Niizumi, Daisuke
    Harada, Noboru
    Takeuchi, Daiki
    Ohishi, Yasunori
    Saito, Shoichiro
    Yasuda, Masahiro
    Description

    ToyADMOS2 dataset is a large-scale dataset for anomaly detection in machine operating sounds (ADMOS), designed for evaluating systems under domain-shift conditions. It consists of two sub-datasets for machine-condition inspection: fault diagnosis of machines with geometrically fixed tasks ("toy car") and fault diagnosis of machines with moving tasks ("toy train"). Domain shifts are represented by introducing several differences in operating conditions, such as the use of the same machine type but with different machine models and part configurations, different operating speeds, microphone arrangements, etc. Each sub-dataset contains over 27 k samples of normal machine-operating sounds and over 8 k samples of anomalous sounds recorded at a 48-kHz sampling rate. A subset of the ToyADMOS2 dataset was used in the DCASE 2021 challenge task 2: Unsupervised anomalous sound detection for machine condition monitoring under domain shifted conditions.

    What makes this dataset different from others is that it is not used as is, but in conjunction with the tool provided on GitHub. The mixer tool lets you create datasets with any combination of recordings by describing the amount you need in a recipe file.

    The samples are compressed as MPEG-4 ALS (MPEG-4 Audio Lossless Coding) with a suffix of '.mp4' that you can load by using the audioread or librosa python module.

    The total size of files under a folder ToyADMOS2 is 149 GB, and the total size of example benchmark datasets that are created from the ToyADMOS2 dataset is 13.2 GB.

    The detail of the dataset is described in [1] and GitHub: https://github.com/nttcslab/ToyADMOS2-dataset

    License: see LICENSE.pdf for the detail of the license.

    [1] Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito, "ToyADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions," 2021. https://arxiv.org/abs/2106.02369

  9. MIMII DG: Sound Dataset for Malfunctioning Industrial Machine Investigation...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kota Dohi; Kota Dohi; Tomoya Nishida; Harsh Purohit; Ryo Tanabe; Takashi Endo; Masaaki Yamamoto; Yuki Nikaido; Yohei Kawaguchi; Yohei Kawaguchi; Tomoya Nishida; Harsh Purohit; Ryo Tanabe; Takashi Endo; Masaaki Yamamoto; Yuki Nikaido (2022). MIMII DG: Sound Dataset for Malfunctioning Industrial Machine Investigation for Domain Generalization Task [Dataset]. http://doi.org/10.5281/zenodo.6529888
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 11, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kota Dohi; Kota Dohi; Tomoya Nishida; Harsh Purohit; Ryo Tanabe; Takashi Endo; Masaaki Yamamoto; Yuki Nikaido; Yohei Kawaguchi; Yohei Kawaguchi; Tomoya Nishida; Harsh Purohit; Ryo Tanabe; Takashi Endo; Masaaki Yamamoto; Yuki Nikaido
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task (MIMII DG). The dataset consists of normal and abnormal operating sounds of five different types of industrial machines, i.e., fans, gearboxes, bearing, slide rails, and valves. The data for each machine type includes three subsets called "sections", and each section roughly corresponds to a type of domain shift. This dataset is a subset of the dataset for DCASE 2022 Challenge Task 2, so the dataset is entirely the same as data included in the development dataset. For more information, please see the pages of the development dataset and the task description for DCASE 2022 Challenge Task 2.

    Baseline system

    Two simple baseline systems are available on the Github repositories autoencoder-based baseline and MobileNetV2-based baseline. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

    Conditions of use

    This dataset was made by Hitachi, Ltd. and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

    Citation

    We will publish a paper on the dataset and will announce the citation information for them, so please make sure to cite them if you use this dataset.

    Feedback

    If there is any problem, pease contact us

  10. d

    pyhydroqc Sensor Data QC: Single Site Example

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amber Spackman Jones (2023). pyhydroqc Sensor Data QC: Single Site Example [Dataset]. http://doi.org/10.4211/hs.92f393cbd06b47c398bdd2bbb86887ac
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    Amber Spackman Jones
    Time period covered
    Jan 1, 2017 - Dec 31, 2017
    Description

    This resource contains an example script for using the software package pyhydroqc. pyhydroqc was developed to identify and correct anomalous values in time series data collected by in situ aquatic sensors. For more information, see the code repository: https://github.com/AmberSJones/pyhydroqc and the documentation: https://ambersjones.github.io/pyhydroqc/. The package may be installed from the Python Package Index.

    This script applies the functions to data from a single site in the Logan River Observatory, which is included in the repository. The data collected in the Logan River Observatory are sourced at http://lrodata.usu.edu/tsa/ or on HydroShare: https://www.hydroshare.org/search/?q=logan%20river%20observatory.

    Anomaly detection methods include ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short Term Memory). These are time series regression methods that detect anomalies by comparing model estimates to sensor observations and labeling points as anomalous when they exceed a threshold. There are multiple possible approaches for applying LSTM for anomaly detection/correction. - Vanilla LSTM: uses past values of a single variable to estimate the next value of that variable. - Multivariate Vanilla LSTM: uses past values of multiple variables to estimate the next value for all variables. - Bidirectional LSTM: uses past and future values of a single variable to estimate a value for that variable at the time step of interest. - Multivariate Bidirectional LSTM: uses past and future values of multiple variables to estimate a value for all variables at the time step of interest.

    The correction approach uses piecewise ARIMA models. Each group of consecutive anomalous points is considered as a unit to be corrected. Separate ARIMA models are developed for valid points preceding and following the anomalous group. Model estimates are blended to achieve a correction.

    The anomaly detection and correction workflow involves the following steps: 1. Retrieving data 2. Applying rules-based detection to screen data and apply initial corrections 3. Identifying and correcting sensor drift and calibration (if applicable) 4. Developing a model (i.e., ARIMA or LSTM) 5. Applying model to make time series predictions 6. Determining a threshold and detecting anomalies by comparing sensor observations to modeled results 7. Widening the window over which an anomaly is identified 8. Aggregating detections resulting from multiple models 9. Making corrections for anomalous events

    Instructions to run the notebook through the CUAHSI JupyterHub: 1. Click "Open with..." at the top of the resource and select the CUAHSI JupyterHub. You may need to sign into CUAHSI JupyterHub using your HydroShare credentials. 2. Select 'Python 3.8 - Scientific' as the server and click Start. 2. From your JupyterHub directory, click on the ExampleNotebook.ipynb file. 3. Execute each cell in the code by clicking the Run button.

  11. Dataset Artifact for Prodigy: Towards Unsupervised Anomaly Detection in...

    • zenodo.org
    • data.niaid.nih.gov
    json, tar
    Updated Nov 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Burak Aksar; Burak Aksar; Efe Sencan; Efe Sencan; Benjamin Schwaller; Benjamin Schwaller; Omar Aaziz; Vitus Leung; Jim Brandt; Brian Kulis; Manuel Egele; Ayse Coskun; Omar Aaziz; Vitus Leung; Jim Brandt; Brian Kulis; Manuel Egele; Ayse Coskun (2023). Dataset Artifact for Prodigy: Towards Unsupervised Anomaly Detection in Production HPC Systems [Dataset]. http://doi.org/10.5281/zenodo.8079388
    Explore at:
    tar, jsonAvailable download formats
    Dataset updated
    Nov 12, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Burak Aksar; Burak Aksar; Efe Sencan; Efe Sencan; Benjamin Schwaller; Benjamin Schwaller; Omar Aaziz; Vitus Leung; Jim Brandt; Brian Kulis; Manuel Egele; Ayse Coskun; Omar Aaziz; Vitus Leung; Jim Brandt; Brian Kulis; Manuel Egele; Ayse Coskun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains a small set of application runs from Eclipse supercomputer. The applications run with and without synthetic HPC performance anomalies. More detailed information regarding synthetic anomalies can be found at: https://github.com/peaclab/HPAS.

    We have chosen four applications, namely LAMMPS, sw4, sw4Lite, and ExaMiniMD, to encompass both real and proxy applications. We have executed each application five times on four compute nodes without introducing any anomalies. To showcase our experiment, we have specifically selected the "memleak" anomaly as it is one of the most commonly occurring types. Additionally, we have also executed each application five times with the chosen anomaly. The dataset we have collected consists of a total of 160 samples, with 80 samples labeled as anomalous and 80 samples labeled as healthy. For the details of applications please refer to the paper.

    The applications were run on Eclipse, which is situated at Sandia National Laboratories. Eclipse comprises 1488 compute nodes, each equipped with 128GB of memory and two sockets. Each socket contains 18 E5-2695 v4 CPU cores with 2-way hyperthreading, providing substantial computational power for scientific and engineering applications.

  12. BETH Dataset

    • kaggle.com
    Updated Jul 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kate Highnam (2021). BETH Dataset [Dataset]. https://www.kaggle.com/katehighnam/beth-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 29, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kate Highnam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset corresponds to the paper "BETH Dataset: Real Cybersecurity Data for Anomaly Detection Research" by Kate Highnam* (@jinxmirror13), Kai Arulkumaran* (@kaixhin), Zachary Hanif*, and Nicholas R. Jennings (@LboroVC).

    This paper was published in the ICML Workshop on Uncertainty and Robustness in Deep Learning 2021 and Conference on Applied Machine Learning for Information Security (CAMLIS 2021)

    THIS DATASET IS STILL BEING UPDATED

    Context

    When deploying machine learning (ML) models in the real world, anomalous data points and shifts in the data distribution are inevitable. From a cyber security perspective, these anomalies and dataset shifts are driven by both defensive and adversarial advancement. To withstand the cost of critical system failure, the development of robust models is therefore key to the performance, protection, and longevity of deployed defensive systems.

    We present the BPF-extended tracking honeypot (BETH) dataset as the first cybersecurity dataset for uncertainty and robustness benchmarking. Collected using a novel honeypot tracking system, our dataset has the following properties that make it attractive for the development of robust ML methods: 1. At over eight million data points, this is one of the largest cyber security datasets available 2. It contains modern host activity and attacks 3. It is fully labelled 4. It contains highly structured but heterogeneous features 5. Each host contains benign activity and at most a single attack, which is ideal for behavioural analysis and other research tasks. In addition to the described dataset

    Further data is currently being collected and analysed to add alternative attack vectors to the dataset.

    There are several existing cyber security datasets used in ML research, including the KDD Cup 1999 Data (Hettich & Bay, 1999), the 1998 DARPA Intrusion Detection Evaluation Dataset (Labs, 1998; Lippmann et al., 2000), the ISCX IDS 2012 dataset (Shiravi et al., 2012), and NSL-KDD (Tavallaee et al., 2009), which primarily removes duplicates from the KDD Cup 1999 Data. Each includes millions of records of realistic activity for enterprise applications, with labels for attacks or benign activity. The KDD1999, NSLKDD, and ISCX datasets contain network traffic, while the DARPA1998 dataset also includes limited process calls. However, these datasets are at best almost a decade old, and are collected on in-premise servers. In contrast, BETH contains modern host activity and activity collected from cloud services, making it relevant for current real-world deployments. In addition, some datasets include artificial user activity (Shiravi et al., 2012) while BETH contains only real activity. BETH is also one of the few datasets to include both kernel-process and network logs, providing a holistic view of malicious behaviour.

    Content

    The BETH dataset currently represents 8,004,918 events collected over 23 honeypots, running for about five noncontiguous hours on a major cloud provider. For benchmarking and discussion, we selected the initial subset of the process logs. This subset was further divided into training, validation, and testing sets with a rough 60/20/20 split based on host, quantity of logs generated, and the activity logged—only the test set includes an attack

    The dataset is composed of two sensor logs: kernel-level process calls and network traffic. The initial benchmark subset only includes process logs. Each process call consists of 14 raw features and 2 hand-crafted labels.

    See the paper for more details. For details on the events recorded within the logs, see this report.

    Benchmarks

    Code for our benchmarks, as detailed in the paper, are available through Github at: https://github.com/jinxmirror13/BETH_Dataset_Analysis

    Acknowledgements

    Thank you to Dr. Arinbjörn Kolbeinsson for his assistance in analysing the data and the reviewers for their positive feedback.

  13. Data from: Multi-Source Distributed System Data for AI-powered Analytics

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Nov 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao (2022). Multi-Source Distributed System Data for AI-powered Analytics [Dataset]. http://doi.org/10.5281/zenodo.3549604
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract:

    In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
    The major contributions have been materialized in the form of novel algorithms.
    Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
    Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
    Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
    Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.

    General Information:

    This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.

    You may find details of this dataset from the original paper:

    Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".

    If you use the data, implementation, or any details of the paper, please cite!

    BIBTEX:

    _

    @inproceedings{nedelkoski2020multi,
     title={Multi-source Distributed System Data for AI-Powered Analytics},
     author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej},
     booktitle={European Conference on Service-Oriented and Cloud Computing},
     pages={161--176},
     year={2020},
     organization={Springer}
    }
    

    _

    The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.

    The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.

    Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.

    Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/

  14. P

    SMAP Dataset

    • paperswithcode.com
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle Hundman; Valentino Constantinou; Christopher Laporte; Ian Colwell; Tom Soderstrom (2024). SMAP Dataset [Dataset]. https://paperswithcode.com/dataset/smap
    Explore at:
    Dataset updated
    Jun 6, 2024
    Authors
    Kyle Hundman; Valentino Constantinou; Christopher Laporte; Ian Colwell; Tom Soderstrom
    Description

    Soil Moisture Active Passive (SMAP) dataset is a dataset of soil samples and telemetry information using the Mars rover by NASA. Originally published in https://arxiv.org/abs/1802.04431 and used for the unsupervised anomaly detection task in time series data. Later it was used in many popular anomaly detection methods and benchmarks that distribute it in their repositories e.g., https://github.com/OpsPAI/MTAD

  15. g

    ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

    • elki-project.github.io
    • zenodo.org
    Updated Sep 2, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erich Schubert; Arthur Zimek (2011). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684
    Explore at:
    Dataset updated
    Sep 2, 2011
    Dataset provided by
    TU Dortmund University
    University of Southern Denmark, Denmark
    Authors
    Erich Schubert; Arthur Zimek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Amsterdam Library of Object Images" is a collection of 110250 images of 1000 small objects, taken under various light conditions and rotation angles. All objects were placed on a black background. Thus the images are taken under rather uniform conditions, which means there is little uncontrolled bias in the data set (unless mixed with other sources). They do however not resemble a "typical" image collection. The data set has a rather unique property for its size: there are around 100 different images of each object, so it is well suited for clustering. By downsampling some objects it can also be used for outlier detection. For multi-view research, we offer a number of different feature vector sets for evaluating this data set.

  16. Data from: MIMII Dataset: Sound Dataset for Malfunctioning Industrial...

    • zenodo.org
    zip
    Updated Feb 29, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido; Kaori Suefusa; Kaori Suefusa; Yohei Kawaguchi; Yohei Kawaguchi; Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido (2020). MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection [Dataset]. http://doi.org/10.5281/zenodo.3384388
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 29, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido; Kaori Suefusa; Kaori Suefusa; Yohei Kawaguchi; Yohei Kawaguchi; Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection (MIMII dataset). It contains the sounds generated from four types of industrial machines, i.e. valves, pumps, fans, and slide rails. Each type of machine includes seven individual product models*1, and the data for each model contains normal sounds (from 5000 seconds to 10000 seconds) and anomalous sounds (about 1000 seconds). To resemble a real-life scenario, various anomalous sounds were recorded (e.g., contamination, leakage, rotating unbalance, and rail damage). Also, the background noise recorded in multiple real factories was mixed with the machine sounds. The sounds were recorded by eight-channel microphone array with 16 kHz sampling rate and 16 bit per sample. The MIMII dataset assists benchmark for sound-based machine fault diagnosis. Users can test the performance for specific functions e.g., unsupervised anomaly detection, transfer learning, noise robustness, etc. The detail of the dataset is described in [1][2].

    This dataset is made available by Hitachi, Ltd. under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

    A baseline sample code for anomaly detection is available on GitHub: https://github.com/MIMII-hitachi/mimii_baseline/

    *1: This version "public 1.0" contains four models (model ID 00, 02, 04, and 06). The rest three models will be released in a future edition.

    [1] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” arXiv preprint arXiv:1909.09347, 2019.

    [2] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” in Proc. 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2019.

  17. DCASE2021 UAD-S UMAP Data

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Aug 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andres Fernandez Rodriguez; Andres Fernandez Rodriguez; Mark D. Plumbley; Mark D. Plumbley (2021). DCASE2021 UAD-S UMAP Data [Dataset]. http://doi.org/10.5281/zenodo.5123024
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 23, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andres Fernandez Rodriguez; Andres Fernandez Rodriguez; Mark D. Plumbley; Mark D. Plumbley
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    Support data for our paper:

    USING UMAP TO INSPECT AUDIO DATA FOR UNSUPERVISED ANOMALY DETECTION UNDER DOMAIN-SHIFT CONDITIONS

    ArXiv preprint can be found here. Code for the experiment software pipeline described in the paper can be found here. The pipeline requires and generates different forms of data. Here we provide the following:

    1. AudioSet_wav_fragments.zip: This is a custom selection of 39437 wav files (32kHz, mono, 10 seconds) randomly extracted from AudioSet (originally released under CC-BY). In addition to this custom subset, the paper also uses the following ones, which can be downloaded at their respective websites:
      1. DCASE2021 Task 2 Development Dataset
      2. DCASE2021 Task 2 Additional Training Dataset
      3. Fraunhofer's IDMT-ISA-ELECTRIC-ENGINE Dataset
    2. dcase2021_uads_umaps.zip: To compute the UMAPs, first the log-STFT, log-mel and L3 representations must be extracted, and then the UMAPs must be computed. This can take a substantial amount of time and resources. For convenience, we provide here the 72 UMAPs discussed in the paper.
    3. dcase2021_uads_umap_plots.zip: Also for convenience, we provide here the 198 high-resolution scatter plots rendered from the UMAPs.

    For a comprehensive visual inspection of the computed representations, it is sufficient to download the plots only. Users interested in exploring the plots interactively will need to download all the audio datasets and compute the log-STFT, log-mel and L3 representations as well as the UMAPs themselves (code provided in the GitHub repository). UMAPs for further representations can also be computed and plotted.

  18. P

    Amazon-Fraud Dataset

    • paperswithcode.com
    Updated Dec 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yingtong Dou; Zhiwei Liu; Li Sun; Yutong Deng; Hao Peng; Philip S. Yu (2024). Amazon-Fraud Dataset [Dataset]. https://paperswithcode.com/dataset/amazon-fraud
    Explore at:
    Dataset updated
    Dec 23, 2024
    Authors
    Yingtong Dou; Zhiwei Liu; Li Sun; Yutong Deng; Hao Peng; Philip S. Yu
    Description

    Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.

    Dataset Statistics

    # Nodes%Fraud Nodes (Class=1)
    11,9449.5
    Relation# Edges
    U-P-U
    U-S-U
    U-V-U1,036,737
    All

    Graph Construction

    The Amazon dataset includes product reviews under the Musical Instruments category. Similar to this paper, we label users with more than 80% helpful votes as benign entities and users with less than 20% helpful votes as fraudulent entities. we conduct a fraudulent user detection task on the Amazon-Fraud dataset, which is a binary classification task. We take 25 handcrafted features from this paper as the raw node features for Amazon-Fraud. We take users as nodes in the graph and design three relations: 1) U-P-U: it connects users reviewing at least one same product; 2) U-S-V: it connects users having at least one same star rating within one week; 3) U-V-U: it connects users with top 5% mutual review text similarities (measured by TF-IDF) among all users.

    To download the dataset, please visit this Github repo. For any other questions, please email ytongdou(AT)gmail.com for inquiry.

  19. P

    UCF-Crime Dataset

    • paperswithcode.com
    Updated Nov 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waqas Sultani; Chen Chen; Mubarak Shah (2023). UCF-Crime Dataset [Dataset]. https://paperswithcode.com/dataset/ucf-crime
    Explore at:
    Dataset updated
    Nov 26, 2023
    Authors
    Waqas Sultani; Chen Chen; Mubarak Shah
    Description

    The UCF-Crime dataset is a large-scale dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world surveillance videos, with 13 realistic anomalies including Abuse, Arrest, Arson, Assault, Road Accident, Burglary, Explosion, Fighting, Robbery, Shooting, Stealing, Shoplifting, and Vandalism. These anomalies are selected because they have a significant impact on public safety.

    This dataset can be used for two tasks. First, general anomaly detection considering all anomalies in one group and all normal activities in another group. Second, for recognizing each of 13 anomalous activities.

  20. R&D Dataset for LHC Olympics 2020 Anomaly Detection Challenge

    • zenodo.org
    • explore.openaire.eu
    bin
    Updated Apr 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gregor Kasieczka; Ben Nachman; David Shih; Gregor Kasieczka; Ben Nachman; David Shih (2022). R&D Dataset for LHC Olympics 2020 Anomaly Detection Challenge [Dataset]. http://doi.org/10.5281/zenodo.4287694
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 17, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gregor Kasieczka; Ben Nachman; David Shih; Gregor Kasieczka; Ben Nachman; David Shih
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the first R&D dataset for the LHC Olympics 2020 Anomaly Detection Challenge. It consists of 1M QCD dijet events and 100k W'->XY events, with X->qq and Y->qq. The W', X, and Y masses are 3.5 TeV, 500 GeV and 100 GeV respectively. The events are produced using Pythia8 and Delphes 3.4.1, with no pileup or MPI included. They are selected using a single fat-jet (R=1) trigger with pT threshold of 1.2 TeV.

    The events are randomly shuffled together, but for the purposes of testing and development, we provide the user with a signal/background truth bit for each event. Obviously, the truth bit will not be included in the actual challenge.

    These events are stored as pandas dataframes saved to compressed h5 format. For each event, all Delphes reconstructed particles in the event are assumed to be massless and are recorded in detector coordinates (pT, eta, phi). More detailed information such as particle charge is not included. Events are zero padded to constant size arrays of 700 particles, with the truth bit appended at the end. The array format is therefore (Nevents=1.1M, 2101).

    For more information, including an example Jupyter notebook illustrating how to read and process the events, see the official LHC Olympics 2020 webpage.

    https://lhco2020.github.io/homepage/

    UPDATE May 18 2020

    We have uploaded a second signal dataset for R&D, consisting of 100k W'->XY with X,Y->qqq (i.e. 3-prong substructure). Everything else about this signal dataset (particle masses, trigger, Pythia configuration, detector simulation) is the same as the previous one described above.

    UPDATE November 23 2020

    We now include high-level feature files for the background and 2-prong signal (events_anomalydetection_v2.features.h5) and for the 3-prong signal (events_anomalydetection_Z_XY_qqq.features.h5). To produce the features, we have clustered every event into R=1 jets using the anti-kT algorithm. The features (calculated using fastjet plugins) are the 3-momenta, invariant masses, and n-jettiness variables tau1, tau2 and tau3 for the highest pT jet (j1) and the second highest pT jet (j2):

    'pxj1', 'pyj1', 'pzj1', 'mj1', 'tau1j1', 'tau2j1', 'tau3j1', 'pxj2', 'pyj2', 'pzj2', 'mj2', 'tau1j2', 'tau2j2', 'tau3j2'

    The rows (events) in each feature file should be ordered exactly the same as in their corresponding raw event file. For convenience, we have also included the label (1 for signal and 0 for background) as an additional column in the first feature file (events_anomalydetection_v2.features.h5).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Garske, Samuel (2024). Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13370799

Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms

Explore at:
Dataset updated
Aug 29, 2024
Dataset provided by
Garske, Samuel
Mao, Yiwei
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Summary

This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.

They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD

How to Get Started

All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:

import numpy as np

Load image file

hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")

Load image mask

mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")

Citing the Datasets

If you use any of these datasets, please cite the following paper:

@article{garske2024erx, title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning}, author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC}, journal={arXiv preprint arXiv:2408.14947}, year={2024},}

If you use the beach dataset please cite the following paper as well (original source):

@article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }

Search
Clear search
Close search
Google apps
Main menu