100+ datasets found

Controlled Anomalies Time Series (CATS) Dataset
kaggle.com
Updated Sep 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
astro_pat (2023). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. https://www.kaggle.com/datasets/patrickfleith/controlled-anomalies-time-series-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 14, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
astro_pat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables)including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

5 million timestamps. Sensors readings are at 1Hz sampling frequency.

1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

4 million observations that include** both nominal and anomalous segments.** This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

Contamination level of 0.038. This means about 3.8% of the observations (rows) are anomalous.

Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.

Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.

Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).

Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during**** our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

The dataset provider, Solenix, is an international company providing software e...
z
Controlled Anomalies Time Series (CATS) Dataset
zenodo.org
bin
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7646897
Dataset updated
Jul 12, 2024
Dataset provided by
Solenix Engineering GmbH
Authors
Patrick Fleith; Patrick Fleith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

5 million timestamps. Sensors readings are at 1Hz sampling frequency.

1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

Different types of anomalies to understand what anomaly types can be detected by different approaches.

Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
o
Medical Out-of-Distribution Analysis Challenge 2022
explore.openaire.eu
Updated Mar 16, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Zimmerer; Jens Petersen; Gregor K��hler; Paul J��ger; Peter Full; Klaus Maier-Hein; Tobias Ro��; Tim Adler; Annika Reinke; Lena Maier-Hein (2022). Medical Out-of-Distribution Analysis Challenge 2022 [Dataset]. http://doi.org/10.5281/zenodo.6362313
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6362313
Dataset updated
Mar 16, 2022
Authors
David Zimmerer; Jens Petersen; Gregor K��hler; Paul J��ger; Peter Full; Klaus Maier-Hein; Tobias Ro��; Tim Adler; Annika Reinke; Lena Maier-Hein
Description
Despite overwhelming successes in recent years, progress in the field of biomedical image computing still largely depends on the availability of annotated training examples. This annotation process is often prohibitively expensive because it requires the valuable time of domain experts. Additionally, this approach simply does not scale well: whenever a new imaging modality is created, acquisition parameters change. Even something as basic as the target demographic is prone to changes, and new annotated cases have to be created to allow methods to cope with the resulting images. Image labeling is thus bound to become the major bottleneck in the coming years. Furthermore, it has been shown that many algorithms used in image analysis are vulnerable to out-of-distribution samples, resulting in wrong and overconfident decisions [20, 21, 22, 23]. In addition, physicians can overlook unexpected conditions in medical images, often termed ��inattentional blindness��. In [1], Drew et al. noted that 50% of trained radiologists did not notice a gorilla image, rendered into a lung CT scan when assessing lung nodules. One approach, which does not require labeled images and can generalize to unseen pathological conditions, is Out-of-Distribution or anomaly detection (which in this context is used interchangeably). Anomaly detection can recognize and outline conditions that have not been previously encountered during training and thus circumvents the time-consuming labeling process and can therefore quickly be adapted to new modalities. Additionally, by highlighting such abnormal regions, anomaly detection can guide the physicians�� attention to otherwise overlooked abnormalities in a scan and potentially improve the time required to inspect medical images. However, while there is a lot of recent research on improving anomaly detection [8, 9, 10, 11, 12, 13, 14, 15, 16, 17], especially with a focus on the medical field [4, 5, 6, 7], a common dataset/ benchmark to compare different approaches is missing. Thus, it is currently hard to have a fair comparison of different proposed approaches. While in the last few months common datasets for natural data were proposed, such as default detection [3] or abnormal traffic scene detection [2], we tried to tackle this issue for medical imaging with last year's challenge [25]. In a similar setting to the last years we suggest the medical out-of-distribution challenge as a standardized dataset and benchmark for anomaly detection. We propose two different tasks. First a sample-wise (i.e. patients-wise) analysis, thus detecting out-of-distribution samples. For example, having a pathological condition or any other condition not seen in the training-set. This can pose a problem to classically supervised algorithms and detection of such could further allow physicians to prioritize different patients. Secondly, we propose a voxel-wise analysis i.e. giving a score for each voxel, highlighting abnormal conditions and potentially guiding the physician. However, there are a few aspects to consider when choosing an anomaly detection dataset. First, as in reality, the types of anomalies should not be known beforehand. This can be a particular problem when choosing a dataset and testing on only a single pathological condition, which is vulnerable to exploitation. Even with an educated guess (based on the dataset) and a fully supervised segmentation approach, trained on a not allowed separate dataset, one could outperform other rightfully trained anomaly detection approaches. Furthermore, making the exact types of anomalies known can cause a bias in the evaluation. Studies have shown that proposed anomaly detection algorithms tend to overfit on a given task, given that properties of the test set and the kind of anomalies are known beforehand. This further hinders the comparability of different algorithms [6, 18, 19, 23]. As a second point, combining test sets, from different sources with alternative conditions, may also cause problems. By definition, the different sources already propose a distribution shift to the training dataset, complicating a clean and meaningful evaluation. To solve these issues we propose to provide two datasets with more than 600 scans each, one brain MRI-dataset and one abdominal CT-dataset, to allow for a comparison of the generalizability of the approaches. In order to prevent overfitting on the (types of) anomalies existing in our test set, the test set will be kept confidential at all times. The training set consists of hand-selected scans in which no anomalies were identified. The remaining scans will be assigned to the test set. Thus some scans in the test set do not contain anomalies, whilst others contain naturally occurring anomalies. In addition to the natural anomalies, we will add synthetic anomalies. We choose different structured types of synthetic anomalies (e.g. a tumor or an image of a gorilla rendered into the a brain scan [1]) to cover a broad var...
d
Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...
catalog.data.gov
s.cnmilf.com
+2more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Visual Anomaly Detection - Thermostatic Valves
kaggle.com
Updated Jun 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edge Impulse (2024). Visual Anomaly Detection - Thermostatic Valves [Dataset]. https://www.kaggle.com/datasets/edgeimpulse/visual-anomaly-detection-thermostatic-valves
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2024
Dataset provided by
Kaggle
Authors
Edge Impulse
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset has been collected by Edge Impulse to explain the FOMO-AD (visual anomaly detection) model architecture.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1642573%2Fc0ee423d706b13968414e2ff0ac23ffd%2FScreenshot%202024-06-05%20at%2010.38.41.png?generation=1717577815399896&alt=media" alt="">

The dataset is composed of 195 images including: - Training set: 121 images without anomaly - Testing set: 49 images containing anomaly and 25 images without anomaly

How to use this dataset

To import this data into a new Edge Impulse project, either use:

The clone button in the Edge Impulse public project

via the Edge Impulse Studio. Go to Data acquisition > Upload data.

Or, via the Edge Impulse CLI (https://docs.edgeimpulse.com/docs/tools/edge-impulse-cli/cli-uploader), run with:

edge-impulse-uploader --clean --info-file info.labels

Want to see this dataset results?

Have a look at the Edge Impulse public project to see the results

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1642573%2Ffee19cd5fb833da49f341358919c5ccb%2Ffomo-ad.gif?generation=1717577831265701&alt=media" alt="">

Understand the info.labels file

The info.labels file (can be located in each subdirectory or at the folder root) provides detailed information about the labels. The file follows a JSON format, with the following structure:

version: Indicates the version of the label format.

files: A list of objects, where each object represents a supported file format and its associated labels.

path: The path or file name.

category: Indicates whether the image belongs to the training or testing set.

label (optional): Provides information about the labeled objects.

type: Specifies the type of label - unlabeled, label, multi-label

label (optional): The actual label or class name of the sample.

labels (optional): The labels in the multi-label format:

label: Label for the given period.

startIndex: Timestamp in milliseconds.

endIndex: Timestamp in milliseconds.

metadata (Optional): Additional metadata associated with the image, such as the site where it was collected, the timestamp or any useful information.

boundingBoxes (Optional): A list of objects, where each object represents a bounding box for an object within the image.

label: The label or class name of the object within the bounding box.

x, y: The coordinates of the top-left corner of the bounding box.

width, height: The width and height of the bounding box.

Additional resources

Elevate Your Inspection Tasks with Visual Anomaly Detection article](https://www.edgeimpulse.com/blog/announcing-visual-anomaly-detection/)

FOMO-AD documentation
DCASE 2025 Challenge Task 2 Additional Training Dataset
zenodo.org
zip
Updated May 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomoya Nishida; Tomoya Nishida; Noboru Harada; Noboru Harada; Daisuke Niizumi; Daisuke Niizumi; Davide Albertini; Roberto Sannino; Simone Pradolini; Filippo Augusti; Keisuke Imoto; Keisuke Imoto; Kota Dohi; Kota Dohi; Harsh Purohit; Takashi Endo; Yohei Kawaguchi; Yohei Kawaguchi; Davide Albertini; Roberto Sannino; Simone Pradolini; Filippo Augusti; Harsh Purohit; Takashi Endo (2025). DCASE 2025 Challenge Task 2 Additional Training Dataset [Dataset]. http://doi.org/10.5281/zenodo.15392814
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15392814
Dataset updated
May 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tomoya Nishida; Tomoya Nishida; Noboru Harada; Noboru Harada; Daisuke Niizumi; Daisuke Niizumi; Davide Albertini; Roberto Sannino; Simone Pradolini; Filippo Augusti; Keisuke Imoto; Keisuke Imoto; Kota Dohi; Kota Dohi; Harsh Purohit; Takashi Endo; Yohei Kawaguchi; Yohei Kawaguchi; Davide Albertini; Roberto Sannino; Simone Pradolini; Filippo Augusti; Harsh Purohit; Takashi Endo
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Description

This dataset is the "additional training dataset" for the DCASE 2025 Challenge Task 2.

The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-sec or 12-sec audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task:

AutoTrash

HomeCamera

ToyPet

ToyRCCar

BandSealer

Polisher

ScrewFeeder

CoffeeGrinder

Overview of the task

Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.

This task is the follow-up from DCASE 2020 Task 2 to DCASE 2024 Task 2. The task this year is to develop an ASD system that meets the following five requirements.

1. Train a model using only normal sound (unsupervised learning scenario)
Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data, which is called UASD (unsupervised ASD). This is the same requirement as in the previous tasks.
2. Detect anomalies regardless of domain shifts (domain generalization task)
In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same since DCASE 2022 Task 2.
3. Train a model for a completely new machine type
For a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning. This requirement is the same since DCASE 2023 Task 2.
4. Train a model both with or without attribute information
While additional attribute information can help enhance the detection performance, we cannot always obtain such information. Therefore, the system must work well both when attribute information is available and when it is not.
5. Train a model with additional clean machine data or noise-only data (optional)
Although the primary training data consists of machine sounds recorded under noisy conditions, in some situations it may be possible to collect clean machine data when the factory is idle or gather noise recordings when the machine itself is not running. Participants are free to incorporate these additional data sources to enhance the accuracy of their models.

The last optional requirement is newly introduced in DCASE 2025 Task2.

Definition

We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".

"Machine type" indicates the type of machine, which in the additional training dataset is one of eight: auto trash, home camera, Toy pet, Toy RC car, band sealer, polisher, screw feeder.

A section is defined as a subset of the dataset for calculating performance metrics.

The source domain is the domain under which most of the training data and some of the test data were recorded, and the target domain is a different set of domains under which some of the training data and some of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, signal-to-noise ratio, etc.

Attributes are parameters that define states of machines or types of noise. For several machine types, the attributes are hidden.

Dataset

This dataset consists of eight machine types. For each machine type, one section is provided, and the section is a complete set of training data. A set of test data corresponding to this training data will be provided in another seperate zenodo page as an "evaluation dataset" for the DCASE 2025 Challenge task 2. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training and (ii) ten clips of normal sounds in the target domain for training, (iii) 100 clips of supplementary sound data containing either clean normal machine sounds in the source domain or noise-only sounds. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.

File names and attribute csv files

File names and attribute csv files provide reference labels for each clip. The given reference labels for each training/test clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Note that for machine types that has its attribute information hidden, the attribute information in each file names are only labeled as "noAttributes". Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:

[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...

For machine types that have their attribute information hidden, all columns except the filename column are left blank for each row.

Recording procedure

Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

Directory structure

- /eval_data

- /raw
- /AutoTrash
- /train (only normal clips)
- /section_00_source_train_normal_0001_

Baseline system

The baseline system is available on the Github repository https://github.com/nttcslab/dcase2023_task2_baseline_ae. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

Condition of use

This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Citation

Contact

If there is any problem, please contact us:

Tomoya Nishida, tomoya.nishida.ax@hitachi.com

Keisuke Imoto, keisuke.imoto@ieee.org

Noboru Harada, noboru@ieee.org

Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp

Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
Street Scene Video Anomaly Detection Dataset
zenodo.org
zip
Updated Mar 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Jones; Michael Jones; Bharathkumar Ramachandra; Bharathkumar Ramachandra (2024). Street Scene Video Anomaly Detection Dataset [Dataset]. http://doi.org/10.5281/zenodo.10870472
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10870472
Dataset updated
Mar 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Jones; Michael Jones; Bharathkumar Ramachandra; Bharathkumar Ramachandra
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Introduction

The Street Scene dataset consists of 46 training video sequences and 35 testing video sequences taken from a static USB camera looking down on a scene of a two-lane street with bike lanes and pedestrian sidewalks. Videos were collected from the camera at various times during two consecutive summers. All of the videos were taken during the daytime. The dataset is challenging because of the variety of activities taking place such as cars driving, turning, stopping and parking; pedestrians walking, jogging and pushing strollers; and bikers riding in bike lanes. In addition, the videos contain changing shadows, and moving background such as a flag and trees blowing in the wind.

There are a total of 202,545 color video frames (56,135 for training and 146,410 for testing) each of size 1280 x 720 pixels. The frames were extracted from the original videos at 15 frames per second.

The 35 testing sequences have a total of 205 anomalous events consisting of 17 different anomaly types. A complete list of anomaly types and the number of each in the test set can be found in our paper.

Ground truth annotations are provided for each testing video in the form of bounding boxes around each anomalous event in each frame. Each bounding box is also labeled with a track number, meaning each anomalous event is labeled as a track of bounding boxes. Track lengths vary from tens of frames to 5200 which is the length of the longest testing sequence. A single frame can have more than one anomaly labeled.

NOTE: This version of the dataset differs slightly with the original made available in 2020. Some anomalies were found in a few of the normal training sequences. These training frames were deleted from the dataset. Specifically, the following frames were removed:

Train026: frames 1-184 (car taking a u-turn)

Train027: frames 1-229 (jay walkers)

Train031: frames 1-299 (jay walkers, illegally parked car)

At a Glance

The size of the unzipped dataset is ~46GB

The dataset consists of Train sequences (containing only videos with normal activity), Test sequences (containing some anomalous activity) along with ground truth annotations, and a README.md file describing the data organization and ground truth annotation format.

The zip file contains a Train directory, a Test directory and a README.md file.

Other Resources

None

Citation

If you use the Street Scene dataset in your research, please cite our contribution:

@inproceedings{ramachandra2020street, title={Street Scene: A new dataset and evaluation protocol for video anomaly detection}, author={Ramachandra, Bharathkumar and Jones, Michael}, booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision}, pages={2569--2578}, year={2020} }

License

The Street Scene dataset is released under CC-BY-SA-4.0 license.

All data:

Created by Mitsubishi Electric Research Laboratories (MERL), 2023 SPDX-License-Identifier: CC-BY-SA-4.0
d
Data from: Anomaly Detection in a Fleet of Systems
catalog.data.gov
datasets.ai
+5more
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Anomaly Detection in a Fleet of Systems [Dataset]. https://catalog.data.gov/dataset/anomaly-detection-in-a-fleet-of-systems
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
A fleet is a group of systems (e.g., cars, aircraft) that are designed and manufactured the same way and are intended to be used the same way. For example, a fleet of delivery trucks may consist of one hundred instances of a particular model of truck, each of which is intended for the same type of service—almost the same amount of time and distance driven every day, approximately the same total weight carried, etc. For this reason, one may imagine that data mining for fleet monitoring may merely involve collecting operating data from the multiple systems in the fleet and developing some sort of model, such as a model of normal operation that can be used for anomaly detection. However, one then may realize that each member of the fleet will be unique in some ways—there will be minor variations in manufacturing, quality of parts, and usage. For this reason, the typical machine learning and statis- tics algorithm’s assumption that all the data are independent and identically distributed is not correct. One may realize that data from each system in the fleet must be treated as unique so that one can notice significant changes in the operation of that system.
v
MAR Address Anomalies
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
datasets.ai
+4more
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Washington, DC (2025). MAR Address Anomalies [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/mar-address-anomalies
Explore at:
Dataset updated
Feb 4, 2025
Dataset provided by
City of Washington, DC
Description
An address anomaly is an address whose location is illogical. In other words, the address does not follow the normal rules of the District of Columbia’s (DC) addressing grid system. There are different types of anomalies.
CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly...
zenodo.org
data.niaid.nih.gov
application/gzip, csv
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka; Pavel Šiška; Pavel Šiška (2025). CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting [Dataset]. http://doi.org/10.5281/zenodo.13382427
Explore at:
csv, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13382427
Dataset updated
Feb 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka; Pavel Šiška; Pavel Šiška
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CESNET-TimeSeries24: The dataset for network traffic forecasting and anomaly detection

The dataset called CESNET-TimeSeries24 was collected by long-term monitoring of selected statistical metrics for 40 weeks for each IP address on the ISP network CESNET3 (Czech Education and Science Network). The dataset encompasses network traffic from more than 275,000 active IP addresses, assigned to a wide variety of devices, including office computers, NATs, servers, WiFi routers, honeypots, and video-game consoles found in dormitories. Moreover, the dataset is also rich in network anomaly types since it contains all types of anomalies, ensuring a comprehensive evaluation of anomaly detection methods.

Last but not least, the CESNET-TimeSeries24 dataset provides traffic time series on institutional and IP subnet levels to cover all possible anomaly detection or forecasting scopes. Overall, the time series dataset was created from the 66 billion IP flows that contain 4 trillion packets that carry approximately 3.7 petabytes of data. The CESNET-TimeSeries24 dataset is a complex real-world dataset that will finally bring insights into the evaluation of forecasting models in real-world environments.

Please cite the usage of our dataset as:

Koumar, J., Hynek, K., Čejka, T. et al. CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting. Sci Data 12, 338 (2025). https://doi.org/10.1038/s41597-025-04603-x

@Article{cesnettimeseries24,
author={Koumar, Josef and Hynek, Karel and {\v{C}}ejka, Tom{\'a}{\v{s}} and {\v{S}}i{\v{s}}ka, Pavel},
title={CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting},
journal={Scientific Data},
year={2025},
month={Feb},
day={26},
volume={12},
number={1},
pages={338},
issn={2052-4463},
doi={10.1038/s41597-025-04603-x},
url={https://doi.org/10.1038/s41597-025-04603-x}
}

Time series

We create evenly spaced time series for each IP address by aggregating IP flow records into time series datapoints. The created datapoints represent the behavior of IP addresses within a defined time window of 10 minutes. The vector of time-series metrics v_{ip, i} describes the IP address ip in the i-th time window. Thus, IP flows for vector v_{ip, i} are captured in time windows starting at t_i and ending at t_{i+1}. The time series are built from these datapoints.

Datapoints created by the aggregation of IP flows contain the following time-series metrics:

Simple volumetric metrics: the number of IP flows, the number of packets, and the transmitted data size (i.e. number of bytes)

Unique volumetric metrics: the number of unique destination IP addresses, the number of unique destination Autonomous System Numbers (ASNs), and the number of unique destination transport layer ports. The aggregation of \textit{Unique volumetric metrics} is memory intensive since all unique values must be stored in an array. We used a server with 41 GB of RAM, which was enough for 10-minute aggregation on the ISP network.

Ratios metrics: the ratio of UDP/TCP packets, the ratio of UDP/TCP transmitted data size, the direction ratio of packets, and the direction ratio of transmitted data size

Average metrics: the average flow duration, and the average Time To Live (TTL)

Multiple time aggregation: The original datapoints in the dataset are aggregated by 10 minutes of network traffic. The size of the aggregation interval influences anomaly detection procedures, mainly the training speed of the detection model. However, the 10-minute intervals can be too short for longitudinal anomaly detection methods. Therefore, we added two more aggregation intervals to the datasets--1 hour and 1 day.

Time series of institutions: We identify 283 institutions inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution's data.

Time series of institutional subnets: We identify 548 institution subnets inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution subnet's data.

Data Records

The file hierarchy is described below:

cesnet-timeseries24/

|- institution_subnets/

| |- agg_10_minutes/

| |- agg_1_hour/

| |- agg_1_day/

| |- identifiers.csv

|- institutions/

| |- agg_10_minutes/

| |- agg_1_hour/

| |- agg_1_day/

| |- identifiers.csv

|- ip_addresses_full/

| |- agg_10_minutes/

| |- agg_1_hour/

| |- agg_1_day/

| |- identifiers.csv

|- ip_addresses_sample/

| |- agg_10_minutes/

| |- agg_1_hour/

| |- agg_1_day/

| |- identifiers.csv

|- times/

| |- times_10_minutes.csv

| |- times_1_hour.csv

| |- times_1_day.csv

|- ids_relationship.csv
|- weekends_and_holidays.csv

The following list describes time series data fields in CSV files:

id_time: Unique identifier for each aggregation interval within the time series, used to segment the dataset into specific time periods for analysis.

n_flows: Total number of flows observed in the aggregation interval, indicating the volume of distinct sessions or connections for the IP address.

n_packets: Total number of packets transmitted during the aggregation interval, reflecting the packet-level traffic volume for the IP address.

n_bytes: Total number of bytes transmitted during the aggregation interval, representing the data volume for the IP address.

n_dest_ip: Number of unique destination IP addresses contacted by the IP address during the aggregation interval, showing the diversity of endpoints reached.

n_dest_asn: Number of unique destination Autonomous System Numbers (ASNs) contacted by the IP address during the aggregation interval, indicating the diversity of networks reached.

n_dest_port: Number of unique destination transport layer ports contacted by the IP address during the aggregation interval, representing the variety of services accessed.

tcp_udp_ratio_packets: Ratio of packets sent using TCP versus UDP by the IP address during the aggregation interval, providing insight into the transport protocol usage pattern. This metric belongs to the interval <0, 1> where 1 is when all packets are sent over TCP, and 0 is when all packets are sent over UDP.

tcp_udp_ratio_bytes: Ratio of bytes sent using TCP versus UDP by the IP address during the aggregation interval, highlighting the data volume distribution between protocols. This metric belongs to the interval <0, 1> with same rule as tcp_udp_ratio_packets.

dir_ratio_packets: Ratio of packet directions (inbound versus outbound) for the IP address during the aggregation interval, indicating the balance of traffic flow directions. This metric belongs to the interval <0, 1>, where 1 is when all packets are sent in the outgoing direction from the monitored IP address, and 0 is when all packets are sent in the incoming direction to the monitored IP address.

dir_ratio_bytes: Ratio of byte directions (inbound versus outbound) for the IP address during the aggregation interval, showing the data volume distribution in traffic flows. This metric belongs to the interval <0, 1> with the same rule as dir_ratio_packets.

avg_duration: Average duration of IP flows for the IP address during the aggregation interval, measuring the typical session length.

avg_ttl: Average Time To Live (TTL) of IP flows for the IP address during the aggregation interval, providing insight into the lifespan of packets.

Moreover, the time series created by re-aggregation contains following time series metrics instead of n_dest_ip, n_dest_asn, and n_dest_port:

sum_n_dest_ip: Sum of numbers of unique destination IP addresses.

avg_n_dest_ip: The average number of unique destination IP addresses.

std_n_dest_ip: Standard deviation of numbers of unique destination IP addresses.

sum_n_dest_asn: Sum of numbers of unique destination ASNs.

avg_n_dest_asn: The average number of unique destination ASNs.

std_n_dest_asn: Standard deviation of numbers of unique destination ASNs)

sum_n_dest_port: Sum of numbers of unique destination transport layer ports.

avg_n_dest_port: The average number of unique destination transport layer ports.

std_n_dest_port: Standard deviation of numbers of unique destination transport layer
D
MVTec AD Dataset
datasetninja.com
kaggle.com
Updated Jun 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Bergmann; Kilian Batzner; Michael Fauser (2019). MVTec AD Dataset [Dataset]. https://datasetninja.com/mvtec-ad
Explore at:
Dataset updated
Jun 20, 2019
Dataset provided by
Dataset Ninja
Authors
Paul Bergmann; Kilian Batzner; Michael Fauser
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The authors of the MVTec AD: the MVTec Anomaly Detection dataset addressed the critical task of detecting anomalous structures within natural image data, a crucial aspect of computer vision applications. To facilitate the development of methods for unsupervised anomaly detection, they introduced the MVTec AD dataset, comprising 5354 high-resolution color images encompassing various object and texture categories. The dataset comprises both normal images, intended for training, and images with anomalies, designed for testing. These anomalies manifest in over 70 distinct types of defects, including scratches, dents, contaminations, and structural alterations. The authors also provided pixel-precise ground truth annotations for all anomalies.
Satellite Anomalies Due to Environment
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
s.cnmilf.com
+5more
Updated Apr 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DOC/NOAA/NESDIS/NCEI > National Centers for Environmental Information, NESDIS, NOAA, U.S. Department of Commerce (Point of Contact) (2025). Satellite Anomalies Due to Environment [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/satellite-anomalies-due-to-environment1
Explore at:
Dataset updated
Apr 26, 2025
Dataset provided by
National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
United States Department of Commercehttp://www.commerce.gov/
National Environmental Satellite, Data, and Information Service
Description
These events range from minor operational problems to permanent spacecraft failures. Australia, Canada, Germany, India, Japan, United Kingdom, and the United States have contributed data. This data base of known satellite anomalies is used to study and identify trends in the anomalous behavior of different families of satellites. The trends include seasonal groupings, diurnal groupings, and anomaly types indicative of certain satellite types and manufacturers. Corrections are done with several Solar-terrestrial data sets. Specifically, geomagnetic activity has been found to have significant effects on satellite behavior. Solar activity and cosmic rays have also proven to be important in the anomalous behavior of satellites. Information provided by this program can be used in the design phase of spacecraft to prevent the propagation of problems from one spacecraft to the next. This information can also be used by operations personnel to anticipate periods of anomalous behavior based on the proven response of an existing craft to environmental conditions.
f
Table_1_AMAnD: an automated metagenome anomaly detection methodology...
frontiersin.figshare.com
docx
Updated Jul 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Colin Price; Joseph A. Russell (2023). Table_1_AMAnD: an automated metagenome anomaly detection methodology utilizing DeepSVDD neural networks.DOCX [Dataset]. http://doi.org/10.3389/fpubh.2023.1181911.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2023.1181911.s001
Dataset updated
Jul 11, 2023
Dataset provided by
Frontiers
Authors
Colin Price; Joseph A. Russell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The composition of metagenomic communities within the human body often reflects localized medical conditions such as upper respiratory diseases and gastrointestinal diseases. Fast and accurate computational tools to flag anomalous metagenomic samples from typical samples are desirable to understand different phenotypes, especially in contexts where repeated, long-duration temporal sampling is done. Here, we present Automated Metagenome Anomaly Detection (AMAnD), which utilizes two types of Deep Support Vector Data Description (DeepSVDD) models; one trained on taxonomic feature space output by the Pan-Genomics for Infectious Agents (PanGIA) taxonomy classifier and one trained on kmer frequency counts. AMAnD's semi-supervised one-class approach makes no assumptions about what an anomaly may look like, allowing the flagging of potentially novel anomaly types. Three diverse datasets are profiled. The first dataset is hosted on the National Center for Biotechnology Information's (NCBI) Sequence Read Archive (SRA) and contains nasopharyngeal swabs from healthy and COVID-19-positive patients. The second dataset is also hosted on SRA and contains gut microbiome samples from normal controls and from patients with slow transit constipation (STC). AMAnD can learn a typical healthy nasopharyngeal or gut microbiome profile and reliably flag the anomalous COVID+ or STC samples in both feature spaces. The final dataset is a synthetic metagenome created by the Critical Assessment of Metagenome Annotation Simulator (CAMISIM). A control dataset of 50 well-characterized organisms was submitted to CAMISIM to generate 100 synthetic control class samples. The experimental conditions included 12 different spiked-in contaminants that are taxonomically similar to organisms present in the laboratory blank sample ranging from one strain tree branch taxonomic distance away to one family tree branch taxonomic distance away. This experiment was repeated in triplicate at three different coverage levels to probe the dependence on sample coverage. AMAnD was again able to flag the contaminant inserts as anomalous. AMAnD's assumption-free flagging of metagenomic anomalies, the real-time model training update potential of the deep learning approach, and the strong performance even with lightweight models of low sample cardinality would make AMAnD well-suited to a wide array of applied metagenomics biosurveillance use-cases, from environmental to clinical utility.
f
BigDataAD Benchmark Dataset
figshare.com
zip
Updated Sep 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kingsley Pattinson (2023). BigDataAD Benchmark Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24040563.v8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24040563.v8
Dataset updated
Sep 29, 2023
Dataset provided by
figshare
Authors
Kingsley Pattinson
License
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
Description
The largest real-world dataset for multivariate time series anomaly detection (MTSAD) from the AIOps system of a Real-Time Data Warehouse (RTDW) from a top cloud computing company. All the metrics and labels in our dataset are derived from real-world scenarios. All metrics were obtained from the RTDW instance monitoring system and cover a rich variety of metric types, including CPU usage, queries per second (QPS) and latency, which are related to many important modules within RTDW AIOps Dataset. We obtain labels from the ticket system, which integrates three main sources of instance anomalies: user service requests, instance unavailability and fault simulations . User service requests refer to tickets that are submitted directly by users, whereas instance unavailability is typically detected through existing monitoring tools or discovered by Site Reliability Engineers (SREs). Since the system is usually very stable, we augment the anomaly samples by conducting fault simulations. Fault simulation refers to a special type of anomaly, planned beforehand, which is introduced to the system to test its performance under extreme conditions. All records in the ticket system are subject to follow-up processing by engineers, who meticulously mark the start and end times of each ticket. This rigorous approach ensures the accuracy of the labels in our dataset.
Spacecraft Anomaly Data
kaggle.com
Updated Jan 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J Harris (2019). Spacecraft Anomaly Data [Dataset]. https://www.kaggle.com/datasets/usaf091847/spacecraft-anomaly-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 30, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
J Harris
Description
Context

Spacecraft Anomaly Data (edited version of the text file provided in the data section)

This database of spacecraft anomalies was developed by the Solar-Terrestrial Physics Division of the National Geophysical Data Center (NGDC) in 1984. It included the date, time, location, and other pertinent information about incidents of spacecraft operational irregularity due to the natural environment. The ability to attribute anomalies to the natural environment helps rule out hardware or software engineering causes or hostile activity.

Anomaly events range from minor operational problems which can be easily corrected to permanent spacecraft failures. The database includes spacecraft anomalies in interplanetary space and in near-earth orbit; the majority of the data comes from geostationary spacecraft.

Many spacecraft are identified by aliases in order to preserve confidentiality -- industry has proprietary concerns regarding design (and insurance issues), these are prefaced with the "@" character.

Due to proprietary and operational security concerns, anomaly contributions slowed to a trickle and ceased in the early 1990s. The original database files have been converted to Excel Spreadsheets.

Content

The data in the "anom5j.xls" file are sorted by satellite. Essential information include the satellite name (BIRD), anomaly date (ADATE), satellite time UTC (STIMEU) for space environment context and Local (STIMEL) for satellite location relative to the sun and earth, orbit, anomaly type (ATYPE) and anomaly diagnosis (ADIAG). Some other factors may play into the anomalies, though not as importantly at first.

Acknowledgements

These data are courtesy of NGDC (now National Centers for Environmental Information, NCEI). https://www.ngdc.noaa.gov/stp/satellite/anomaly/satelliteanomaly.html

Motivation

My goal is to search for correlations between the types of anomalies associated with particular space environment phenomena and spacecraft orbit. When is the environment conducive to certain types of anomalies in particular orbits?

License

No known license restrictions. From the copyright notice: As required by 17 U.S.C. 403, third parties producing copyrighted works consisting predominantly of the material produced by U.S. government agencies must provide notice with such work(s) identifying the U.S. Government material incorporated and stating that such material is not subject to copyright protection within the United States. The information on government web pages is in the public domain and not subject to copyright protection within the United States unless specifically annotated otherwise (copyright may be held elsewhere). Foreign copyrights may apply.
A
OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis...
data.amerigeoss.org
data.wu.ac.at
html
Updated Jul 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States[old] (2019). OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis Portal [Dataset]. https://data.amerigeoss.org/pl/dataset/0f24d562-556c-4895-955a-74fec4cc9993
Explore at:
htmlAvailable download formats
Dataset updated
Jul 25, 2019
Dataset provided by
United States[old]
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Anomaly detection is a process of identifying items, events or observations, which do not conform to an expected pattern in a dataset or time series. Current and future missions and our research communities challenge us to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data intensive reality, we propose to develop an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of ocean science datasets. A parallel analytics engine will be developed as the key computational and data-mining core of OceanXtreams' backend processing. This analytic engine will demonstrate three new technology ideas to provide rapid turn around on climatology computation and anomaly detection: 1. An adaption of the Hadoop/MapReduce framework for parallel data mining of science datasets, typically large 3 or 4 dimensional arrays packaged in NetCDF and HDF. 2. An algorithm profiling service to efficiently and cost-effectively scale up hybrid Cloud computing resources based on the needs of scheduled jobs (CPU, memory, network, and bursting from a private Cloud computing cluster to public cloud provider like Amazon Cloud services). 3. An extension to industry-standard search solutions (OpenSearch and Faceted search) to provide support for shared discovery and exploration of ocean phenomena and anomalies, along with unexpected correlations between key measured variables. We will use a hybrid Cloud compute cluster (private Eucalyptus on-premise at JPL with bursting to Amazon Web Services) as the operational backend. The key idea is that the parallel data-mining operations will be run 'near' the ocean data archives (a local 'network' hop) so that we can efficiently access the thousands of (say, daily) files making up a three decade time-series, and then cache key variables and pre-computed climatologies in a high-performance parallel database. OceanXtremes will be equipped with both web portal and web service interfaces for users and applications/systems to register and retrieve oceanographic anomalies data. By leveraging technology such as Datacasting (Bingham, et.al, 2007), users can also subscribe to anomaly or 'event' types of their interest and have newly computed anomaly metrics and other information delivered to them by metadata feeds packaged in standard Rich Site Summary (RSS) format. Upon receiving new feed entries, users can examine the metrics and download relevant variables, by simply clicking on a link, to begin further analyzing the event. The OceanXtremes web portal will allow users to define their own anomaly or feature types where continuous backend processing will be scheduled to populate the new user-defined anomaly type by executing the chosen data mining algorithm (i.e. differences from climatology or gradients above a specified threshold). Metadata on the identified anomalies will be cataloged including temporal and geospatial profiles, key physical metrics, related observational artifacts and other relevant metadata to facilitate discovery, extraction, and visualization. Products created by the anomaly detection algorithm will be made explorable and subsettable using Webification (Huang, et.al, 2014) and OPeNDAP (http://opendap.org) technologies. Using this platform scientists can efficiently search for anomalies or ocean phenomena, compute data metrics for events or over time-series of ocean variables, and efficiently find and access all of the data relevant to their study (and then download only that data).
A
Anomaly Detection Industry Report
datainsightsmarket.com
doc, pdf, ppt
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Anomaly Detection Industry Report [Dataset]. https://www.datainsightsmarket.com/reports/anomaly-detection-industry-14721
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Mar 4, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The anomaly detection market is experiencing robust growth, fueled by the increasing volume and complexity of data generated across various industries. A compound annual growth rate (CAGR) of 16.22% from 2019 to 2024 suggests a significant market expansion, driven by the imperative for businesses to enhance cybersecurity, improve operational efficiency, and gain valuable insights from their data. Key drivers include the rising adoption of cloud computing, the proliferation of IoT devices generating massive datasets, and the growing need for real-time fraud detection and prevention, particularly within the BFSI (Banking, Financial Services, and Insurance) sector. The market is segmented by solution type (software, services), end-user industry (BFSI, manufacturing, healthcare, IT and telecommunications, others), and deployment (on-premise, cloud). The cloud deployment segment is anticipated to witness faster growth due to its scalability, cost-effectiveness, and ease of implementation. The increasing sophistication of cyberattacks and the need for proactive security measures are further bolstering demand for advanced anomaly detection solutions. While data privacy concerns and the complexity of integrating these solutions into existing IT infrastructure represent potential restraints, the overall market trajectory indicates a sustained period of expansion. Companies like SAS Institute, IBM, and Microsoft are actively shaping this market with their comprehensive offerings. The significant growth trajectory is expected to continue through 2033. The substantial investments in research and development by major players and the growing adoption across diverse sectors, including healthcare for predictive maintenance and anomaly detection in medical imaging, will continue to fuel the expansion. The competitive landscape is characterized by both established players offering comprehensive solutions and emerging niche players focusing on specific industry needs. This competitive dynamism fosters innovation and drives the development of more efficient and sophisticated anomaly detection technologies. While regional variations exist, North America and Europe currently hold a significant market share, with Asia-Pacific poised for rapid expansion due to increasing digitalization and investment in advanced technologies. This report provides a detailed analysis of the global anomaly detection market, projecting robust growth from $XXX million in 2025 to $YYY million by 2033. The study covers the historical period (2019-2024), base year (2025), and forecast period (2025-2033), offering invaluable insights for businesses navigating this rapidly evolving landscape. Keywords: Anomaly detection, machine learning, AI, cybersecurity, fraud detection, predictive analytics, data mining, big data analytics, real-time analytics. Recent developments include: June 2023: Wipro has launched a new suite of banking financial services built on Microsoft Cloud; the partnership will combine Microsoft Cloud capabilities with Wipro FullStride Cloud and leverage Wipro's and Capco's deep domain expertise in financial services. And develop new solutions to help financial services clients accelerate growth and deepen client relationships., June 2023: Cisco has announced delivering on its promise of the AI-driven Cisco Security Cloud to simplify cybersecurity and empower people to do their best work from anywhere, regardless of the increasingly sophisticated threat landscape. Cisco invests in cutting-edge artificial intelligence and machine learning innovations that will empower security teams by simplifying operations and increasing efficacy.. Key drivers for this market are: Increasing Number of Cyber Crimes, Increasing Adoption of Anomaly Detection Solutions in Software Testing. Potential restraints include: Open Source Alternatives Pose as a Threat. Notable trends are: BFSI is Expected to Hold a Significant Part of the Market Share.
f
DataSheet_1_Anomaly detection in feature space for detecting changes in...
frontiersin.figshare.com
pdf
Updated Jan 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massimiliano Ciranni; Francesca Odone; Vito Paolo Pastore (2024). DataSheet_1_Anomaly detection in feature space for detecting changes in phytoplankton populations.pdf [Dataset]. http://doi.org/10.3389/fmars.2023.1283265.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fmars.2023.1283265.s001
Dataset updated
Jan 5, 2024
Dataset provided by
Frontiers
Authors
Massimiliano Ciranni; Francesca Odone; Vito Paolo Pastore
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Plankton organisms are fundamental components of the earth’s ecosystem. Zooplankton feeds on phytoplankton and is predated by fish and other aquatic animals, being at the core of the aquatic food chain. On the other hand, Phytoplankton has a crucial role in climate regulation, has produced almost 50% of the total oxygen in the atmosphere and it’s responsible for fixing around a quarter of the total earth’s carbon dioxide. Importantly, plankton can be regarded as a good indicator of environmental perturbations, as it can react to even slight environmental changes with corresponding modifications in morphology and behavior. At a population level, the biodiversity and the concentration of individuals of specific species may shift dramatically due to environmental changes. Thus, in this paper, we propose an anomaly detection-based framework to recognize heavy morphological changes in phytoplankton at a population level, starting from images acquired in situ. Given that an initial annotated dataset is available, we propose to build a parallel architecture training one anomaly detection algorithm for each available class on top of deep features extracted by a pre-trained Vision Transformer, further reduced in dimensionality with PCA. We later define global anomalies, corresponding to samples rejected by all the trained detectors, proposing to empirically identify a threshold based on global anomaly count over time as an indicator that can be used by field experts and institutions to investigate potential environmental perturbations. We use two publicly available datasets (WHOI22 and WHOI40) of grayscale microscopic images of phytoplankton collected with the Imaging FlowCytobot acquisition system to test the proposed approach, obtaining high performances in detecting both in-class and out-of-class samples. Finally, we build a dataset of 15 classes acquired by the WHOI across four years, showing that the proposed approach’s ability to identify anomalies is preserved when tested on images of the same classes acquired across a timespan of years.
Z
Dataset of "Anomaly Detection in Industrial Networks: Current State,...
data.niaid.nih.gov
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carda, Michal (2025). Dataset of "Anomaly Detection in Industrial Networks: Current State, Classification, and Key Challenges" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13332269
Explore at:
Dataset updated
Jan 27, 2025
Dataset provided by
Carda, Michal
Bouzek, Karel
Mlýnek, Petr
Fujdiak, Radek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Industrial networks are adapted to their specific requirements, especially in terms of industrial processes. To ensure sufficient security in these networks, it is necessary to set and use security policies that complement government regulations, recommendations, and relevant security standards. This paper aims to provide an in-depth analysis of the anomalies occurring within the networks and propose a structure for collecting valuable data from the experimental site based on dividing anomalies into three main categories:security, operational, and service anomalies (and regular traffic recognition). We present a proof-of-concept solution/design aggregating data in industrial networks for advanced anomaly classification. Multiple data sources such as industrial communication, sensor data (additional sensors controlling device behavior), and HW status data are used as data sources. A total of three scenarios (using a physical testbed) were implemented, where we achieved an accuracy of 0.8540/0.9972 in advanced anomaly classification.
h
Brain3-Anomaly-Classification
huggingface.co
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prithiv Sakthi (2025). Brain3-Anomaly-Classification [Dataset]. https://huggingface.co/datasets/prithivMLmods/Brain3-Anomaly-Classification
Explore at:
Dataset updated
Jul 4, 2025
Authors
Prithiv Sakthi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Brain3-Anomaly-Classification

The Brain3-Anomaly-Classification dataset is a curated collection of brain MRI scans categorized into three types of brain anomalies. It is designed for use in machine learning applications related to medical imaging, especially in the detection and classification of brain tumors.

Dataset Summary

Total Samples: 6,000 Image Size: 512 x 512 pixels (grayscale) Number of Classes: 3 Data Split: Only train split is provided

Each image in the… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/Brain3-Anomaly-Classification.

Facebook

Twitter

Click to copy link

Link copied

Cite

astro_pat (2023). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. https://www.kaggle.com/datasets/patrickfleith/controlled-anomalies-time-series-dataset/discussion

Controlled Anomalies Time Series (CATS) Dataset

Awesome Dataset to benchmark Anomaly Detection in Multivariate Time Series

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 14, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

astro_pat

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables)including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
- 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
- 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
- 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
5 million timestamps. Sensors readings are at 1Hz sampling frequency.
- 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
- 4 million observations that include** both nominal and anomalous segments.** This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
- Contamination level of 0.038. This means about 3.8% of the observations (rows) are anomalous.
Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.
Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.
Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).
Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during**** our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

The dataset provider, Solenix, is an international company providing software e...

Clear search

Close search

Google apps

Main menu

Controlled Anomalies Time Series (CATS) Dataset

Controlled Anomalies Time Series (CATS) Dataset

Medical Out-of-Distribution Analysis Challenge 2022

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

Visual Anomaly Detection - Thermostatic Valves

How to use this dataset

Want to see this dataset results?

Understand the info.labels file

Additional resources

DCASE 2025 Challenge Task 2 Additional Training Dataset

Street Scene Video Anomaly Detection Dataset

Data from: Anomaly Detection in a Fleet of Systems

MAR Address Anomalies

CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly...

CESNET-TimeSeries24: The dataset for network traffic forecasting and anomaly detection

Time series

Data Records

MVTec AD Dataset

Satellite Anomalies Due to Environment

Table_1_AMAnD: an automated metagenome anomaly detection methodology...

BigDataAD Benchmark Dataset

Spacecraft Anomaly Data

Context

Content

Acknowledgements

Motivation

License

OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis...

Anomaly Detection Industry Report

DataSheet_1_Anomaly detection in feature space for detecting changes in...

Dataset of "Anomaly Detection in Industrial Networks: Current State,...

Brain3-Anomaly-Classification

Controlled Anomalies Time Series (CATS) Dataset

Awesome Dataset to benchmark Anomaly Detection in Multivariate Time Series