Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
The dataset provider, Solenix, is an international company providing software e...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
Despite overwhelming successes in recent years, progress in the field of biomedical image computing still largely depends on the availability of annotated training examples. This annotation process is often prohibitively expensive because it requires the valuable time of domain experts. Additionally, this approach simply does not scale well: whenever a new imaging modality is created, acquisition parameters change. Even something as basic as the target demographic is prone to changes, and new annotated cases have to be created to allow methods to cope with the resulting images. Image labeling is thus bound to become the major bottleneck in the coming years. Furthermore, it has been shown that many algorithms used in image analysis are vulnerable to out-of-distribution samples, resulting in wrong and overconfident decisions [20, 21, 22, 23]. In addition, physicians can overlook unexpected conditions in medical images, often termed ���inattentional blindness���. In [1], Drew et al. noted that 50% of trained radiologists did not notice a gorilla image, rendered into a lung CT scan when assessing lung nodules. One approach, which does not require labeled images and can generalize to unseen pathological conditions, is Out-of-Distribution or anomaly detection (which in this context is used interchangeably). Anomaly detection can recognize and outline conditions that have not been previously encountered during training and thus circumvents the time-consuming labeling process and can therefore quickly be adapted to new modalities. Additionally, by highlighting such abnormal regions, anomaly detection can guide the physicians��� attention to otherwise overlooked abnormalities in a scan and potentially improve the time required to inspect medical images. However, while there is a lot of recent research on improving anomaly detection [8, 9, 10, 11, 12, 13, 14, 15, 16, 17], especially with a focus on the medical field [4, 5, 6, 7], a common dataset/ benchmark to compare different approaches is missing. Thus, it is currently hard to have a fair comparison of different proposed approaches. While in the last few months common datasets for natural data were proposed, such as default detection [3] or abnormal traffic scene detection [2], we tried to tackle this issue for medical imaging with last year's challenge [25]. In a similar setting to the last years we suggest the medical out-of-distribution challenge as a standardized dataset and benchmark for anomaly detection. We propose two different tasks. First a sample-wise (i.e. patients-wise) analysis, thus detecting out-of-distribution samples. For example, having a pathological condition or any other condition not seen in the training-set. This can pose a problem to classically supervised algorithms and detection of such could further allow physicians to prioritize different patients. Secondly, we propose a voxel-wise analysis i.e. giving a score for each voxel, highlighting abnormal conditions and potentially guiding the physician. However, there are a few aspects to consider when choosing an anomaly detection dataset. First, as in reality, the types of anomalies should not be known beforehand. This can be a particular problem when choosing a dataset and testing on only a single pathological condition, which is vulnerable to exploitation. Even with an educated guess (based on the dataset) and a fully supervised segmentation approach, trained on a not allowed separate dataset, one could outperform other rightfully trained anomaly detection approaches. Furthermore, making the exact types of anomalies known can cause a bias in the evaluation. Studies have shown that proposed anomaly detection algorithms tend to overfit on a given task, given that properties of the test set and the kind of anomalies are known beforehand. This further hinders the comparability of different algorithms [6, 18, 19, 23]. As a second point, combining test sets, from different sources with alternative conditions, may also cause problems. By definition, the different sources already propose a distribution shift to the training dataset, complicating a clean and meaningful evaluation. To solve these issues we propose to provide two datasets with more than 600 scans each, one brain MRI-dataset and one abdominal CT-dataset, to allow for a comparison of the generalizability of the approaches. In order to prevent overfitting on the (types of) anomalies existing in our test set, the test set will be kept confidential at all times. The training set consists of hand-selected scans in which no anomalies were identified. The remaining scans will be assigned to the test set. Thus some scans in the test set do not contain anomalies, whilst others contain naturally occurring anomalies. In addition to the natural anomalies, we will add synthetic anomalies. We choose different structured types of synthetic anomalies (e.g. a tumor or an image of a gorilla rendered into the a brain scan [1]) to cover a broad var...
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset has been collected by Edge Impulse to explain the FOMO-AD (visual anomaly detection) model architecture.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1642573%2Fc0ee423d706b13968414e2ff0ac23ffd%2FScreenshot%202024-06-05%20at%2010.38.41.png?generation=1717577815399896&alt=media" alt="">
The dataset is composed of 195 images including: - Training set: 121 images without anomaly - Testing set: 49 images containing anomaly and 25 images without anomaly
To import this data into a new Edge Impulse project, either use:
edge-impulse-uploader --clean --info-file info.labels
Have a look at the Edge Impulse public project to see the results
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1642573%2Ffee19cd5fb833da49f341358919c5ccb%2Ffomo-ad.gif?generation=1717577831265701&alt=media" alt="">
The info.labels
file (can be located in each subdirectory or at the folder root) provides detailed information about the labels. The file follows a JSON format, with the following structure:
version
: Indicates the version of the label format.files
: A list of objects, where each object represents a supported file format and its associated labels.
path
: The path or file name.category
: Indicates whether the image belongs to the training or testing set.label
(optional): Provides information about the labeled objects.type
: Specifies the type of label - unlabeled
, label
, multi-label
label
(optional): The actual label or class name of the sample.labels
(optional): The labels in the multi-label format:
label
: Label for the given period.startIndex
: Timestamp in milliseconds.endIndex
: Timestamp in milliseconds.metadata
(Optional): Additional metadata associated with the image, such as the site where it was collected, the timestamp or any useful information.boundingBoxes
(Optional): A list of objects, where each object represents a bounding box for an object within the image.label
: The label or class name of the object within the bounding box.x
, y
: The coordinates of the top-left corner of the bounding box.width
, height
: The width and height of the bounding box.Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is the "additional training dataset" for the DCASE 2025 Challenge Task 2.
The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-sec or 12-sec audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task:
Overview of the task
Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.
This task is the follow-up from DCASE 2020 Task 2 to DCASE 2024 Task 2. The task this year is to develop an ASD system that meets the following five requirements.
1. Train a model using only normal sound (unsupervised learning scenario)
Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data, which is called UASD (unsupervised ASD). This is the same requirement as in the previous tasks.
2. Detect anomalies regardless of domain shifts (domain generalization task)
In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same since DCASE 2022 Task 2.
3. Train a model for a completely new machine type
For a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning. This requirement is the same since DCASE 2023 Task 2.
4. Train a model both with or without attribute information
While additional attribute information can help enhance the detection performance, we cannot always obtain such information. Therefore, the system must work well both when attribute information is available and when it is not.
5. Train a model with additional clean machine data or noise-only data (optional)
Although the primary training data consists of machine sounds recorded under noisy conditions, in some situations it may be possible to collect clean machine data when the factory is idle or gather noise recordings when the machine itself is not running. Participants are free to incorporate these additional data sources to enhance the accuracy of their models.
The last optional requirement is newly introduced in DCASE 2025 Task2.
Definition
We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".
Dataset
This dataset consists of eight machine types. For each machine type, one section is provided, and the section is a complete set of training data. A set of test data corresponding to this training data will be provided in another seperate zenodo page as an "evaluation dataset" for the DCASE 2025 Challenge task 2. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training and (ii) ten clips of normal sounds in the target domain for training, (iii) 100 clips of supplementary sound data containing either clean normal machine sounds in the source domain or noise-only sounds. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.
File names and attribute csv files
File names and attribute csv files provide reference labels for each clip. The given reference labels for each training/test clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Note that for machine types that has its attribute information hidden, the attribute information in each file names are only labeled as "noAttributes". Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:
[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...
For machine types that have their attribute information hidden, all columns except the filename column are left blank for each row.
Recording procedure
Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.
Directory structure
- /eval_data
- /raw
- /AutoTrash
- /train (only normal clips)
- /section_00_source_train_normal_0001_
Baseline system
The baseline system is available on the Github repository https://github.com/nttcslab/dcase2023_task2_baseline_ae. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.
Condition of use
This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Citation
Contact
If there is any problem, please contact us:
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Introduction
The Street Scene dataset consists of 46 training video sequences and 35 testing video sequences taken from a static USB camera looking down on a scene of a two-lane street with bike lanes and pedestrian sidewalks. Videos were collected from the camera at various times during two consecutive summers. All of the videos were taken during the daytime. The dataset is challenging because of the variety of activities taking place such as cars driving, turning, stopping and parking; pedestrians walking, jogging and pushing strollers; and bikers riding in bike lanes. In addition, the videos contain changing shadows, and moving background such as a flag and trees blowing in the wind.
There are a total of 202,545 color video frames (56,135 for training and 146,410 for testing) each of size 1280 x 720 pixels. The frames were extracted from the original videos at 15 frames per second.
The 35 testing sequences have a total of 205 anomalous events consisting of 17 different anomaly types. A complete list of anomaly types and the number of each in the test set can be found in our paper.
Ground truth annotations are provided for each testing video in the form of bounding boxes around each anomalous event in each frame. Each bounding box is also labeled with a track number, meaning each anomalous event is labeled as a track of bounding boxes. Track lengths vary from tens of frames to 5200 which is the length of the longest testing sequence. A single frame can have more than one anomaly labeled.
NOTE: This version of the dataset differs slightly with the original made available in 2020. Some anomalies were found in a few of the normal training sequences. These training frames were deleted from the dataset. Specifically, the following frames were removed:
Train026: frames 1-184 (car taking a u-turn)
Train027: frames 1-229 (jay walkers)
Train031: frames 1-299 (jay walkers, illegally parked car)
At a Glance
Other Resources
None
Citation
If you use the Street Scene dataset in your research, please cite our contribution:
@inproceedings{ramachandra2020street,
title={Street Scene: A new dataset and evaluation protocol for video anomaly detection},
author={Ramachandra, Bharathkumar and Jones, Michael},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={2569--2578},
year={2020}
}
License
The Street Scene dataset is released under CC-BY-SA-4.0 license.
All data:
Created by Mitsubishi Electric Research Laboratories (MERL), 2023
SPDX-License-Identifier: CC-BY-SA-4.0
A fleet is a group of systems (e.g., cars, aircraft) that are designed and manufactured the same way and are intended to be used the same way. For example, a fleet of delivery trucks may consist of one hundred instances of a particular model of truck, each of which is intended for the same type of service—almost the same amount of time and distance driven every day, approximately the same total weight carried, etc. For this reason, one may imagine that data mining for fleet monitoring may merely involve collecting operating data from the multiple systems in the fleet and developing some sort of model, such as a model of normal operation that can be used for anomaly detection. However, one then may realize that each member of the fleet will be unique in some ways—there will be minor variations in manufacturing, quality of parts, and usage. For this reason, the typical machine learning and statis- tics algorithm’s assumption that all the data are independent and identically distributed is not correct. One may realize that data from each system in the fleet must be treated as unique so that one can notice significant changes in the operation of that system.
An address anomaly is an address whose location is illogical. In other words, the address does not follow the normal rules of the District of Columbia’s (DC) addressing grid system. There are different types of anomalies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset called CESNET-TimeSeries24 was collected by long-term monitoring of selected statistical metrics for 40 weeks for each IP address on the ISP network CESNET3 (Czech Education and Science Network). The dataset encompasses network traffic from more than 275,000 active IP addresses, assigned to a wide variety of devices, including office computers, NATs, servers, WiFi routers, honeypots, and video-game consoles found in dormitories. Moreover, the dataset is also rich in network anomaly types since it contains all types of anomalies, ensuring a comprehensive evaluation of anomaly detection methods.
Last but not least, the CESNET-TimeSeries24 dataset provides traffic time series on institutional and IP subnet levels to cover all possible anomaly detection or forecasting scopes. Overall, the time series dataset was created from the 66 billion IP flows that contain 4 trillion packets that carry approximately 3.7 petabytes of data. The CESNET-TimeSeries24 dataset is a complex real-world dataset that will finally bring insights into the evaluation of forecasting models in real-world environments.
Please cite the usage of our dataset as:
Koumar, J., Hynek, K., Čejka, T. et al. CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting. Sci Data 12, 338 (2025). https://doi.org/10.1038/s41597-025-04603-x
@Article{cesnettimeseries24,
author={Koumar, Josef and Hynek, Karel and {\v{C}}ejka, Tom{\'a}{\v{s}} and {\v{S}}i{\v{s}}ka, Pavel},
title={CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting},
journal={Scientific Data},
year={2025},
month={Feb},
day={26},
volume={12},
number={1},
pages={338},
issn={2052-4463},
doi={10.1038/s41597-025-04603-x},
url={https://doi.org/10.1038/s41597-025-04603-x}
}
We create evenly spaced time series for each IP address by aggregating IP flow records into time series datapoints. The created datapoints represent the behavior of IP addresses within a defined time window of 10 minutes. The vector of time-series metrics v_{ip, i} describes the IP address ip in the i-th time window. Thus, IP flows for vector v_{ip, i} are captured in time windows starting at t_i and ending at t_{i+1}. The time series are built from these datapoints.
Datapoints created by the aggregation of IP flows contain the following time-series metrics:
Multiple time aggregation: The original datapoints in the dataset are aggregated by 10 minutes of network traffic. The size of the aggregation interval influences anomaly detection procedures, mainly the training speed of the detection model. However, the 10-minute intervals can be too short for longitudinal anomaly detection methods. Therefore, we added two more aggregation intervals to the datasets--1 hour and 1 day.
Time series of institutions: We identify 283 institutions inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution's data.
Time series of institutional subnets: We identify 548 institution subnets inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution subnet's data.
The file hierarchy is described below:
cesnet-timeseries24/
|- institution_subnets/
| |- agg_10_minutes/
| |- agg_1_hour/
| |- agg_1_day/
| |- identifiers.csv
|- institutions/
| |- agg_10_minutes/
| |- agg_1_hour/
| |- agg_1_day/
| |- identifiers.csv
|- ip_addresses_full/
| |- agg_10_minutes/
| |- agg_1_hour/
| |- agg_1_day/
| |- identifiers.csv
|- ip_addresses_sample/
| |- agg_10_minutes/
| |- agg_1_hour/
| |- agg_1_day/
| |- identifiers.csv
|- times/
| |- times_10_minutes.csv
| |- times_1_hour.csv
| |- times_1_day.csv
|- ids_relationship.csv
|- weekends_and_holidays.csv
The following list describes time series data fields in CSV files:
Moreover, the time series created by re-aggregation contains following time series metrics instead of n_dest_ip, n_dest_asn, and n_dest_port:
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The authors of the MVTec AD: the MVTec Anomaly Detection dataset addressed the critical task of detecting anomalous structures within natural image data, a crucial aspect of computer vision applications. To facilitate the development of methods for unsupervised anomaly detection, they introduced the MVTec AD dataset, comprising 5354 high-resolution color images encompassing various object and texture categories. The dataset comprises both normal images, intended for training, and images with anomalies, designed for testing. These anomalies manifest in over 70 distinct types of defects, including scratches, dents, contaminations, and structural alterations. The authors also provided pixel-precise ground truth annotations for all anomalies.
These events range from minor operational problems to permanent spacecraft failures. Australia, Canada, Germany, India, Japan, United Kingdom, and the United States have contributed data. This data base of known satellite anomalies is used to study and identify trends in the anomalous behavior of different families of satellites. The trends include seasonal groupings, diurnal groupings, and anomaly types indicative of certain satellite types and manufacturers. Corrections are done with several Solar-terrestrial data sets. Specifically, geomagnetic activity has been found to have significant effects on satellite behavior. Solar activity and cosmic rays have also proven to be important in the anomalous behavior of satellites. Information provided by this program can be used in the design phase of spacecraft to prevent the propagation of problems from one spacecraft to the next. This information can also be used by operations personnel to anticipate periods of anomalous behavior based on the proven response of an existing craft to environmental conditions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The composition of metagenomic communities within the human body often reflects localized medical conditions such as upper respiratory diseases and gastrointestinal diseases. Fast and accurate computational tools to flag anomalous metagenomic samples from typical samples are desirable to understand different phenotypes, especially in contexts where repeated, long-duration temporal sampling is done. Here, we present Automated Metagenome Anomaly Detection (AMAnD), which utilizes two types of Deep Support Vector Data Description (DeepSVDD) models; one trained on taxonomic feature space output by the Pan-Genomics for Infectious Agents (PanGIA) taxonomy classifier and one trained on kmer frequency counts. AMAnD's semi-supervised one-class approach makes no assumptions about what an anomaly may look like, allowing the flagging of potentially novel anomaly types. Three diverse datasets are profiled. The first dataset is hosted on the National Center for Biotechnology Information's (NCBI) Sequence Read Archive (SRA) and contains nasopharyngeal swabs from healthy and COVID-19-positive patients. The second dataset is also hosted on SRA and contains gut microbiome samples from normal controls and from patients with slow transit constipation (STC). AMAnD can learn a typical healthy nasopharyngeal or gut microbiome profile and reliably flag the anomalous COVID+ or STC samples in both feature spaces. The final dataset is a synthetic metagenome created by the Critical Assessment of Metagenome Annotation Simulator (CAMISIM). A control dataset of 50 well-characterized organisms was submitted to CAMISIM to generate 100 synthetic control class samples. The experimental conditions included 12 different spiked-in contaminants that are taxonomically similar to organisms present in the laboratory blank sample ranging from one strain tree branch taxonomic distance away to one family tree branch taxonomic distance away. This experiment was repeated in triplicate at three different coverage levels to probe the dependence on sample coverage. AMAnD was again able to flag the contaminant inserts as anomalous. AMAnD's assumption-free flagging of metagenomic anomalies, the real-time model training update potential of the deep learning approach, and the strong performance even with lightweight models of low sample cardinality would make AMAnD well-suited to a wide array of applied metagenomics biosurveillance use-cases, from environmental to clinical utility.
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
The largest real-world dataset for multivariate time series anomaly detection (MTSAD) from the AIOps system of a Real-Time Data Warehouse (RTDW) from a top cloud computing company. All the metrics and labels in our dataset are derived from real-world scenarios. All metrics were obtained from the RTDW instance monitoring system and cover a rich variety of metric types, including CPU usage, queries per second (QPS) and latency, which are related to many important modules within RTDW AIOps Dataset. We obtain labels from the ticket system, which integrates three main sources of instance anomalies: user service requests, instance unavailability and fault simulations . User service requests refer to tickets that are submitted directly by users, whereas instance unavailability is typically detected through existing monitoring tools or discovered by Site Reliability Engineers (SREs). Since the system is usually very stable, we augment the anomaly samples by conducting fault simulations. Fault simulation refers to a special type of anomaly, planned beforehand, which is introduced to the system to test its performance under extreme conditions. All records in the ticket system are subject to follow-up processing by engineers, who meticulously mark the start and end times of each ticket. This rigorous approach ensures the accuracy of the labels in our dataset.
Spacecraft Anomaly Data (edited version of the text file provided in the data section)
This database of spacecraft anomalies was developed by the Solar-Terrestrial Physics Division of the National Geophysical Data Center (NGDC) in 1984. It included the date, time, location, and other pertinent information about incidents of spacecraft operational irregularity due to the natural environment. The ability to attribute anomalies to the natural environment helps rule out hardware or software engineering causes or hostile activity.
Anomaly events range from minor operational problems which can be easily corrected to permanent spacecraft failures. The database includes spacecraft anomalies in interplanetary space and in near-earth orbit; the majority of the data comes from geostationary spacecraft.
Many spacecraft are identified by aliases in order to preserve confidentiality -- industry has proprietary concerns regarding design (and insurance issues), these are prefaced with the "@" character.
Due to proprietary and operational security concerns, anomaly contributions slowed to a trickle and ceased in the early 1990s. The original database files have been converted to Excel Spreadsheets.
The data in the "anom5j.xls" file are sorted by satellite. Essential information include the satellite name (BIRD), anomaly date (ADATE), satellite time UTC (STIMEU) for space environment context and Local (STIMEL) for satellite location relative to the sun and earth, orbit, anomaly type (ATYPE) and anomaly diagnosis (ADIAG). Some other factors may play into the anomalies, though not as importantly at first.
These data are courtesy of NGDC (now National Centers for Environmental Information, NCEI). https://www.ngdc.noaa.gov/stp/satellite/anomaly/satelliteanomaly.html
My goal is to search for correlations between the types of anomalies associated with particular space environment phenomena and spacecraft orbit. When is the environment conducive to certain types of anomalies in particular orbits?
No known license restrictions. From the copyright notice: As required by 17 U.S.C. 403, third parties producing copyrighted works consisting predominantly of the material produced by U.S. government agencies must provide notice with such work(s) identifying the U.S. Government material incorporated and stating that such material is not subject to copyright protection within the United States. The information on government web pages is in the public domain and not subject to copyright protection within the United States unless specifically annotated otherwise (copyright may be held elsewhere). Foreign copyrights may apply.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Anomaly detection is a process of identifying items, events or observations, which do not conform to an expected pattern in a dataset or time series. Current and future missions and our research communities challenge us to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data intensive reality, we propose to develop an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of ocean science datasets. A parallel analytics engine will be developed as the key computational and data-mining core of OceanXtreams' backend processing. This analytic engine will demonstrate three new technology ideas to provide rapid turn around on climatology computation and anomaly detection: 1. An adaption of the Hadoop/MapReduce framework for parallel data mining of science datasets, typically large 3 or 4 dimensional arrays packaged in NetCDF and HDF. 2. An algorithm profiling service to efficiently and cost-effectively scale up hybrid Cloud computing resources based on the needs of scheduled jobs (CPU, memory, network, and bursting from a private Cloud computing cluster to public cloud provider like Amazon Cloud services). 3. An extension to industry-standard search solutions (OpenSearch and Faceted search) to provide support for shared discovery and exploration of ocean phenomena and anomalies, along with unexpected correlations between key measured variables. We will use a hybrid Cloud compute cluster (private Eucalyptus on-premise at JPL with bursting to Amazon Web Services) as the operational backend. The key idea is that the parallel data-mining operations will be run 'near' the ocean data archives (a local 'network' hop) so that we can efficiently access the thousands of (say, daily) files making up a three decade time-series, and then cache key variables and pre-computed climatologies in a high-performance parallel database. OceanXtremes will be equipped with both web portal and web service interfaces for users and applications/systems to register and retrieve oceanographic anomalies data. By leveraging technology such as Datacasting (Bingham, et.al, 2007), users can also subscribe to anomaly or 'event' types of their interest and have newly computed anomaly metrics and other information delivered to them by metadata feeds packaged in standard Rich Site Summary (RSS) format. Upon receiving new feed entries, users can examine the metrics and download relevant variables, by simply clicking on a link, to begin further analyzing the event. The OceanXtremes web portal will allow users to define their own anomaly or feature types where continuous backend processing will be scheduled to populate the new user-defined anomaly type by executing the chosen data mining algorithm (i.e. differences from climatology or gradients above a specified threshold). Metadata on the identified anomalies will be cataloged including temporal and geospatial profiles, key physical metrics, related observational artifacts and other relevant metadata to facilitate discovery, extraction, and visualization. Products created by the anomaly detection algorithm will be made explorable and subsettable using Webification (Huang, et.al, 2014) and OPeNDAP (http://opendap.org) technologies. Using this platform scientists can efficiently search for anomalies or ocean phenomena, compute data metrics for events or over time-series of ocean variables, and efficiently find and access all of the data relevant to their study (and then download only that data).
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The anomaly detection market is experiencing robust growth, fueled by the increasing volume and complexity of data generated across various industries. A compound annual growth rate (CAGR) of 16.22% from 2019 to 2024 suggests a significant market expansion, driven by the imperative for businesses to enhance cybersecurity, improve operational efficiency, and gain valuable insights from their data. Key drivers include the rising adoption of cloud computing, the proliferation of IoT devices generating massive datasets, and the growing need for real-time fraud detection and prevention, particularly within the BFSI (Banking, Financial Services, and Insurance) sector. The market is segmented by solution type (software, services), end-user industry (BFSI, manufacturing, healthcare, IT and telecommunications, others), and deployment (on-premise, cloud). The cloud deployment segment is anticipated to witness faster growth due to its scalability, cost-effectiveness, and ease of implementation. The increasing sophistication of cyberattacks and the need for proactive security measures are further bolstering demand for advanced anomaly detection solutions. While data privacy concerns and the complexity of integrating these solutions into existing IT infrastructure represent potential restraints, the overall market trajectory indicates a sustained period of expansion. Companies like SAS Institute, IBM, and Microsoft are actively shaping this market with their comprehensive offerings. The significant growth trajectory is expected to continue through 2033. The substantial investments in research and development by major players and the growing adoption across diverse sectors, including healthcare for predictive maintenance and anomaly detection in medical imaging, will continue to fuel the expansion. The competitive landscape is characterized by both established players offering comprehensive solutions and emerging niche players focusing on specific industry needs. This competitive dynamism fosters innovation and drives the development of more efficient and sophisticated anomaly detection technologies. While regional variations exist, North America and Europe currently hold a significant market share, with Asia-Pacific poised for rapid expansion due to increasing digitalization and investment in advanced technologies. This report provides a detailed analysis of the global anomaly detection market, projecting robust growth from $XXX million in 2025 to $YYY million by 2033. The study covers the historical period (2019-2024), base year (2025), and forecast period (2025-2033), offering invaluable insights for businesses navigating this rapidly evolving landscape. Keywords: Anomaly detection, machine learning, AI, cybersecurity, fraud detection, predictive analytics, data mining, big data analytics, real-time analytics. Recent developments include: June 2023: Wipro has launched a new suite of banking financial services built on Microsoft Cloud; the partnership will combine Microsoft Cloud capabilities with Wipro FullStride Cloud and leverage Wipro's and Capco's deep domain expertise in financial services. And develop new solutions to help financial services clients accelerate growth and deepen client relationships., June 2023: Cisco has announced delivering on its promise of the AI-driven Cisco Security Cloud to simplify cybersecurity and empower people to do their best work from anywhere, regardless of the increasingly sophisticated threat landscape. Cisco invests in cutting-edge artificial intelligence and machine learning innovations that will empower security teams by simplifying operations and increasing efficacy.. Key drivers for this market are: Increasing Number of Cyber Crimes, Increasing Adoption of Anomaly Detection Solutions in Software Testing. Potential restraints include: Open Source Alternatives Pose as a Threat. Notable trends are: BFSI is Expected to Hold a Significant Part of the Market Share.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Plankton organisms are fundamental components of the earth’s ecosystem. Zooplankton feeds on phytoplankton and is predated by fish and other aquatic animals, being at the core of the aquatic food chain. On the other hand, Phytoplankton has a crucial role in climate regulation, has produced almost 50% of the total oxygen in the atmosphere and it’s responsible for fixing around a quarter of the total earth’s carbon dioxide. Importantly, plankton can be regarded as a good indicator of environmental perturbations, as it can react to even slight environmental changes with corresponding modifications in morphology and behavior. At a population level, the biodiversity and the concentration of individuals of specific species may shift dramatically due to environmental changes. Thus, in this paper, we propose an anomaly detection-based framework to recognize heavy morphological changes in phytoplankton at a population level, starting from images acquired in situ. Given that an initial annotated dataset is available, we propose to build a parallel architecture training one anomaly detection algorithm for each available class on top of deep features extracted by a pre-trained Vision Transformer, further reduced in dimensionality with PCA. We later define global anomalies, corresponding to samples rejected by all the trained detectors, proposing to empirically identify a threshold based on global anomaly count over time as an indicator that can be used by field experts and institutions to investigate potential environmental perturbations. We use two publicly available datasets (WHOI22 and WHOI40) of grayscale microscopic images of phytoplankton collected with the Imaging FlowCytobot acquisition system to test the proposed approach, obtaining high performances in detecting both in-class and out-of-class samples. Finally, we build a dataset of 15 classes acquired by the WHOI across four years, showing that the proposed approach’s ability to identify anomalies is preserved when tested on images of the same classes acquired across a timespan of years.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Industrial networks are adapted to their specific requirements, especially in terms of industrial processes. To ensure sufficient security in these networks, it is necessary to set and use security policies that complement government regulations, recommendations, and relevant security standards. This paper aims to provide an in-depth analysis of the anomalies occurring within the networks and propose a structure for collecting valuable data from the experimental site based on dividing anomalies into three main categories:security, operational, and service anomalies (and regular traffic recognition). We present a proof-of-concept solution/design aggregating data in industrial networks for advanced anomaly classification. Multiple data sources such as industrial communication, sensor data (additional sensors controlling device behavior), and HW status data are used as data sources. A total of three scenarios (using a physical testbed) were implemented, where we achieved an accuracy of 0.8540/0.9972 in advanced anomaly classification.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Brain3-Anomaly-Classification
The Brain3-Anomaly-Classification dataset is a curated collection of brain MRI scans categorized into three types of brain anomalies. It is designed for use in machine learning applications related to medical imaging, especially in the detection and classification of brain tumors.
Dataset Summary
Total Samples: 6,000 Image Size: 512 x 512 pixels (grayscale) Number of Classes: 3 Data Split: Only train split is provided
Each image in the… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/Brain3-Anomaly-Classification.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
The dataset provider, Solenix, is an international company providing software e...