56 datasets found

d
Comparison of Unsupervised Anomaly Detection Methods
catalog.data.gov
data.nasa.gov
+2more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Comparison of Unsupervised Anomaly Detection Methods [Dataset]. https://catalog.data.gov/dataset/comparison-of-unsupervised-anomaly-detection-methods
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Several different unsupervised anomaly detection algorithms have been applied to Space Shuttle Main Engine (SSME) data to serve the purpose of developing a comprehensive suite of Integrated Systems Health Management (ISHM) tools. As the theoretical bases for these methods vary considerably, it is reasonable to conjecture that the resulting anomalies detected by them may differ quite significantly as well. As such, it would be useful to apply a common metric with which to compare the results. However, for such a quantitative analysis to be statistically significant, a sufficient number of examples of both nominally categorized and anomalous data must be available. Due to the lack of sufficient examples of anomalous data, use of any statistics that rely upon a statistically significant sample of anomalous data is infeasible. Therefore, the main focus of this paper will be to compare actual examples of anomalies detected by the algorithms via the sensors in which they appear, as well the times at which they appear. We find that there is enough overlap in detection of the anomalies among all of the different algorithms tested in order for them to corroborate the severity of these anomalies. In certain cases, the severity of these anomalies is supported by their categorization as failures by experts, with realistic physical explanations. For those anomalies that can not be corroborated by at least one other method, this overlap says less about the severity of the anomaly, and more about their technical nuances, which will also be discussed.
Dataset for the paper "Anomaly Detection in Large-Scale Cloud Systems: An...
zenodo.org
bin, csv, html
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Saiful Islam; Mohamed Sami Rakha; William Pourmajidi; Janakan Sivaloganathan; John Steinbacher; Andriy Miranskyy; Mohammad Saiful Islam; Mohamed Sami Rakha; William Pourmajidi; Janakan Sivaloganathan; John Steinbacher; Andriy Miranskyy (2025). Dataset for the paper "Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset" [Dataset]. http://doi.org/10.5281/zenodo.14062900
Explore at:
bin, html, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14062900
Dataset updated
Feb 12, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mohammad Saiful Islam; Mohamed Sami Rakha; William Pourmajidi; Janakan Sivaloganathan; John Steinbacher; Andriy Miranskyy; Mohammad Saiful Islam; Mohamed Sami Rakha; William Pourmajidi; Janakan Sivaloganathan; John Steinbacher; Andriy Miranskyy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a large-scale anomaly detection dataset collected from IBM Cloud's Console over approximately 4.5 months. This high-dimensional dataset captures telemetry data from multiple data centers, specifically designed to aid researchers in developing and benchmarking anomaly detection methods in large-scale cloud environments. It contains 39,365 entries, each representing a 5-minute interval, with 117,448 features/attributes, as interval_start is used as the index. The dataset includes detailed information on request counts, HTTP response codes, and various aggregated statistics. The dataset also includes labeled anomaly events identified through IBM's internal monitoring tools, providing a comprehensive resource for real-world anomaly detection research and evaluation.

File Descriptions

location_downtime.csv - Details planned and unplanned downtimes for IBM Cloud data centers, including start and end times in ISO 8601 format.

unpivoted_data.parquet - Contains raw telemetry data with 413 million+ rows, covering details like location, HTTP status codes, request types, and aggregated statistics (min, max, median response times).

anomaly_windows.csv - Ground truth for anomalies, listing start and end times of recorded anomalies, categorized by source (Issue Tracker, Instant Messenger, Test Log).

pivoted_data_all.parquet - Pivoted version of the telemetry dataset with 39,365 rows and 117,449 columns, including aggregated statistics across multiple metrics and intervals.

demo/demo.[ipynb|html]: This demo file provides examples of how to access data in the Parquet files, available in Jupyter Notebook (.ipynb) and HTML (.html) formats, respectively.

Further details of the dataset can be found in Appendix B: Dataset Characteristics of the paper titled "Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset." Sample code for training anomaly detectors using this data is provided in this package.

When using the dataset, please cite it as follows:

@misc{islam2024anomaly,
title={Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset},
author={Mohammad Saiful Islam and Mohamed Sami Rakha and William Pourmajidi and Janakan Sivaloganathan and John Steinbacher and Andriy Miranskyy},
year={2024},
eprint={2411.09047},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2411.09047}
}

Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

technavio.com

Updated Oct 11, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2022). Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Spain, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/anomaly-detection-market-industry-analysis

Explore at:

Dataset updated

Oct 11, 2022

Dataset provided by

TechNavio

Authors

Technavio

Time period covered

2021 - 2025

Area covered

United States, Canada, Mexico, Germany, Global

Description

Snapshot img

Anomaly Detection Market Size 2025-2029

The anomaly detection market size is forecast to increase by USD 4.44 billion at a CAGR of 14.4% between 2024 and 2029.

The market is experiencing significant growth, particularly in the BFSI sector, as organizations increasingly prioritize identifying and addressing unusual patterns or deviations from normal business operations. The rising incidence of internal threats and cyber frauds necessitates the implementation of advanced anomaly detection tools to mitigate potential risks and maintain security. However, implementing these solutions comes with challenges, primarily infrastructural requirements. Ensuring compatibility with existing systems, integrating new technologies, and training staff to effectively utilize these tools pose significant hurdles for organizations.
Despite these challenges, the potential benefits of anomaly detection, such as improved risk management, enhanced operational efficiency, and increased security, make it an essential investment for businesses seeking to stay competitive and agile in today's complex and evolving threat landscape. Companies looking to capitalize on this market opportunity must carefully consider these challenges and develop strategies to address them effectively. Cloud computing is a key trend in the market, as cloud-based solutions offer quick deployment, flexibility, and scalability.

What will be the Size of the Anomaly Detection Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free Sample

In the dynamic and evolving market, advanced technologies such as resource allocation, linear regression, pattern recognition, and support vector machines are increasingly being adopted for automated decision making. Businesses are leveraging these techniques to enhance customer experience through behavioral analytics, object detection, and sentiment analysis. Machine learning algorithms, including random forests, naive Bayes, decision trees, clustering algorithms, and k-nearest neighbors, are essential tools for risk management and compliance monitoring. AI-powered analytics, time series forecasting, and predictive modeling are revolutionizing business intelligence, while process optimization is achieved through the application of decision support systems, natural language processing, and predictive analytics.
Computer vision, image recognition, logistic regression, and operational efficiency are key areas where principal component analysis and artificial neural networks contribute significantly. Speech recognition and operational efficiency are also benefiting from these advanced technologies, enabling businesses to streamline processes and improve overall performance.

How is this Anomaly Detection Industry segmented?

The anomaly detection industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Deployment

  Cloud
  On-premises


Component

  Solution
  Services


End-user

  BFSI
  IT and telecom
  Retail and e-commerce
  Manufacturing
  Others


Technology

  Big data analytics
  AI and ML
  Data mining and business intelligence


Geography

  North America

    US
    Canada
    Mexico


  Europe

    France
    Germany
    Spain
    UK


  APAC

    China
    India
    Japan


  Rest of World (ROW)

By Deployment Insights

The cloud segment is estimated to witness significant growth during the forecast period. The market is witnessing significant growth due to the increasing adoption of advanced technologies such as machine learning models, statistical methods, and real-time monitoring. These technologies enable the identification of anomalous behavior in real-time, thereby enhancing network security and data privacy. Anomaly detection algorithms, including unsupervised learning, reinforcement learning, and deep learning networks, are used to identify outliers and intrusions in large datasets. Data security is a major concern, leading to the adoption of data masking, data pseudonymization, data de-identification, and differential privacy.

Data leakage prevention and incident response are critical components of an effective anomaly detection system. False positive and false negative rates are essential metrics to evaluate the performance of these systems. Time series analysis and concept drift are important techniques used in anomaly detection. Data obfuscation, data suppression, and data aggregation are other strategies employed to maintain data privacy. Companies such as Anodot, Cisco Systems Inc, IBM Corp, and SAS Institute Inc offer both cloud-based and on-premises anomaly detection solutions. These solutions use v

v
Global Anomaly Detection Solution Market Size By Type (Statistical Anomaly...
verifiedmarketresearch.com
pdf,excel,csv,ppt
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Verified Market Research, Global Anomaly Detection Solution Market Size By Type (Statistical Anomaly Detection, Machine Learning Anomaly Detection), By Application (Network Security, Fraud Detection, Risk Management), By Industry Vertical (Banking, Financial Services, And Insurance (BFSI), Retail And E-commerce, Healthcare), By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/anomaly-detection-solution-market/
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset authored and provided by
Verified Market Research
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Time period covered
2026 - 2032
Area covered
Global
Description
Global Anomaly Detection Solution Market size was valued at USD 6.18 Billion in 2024 and is projected to reach USD 19.99 Billion by 2032, growing at a CAGR of 15.80% from 2026 to 2032.Global Anomaly Detection Solution Market DynamicsThe key market dynamics that are shaping the global Anomaly Detection Solution Market include:Key Market Drivers:Increasing Cybersecurity Threats: The surge in sophisticated cyberattacks and data breaches is a key driver of the Anomaly Detection Solution Market. Cybercriminals are increasingly targeting organizations with innovative tactics for breaching security systems. Anomaly detection solutions are critical for detecting unexpected patterns or behaviors that could indicate a threat such as unauthorized access or insider threats.Growing Volume of Data: The exponential rise of data generated by businesses, fueled by digital transformation and IoT devices, needs excellent anomaly detection.
MNIST dataset for Outliers Detection - [ MNIST4OD ]
figshare.com
application/gzip
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9954986.v2
Dataset updated
May 17, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Giovanni Stilo; Bardh Prenkaj
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10
Data from: Theoretically Optimal Distributed Anomaly Detection
data.nasa.gov
datasets.ai
+1more
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Theoretically Optimal Distributed Anomaly Detection [Dataset]. https://data.nasa.gov/dataset/theoretically-optimal-distributed-anomaly-detection
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
A novel general framework for distributed anomaly detection with theoretical performance guarantees is proposed. Our algorithmic approach combines existing anomaly detection procedures with a novel method for computing global statistics using local sufficient statistics. Under a Gaussian assumption, our distributed algorithm is guaranteed to perform as well as its centralized counterpart, a condition we call Ôzero information lossÕ. We further report experimental results on synthetic as well as real-world data to demonstrate the viability of our approach.
d
Data from: Discovering System Health Anomalies using Data Mining Techniques
catalog.data.gov
data.nasa.gov
+2more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Discovering System Health Anomalies using Data Mining Techniques [Dataset]. https://catalog.data.gov/dataset/discovering-system-health-anomalies-using-data-mining-techniques
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
We discuss a statistical framework that underlies envelope detection schemes as well as dynamical models based on Hidden Markov Models (HMM) that can encompass both discrete and continuous sensor measurements for use in Integrated System Health Management (ISHM) applications. The HMM allows for the rapid assimilation, analysis, and discovery of system anomalies. We motivate our work with a discussion of an aviation problem where the identification of anomalous sequences is essential for safety reasons. The data in this application are discrete and continuous sensor measurements and can be dealt with seamlessly using the methods described here to discover anomalous flights. We specifically treat the problem of discovering anomalous features in the time series that may be hidden from the sensor suite and compare those methods to standard envelope detection methods on test data designed to accentuate the differences between the two methods. Identification of these hidden anomalies is crucial to building stable, reusable, and cost-efficient systems. We also discuss a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an ISHM system. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.
f
Data from: Nonparametric Anomaly Detection on Time Series of Graphs
tandf.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dorcas Ofori-Boateng; Yulia R. Gel; Ivor Cribben (2023). Nonparametric Anomaly Detection on Time Series of Graphs [Dataset]. http://doi.org/10.6084/m9.figshare.13180181.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13180181.v3
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Dorcas Ofori-Boateng; Yulia R. Gel; Ivor Cribben
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identifying change points and/or anomalies in dynamic network structures has become increasingly popular across various domains, from neuroscience to telecommunication to finance. One particular objective of anomaly detection from a neuroscience perspective is the reconstruction of the dynamic manner of brain region interactions. However, most statistical methods for detecting anomalies have the following unrealistic limitation for brain studies and beyond: that is, network snapshots at different time points are assumed to be independent. To circumvent this limitation, we propose a distribution-free framework for anomaly detection in dynamic networks. First, we present each network snapshot of the data as a linear object and find its respective univariate characterization via local and global network topological summaries. Second, we adopt a change point detection method for (weakly) dependent time series based on efficient scores, and enhance the finite sample properties of change point method by approximating the asymptotic distribution of the test statistic using the sieve bootstrap. We apply our method to simulated and to real data, particularly, two functional magnetic resonance imaging (fMRI) datasets and the Enron communication graph. We find that our new method delivers impressively accurate and realistic results in terms of identifying locations of true change points compared to the results reported by competing approaches. The new method promises to offer a deeper insight into the large-scale characterizations and functional dynamics of the brain and, more generally, into the intrinsic structure of complex dynamic networks. Supplemental materials for this article are available online.
CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly...
zenodo.org
data.niaid.nih.gov
application/gzip, csv
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka; Pavel Šiška; Pavel Šiška (2025). CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting [Dataset]. http://doi.org/10.5281/zenodo.13382427
Explore at:
csv, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13382427
Dataset updated
Feb 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka; Pavel Šiška; Pavel Šiška
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CESNET-TimeSeries24: The dataset for network traffic forecasting and anomaly detection

The dataset called CESNET-TimeSeries24 was collected by long-term monitoring of selected statistical metrics for 40 weeks for each IP address on the ISP network CESNET3 (Czech Education and Science Network). The dataset encompasses network traffic from more than 275,000 active IP addresses, assigned to a wide variety of devices, including office computers, NATs, servers, WiFi routers, honeypots, and video-game consoles found in dormitories. Moreover, the dataset is also rich in network anomaly types since it contains all types of anomalies, ensuring a comprehensive evaluation of anomaly detection methods.

Last but not least, the CESNET-TimeSeries24 dataset provides traffic time series on institutional and IP subnet levels to cover all possible anomaly detection or forecasting scopes. Overall, the time series dataset was created from the 66 billion IP flows that contain 4 trillion packets that carry approximately 3.7 petabytes of data. The CESNET-TimeSeries24 dataset is a complex real-world dataset that will finally bring insights into the evaluation of forecasting models in real-world environments.

Please cite the usage of our dataset as:

Koumar, J., Hynek, K., Čejka, T. et al. CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting. Sci Data 12, 338 (2025). https://doi.org/10.1038/s41597-025-04603-x

@Article{cesnettimeseries24,
author={Koumar, Josef and Hynek, Karel and {\v{C}}ejka, Tom{\'a}{\v{s}} and {\v{S}}i{\v{s}}ka, Pavel},
title={CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting},
journal={Scientific Data},
year={2025},
month={Feb},
day={26},
volume={12},
number={1},
pages={338},
issn={2052-4463},
doi={10.1038/s41597-025-04603-x},
url={https://doi.org/10.1038/s41597-025-04603-x}
}

Time series

We create evenly spaced time series for each IP address by aggregating IP flow records into time series datapoints. The created datapoints represent the behavior of IP addresses within a defined time window of 10 minutes. The vector of time-series metrics v_{ip, i} describes the IP address ip in the i-th time window. Thus, IP flows for vector v_{ip, i} are captured in time windows starting at t_i and ending at t_{i+1}. The time series are built from these datapoints.

Datapoints created by the aggregation of IP flows contain the following time-series metrics:

Simple volumetric metrics: the number of IP flows, the number of packets, and the transmitted data size (i.e. number of bytes)

Unique volumetric metrics: the number of unique destination IP addresses, the number of unique destination Autonomous System Numbers (ASNs), and the number of unique destination transport layer ports. The aggregation of \textit{Unique volumetric metrics} is memory intensive since all unique values must be stored in an array. We used a server with 41 GB of RAM, which was enough for 10-minute aggregation on the ISP network.

Ratios metrics: the ratio of UDP/TCP packets, the ratio of UDP/TCP transmitted data size, the direction ratio of packets, and the direction ratio of transmitted data size

Average metrics: the average flow duration, and the average Time To Live (TTL)

Multiple time aggregation: The original datapoints in the dataset are aggregated by 10 minutes of network traffic. The size of the aggregation interval influences anomaly detection procedures, mainly the training speed of the detection model. However, the 10-minute intervals can be too short for longitudinal anomaly detection methods. Therefore, we added two more aggregation intervals to the datasets--1 hour and 1 day.

Time series of institutions: We identify 283 institutions inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution's data.

Time series of institutional subnets: We identify 548 institution subnets inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution subnet's data.

Data Records

The file hierarchy is described below:

cesnet-timeseries24/

|- institution_subnets/

| |- agg_10_minutes/

| |- agg_1_hour/

| |- agg_1_day/

| |- identifiers.csv

|- institutions/

| |- agg_10_minutes/

| |- agg_1_hour/

| |- agg_1_day/

| |- identifiers.csv

|- ip_addresses_full/

| |- agg_10_minutes/

| |- agg_1_hour/

| |- agg_1_day/

| |- identifiers.csv

|- ip_addresses_sample/

| |- agg_10_minutes/

| |- agg_1_hour/

| |- agg_1_day/

| |- identifiers.csv

|- times/

| |- times_10_minutes.csv

| |- times_1_hour.csv

| |- times_1_day.csv

|- ids_relationship.csv
|- weekends_and_holidays.csv

The following list describes time series data fields in CSV files:

id_time: Unique identifier for each aggregation interval within the time series, used to segment the dataset into specific time periods for analysis.

n_flows: Total number of flows observed in the aggregation interval, indicating the volume of distinct sessions or connections for the IP address.

n_packets: Total number of packets transmitted during the aggregation interval, reflecting the packet-level traffic volume for the IP address.

n_bytes: Total number of bytes transmitted during the aggregation interval, representing the data volume for the IP address.

n_dest_ip: Number of unique destination IP addresses contacted by the IP address during the aggregation interval, showing the diversity of endpoints reached.

n_dest_asn: Number of unique destination Autonomous System Numbers (ASNs) contacted by the IP address during the aggregation interval, indicating the diversity of networks reached.

n_dest_port: Number of unique destination transport layer ports contacted by the IP address during the aggregation interval, representing the variety of services accessed.

tcp_udp_ratio_packets: Ratio of packets sent using TCP versus UDP by the IP address during the aggregation interval, providing insight into the transport protocol usage pattern. This metric belongs to the interval <0, 1> where 1 is when all packets are sent over TCP, and 0 is when all packets are sent over UDP.

tcp_udp_ratio_bytes: Ratio of bytes sent using TCP versus UDP by the IP address during the aggregation interval, highlighting the data volume distribution between protocols. This metric belongs to the interval <0, 1> with same rule as tcp_udp_ratio_packets.

dir_ratio_packets: Ratio of packet directions (inbound versus outbound) for the IP address during the aggregation interval, indicating the balance of traffic flow directions. This metric belongs to the interval <0, 1>, where 1 is when all packets are sent in the outgoing direction from the monitored IP address, and 0 is when all packets are sent in the incoming direction to the monitored IP address.

dir_ratio_bytes: Ratio of byte directions (inbound versus outbound) for the IP address during the aggregation interval, showing the data volume distribution in traffic flows. This metric belongs to the interval <0, 1> with the same rule as dir_ratio_packets.

avg_duration: Average duration of IP flows for the IP address during the aggregation interval, measuring the typical session length.

avg_ttl: Average Time To Live (TTL) of IP flows for the IP address during the aggregation interval, providing insight into the lifespan of packets.

Moreover, the time series created by re-aggregation contains following time series metrics instead of n_dest_ip, n_dest_asn, and n_dest_port:

sum_n_dest_ip: Sum of numbers of unique destination IP addresses.

avg_n_dest_ip: The average number of unique destination IP addresses.

std_n_dest_ip: Standard deviation of numbers of unique destination IP addresses.

sum_n_dest_asn: Sum of numbers of unique destination ASNs.

avg_n_dest_asn: The average number of unique destination ASNs.

std_n_dest_asn: Standard deviation of numbers of unique destination ASNs)

sum_n_dest_port: Sum of numbers of unique destination transport layer ports.

avg_n_dest_port: The average number of unique destination transport layer ports.

std_n_dest_port: Standard deviation of numbers of unique destination transport layer
d
Solving a prisoner's dilemma in distributed anomaly detection
catalog.data.gov
datasets.ai
+4more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Solving a prisoner's dilemma in distributed anomaly detection [Dataset]. https://catalog.data.gov/dataset/solving-a-prisoners-dilemma-in-distributed-anomaly-detection
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Anomaly detection has recently become an important problem in many industrial and financial applications. In several instances, the data to be analyzed for possible anomalies is located at multiple sites and cannot be merged due to practical constraints such as bandwidth limitations and proprietary concerns. At the same time, the size of data sets affects prediction quality in almost all data mining applications. In such circumstances, distributed data mining algorithms may be used to extract information from multiple data sites in order to make better predictions. In the absence of theoretical guarantees, however, the degree to which data decentralization affects the performance of these algorithms is not known, which reduces the data providing participants' incentive to cooperate.This creates a metaphorical 'prisoners' dilemma' in the context of data mining. In this work, we propose a novel general framework for distributed anomaly detection with theoretical performance guarantees. Our algorithmic approach combines existing anomaly detection procedures with a novel method for computing global statistics using local sufficient statistics. We show that the performance of such a distributed approach is indistinguishable from that of a centralized instantiation of the same anomaly detection algorithm, a condition that we call zero information loss. We further report experimental results on synthetic as well as real-world data to demonstrate the viability of our approach. The remaining content of this presentation is presented in Fig. 1.
o
Data from: Multi-Source Distributed System Data for AI-powered Analytics
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Oct 14, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sasho Nedelkoski; Ajay Kumar Mandapati; Jasmin Bogatinovski; Soeren Becker; Jorge Cardoso; Odej Kao (2019). Multi-Source Distributed System Data for AI-powered Analytics [Dataset]. http://doi.org/10.5281/zenodo.3484800
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3484800
Dataset updated
Oct 14, 2019
Authors
Sasho Nedelkoski; Ajay Kumar Mandapati; Jasmin Bogatinovski; Soeren Becker; Jorge Cardoso; Odej Kao
Description
Abstract: In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems. The major contributions have been materialized in the form of novel algorithms. Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms. Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance. Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research. Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation. General Information: This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset. You may find details of this dataset from the original paper: Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics". If you use the data, implementation, or any details of the paper, please cite! BIBTEX: _ @inproceedings{nedelkoski2020multi, title={Multi-source Distributed System Data for AI-Powered Analytics}, author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej}, booktitle={European Conference on Service-Oriented and Cloud Computing}, pages={161--176}, year={2020}, organization={Springer} } _ The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests. The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format. Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data. Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/
Synthetic Datasets for DINAMO: Dynamic and INterpretable Anomaly MOnitoring...
zenodo.org
tar
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arsenii Gavrikov; Arsenii Gavrikov; Julián García Pardiñas; Julián García Pardiñas; Alberto Garfagnini; Alberto Garfagnini (2025). Synthetic Datasets for DINAMO: Dynamic and INterpretable Anomaly MOnitoring for Large-Scale Particle Physics Experiments [Dataset]. http://doi.org/10.5281/zenodo.15610342
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15610342
Dataset updated
Jun 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Arsenii Gavrikov; Arsenii Gavrikov; Julián García Pardiñas; Julián García Pardiñas; Alberto Garfagnini; Alberto Garfagnini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This archive contains 1000 synthetic datasets for benchmarking the DINAMO framework (https://arxiv.org/abs/2501.19237), an automated anomaly detection solution featuring both a generalized EWMA-based statistical method and a transformer encoder-based ML approach for Data Quality Monitoring (DQM) in particle physics experiments.

The datasets overview:

Size: 1000 datasets in .npz format, each containing 5000 runs with one-dimensional Gaussian-based histograms

Labels: each run is labeled as "good" (4500 runs) or "bad" (500 runs)

Features: datasets mimic particle physics DQM data with emphasis on dynamic operational conditions:

Gradual operational drifts (sinusoidal evolution)

Abrupt hardware/software changes

Varying event statistics and Poisson uncertainties

Systematic detector uncertainties

Bad runs contain additional distortions and dead histogram bins

These datasets enable systematic evaluation of anomaly detection algorithms in time-dependent settings for the DQM problem. More details can be found in the paper and in the GitHub repository at https://github.com/ArseniiGav/DINAMO/
P
Amazon-Fraud Dataset
paperswithcode.com
Updated Dec 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yingtong Dou; Zhiwei Liu; Li Sun; Yutong Deng; Hao Peng; Philip S. Yu (2024). Amazon-Fraud Dataset [Dataset]. https://paperswithcode.com/dataset/amazon-fraud
Explore at:
Dataset updated
Dec 23, 2024
Authors
Yingtong Dou; Zhiwei Liu; Li Sun; Yutong Deng; Hao Peng; Philip S. Yu
Description
Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.

Dataset Statistics

# Nodes %Fraud Nodes (Class=1)
11,944 9.5

Relation # Edges
U-P-U
U-S-U
U-V-U 1,036,737
All

Graph Construction

The Amazon dataset includes product reviews under the Musical Instruments category. Similar to this paper, we label users with more than 80% helpful votes as benign entities and users with less than 20% helpful votes as fraudulent entities. we conduct a fraudulent user detection task on the Amazon-Fraud dataset, which is a binary classification task. We take 25 handcrafted features from this paper as the raw node features for Amazon-Fraud. We take users as nodes in the graph and design three relations: 1) U-P-U: it connects users reviewing at least one same product; 2) U-S-V: it connects users having at least one same star rating within one week; 3) U-V-U: it connects users with top 5% mutual review text similarities (measured by TF-IDF) among all users.

To download the dataset, please visit this Github repo. For any other questions, please email ytongdou(AT)gmail.com for inquiry.
Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
Explore at:
Dataset updated
Feb 15, 2025
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
Global, United States
Description
Snapshot img

Data Science Platform Market Size 2025-2029

The data science platform market size is forecast to increase by USD 763.9 million, at a CAGR of 40.2% between 2024 and 2029.

The market is experiencing significant growth, driven by the increasing integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to derive deeper insights from their data, fueling business innovation and decision-making. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. This approach offers enhanced flexibility, scalability, and efficiency, making it an attractive choice for businesses seeking to streamline their data science operations. However, the market also faces challenges. Data privacy and security remain critical concerns, with the increasing volume and complexity of data posing significant risks. Ensuring robust data security and privacy measures is essential for companies to maintain customer trust and comply with regulatory requirements. Additionally, managing the complexity of data science platforms and ensuring seamless integration with existing systems can be a daunting task, requiring significant investment in resources and expertise. Companies must navigate these challenges effectively to capitalize on the market's opportunities and stay competitive in the rapidly evolving data landscape.

What will be the Size of the Data Science Platform Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market continues to evolve, driven by the increasing demand for advanced analytics and artificial intelligence solutions across various sectors. Real-time analytics and classification models are at the forefront of this evolution, with APIs integrations enabling seamless implementation. Deep learning and model deployment are crucial components, powering applications such as fraud detection and customer segmentation. Data science platforms provide essential tools for data cleaning and data transformation, ensuring data integrity for big data analytics. Feature engineering and data visualization facilitate model training and evaluation, while data security and data governance ensure data privacy and compliance. Machine learning algorithms, including regression models and clustering models, are integral to predictive modeling and anomaly detection. Statistical analysis and time series analysis provide valuable insights, while ETL processes streamline data integration. Cloud computing enables scalability and cost savings, while risk management and algorithm selection optimize model performance. Natural language processing and sentiment analysis offer new opportunities for data storytelling and computer vision. Supply chain optimization and recommendation engines are among the latest applications of data science platforms, demonstrating their versatility and continuous value proposition. Data mining and data warehousing provide the foundation for these advanced analytics capabilities.

How is this Data Science Platform Industry segmented?

The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudComponentPlatformServicesEnd-userBFSIRetail and e-commerceManufacturingMedia and entertainmentOthersSectorLarge enterprisesSMEsApplicationData PreparationData VisualizationMachine LearningPredictive AnalyticsData GovernanceOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period.In the dynamic the market, businesses increasingly adopt solutions to gain real-time insights from their data, enabling them to make informed decisions. Classification models and deep learning algorithms are integral parts of these platforms, providing capabilities for fraud detection, customer segmentation, and predictive modeling. API integrations facilitate seamless data exchange between systems, while data security measures ensure the protection of valuable business information. Big data analytics and feature engineering are essential for deriving meaningful insights from vast datasets. Data transformation, data mining, and statistical analysis are crucial processes in data preparation and discovery. Machine learning models, including regression and clustering, are employed for model training and evaluation. Time series analysis and natural language processing are valuable tools for understanding trends and customer sen
COVID-19 Data Checking and Repairing (CDCAR)
figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guannan Wang; Zhiling Gu; Xinyi Li; Shan Yu; Myungjin Kim; Yueying Wang; Lei Gao; Lily Wang (2023). COVID-19 Data Checking and Repairing (CDCAR) [Dataset]. http://doi.org/10.6084/m9.figshare.12418550.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12418550.v3
Dataset updated
Jun 2, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Guannan Wang; Zhiling Gu; Xinyi Li; Shan Yu; Myungjin Kim; Yueying Wang; Lei Gao; Lily Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the past several months, the outbreak of COVID-19 has been expanding over the world. A reliable and accurate dataset of the cases is vital for scientists to conduct related research and for policy-makers to make better decisions. We collect the COVID-19 daily reported data from four open sources: the New York Times, the COVID-19 Data Repository by Johns Hopkins University, the COVID Tracking Project at the Atlantic, and the USAFacts, and compare the similarities and differences among them. In addition, we examine the following problems which occur frequently: (1) the order dependencies violation, (2) abnormal data point and/or period, and (3) the delay-reported issue on weekends and/or holidays. We also integrate the COVID-19 reported cases with the county-level auxiliary information of the local features from official sources, such as health infrastructure, demographic, socioeconomic, and environment information, which are essential for understanding the spread of the virus.
d
Data from: Distributed Monitoring of the R2 Statistic for Linear Regression
catalog.data.gov
gimi9.com
+2more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Monitoring of the R2 Statistic for Linear Regression [Dataset]. https://catalog.data.gov/dataset/distributed-monitoring-of-the-r2-statistic-for-linear-regression
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
The problem of monitoring a multivariate linear regression model is relevant in studying the evolving relationship between a set of input variables (features) and one or more dependent target variables. This problem becomes challenging for large scale data in a distributed computing environment when only a subset of instances is available at individual nodes and the local data changes frequently. Data centralization and periodic model recomputation can add high overhead to tasks like anomaly detection in such dynamic settings. Therefore, the goal is to develop techniques for monitoring and updating the model over the union of all nodes' data in a communication-efficient fashion. Correctness guarantees on such techniques are also often highly desirable, especially in safety-critical application scenarios. In this paper we develop DReMo --- a distributed algorithm with very low resource overhead, for monitoring the quality of a regression model in terms of its coefficient of determination (R2 statistic). When the nodes collectively determine that R2 has dropped below a fixed threshold, the linear regression model is recomputed via a network-wide convergecast and the updated model is broadcast back to all nodes. We show empirically, using both synthetic and real data, that our proposed method is highly communication-efficient and scalable, and also provide theoretical guarantees on correctness.
f
Data from: Subset Multivariate Collective and Point Anomaly Detection
tandf.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander T. M. Fisch; Idris A. Eckley; Paul Fearnhead (2023). Subset Multivariate Collective and Point Anomaly Detection [Dataset]. http://doi.org/10.6084/m9.figshare.17054276.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17054276.v1
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Alexander T. M. Fisch; Idris A. Eckley; Paul Fearnhead
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the recent years, there has been a growing interest in identifying anomalous structure within multivariate data sequences. We consider the problem of detecting collective anomalies, corresponding to intervals where one, or more, of the data sequences behaves anomalously. We first develop a test for a single collective anomaly that has power to simultaneously detect anomalies that are either rare, that is affecting few data sequences, or common. We then show how to detect multiple anomalies in a way that is computationally efficient but avoids the approximations inherent in binary segmentation-like approaches. This approach is shown to consistently estimate the number and location of the collective anomalies—a property that has not previously been shown for competing methods. Our approach can be made robust to point anomalies and can allow for the anomalies to be imperfectly aligned. We show the practical usefulness of allowing for imperfect alignments through a resulting increase in power to detect regions of copy number variation. Supplemental files for this article are available online.
P
Yelp-Fraud Dataset
paperswithcode.com
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yingtong Dou; Zhiwei Liu; Li Sun; Yutong Deng; Hao Peng; Philip S. Yu (2025). Yelp-Fraud Dataset [Dataset]. https://paperswithcode.com/dataset/yelpchi
Explore at:
Dataset updated
Apr 21, 2025
Authors
Yingtong Dou; Zhiwei Liu; Li Sun; Yutong Deng; Hao Peng; Philip S. Yu
Description
Yelp-Fraud is a multi-relational graph dataset built upon the Yelp spam review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.

Dataset Statistics

# Nodes %Fraud Nodes (Class=1)
45,954 14.5

Relation # Edges
R-U-R
R-T-R
R-S-R 3,402,743
All

Graph Construction

The Yelp spam review dataset includes hotel and restaurant reviews filtered (spam) and recommended (legitimate) by Yelp. We conduct a spam review detection task on the Yelp-Fraud dataset which is a binary classification task. We take 32 handcrafted features from SpEagle paper as the raw node features for Yelp-Fraud. Based on previous studies which show that opinion fraudsters have connections in user, product, review text, and time, we take reviews as nodes in the graph and design three relations: 1) R-U-R: it connects reviews posted by the same user; 2) R-S-R: it connects reviews under the same product with the same star rating (1-5 stars); 3) R-T-R: it connects two reviews under the same product posted in the same month.

To download the dataset, please visit this Github repo. For any other questions, please email ytongdou(AT)gmail.com for inquiry.
f
DataSheet1_Sequential Detection of Microgrid Bad Data via a Data-Driven...
figshare.com
pdf
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heming Huang; Fei Liu; Tinghui Ouyang; Xiaoming Zha (2023). DataSheet1_Sequential Detection of Microgrid Bad Data via a Data-Driven Approach Combining Online Machine Learning With Statistical Analysis.pdf [Dataset]. http://doi.org/10.3389/fenrg.2022.861563.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fenrg.2022.861563.s001
Dataset updated
Jun 8, 2023
Dataset provided by
Frontiers
Authors
Heming Huang; Fei Liu; Tinghui Ouyang; Xiaoming Zha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Bad data is required to be detected and removed from the microgrid data stream because it misleads the decision-making of the Energy Management Systems (EMS) and puts the microgrid at risk of instability. In this paper, the authors propose a sequential detection method that combines three data mining algorithms, that is the Online Sequential Extreme Learning Machine (OSELM), statistical analysis within a sliding time window, and the Density-Based Spatial Clustering of Applications with Noise (DBSCAN). After sequential data training, OSELM is used to construct an online updated error-filtering map to extract the electrical feature of the microgrid data sequence. Meanwhile, the statistical features, i.e. the surge of the variance and the corresponding correlation coefficients under a sliding time window are first proposed as another two complementary feature dimensions. The three-dimensional features are finally analyzed by DBSCAN to discriminate the bad data. The detection performance of this approach is verified by the data sequence collected from a four-terminal ring-shaped DC microgrid prototype. Compared with bad data detection using a single electrical feature or only statistical features, this approach shows the best performance. Moreover, it can be further applied to the online detection of microgrid bad data in the future.
m
Data from: Real Electronic Signal Data from Particle Accelerator Power...
data.mendeley.com
Updated Jul 19, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Majdi Radaideh (2022). Real Electronic Signal Data from Particle Accelerator Power Systems for Machine Learning Anomaly Detection [Dataset]. http://doi.org/10.17632/kbbrw99vh8.5
Explore at:
Unique identifier
https://doi.org/10.17632/kbbrw99vh8.5
Dataset updated
Jul 19, 2022
Authors
Majdi Radaideh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This work describes real-time series datasets collected from the high voltage converter modulators (HVCM) of the Spallation Neutron Source facility. HVCMs are used to power the linear accelerator klystrons, which in turn produce the high-power radio frequency to accelerate the negative hydrogen ions (H−). Waveform signals have been collected from the operation of more than 15 HVCM systems categorized into four major subsystems during the years 2020-2022. The data collection process occurred in the Spallation Neutron Source facility of Oak Ridge, Tennessee in the United States. For each of the four subsystems, there are two datasets. The first one contains the waveform signals, while the second contains the label of the waveform, whether it has a normal or faulty signal. A variety of waveforms are included in the datasets including insulated-gate bipolar transistor (IGBT) currents in three phases, magnetic flux in the three phases, modulator current and voltage, cap bank current and voltage, and time derivative change of the modulator voltage. The datasets provided are useful to test and develop machine learning and statistical algorithms for applications related to anomaly detection, system fault detection and classification, and signal processing.

# Nodes	%Fraud Nodes (Class=1)
11,944	9.5

Relation	# Edges
	U-P-U
	U-S-U
U-V-U	1,036,737
	All

# Nodes	%Fraud Nodes (Class=1)
45,954	14.5

Relation	# Edges
	R-U-R
	R-T-R
R-S-R	3,402,743
	All

Facebook

Twitter

Click to copy link

Link copied

Cite

Dashlink (2025). Comparison of Unsupervised Anomaly Detection Methods [Dataset]. https://catalog.data.gov/dataset/comparison-of-unsupervised-anomaly-detection-methods

Comparison of Unsupervised Anomaly Detection Methods

Explore at:

18 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 10, 2025

Dataset provided by

Dashlink

Description

Several different unsupervised anomaly detection algorithms have been applied to Space Shuttle Main Engine (SSME) data to serve the purpose of developing a comprehensive suite of Integrated Systems Health Management (ISHM) tools. As the theoretical bases for these methods vary considerably, it is reasonable to conjecture that the resulting anomalies detected by them may differ quite significantly as well. As such, it would be useful to apply a common metric with which to compare the results. However, for such a quantitative analysis to be statistically significant, a sufficient number of examples of both nominally categorized and anomalous data must be available. Due to the lack of sufficient examples of anomalous data, use of any statistics that rely upon a statistically significant sample of anomalous data is infeasible. Therefore, the main focus of this paper will be to compare actual examples of anomalies detected by the algorithms via the sensors in which they appear, as well the times at which they appear. We find that there is enough overlap in detection of the anomalies among all of the different algorithms tested in order for them to corroborate the severity of these anomalies. In certain cases, the severity of these anomalies is supported by their categorization as failures by experts, with realistic physical explanations. For those anomalies that can not be corroborated by at least one other method, this overlap says less about the severity of the anomaly, and more about their technical nuances, which will also be discussed.

Clear search

Close search

Google apps

Main menu

Comparison of Unsupervised Anomaly Detection Methods

Dataset for the paper "Anomaly Detection in Large-Scale Cloud Systems: An...

Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Global Anomaly Detection Solution Market Size By Type (Statistical Anomaly...

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Data from: Theoretically Optimal Distributed Anomaly Detection

Data from: Discovering System Health Anomalies using Data Mining Techniques

Data from: Nonparametric Anomaly Detection on Time Series of Graphs

CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly...

CESNET-TimeSeries24: The dataset for network traffic forecasting and anomaly detection

Time series

Data Records

Solving a prisoner's dilemma in distributed anomaly detection

Data from: Multi-Source Distributed System Data for AI-powered Analytics

Synthetic Datasets for DINAMO: Dynamic and INterpretable Anomaly MOnitoring...

Amazon-Fraud Dataset

Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

COVID-19 Data Checking and Repairing (CDCAR)

Data from: Distributed Monitoring of the R2 Statistic for Linear Regression

Data from: Subset Multivariate Collective and Point Anomaly Detection

Yelp-Fraud Dataset

DataSheet1_Sequential Detection of Microgrid Bad Data via a Data-Driven...

Data from: Real Electronic Signal Data from Particle Accelerator Power...

Comparison of Unsupervised Anomaly Detection MethodsSee More Versions

Comparison of Unsupervised Anomaly Detection Methods