31 datasets found
  1. d

    Data from: Anomaly Detection in a Fleet of Systems

    • datasets.ai
    • data.nasa.gov
    • +2more
    33
    Updated Sep 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2024). Anomaly Detection in a Fleet of Systems [Dataset]. https://datasets.ai/datasets/anomaly-detection-in-a-fleet-of-systems
    Explore at:
    33Available download formats
    Dataset updated
    Sep 10, 2024
    Dataset authored and provided by
    National Aeronautics and Space Administration
    Description

    A fleet is a group of systems (e.g., cars, aircraft) that are designed and manufactured the same way and are intended to be used the same way. For example, a fleet of delivery trucks may consist of one hundred instances of a particular model of truck, each of which is intended for the same type of service—almost the same amount of time and distance driven every day, approximately the same total weight carried, etc. For this reason, one may imagine that data mining for fleet monitoring may merely involve collecting operating data from the multiple systems in the fleet and developing some sort of model, such as a model of normal operation that can be used for anomaly detection. However, one then may realize that each member of the fleet will be unique in some ways—there will be minor variations in manufacturing, quality of parts, and usage. For this reason, the typical machine learning and statis- tics algorithm’s assumption that all the data are independent and identically distributed is not correct. One may realize that data from each system in the fleet must be treated as unique so that one can notice significant changes in the operation of that system.

  2. Anomaly Detection Market Analysis North America, Europe, APAC, South...

    • technavio.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio, Anomaly Detection Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, Germany, UK, China, Japan - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/anomaly-detection-market-industry-analysis
    Explore at:
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    United Kingdom, United States, Global
    Description

    Snapshot img

    Anomaly Detection Market Size 2024-2028

    The anomaly detection market size is forecast to increase by USD 3.71 billion at a CAGR of 13.63% between 2023 and 2028. Anomaly detection is a critical aspect of cybersecurity, particularly in sectors like healthcare where abnormal patient conditions or unusual network activity can have significant consequences. The market for anomaly detection solutions is experiencing significant growth due to several factors. Firstly, the increasing incidence of internal threats and cyber frauds has led organizations to invest in advanced tools for detecting and responding to anomalous behavior. Secondly, the infrastructural requirements for implementing these solutions are becoming more accessible, making them a viable option for businesses of all sizes. Data science and machine learning algorithms play a crucial role in anomaly detection, enabling accurate identification of anomalies and minimizing the risk of incorrect or misleading conclusions.

    However, data quality is a significant challenge in this field, as poor quality data can lead to false positives or false negatives, undermining the effectiveness of the solution. Overall, the market for anomaly detection solutions is expected to grow steadily in the coming years, driven by the need for enhanced cybersecurity and the increasing availability of advanced technologies.

    What will be the Anomaly Detection Market Size During the Forecast Period?

    Request Free Sample

    Anomaly detection, also known as outlier detection, is a critical data analysis technique used to identify observations or events that deviate significantly from the normal behavior or expected patterns in data. These deviations, referred to as anomalies or outliers, can indicate infrastructure failures, breaking changes, manufacturing defects, equipment malfunctions, or unusual network activity. In various industries, including manufacturing, cybersecurity, healthcare, and data science, anomaly detection plays a crucial role in preventing incorrect or misleading conclusions. Artificial intelligence and machine learning algorithms, such as statistical tests (Grubbs test, Kolmogorov-Smirnov test), decision trees, isolation forest, naive Bayesian, autoencoders, local outlier factor, and k-means clustering, are commonly used for anomaly detection.

    Furthermore, these techniques help identify anomalies by analyzing data points and their statistical properties using charts, visualization, and ML models. For instance, in manufacturing, anomaly detection can help identify defective products, while in cybersecurity, it can detect unusual network activity. In healthcare, it can be used to identify abnormal patient conditions. By applying anomaly detection techniques, organizations can proactively address potential issues and mitigate risks, ensuring optimal performance and security.

    Market Segmentation

    The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.

    Deployment
    
      Cloud
      On-premise
    
    
    Geography
    
      North America
    
        US
    
    
      Europe
    
        Germany
        UK
    
    
      APAC
    
        China
        Japan
    
    
      South America
    
    
    
      Middle East and Africa
    

    By Deployment Insights

    The cloud segment is estimated to witness significant growth during the forecast period. The market is witnessing a notable shift towards cloud-based solutions due to their numerous advantages over traditional on-premises systems. Cloud-based anomaly detection offers breaking changes such as quicker deployment, enhanced flexibility, and scalability, real-time data visibility, and customization capabilities. These features are provided by service providers with flexible payment models like monthly subscriptions and pay-as-you-go, making cloud-based software a cost-effective and economical choice. Anodot, Ltd, Cisco Systems Inc, IBM Corp, and SAS Institute Inc are some prominent companies offering cloud-based anomaly detection solutions in addition to on-premise alternatives. In the context of security threats, architectural optimization, marketing strategies, finance, fraud detection, manufacturing, and defects, equipment malfunctions, cloud-based anomaly detection is becoming increasingly popular due to its ability to provide real-time insights and swift response to anomalies.

    Get a glance at the market share of various segments Request Free Sample

    The cloud segment accounted for USD 1.59 billion in 2018 and showed a gradual increase during the forecast period.

    Regional Insights

    When it comes to Anomaly Detection Market growth, North America is estimated to contribute 37% to the global market during the forecast period. Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast per

  3. Data from: Automated, Unsupervised, and Auto-parameterized Inference of Data...

    • zenodo.org
    pdf, zip
    Updated Jan 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qiaolin Qin; Qiaolin Qin (2025). Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection [Dataset]. http://doi.org/10.48550/arxiv.2412.05240
    Explore at:
    pdf, zipAvailable download formats
    Dataset updated
    Jan 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Qiaolin Qin; Qiaolin Qin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the replication package for the paper "Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection".

    "Discover-Data-Quality-With-RIOLU-A-Replication-Package" Folder structure:

    ├── ablation_study
    ├── 20_subsampling.py
    ├── no_selection.py
    ├── static_rEM_1.py
    ├── static_rcov_95.py
    ├── static_selection_threshold.py
    └── readme.md
    ├── ground_truth_anomaly_detection (Data ground truths)
    ├── images
    ├── java_repo_exploration
    ├── java_names
    ├── java_naming_anomalies
    └── readme.md
    ├── sensitivity_analysis
    ├── Auto_RIOLU_alt_inircov.py
    ├── Auto_RIOLU_alt_nsubset.py
    └── readme.md
    ├── test_anomaly_detection
    ├── chatgpt_sampled (Data sampled for ChatGPT & the extracted regexes)
    ├── flights
    ├── hosp_1k
    ├── hosp_10k
    ├── hosp_100k
    ├── movies
    └── readme.md
    ├── test_data_profiling
    ├── hetero
    ├── homo.simple
    ├── homo
    ├── GPT_responses.csv (ChatGPT profiling responses & the extracted regexes)
    └── readme.md
    ├── Auto-RIOLU.py (Auto-RIOLU for anomaly detection)
    ├── Guided-RIOLU.py (Guided-RIOLU for anomaly detection)
    ├── pattern_generator.py
    ├── pattern_selector.py
    ├── pattern_summarizer.py
    ├── test_profiling.py (RIOLU for data profiling)
    ├── utils.py
    ├── LICENSE
    └── readme.md

  4. Solving a prisoner's dilemma in distributed anomaly detection

    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • datasets.ai
    • +5more
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Solving a prisoner's dilemma in distributed anomaly detection [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/solving-a-prisoners-dilemma-in-distributed-anomaly-detection
    Explore at:
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Anomaly detection has recently become an important problem in many industrial and financial applications. In several instances, the data to be analyzed for possible anomalies is located at multiple sites and cannot be merged due to practical constraints such as bandwidth limitations and proprietary concerns. At the same time, the size of data sets affects prediction quality in almost all data mining applications. In such circumstances, distributed data mining algorithms may be used to extract information from multiple data sites in order to make better predictions. In the absence of theoretical guarantees, however, the degree to which data decentralization affects the performance of these algorithms is not known, which reduces the data providing participants' incentive to cooperate.This creates a metaphorical 'prisoners' dilemma' in the context of data mining. In this work, we propose a novel general framework for distributed anomaly detection with theoretical performance guarantees. Our algorithmic approach combines existing anomaly detection procedures with a novel method for computing global statistics using local sufficient statistics. We show that the performance of such a distributed approach is indistinguishable from that of a centralized instantiation of the same anomaly detection algorithm, a condition that we call zero information loss. We further report experimental results on synthetic as well as real-world data to demonstrate the viability of our approach. The remaining content of this presentation is presented in Fig. 1.

  5. d

    pyhydroqc Sensor Data QC: Single Site Example

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amber Spackman Jones (2023). pyhydroqc Sensor Data QC: Single Site Example [Dataset]. http://doi.org/10.4211/hs.92f393cbd06b47c398bdd2bbb86887ac
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    Amber Spackman Jones
    Time period covered
    Jan 1, 2017 - Dec 31, 2017
    Description

    This resource contains an example script for using the software package pyhydroqc. pyhydroqc was developed to identify and correct anomalous values in time series data collected by in situ aquatic sensors. For more information, see the code repository: https://github.com/AmberSJones/pyhydroqc and the documentation: https://ambersjones.github.io/pyhydroqc/. The package may be installed from the Python Package Index.

    This script applies the functions to data from a single site in the Logan River Observatory, which is included in the repository. The data collected in the Logan River Observatory are sourced at http://lrodata.usu.edu/tsa/ or on HydroShare: https://www.hydroshare.org/search/?q=logan%20river%20observatory.

    Anomaly detection methods include ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short Term Memory). These are time series regression methods that detect anomalies by comparing model estimates to sensor observations and labeling points as anomalous when they exceed a threshold. There are multiple possible approaches for applying LSTM for anomaly detection/correction. - Vanilla LSTM: uses past values of a single variable to estimate the next value of that variable. - Multivariate Vanilla LSTM: uses past values of multiple variables to estimate the next value for all variables. - Bidirectional LSTM: uses past and future values of a single variable to estimate a value for that variable at the time step of interest. - Multivariate Bidirectional LSTM: uses past and future values of multiple variables to estimate a value for all variables at the time step of interest.

    The correction approach uses piecewise ARIMA models. Each group of consecutive anomalous points is considered as a unit to be corrected. Separate ARIMA models are developed for valid points preceding and following the anomalous group. Model estimates are blended to achieve a correction.

    The anomaly detection and correction workflow involves the following steps: 1. Retrieving data 2. Applying rules-based detection to screen data and apply initial corrections 3. Identifying and correcting sensor drift and calibration (if applicable) 4. Developing a model (i.e., ARIMA or LSTM) 5. Applying model to make time series predictions 6. Determining a threshold and detecting anomalies by comparing sensor observations to modeled results 7. Widening the window over which an anomaly is identified 8. Aggregating detections resulting from multiple models 9. Making corrections for anomalous events

    Instructions to run the notebook through the CUAHSI JupyterHub: 1. Click "Open with..." at the top of the resource and select the CUAHSI JupyterHub. You may need to sign into CUAHSI JupyterHub using your HydroShare credentials. 2. Select 'Python 3.8 - Scientific' as the server and click Start. 2. From your JupyterHub directory, click on the ExampleNotebook.ipynb file. 3. Execute each cell in the code by clicking the Run button.

  6. Discovering Anomalous Aviation Safety Events Using Scalable Data Mining...

    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • datadiscoverystudio.org
    • +6more
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Discovering Anomalous Aviation Safety Events Using Scalable Data Mining Algorithms [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/discovering-anomalous-aviation-safety-events-using-scalable-data-mining-algorithms
    Explore at:
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The worldwide civilian aviation system is one of the most complex dynamical systems created. Most modern commercial aircraft have onboard flight data recorders that record several hundred discrete and continuous parameters at approximately 1Hz for the entire duration of the flight. These data contain information about the flight control systems, actuators, engines, landing gear, avionics, and pilot commands. In this paper, recent advances in the development of a novel knowledge discovery process consisting of a suite of data mining techniques for identifying precursors to aviation safety incidents are discussed. The data mining techniques include scalable multiple-kernel learning for large-scale distributed anomaly detection. A novel multivariate time-series search algorithm is used to search for signatures of discovered anomalies on massive datasets. The process can identify operationally significant events due to environmental, mechanical, and human factors issues in the high-dimensional flight operations quality assurance data. All discovered anomalies are validated by a team of independent domain experts. This novel automated knowledge discovery process is aimed at complementing the state-of-the-art human-generated exceedance-based analysis that fails to discover previously unknown aviation safety incidents. In this paper, the discovery pipeline, the methods used, and some of the significant anomalies detected on real-world commercial aviation data are discussed.

  7. z

    OPSSAT-AD - anomaly detection dataset for satellite telemetry

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruszczak Bogdan; Ruszczak Bogdan (2024). OPSSAT-AD - anomaly detection dataset for satellite telemetry [Dataset]. http://doi.org/10.5281/zenodo.12588359
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Ruszczak
    Authors
    Ruszczak Bogdan; Ruszczak Bogdan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the AI-ready benchmark dataset (OPSSAT-AD) containing the telemetry data acquired on board OPS-SAT---a CubeSat mission that has been operated by the European Space Agency.

    It is accompanied by the paper with baseline results obtained using 30 supervised and unsupervised classic and deep machine learning algorithms for anomaly detection. They were trained and validated using the training-test dataset split introduced in this work, and we present a suggested set of quality metrics that should always be calculated to confront the new algorithms for anomaly detection while exploiting OPSSAT-AD. We believe that this work may become an important step toward building a fair, reproducible, and objective validation procedure that can be used to quantify the capabilities of the emerging anomaly detection techniques in an unbiased and fully transparent way.

    The two included files are:

    • segments.csv with the acquired telemetry signals from ESA OPS-SAT aircraft,
    • dataset.csv with the extracted, synthetic features are computed for each manually split and labeled telemetry segment.

    Please have a look at our two papers commenting on this dataset:

    • The benchmark paper with results of 30 supervised and unsupervised anomaly detection models for this collection:
      Ruszczak, B., Kotowski. K., Nalepa, J., Evans, D.: The OPS-SAT benchmark for detecting anomalies in satellite telemetry, 2024, preprint arxiv: 2407.04730,
    • the conference paper in which we presented some preliminary results for this dataset:
      Ruszczak, B., Kotowski. K., Andrzejewski, J., et al.: (2023). Machine Learning Detects Anomalies in OPS-SAT Telemetry. Computational Science – ICCS 2023. LNCS, vol 14073. Springer, Cham, DOI:10.1007/978-3-031-35995-8_21.
  8. d

    Supporting data and tools for "Toward automating post processing of aquatic...

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amber Spackman Jones; Tanner Jones; Jeffery S. Horsburgh (2023). Supporting data and tools for "Toward automating post processing of aquatic sensor data" [Dataset]. https://search.dataone.org/view/sha256%3A5ebbad6b4db49f3718e386c4bfc53a14c5d5f5b6b1d030f7b518320cc5311a4c
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    Amber Spackman Jones; Tanner Jones; Jeffery S. Horsburgh
    Time period covered
    Jan 1, 2013 - Dec 31, 2019
    Area covered
    Description

    This resource contains the supporting data and code files for the analyses presented in "Toward automating post processing of aquatic sensor data," an article published in the journal Environmental Modelling and Software. This paper describes pyhydroqc, a Python package developed to identify and correct anomalous values in time series data collected by in situ aquatic sensors. For more information on pyhydroqc, see the code repository (https://github.com/AmberSJones/pyhydroqc) and the documentation (https://ambersjones.github.io/pyhydroqc/). The package may be installed from the Python Package Index (more info: https://packaging.python.org/tutorials/installing-packages/).

    Included in this resource are input data, Python scripts to run the package on the input data (anomaly detection and correction), results from running the algorithm, and Python scripts for generating the figures in the manuscript. The organization and structure of the files are described in detail in the readme file. The input data were collected as part of the Logan River Observatory (LRO). The data in this resource represent a subset of data available for the LRO and were compiled by querying the LRO’s operational database. All available data for the LRO can be sourced at http://lrodata.usu.edu/tsa/ or on HydroShare: https://www.hydroshare.org/search/?q=logan%20river%20observatory.

    There are two sets of scripts in this resource: 1.) Scripts that reproduce plots for the paper using saved results, and 2.) Code used to generate the complete results for the series in the case study. While all figures can be reproduced, there are challenges to running the code for the complete results (it is computationally intensive, different results will be generated due to the stochastic nature of the models, and the code was developed with an early version of the package), which is why the saved results are included in this resource. For a simple example of running pyhydroqc functions for anomaly detection and correction on a subset of data, see this resource: https://www.hydroshare.org/resource/92f393cbd06b47c398bdd2bbb86887ac/.

  9. GECCO Industrial Challenge 2017 Dataset: A water quality dataset for the...

    • zenodo.org
    • data.niaid.nih.gov
    csv, pdf, zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steffen Moritz; Steffen Moritz; Martina Friese; Jörg Stork; Margarita Rebolledo; Andreas Fischbach; Thomas Bartz-Beielstein; Thomas Bartz-Beielstein; Martina Friese; Jörg Stork; Margarita Rebolledo; Andreas Fischbach (2024). GECCO Industrial Challenge 2017 Dataset: A water quality dataset for the 'Monitoring of drinking-water quality' competition at the Genetic and Evolutionary Computation Conference 2017, Berlin, Germany. [Dataset]. http://doi.org/10.5281/zenodo.3884465
    Explore at:
    pdf, csv, zipAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Steffen Moritz; Steffen Moritz; Martina Friese; Jörg Stork; Margarita Rebolledo; Andreas Fischbach; Thomas Bartz-Beielstein; Thomas Bartz-Beielstein; Martina Friese; Jörg Stork; Margarita Rebolledo; Andreas Fischbach
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Berlin
    Description

    Dataset of the 'Industrial Challenge: Monitoring of drinking-water quality' competition hosted at The Genetic and Evolutionary Computation Conference (GECCO) July 15th-19th 2017, Berlin, Germany

    The task of the competition was to develop an anomaly detection algorithm for a water- and environmental data set.

    Included in zenodo:

    - dataset of water quality data

    - additional material and descriptions provided for the competition

    The competition was organized by:

    M. Friese, J. Stork, A. Fischbach, M. Rebolledo, T. Bartz-Beielstein (TH Köln)

    The dataset was provided and prepared by:

    Thüringer Fernwasserversorgung,

    IMProvT research project (S. Moritz)


    Industrial Challenge: Monitoring of drinking-water quality

    Description:

    Water covers 71% of the Earth's surface and is vital to all known forms of life. The provision of safe and clean drinking water to protect public health is a natural aim. Performing regular monitoring of the water-quality is essential to achieve this aim.

    Goal of the GECCO 2017 Industrial Challenge is to analyze drinking-water data and to develop a highly efficient algorithm that most accurately recognizes diverse kinds of changes in the quality of our drinking-water.

    Submission deadline:

    June 30, 2017

    Official webpage:

    http://www.spotseven.de/gecco-challenge/gecco-challenge-2017/

  10. Data from: Fleet Level Anomaly Detection of Aviation Safety Data

    • data.nasa.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +1more
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Fleet Level Anomaly Detection of Aviation Safety Data [Dataset]. https://data.nasa.gov/w/n7yu-ua2x/_variation_?cur=kQyfdqGHa1-&from=root
    Explore at:
    application/rdfxml, application/rssxml, xml, csv, json, tsvAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    For the purposes of this paper, the National Airspace System (NAS) encompasses the operations of all aircraft which are subject to air traffic control procedures. The NAS is a highly complex dynamic system that is sensitive to aeronautical decision-making and risk management skills. In order to ensure a healthy system with safe flights a systematic approach to anomaly detection is very important when evaluating a given set of circumstances and for determination of the best possible course of action. Given the fact that the NAS is a vast and loosely integrated network of systems, it requires improved safety assurance capabilities to maintain an extremely low accident rate under increasingly dense operating conditions. Data mining based tools and techniques are required to support and aid operators’ (such as pilots, management, or policy makers) overall decision-making capacity. Within the NAS, the ability to analyze fleetwide aircraft data autonomously is still considered a significantly challenging task. For our purposes a fleet is defined as a group of aircraft sharing generally compatible parameter lists. Here, in this effort, we aim at developing a system level analysis scheme. In this paper we address the capability for detection of fleetwide anomalies as they occur, which itself is an important initiative toward the safety of the real-world flight operations. The flight data recorders archive millions of data points with valuable information on flights everyday. The operational parameters consist of both continuous and discrete (binary & categorical) data from several critical subsystems and numerous complex procedures. In this paper, we discuss a system level anomaly detection approach based on the theory of kernel learning to detect potential safety anomalies in a very large data base of commercial aircraft. We also demonstrate that the proposed approach uncovers some operationally significant events due to environmental, mechanical, and human factors issues in high dimensional, multivariate Flight Operations Quality Assurance (FOQA) data. We present the results of our detection algorithms on real FOQA data from a regional carrier.

  11. Data from: Multi-Source Distributed System Data for AI-powered Analytics

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Nov 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao (2022). Multi-Source Distributed System Data for AI-powered Analytics [Dataset]. http://doi.org/10.5281/zenodo.3549604
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract:

    In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
    The major contributions have been materialized in the form of novel algorithms.
    Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
    Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
    Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
    Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.

    General Information:

    This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.

    You may find details of this dataset from the original paper:

    Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".

    If you use the data, implementation, or any details of the paper, please cite!

    BIBTEX:

    _

    @inproceedings{nedelkoski2020multi,
     title={Multi-source Distributed System Data for AI-Powered Analytics},
     author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej},
     booktitle={European Conference on Service-Oriented and Cloud Computing},
     pages={161--176},
     year={2020},
     organization={Springer}
    }
    

    _

    The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.

    The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.

    Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.

    Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/

  12. Numbers of palindromes as a function of Mason_variator iterations.

    • plos.figshare.com
    xls
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan F. Karr; Jason Hauzel; Adam A. Porter; Marcel Schaefer (2023). Numbers of palindromes as a function of Mason_variator iterations. [Dataset]. http://doi.org/10.1371/journal.pone.0271970.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Alan F. Karr; Jason Hauzel; Adam A. Porter; Marcel Schaefer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The genome is E. coli. Half lengths of 6, 8, 10, 12, 14 and 16 are columns and Mason_variator iterations are rows.

  13. GECCO Industrial Challenge 2018 Dataset: A water quality dataset for the...

    • zenodo.org
    • data.niaid.nih.gov
    csv, pdf, zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steffen Moritz; Steffen Moritz; Frederik Rehbach; Sowmya Chandrasekaran; Margarita Rebolledo; Thomas Bartz-Beielstein; Thomas Bartz-Beielstein; Frederik Rehbach; Sowmya Chandrasekaran; Margarita Rebolledo (2024). GECCO Industrial Challenge 2018 Dataset: A water quality dataset for the 'Internet of Things: Online Anomaly Detection for Drinking Water Quality' competition at the Genetic and Evolutionary Computation Conference 2018, Kyoto, Japan. [Dataset]. http://doi.org/10.5281/zenodo.3884398
    Explore at:
    zip, csv, pdfAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Steffen Moritz; Steffen Moritz; Frederik Rehbach; Sowmya Chandrasekaran; Margarita Rebolledo; Thomas Bartz-Beielstein; Thomas Bartz-Beielstein; Frederik Rehbach; Sowmya Chandrasekaran; Margarita Rebolledo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset of the 'Internet of Things: Online Anomaly Detection for Drinking Water Quality' competition hosted at The Genetic and Evolutionary Computation Conference (GECCO) July 15th-19th 2018, Kyoto, Japan

    The task of the competition was to develop an anomaly detection algorithm for a water- and environmental data set.

    Included in zenodo:

    - dataset of water quality data

    - additional material and descriptions provided for the competition

    The competition was organized by:

    F. Rehbach, M. Rebolledo, S. Moritz, S. Chandrasekaran, T. Bartz-Beielstein (TH Köln)

    The dataset was provided by:

    Thüringer Fernwasserversorgung and IMProvT research project

    GECCO Industrial Challenge: 'Internet of Things: Online Anomaly Detection for Drinking Water Quality'

    Description:

    For the 7th time in GECCO history, the SPOTSeven Lab is hosting an industrial challenge in cooperation with various industry partners. This years challenge, based on the 2017 challenge, is held in cooperation with "Thüringer Fernwasserversorgung" which provides their real-world data set. The task of this years competition is to develop an anomaly detection algorithm for the water- and environmental data set. Early identification of anomalies in water quality data is a challenging task. It is important to identify true undesirable variations in the water quality. At the same time, false alarm rates have to be very low.
    Additionally to the competition, for the first time in GECCO history we are now able to provide the opportunity for all participants to submit 2-page algorithm descriptions for the GECCO Companion. Thus, it is now possible to create publications in a similar procedure to the Late Breaking Abstracts (LBAs) directly through competition participation!

    Accepted Competition Entry Abstracts
    - Online Anomaly Detection for Drinking Water Quality Using a Multi-objective Machine Learning Approach (Victor Henrique Alves Ribeiro and Gilberto Reynoso Meza from the Pontifical Catholic University of Parana)
    - Anomaly Detection for Drinking Water Quality via Deep BiLSTM Ensemble (Xingguo Chen, Fan Feng, Jikai Wu, and Wenyu Liu from the Nanjing University of Posts and Telecommunications and Nanjing University)
    - Automatic vs. Manual Feature Engineering for Anomaly Detection of Drinking-Water Quality (Valerie Aenne Nicola Fehst from idatase GmbH)

    Official webpage:

    http://www.spotseven.de/gecco/gecco-challenge/gecco-challenge-2018/

  14. Z

    GECCO Industrial Challenge 2019 Dataset: A water quality dataset for the...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rehbach, Frederik (2024). GECCO Industrial Challenge 2019 Dataset: A water quality dataset for the 'Internet of Things: Online Event Detection for Drinking Water Quality Control' competition at the Genetic and Evolutionary Computation Conference 2019, Prague, Czech Republic. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3884443
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Moritz, Steffen
    Rehbach, Frederik
    Bartz-Beielstein, Thomas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Czechia, Prague
    Description

    Dataset of the 'Internet of Things: Online Event Detection for Drinking Water Quality Control' competition hosted at The Genetic and Evolutionary Computation Conference (GECCO) July 13th-17th 2019, Prague, Czech Republic

    The task of the competition was to develop an anomaly detection algorithm for a water- and environmental data set.

    Included in zenodo:

    1. Original train dataset of water quality data provided to participants (identical to gecco2019_train_water_quality.csv)

    2. Call for Participation

    3. Rules and Description of the Challenge

    4. Resource Package provided to participants

    5. The complete dataset, consisting of train, test and validation merged together (gecco2019_all_water_quality.csv)

    6. The test dataset, which was used for creating the leaderboard on the server (gecco2019_test_water_quality.csv)

    7. The train dataset, which participants had available for training their models (gecco2019_train_water_quality.csv)

    8. The validation dataset, which was used for the end results for the challenge (gecco2019_valid_water_quality.csv)

    The challenge required the participants to submit a program for event detection. A training dataset was available to the participants (gecco2019_train_water_quality.csv). During the challenge the participants were able to upload a version of their program to out online platform, where this version was scored against the testing dataset (gecco2019_test_water_quality.csv), thus an intermediate leaderboard was available. To avoid overfitting against this dataset, at the end of the challenge, the end result was created from scoring with the validation dataset (gecco2019_valid_water_quality.csv).

    Train, Test, Validation dataset are from the same measuring station and are in chronological order. So the timestamps from the test dataset begin directly after the train timestamps, while the validation timestamps begin directly after the test timestamps.

    The competition was organized by:

    F. Rehbach, S. Moritz, T. Bartz-Beielstein (TH Köln)

    The dataset was provided by:

    Thüringer Fernwasserversorgung and IMProvT research project

    Internet of Things: Online Event Detection for Drinking Water Quality Control

    Description:

    For the 8th time in GECCO history, the SPOTSeven Lab is hosting an industrial challenge in cooperation with various industry partners. This years challenge, based on the 2018 challenge, is held in cooperation with "Thüringer Fernwasserversorgung" which provides their real-world data set. The task of this years competition is to develop an anomaly detection algorithm for the water- and environmental data set. Early identification of anomalies in water quality data is a challenging task. It is important to identify true undesirable variations in the water quality. At the same time, false alarm rates have to be very low.

    Competition Opens: End of January/Start of February 2019 Final Submission: 30 June 2019

    Official webpage:

    https://www.th-koeln.de/informatik-und-ingenieurwissenschaften/gecco-challenge-2019_63244.php

  15. d

    Developing Standardized Testing Datasets for Benchmarking Automated QC...

    • search.dataone.org
    Updated Mar 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ehsan Kahrizi (2025). Developing Standardized Testing Datasets for Benchmarking Automated QC Algorithm Performance [Dataset]. https://search.dataone.org/view/sha256%3A6e32c460ddc4bc00c48ad1582047e7509bfb09f11efa4171d5a144e48ac8c893
    Explore at:
    Dataset updated
    Mar 15, 2025
    Dataset provided by
    Hydroshare
    Authors
    Ehsan Kahrizi
    Description

    Diagnose Aquatic Sensor Data for Temperature and Water Quality Events

    Overview

    This project is designed to diagnose and flag events in aquatic sensor data based on various conditions and thresholds. It processes raw data from aquatic sites and applies thresholds and logical conditions to identify different types of anomalies. The primary focus is to flag events that may indicate sensor anomalies, environmental conditions (e.g., frozen water), or technician site visits.

    Key Features

    1. Event Detection: Detects and flags various event types, such as MNT (maintenance), LWT (low water table), ICE (frozen water), SLM (sensor logger malfunction), PF (power failure), and VIN (visual inspection).
    2. Data Quality Control: Uses thresholds to validate sensor readings, ensuring accurate representation of water conditions.
    3. Automated Labelling: Automatically labels events using a set of predefined indicators for anomaly detection.

    Workflow of the model: https://ibb.co/8BDFjsv

  16. H

    Techniques for Increased Automation of Aquatic Sensor Data Post Processing...

    • beta.hydroshare.org
    • hydroshare.org
    • +1more
    zip
    Updated Sep 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amber Spackman Jones; Jeffery S. Horsburgh; Tannner Jones (2021). Techniques for Increased Automation of Aquatic Sensor Data Post Processing in Python: Video Presentation [Dataset]. https://beta.hydroshare.org/resource/bc5c616426214b60b068352ae028d963/
    Explore at:
    zip(351.0 MB)Available download formats
    Dataset updated
    Sep 7, 2021
    Dataset provided by
    HydroShare
    Authors
    Amber Spackman Jones; Jeffery S. Horsburgh; Tannner Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This resource contains a video recording for a presentation given as part of the National Water Quality Monitoring Council conference in April 2021. The presentation covers the motivation for performing quality control for sensor data, the development of PyHydroQC, a Python package with functions for automating sensor quality control including anomaly detection and correction, and the performance of the algorithms applied to data from multiple sites in the Logan River Observatory.

    The initial abstract for the presentation: Water quality sensors deployed to aquatic environments make measurements at high frequency and commonly include artifacts that do not represent the environmental phenomena targeted by the sensor. Sensors are subject to fouling from environmental conditions, often exhibit drift and calibration shifts, and report anomalies and erroneous readings due to issues with datalogging, transmission, and other unknown causes. The suitability of data for analyses and decision making often depend on subjective and time-consuming quality control processes consisting of manual review and adjustment of data. Data driven and machine learning techniques have the potential to automate identification and correction of anomalous data, streamlining the quality control process. We explored documented approaches and selected several for implementation in a reusable, extensible Python package designed for anomaly detection for aquatic sensor data. Implemented techniques include regression approaches that estimate values in a time series, flag a point as anomalous if the difference between the sensor measurement exceeds a threshold, and offer replacement values for correcting anomalies. Additional algorithms that scaffold the central regression approaches include rules-based preprocessing, thresholds for determining anomalies that adjust with data variability, and the ability to detect and correct anomalies using forecasted and backcasted estimation. The techniques were developed and tested based on several years of data from aquatic sensors deployed at multiple sites in the Logan River Observatory in northern Utah, USA. Performance was assessed based on labels and corrections applied previously by trained technicians. In this presentation, we describe the techniques for detection and correction, report their performance, illustrate the workflow for applying to high frequency aquatic sensor data, and demonstrate the possibility for additional approaches to help increase automation of aquatic sensor data post processing.

  17. A Comprehensive Surface Water Quality Monitoring Dataset (1940-2023):...

    • figshare.com
    csv
    Updated Feb 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Rajaul Karim; Mahbubul Syeed; Ashifur Rahman; Khondkar Ayaz Rabbani; Kaniz Fatema; Razib Hayat Khan; Md Shakhawat Hossain; Mohammad Faisal Uddin (2025). A Comprehensive Surface Water Quality Monitoring Dataset (1940-2023): 2.82Million Record Resource for Empirical and ML-Based Research [Dataset]. http://doi.org/10.6084/m9.figshare.27800394.v2
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 23, 2025
    Dataset provided by
    figshare
    Authors
    Md. Rajaul Karim; Mahbubul Syeed; Ashifur Rahman; Khondkar Ayaz Rabbani; Kaniz Fatema; Razib Hayat Khan; Md Shakhawat Hossain; Mohammad Faisal Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data DescriptionWater Quality Parameters: Ammonia, BOD, DO, Orthophosphate, pH, Temperature, Nitrogen, Nitrate.Countries/Regions: United States, Canada, Ireland, England, China.Years Covered: 1940-2023.Data Records: 2.82 million.Definition of ColumnsCountry: Name of the water-body region.Area: Name of the area in the region.Waterbody Type: Type of the water-body source.Date: Date of the sample collection (dd-mm-yyyy).Ammonia (mg/l): Ammonia concentration.Biochemical Oxygen Demand (BOD) (mg/l): Oxygen demand measurement.Dissolved Oxygen (DO) (mg/l): Concentration of dissolved oxygen.Orthophosphate (mg/l): Orthophosphate concentration.pH (pH units): pH level of water.Temperature (°C): Temperature in Celsius.Nitrogen (mg/l): Total nitrogen concentration.Nitrate (mg/l): Nitrate concentration.CCME_Values: Calculated water quality index values using the CCME WQI model.CCME_WQI: Water Quality Index classification based on CCME_Values.Data Directory Description:Category 1: DatasetCombined Data: This folder contains two CSV files: Combined_dataset.csv and Summary.xlsx. The Combined_dataset.csv file includes all eight water quality parameter readings across five countries, with additional data for initial preprocessing steps like missing value handling, outlier detection, and other operations. It also contains the CCME Water Quality Index calculation for empirical analysis and ML-based research. The Summary.xlsx provides a brief description of the datasets, including data distributions (e.g., maximum, minimum, mean, standard deviation).Combined_dataset.csvSummary.xlsxCountry-wise Data: This folder contains separate country-based datasets in CSV files. Each file includes the eight water quality parameters for regional analysis. The Summary_country.xlsx file presents country-wise dataset descriptions with data distributions (e.g., maximum, minimum, mean, standard deviation).England_dataset.csvCanada_dataset.csvUSA_dataset.csvIreland_dataset.csvChina_dataset.csvSummary_country.xlsxCategory 2: CodeData processing and harmonization code (e.g., Language Conversion, Date Conversion, Parameter Naming and Unit Conversion, Missing Value Handling, WQI Measurement and Classification).Data_Processing_Harmonnization.ipynbThe code used for Technical Validation (e.g., assessing the Data Distribution, Outlier Detection, Water Quality Trend Analysis, and Vrifying the Application of the Dataset for the ML Models).Technical_Validation.ipynbCategory 3: Data Collection SourcesThis category includes links to the selected dataset sources, which were used to create the dataset and are provided for further reconstruction or data formation. It contains links to various data collection sources.DataCollectionSources.xlsxOriginal Paper Title: A Comprehensive Dataset of Surface Water Quality Spanning 1940-2023 for Empirical and ML Adopted ResearchAbstractAssessment and monitoring of surface water quality are essential for food security, public health, and ecosystem protection. Although water quality monitoring is a known phenomenon, little effort has been made to offer a comprehensive and harmonized dataset for surface water at the global scale. This study presents a comprehensive surface water quality dataset that preserves spatio-temporal variability, integrity, consistency, and depth of the data to facilitate empirical and data-driven evaluation, prediction, and forecasting. The dataset is assembled from a range of sources, including regional and global water quality databases, water management organizations, and individual research projects from five prominent countries in the world, e.g., the USA, Canada, Ireland, England, and China. The resulting dataset consists of 2.82 million measurements of eight water quality parameters that span 1940 - 2023. This dataset can support meta-analysis of water quality models and can facilitate Machine Learning (ML) based data and model-driven investigation of the spatial and temporal drivers and patterns of surface water quality at a cross-regional to global scale.Note: Cite this repository and the original paper when using this dataset.

  18. Data from: Analysis of Virtual Sensors for Predicting Aircraft Fuel...

    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • data.nasa.gov
    • +1more
    Updated Feb 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Analysis of Virtual Sensors for Predicting Aircraft Fuel Consumption [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/analysis-of-virtual-sensors-for-predicting-aircraft-fuel-consumption
    Explore at:
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Previous research described the use of machine learning algorithms to predict aircraft fuel consumption. This technique, known as Virtual Sensors, models fuel consumption as a function of aircraft Flight Operations Quality Assurance (FOQA) data. FOQA data consist of a large number of measurements that are already recorded by many commercial airlines. The predictive model is used for anomaly detection in the fuel consumption history by noting when measured fuel consumption exceeds an expected value. This exceedance may indicate overconsumption of fuel, the source of which may be identified and corrected by the aircraft operator. This would reduce both fuel emissions and operational costs. This paper gives a brief overview of the modeling approach and describes efforts to validate and analyze the initial results of this project. We examine the typical error in modeling, and compare modeling accuracy against both complex and simplistic regression approaches. We also estimate a ranking of the importance of each FOQA variable used as input, and demonstrate that FOQA variables can reliably be used to identify different modes of fuel consumption, which may be useful in future work. Analysis indicates that fuel consumption is accurately predicted while remaining theoretically sensitive to sub-nominal pilot inputs and maintenance-related issues.

  19. d

    MKAD (Open Sourced Code)

    • catalog.data.gov
    • data.nasa.gov
    • +1more
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2023). MKAD (Open Sourced Code) [Dataset]. https://catalog.data.gov/dataset/mkad-open-sourced-code
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Dashlink
    Area covered
    MKAD
    Description

    The Multiple Kernel Anomaly Detection (MKAD) algorithm is designed for anomaly detection over a set of files. It combines multiple kernels into a single optimization function using the One Class Support Vector Machine (OCSVM) framework. Any kernel function can be combined in the algorithm as long as it meets the Mercer conditions, however for the purposes of this code the data preformatting and kernel type is specific to the Flight Operations Quality Assurance (FOQA) data and has been integrated into the coding steps. For this domain, discrete binary switch sequences are used in the discrete kernel, and discretized continuous parameter features are used to form the continuous kernel. The OCSVM uses a training set of nominal examples (in this case flights) and evaluates test examples for anomaly detection to determine whether they are anomalous or not. After completing this analysis the algorithm reports the anomalous examples and determines whether there is a contribution from either or both continuous and discrete elements.

  20. Wind Turbine SCADA Data For Early Fault Detection

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Mar 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Gück; Christian Gück; Cyriana Roelofs; Cyriana Roelofs (2025). Wind Turbine SCADA Data For Early Fault Detection [Dataset]. http://doi.org/10.5281/zenodo.10958775
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 6, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christian Gück; Christian Gück; Cyriana Roelofs; Cyriana Roelofs
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Note: Please check out Version 5 of this dataset since some labels have been corrected.

    This dataset is published together with the paper "CARE to Compare: A real-world dataset for anomaly detection in wind turbine data", which explains the dataset in detail and defines the CARE-score that can be used to evaluated anomaly detection algorithms on this dataset. When referring to this dataset, please cite the paper mentioned in the related work section.

    The data consists of 95 datasets, containing 89 years of SCADA time series distributed across 36 different wind turbines
    from the three wind farms A, B and C. The number of features depends on the wind farm; Wind farm A has 86 features, wind farm B has 257 features and wind farm C has 957 features.

    The overall dataset is balanced, as 44 out the 95 datasets contain a labeled anomaly event that leads up to a turbine fault and the other 51 datasets represent normal behavior. Additionally, the quality of training data is ensured by turbine-status-based labels for each data point and further information about some of the given turbine faults are included.

    The data for Wind farm A is based on data from the EDP open data platform (https://www.edp.com/en/innovation/open-data/data),
    and consists of 5 wind turbines of an onshore wind farm in Portugal.
    It contains SCADA data and information derived by a given fault logbook which defines start timestamps for specified faults.
    From this data 22 datasets were selected to be included in this data collection.
    The other two wind farms are offshore wind farms located in Germany. All three datasets were anonymized due to confidentiality reasons for the wind farms B and C.
    Each dataset is provided in form of a csv-file with columns defining the features and rows representing the data points of the time series. Files

    More detailed information can be found in the included README-file and in the publication corresponding to this dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
National Aeronautics and Space Administration (2024). Anomaly Detection in a Fleet of Systems [Dataset]. https://datasets.ai/datasets/anomaly-detection-in-a-fleet-of-systems

Data from: Anomaly Detection in a Fleet of Systems

Related Article
Explore at:
33Available download formats
Dataset updated
Sep 10, 2024
Dataset authored and provided by
National Aeronautics and Space Administration
Description

A fleet is a group of systems (e.g., cars, aircraft) that are designed and manufactured the same way and are intended to be used the same way. For example, a fleet of delivery trucks may consist of one hundred instances of a particular model of truck, each of which is intended for the same type of service—almost the same amount of time and distance driven every day, approximately the same total weight carried, etc. For this reason, one may imagine that data mining for fleet monitoring may merely involve collecting operating data from the multiple systems in the fleet and developing some sort of model, such as a model of normal operation that can be used for anomaly detection. However, one then may realize that each member of the fleet will be unique in some ways—there will be minor variations in manufacturing, quality of parts, and usage. For this reason, the typical machine learning and statis- tics algorithm’s assumption that all the data are independent and identically distributed is not correct. One may realize that data from each system in the fleet must be treated as unique so that one can notice significant changes in the operation of that system.

Search
Clear search
Close search
Google apps
Main menu