90 datasets found
  1. d

    Comparison of Unsupervised Anomaly Detection Methods

    • catalog.data.gov
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • +2more
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Comparison of Unsupervised Anomaly Detection Methods [Dataset]. https://catalog.data.gov/dataset/comparison-of-unsupervised-anomaly-detection-methods
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Several different unsupervised anomaly detection algorithms have been applied to Space Shuttle Main Engine (SSME) data to serve the purpose of developing a comprehensive suite of Integrated Systems Health Management (ISHM) tools. As the theoretical bases for these methods vary considerably, it is reasonable to conjecture that the resulting anomalies detected by them may differ quite significantly as well. As such, it would be useful to apply a common metric with which to compare the results. However, for such a quantitative analysis to be statistically significant, a sufficient number of examples of both nominally categorized and anomalous data must be available. Due to the lack of sufficient examples of anomalous data, use of any statistics that rely upon a statistically significant sample of anomalous data is infeasible. Therefore, the main focus of this paper will be to compare actual examples of anomalies detected by the algorithms via the sensors in which they appear, as well the times at which they appear. We find that there is enough overlap in detection of the anomalies among all of the different algorithms tested in order for them to corroborate the severity of these anomalies. In certain cases, the severity of these anomalies is supported by their categorization as failures by experts, with realistic physical explanations. For those anomalies that can not be corroborated by at least one other method, this overlap says less about the severity of the anomaly, and more about their technical nuances, which will also be discussed.

  2. Z

    Controlled Anomalies Time Series (CATS) Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7646896
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset authored and provided by
    Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

    4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

    3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

    10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

    5 million timestamps. Sensors readings are at 1Hz sampling frequency.

    1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

    4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

    200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

    Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.

    Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

    Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.

    Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).

    Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

    Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

    Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

    No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    Change Log

    Version 2

    Metadata: we include a metadata.csv with information about:

    Anomaly categories

    Root cause channel (signal in which the anomaly is first visible)

    Affected channel (signal in which the anomaly might propagate) through coupled system dynamics

    Removal of anomaly overlaps: version 1 contained anomalies which overlapped with each other resulting in only 190 distinct anomalous segments. Now, there are no more anomaly overlaps.

    Two data files: CSV and parquet for convenience.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

  3. Comparison of Unsupervised Anomaly Detection Methods - Dataset - NASA Open...

    • data.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Comparison of Unsupervised Anomaly Detection Methods - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/comparison-of-unsupervised-anomaly-detection-methods
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Several different unsupervised anomaly detection algorithms have been applied to Space Shuttle Main Engine (SSME) data to serve the purpose of developing a comprehensive suite of Integrated Systems Health Management (ISHM) tools. As the theoretical bases for these methods vary considerably, it is reasonable to conjecture that the resulting anomalies detected by them may differ quite significantly as well. As such, it would be useful to apply a common metric with which to compare the results. However, for such a quantitative analysis to be statistically significant, a sufficient number of examples of both nominally categorized and anomalous data must be available. Due to the lack of sufficient examples of anomalous data, use of any statistics that rely upon a statistically significant sample of anomalous data is infeasible. Therefore, the main focus of this paper will be to compare actual examples of anomalies detected by the algorithms via the sensors in which they appear, as well the times at which they appear. We find that there is enough overlap in detection of the anomalies among all of the different algorithms tested in order for them to corroborate the severity of these anomalies. In certain cases, the severity of these anomalies is supported by their categorization as failures by experts, with realistic physical explanations. For those anomalies that can not be corroborated by at least one other method, this overlap says less about the severity of the anomaly, and more about their technical nuances, which will also be discussed.

  4. v

    Data from: Anomaly Detection in a Fleet of Systems

    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • s.cnmilf.com
    • +4more
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Anomaly Detection in a Fleet of Systems [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/anomaly-detection-in-a-fleet-of-systems
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    A fleet is a group of systems (e.g., cars, aircraft) that are designed and manufactured the same way and are intended to be used the same way. For example, a fleet of delivery trucks may consist of one hundred instances of a particular model of truck, each of which is intended for the same type of service—almost the same amount of time and distance driven every day, approximately the same total weight carried, etc. For this reason, one may imagine that data mining for fleet monitoring may merely involve collecting operating data from the multiple systems in the fleet and developing some sort of model, such as a model of normal operation that can be used for anomaly detection. However, one then may realize that each member of the fleet will be unique in some ways—there will be minor variations in manufacturing, quality of parts, and usage. For this reason, the typical machine learning and statis- tics algorithm’s assumption that all the data are independent and identically distributed is not correct. One may realize that data from each system in the fleet must be treated as unique so that one can notice significant changes in the operation of that system.

  5. Controlled Anomalies Time Series (CATS) Dataset

    • zenodo.org
    bin
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Authors
    Patrick Fleith; Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    • Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
      • 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
      • 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
      • 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
    • 5 million timestamps. Sensors readings are at 1Hz sampling frequency.
      • 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
      • 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
    • 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
    • Different types of anomalies to understand what anomaly types can be detected by different approaches.
    • Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
    • Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
    • Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
    • Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
    • No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

  6. Z

    Data set for anomaly detection on a HPC system

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Borghesi (2023). Data set for anomaly detection on a HPC system [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3251872
    Explore at:
    Dataset updated
    Apr 19, 2023
    Dataset provided by
    Andrea Borghesi
    Andrea Bartolini
    Francesco Beneventi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bologna, Bologna, Italy) in the period March-May 2018.

    The data set has been used to train a autoencoder-based model to automatically detect anomalies in a semi-supervised fashion, on a real HPC system.

    This work is described in:

    1) "Anomaly Detection using Autoencoders in High Performance Computing Systems", Andrea Borghesi, Andrea Bartolini, Michele Lombardi, Michela Milano, Luca Benini, IAAI19 (proceedings in process) -- https://arxiv.org/abs/1902.08447

    2) "Online Anomaly Detection in HPC Systems", Andrea Borghesi, Antonio Libri, Luca Benini, Andrea Bartolini, AICAS19 (proceedings in process) -- https://arxiv.org/abs/1811.05269

    See the git repository for usage examples & details --> https://github.com/AndreaBorghesi/anomaly_detection_HPC

  7. Controlled Anomalies Time Series (CATS) Dataset

    • kaggle.com
    Updated Sep 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    astro_pat (2023). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. https://www.kaggle.com/datasets/patrickfleith/controlled-anomalies-time-series-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 14, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    astro_pat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    • Multivariate (17 variables)including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
      • 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
      • 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
      • 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
    • 5 million timestamps. Sensors readings are at 1Hz sampling frequency.
      • 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
      • 4 million observations that include** both nominal and anomalous segments.** This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
    • 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
      • Contamination level of 0.038. This means about 3.8% of the observations (rows) are anomalous.
    • Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.
    • Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
    • Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.
    • Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).
    • Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during**** our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
    • Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
    • Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
    • No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    The dataset provider, Solenix, is an international company providing software e...

  8. d

    Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

    • catalog.data.gov
    • s.cnmilf.com
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).

  9. d

    Labelled evaluation datasets of AIS Trajectories from Danish Waters for...

    • data.dtu.dk
    bin
    Updated Jul 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kristoffer Vinther Olesen; Line Katrine Harder Clemmensen; Anders Nymark Christensen (2023). Labelled evaluation datasets of AIS Trajectories from Danish Waters for Abnormal Behavior Detection [Dataset]. http://doi.org/10.11583/DTU.21511815.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 12, 2023
    Dataset provided by
    Technical University of Denmark
    Authors
    Kristoffer Vinther Olesen; Line Katrine Harder Clemmensen; Anders Nymark Christensen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This item is part of the collection "AIS Trajectories from Danish Waters for Abnormal Behavior Detection"

    DOI: https://doi.org/10.11583/DTU.c.6287841

    Using Deep Learning for detection of maritime abnormal behaviour in spatio temporal trajectories is a relatively new and promising application. Open access to the Automatic Identification System (AIS) has made large amounts of maritime trajectories publically avaliable. However, these trajectories are unannotated when it comes to the detection of abnormal behaviour. The lack of annotated datasets for abnormality detection on maritime trajectories makes it difficult to evaluate and compare suggested models quantitavely. With this dataset, we attempt to provide a way for researchers to evaluate and compare performance.
    We have manually labelled trajectories which showcase abnormal behaviour following an collision accident. The annotated dataset consists of 521 data points with 25 abnormal trajectories. The abnormal trajectories cover amoung other; Colliding vessels, vessels engaged in Search-and-Rescue activities, law enforcement, and commercial maritime traffic forced to deviate from the normal course

    These datasets consists of labelled trajectories for the purpose of evaluating unsupervised models for detection of abnormal maritime behavior. For unlabelled datasets for training please refer to the collection. Link in Related publications.

    The dataset is an example of a SAR event and cannot not be considered representative of a large population of all SAR events.

    The dataset consists of a total of 521 trajectories of which 25 is labelled as abnormal. the data is captured on a single day in a specific region. The remaining normal traffic is representative of the traffic during the winter season. The normal traffic in the ROI has a fairly high seasonality related to fishing and leisure sailing traffic.

    The data is saved using the pickle format for Python. Each dataset is split into 2 files with naming convention:

    datasetInfo_XXX
    data_XXX

    Files named "data_XXX" contains the extracted trajectories serialized sequentially one at a time and must be read as such. Please refer to provided utility functions for examples. Files named "datasetInfo" contains Metadata related to the dataset and indecies at which trajectories begin in "data_XXX" files.

    The data are sequences of maritime trajectories defined by their; timestamp, latitude/longitude position, speed, course, and unique ship identifer MMSI. In addition, the dataset contains metadata related to creation parameters. The dataset has been limited to a specific time period, ship types, moving AIS navigational statuses, and filtered within an region of interest (ROI). Trajectories were split if exceeding an upper limit and short trajectories were discarded. All values are given as metadata in the dataset and used in the naming syntax.

    Naming syntax: data_AIS_Custom_STARTDATE_ENDDATE_SHIPTYPES_MINLENGTH_MAXLENGTH_RESAMPLEPERIOD.pkl

    See datasheet for more detailed information and we refer to provided utility functions for examples on how to read and plot the data.

  10. Data from: IMAD-DS: A Dataset for Industrial Multi-Sensor Anomaly Detection...

    • zenodo.org
    bin
    Updated Aug 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Filippo Augusti; Filippo Augusti; Davide Albertini; Davide Albertini; Kudret Esmer; Roberto Sannino; Alberto Bernardini; Alberto Bernardini; Kudret Esmer; Roberto Sannino (2024). IMAD-DS: A Dataset for Industrial Multi-Sensor Anomaly Detection Under Domain Shift Conditions [Dataset]. http://doi.org/10.5281/zenodo.12665499
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Filippo Augusti; Filippo Augusti; Davide Albertini; Davide Albertini; Kudret Esmer; Roberto Sannino; Alberto Bernardini; Alberto Bernardini; Kudret Esmer; Roberto Sannino
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    IMAD-DS is a dataset developed for multi-rate multi-sensor anomaly detection (AD) in industrial environments, that considers varying operational and environmental conditions known as domain shifts.

    Dataset Overview:

    This dataset includes data from two scaled industrial machines: a robotic arm and a brushless motor.

    It includes both normal and abnormal data recorded under various operating conditions to account for domain shifts. These shifts are categorized into:

    Robotic Arm: The robotic arm is a scaled version of a robotic arm used to move silicon wafers in a factory. Anomalies are created by removing bolts at the nodes of the arm, resulting in an imbalance in the machine.
    Brushless Motor: The brushless motor is a scaled representation of an industrial brushless motor. Two anomalies are introduced: first, a magnet is moved closer to the motor load, causing oscillations by interacting with two symmetrical magnets on the load; second, a belt that rotates in unison with the motor shaft is tightened, creating mechanical stress.

    The following domain shifts are included in the dataset:

    Operational Domain Shifts: Variations caused by changes in machine conditions (e.g., load changes for the robotic arm and speed changes for the brushless motor).

    Environmental Domain Shifts: Variations due to changes in background noise levels.

    Combinations of operating and environmental conditions divide each machine's dataset into two subsets: the source domain and the target domain. The source domain has a large number of training examples. The target domain, instead, has limited training data. This discrepancy highlights a common issue in the industry where sufficient training data is often unavailable for the target domain, as machine data is collected under controlled environments that do not fully represent the deployment environments.

    Data Collection and Processing:

    Data is collected using the STEVAL-STWINBX1 IoT Sensor Industrial Node. The sensor used to record the dataset are the following.

    · Analog Microphone (16 kHz)

    · 3-axis Accelerometer (6.7 kHz)

    · 3-axis Gyroscope (6.7 kHz)

    Recordings are conducted in an anechoic chamber to control acoustic conditions precisely

    Data Format:
    Files are already divided into train and test sets. Inside each folder, each sensor's data is stored in a separate '.parquet' file.

    Sensor files related to the same segment of machine data share a unique ID. The mapping of each machine data segment to the sensor files is given in .csv files inside the train and test folders. Those .csv files also contain metadata denoting the operational and environmental conditions of a specific segment.

  11. Anomalous Action Detection Dataset

    • kaggle.com
    Updated Jun 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sayantan roy 10121999 (2024). Anomalous Action Detection Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/8717699
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    sayantan roy 10121999
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    ******### Ano-AAD Dataset: Comprehensive Anomalous Human Action Detection in Videos******

    The Ano-AAD dataset is a groundbreaking resource designed to advance the field of anomaly detection in video surveillance. Compiled from an extensive array of sources, including popular social media platforms and various websites, this dataset captures a wide range of human behaviors, both normal and anomalous. By providing a rich and diverse set of video data, the Ano-AAD dataset is poised to significantly enhance the capabilities of surveillance systems and contribute to the development of more sophisticated safety protocols.

    #### Inception and Objective

    The primary objective behind the creation of the Ano-AAD dataset was to address the pressing need for a comprehensive, well-annotated collection of video footage that can be used to train and evaluate models for detecting anomalous human actions. Recognizing the limitations of existing datasets, which often lack diversity and sufficient examples of real-world scenarios, we embarked on a meticulous process to gather, annotate, and validate a diverse array of videos. Our goal was to ensure that the dataset encompasses a wide variety of environments and actions, thereby providing a robust foundation for the development of advanced anomaly detection algorithms.

    #### Data Collection Process

    The data collection process for the Ano-AAD dataset was both extensive and methodical. We identified and selected videos from various social media platforms, such as Facebook and YouTube, as well as other online sources. These videos were chosen to represent a broad spectrum of real-world scenarios, including both typical daily activities and less frequent, but critical, anomalous events. Each video was carefully reviewed to ensure it met our criteria for relevance, clarity, and authenticity.

    #### Categorization and Annotation

    A cornerstone of the Ano-AAD dataset is its detailed categorization and annotation of human actions. Each video clip was meticulously labeled to differentiate between normal activities—such as walking, sitting, and working—and anomalous behaviors, which include arrests, burglaries, explosions, fighting, fire raising, ill treatment, traffic irregularities, attacks, and other violent acts. This comprehensive annotation process was essential to creating a dataset that accurately reflects the complexities of real-world surveillance challenges. Our team of annotators underwent rigorous training to ensure consistency and reliability in the labeling process, and multiple rounds of validation were conducted to maintain high-quality annotations.

    #### Ethical Considerations

    Throughout the data collection and annotation process, we adhered to strict ethical guidelines and privacy regulations. All videos were sourced from publicly available content, and efforts were made to anonymize individuals to protect their privacy. We prioritized compliance with data protection principles, ensuring that our work not only advanced technological capabilities but also respected the rights and privacy of individuals depicted in the footage.

    #### Technical Specifications

    The Ano-AAD dataset comprises a total of 354 abnormal videos, and 41 normal videos amounting to 8.7 GB of abnormal data and a cumulative abnormal video duration of 11 hours and 25 minutes. We also added 41 nomal videos of time duration of 41 miniutes. Each video was processed to maintain a uniform format and resolution, typically standardized nomal videos of time duration to MP4 . This consistency in video quality ensures that the dataset can be seamlessly integrated into various machine learning models and computer vision algorithms, facilitating the development and testing of anomaly detection systems.

    #### Dataset Breakdown

    | Serial Number | Anomaly Class | Total Number of Videos | Size | Duration (HH:MM) | |---------------|----------------------|------------------------|----------|------------------| | 1 | Arrest | 49 | 1.7 GB | 2:10 | | 2 | Burglary | 48 | 948.7 MB | 1:26 | | 3 | Explosion | 49 | 773 MB | 1:01 | | 4 | Fighting | 50 | 2.0 GB | 2:23 | | 5 | Fire Raising | 49 | 999.4 MB | 1:20 | | 6 | Ill Treatment | 32 | 812.5 MB | 1:07 | | 7 | Traffic Irregularities | 13 | 79.3 MB | 0:05 | | 8 | Attack | 38 | 543.8 MB | 0:41 | | 9 | Violence | 26 | 836 MB | 1:08 | | Total | ...

  12. d

    pyhydroqc Sensor Data QC: Single Site Example

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amber Spackman Jones (2023). pyhydroqc Sensor Data QC: Single Site Example [Dataset]. http://doi.org/10.4211/hs.92f393cbd06b47c398bdd2bbb86887ac
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    Amber Spackman Jones
    Time period covered
    Jan 1, 2017 - Dec 31, 2017
    Description

    This resource contains an example script for using the software package pyhydroqc. pyhydroqc was developed to identify and correct anomalous values in time series data collected by in situ aquatic sensors. For more information, see the code repository: https://github.com/AmberSJones/pyhydroqc and the documentation: https://ambersjones.github.io/pyhydroqc/. The package may be installed from the Python Package Index.

    This script applies the functions to data from a single site in the Logan River Observatory, which is included in the repository. The data collected in the Logan River Observatory are sourced at http://lrodata.usu.edu/tsa/ or on HydroShare: https://www.hydroshare.org/search/?q=logan%20river%20observatory.

    Anomaly detection methods include ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short Term Memory). These are time series regression methods that detect anomalies by comparing model estimates to sensor observations and labeling points as anomalous when they exceed a threshold. There are multiple possible approaches for applying LSTM for anomaly detection/correction. - Vanilla LSTM: uses past values of a single variable to estimate the next value of that variable. - Multivariate Vanilla LSTM: uses past values of multiple variables to estimate the next value for all variables. - Bidirectional LSTM: uses past and future values of a single variable to estimate a value for that variable at the time step of interest. - Multivariate Bidirectional LSTM: uses past and future values of multiple variables to estimate a value for all variables at the time step of interest.

    The correction approach uses piecewise ARIMA models. Each group of consecutive anomalous points is considered as a unit to be corrected. Separate ARIMA models are developed for valid points preceding and following the anomalous group. Model estimates are blended to achieve a correction.

    The anomaly detection and correction workflow involves the following steps: 1. Retrieving data 2. Applying rules-based detection to screen data and apply initial corrections 3. Identifying and correcting sensor drift and calibration (if applicable) 4. Developing a model (i.e., ARIMA or LSTM) 5. Applying model to make time series predictions 6. Determining a threshold and detecting anomalies by comparing sensor observations to modeled results 7. Widening the window over which an anomaly is identified 8. Aggregating detections resulting from multiple models 9. Making corrections for anomalous events

    Instructions to run the notebook through the CUAHSI JupyterHub: 1. Click "Open with..." at the top of the resource and select the CUAHSI JupyterHub. You may need to sign into CUAHSI JupyterHub using your HydroShare credentials. 2. Select 'Python 3.8 - Scientific' as the server and click Start. 2. From your JupyterHub directory, click on the ExampleNotebook.ipynb file. 3. Execute each cell in the code by clicking the Run button.

  13. Z

    Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Garske, Samuel (2024). Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13370799
    Explore at:
    Dataset updated
    Aug 29, 2024
    Dataset provided by
    Mao, Yiwei
    Garske, Samuel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.

    They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD

    How to Get Started

    All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:

    import numpy as np

    Load image file

    hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")

    Load image mask

    mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")

    Citing the Datasets

    If you use any of these datasets, please cite the following paper:

    @article{garske2024erx, title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning}, author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC}, journal={arXiv preprint arXiv:2408.14947}, year={2024},}

    If you use the beach dataset please cite the following paper as well (original source):

    @article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }

  14. Anomaly Detection in a Fleet of Systems - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Anomaly Detection in a Fleet of Systems - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/anomaly-detection-in-a-fleet-of-systems
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    A fleet is a group of systems (e.g., cars, aircraft) that are designed and manufactured the same way and are intended to be used the same way. For example, a fleet of delivery trucks may consist of one hundred instances of a particular model of truck, each of which is intended for the same type of service—almost the same amount of time and distance driven every day, approximately the same total weight carried, etc. For this reason, one may imagine that data mining for fleet monitoring may merely involve collecting operating data from the multiple systems in the fleet and developing some sort of model, such as a model of normal operation that can be used for anomaly detection. However, one then may realize that each member of the fleet will be unique in some ways—there will be minor variations in manufacturing, quality of parts, and usage. For this reason, the typical machine learning and statis- tics algorithm’s assumption that all the data are independent and identically distributed is not correct. One may realize that data from each system in the fleet must be treated as unique so that one can notice significant changes in the operation of that system.

  15. Z

    Data from: Anomaly detection in the Zwicky Transient Facility DR3

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kornilov, Matwey (2022). Anomaly detection in the Zwicky Transient Facility DR3 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4318699
    Explore at:
    Dataset updated
    Aug 15, 2022
    Dataset provided by
    Aleo, Patrick
    Korolev, Vladimir
    Kornilov, Matwey
    Malanchev, Konstantin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The feature data set extracted from ZTF DR3 light curves. It was used in Malanchev et al. 2020 to detect anomalous astrophysical sources in ZTF data.

    "feature_XXX.dat" files contain object-ordered light curve feature data, every object is built on 42 feature values, which are encoded as little endian single precision IEEE-754 float (32bit float) numbers. Feature code-names are the same for all three data sets and are listed in plain text files "feature_XXX.name", one code-name per line. "oid_XXX.dat" files contain ZTF DR object identifiers encoded as little endian 64-bit unsigned integer numbers. "oid_XXX.dat" and "feature_XXX.dat" have same object order, for example the first 8 bytes of "oid_m31.dat" files contain the OID of the ZTF DR3 light curve which feature are presented in the first 168 bytes of "feature_m31.dat" file. "m31", "deep" and "disk" denote different ZTF fields and contain 57 546, 406 611, 1 790 565 objects. Note that observations between 58194 ≤ MJD ≤ 58483 are used, see the paper for field and features details.

    The sample Python code to access the data as Numpy arrays:

    import numpy as np

    oid = np.memmap('oid_m31.dat', mode='r', dtype=np.uint64) with open('feature_m31.name') as f: names = f.read().split() dtype = [(name, np.float32) for name in names] feature = np.memmap('feature_m31.dat', mode='r', dtype=dtype, shape=oid.shape)

    idx = np.argmax(feature['amplitude']) print('Object {} has maximum amplitude {:.3f}'.format(oid[idx], feature['amplitude'][idx]))

  16. Official Datasets for LHC Olympics 2020 Anomaly Detection Challenge

    • zenodo.org
    bin
    Updated Feb 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gregor Kasieczka; Benjamin Nachman; David Shih; Gregor Kasieczka; Benjamin Nachman; David Shih (2021). Official Datasets for LHC Olympics 2020 Anomaly Detection Challenge [Dataset]. http://doi.org/10.5281/zenodo.4536624
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 12, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gregor Kasieczka; Benjamin Nachman; David Shih; Gregor Kasieczka; Benjamin Nachman; David Shih
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These are the official datasets for the LHC Olympics 2020 Anomaly Detection Challenge. Each "black box" contains 1M events meant to be representative of actual LHC data. These events may include signal(s) and the challenge consists of finding these signals using the method of your choice. We have uploaded a total of THREE black boxes to be used for the challenge.

    In addition, we include a background sample of 1M events meant to aid in the challenge. The background sample consists of QCD dijet events simulated using Pythia8 and Delphes 3.4.1. Be warned that both the physics and the detector modeling for this simulation may not exactly reflect the "data" in the black boxes. For both background and black box data, events are selected using a single fat-jet (R=1) trigger with pT threshold of 1.2 TeV.

    These events are stored as pandas dataframes saved to compressed h5 format. For each event, all reconstructed particles are assumed to be massless and are recorded in detector coordinates (pT, eta, phi). More detailed information such as particle charge is not included. Events are zero padded to constant size arrays of 700 particles. The array format is therefore (Nevents=1M, 2100).

    For more information, including a complete description of the challenge and an example Jupyter notebook illustrating how to read and process the events, see the official LHC Olympics 2020 webpage here.

    UPDATE: November 23, 2020

    Now that the challenge is over, we have uploaded the solutions to Black Boxes 1 and 3. They are simple ASCII files (events_LHCO2020_BlackBox1.masterkey and events_LHCO2020_BlackBox3.masterkey) where each line is the truth label -- 0 for background and 1 (and 2 in the case of BB3) for signal -- of each event in the corresponding h5 files (same ordering). For more information about the solutions, please visit the LHCO2020 webpage.

    UPDATE: February 11, 2021

    We have uploaded the Delphes detector cards and Pythia command files used to produce the Black Box datasets.

  17. Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned...

    • data.nasa.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).

  18. D

    MVTec AD Dataset

    • datasetninja.com
    • kaggle.com
    Updated Jun 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Bergmann; Kilian Batzner; Michael Fauser (2019). MVTec AD Dataset [Dataset]. https://datasetninja.com/mvtec-ad
    Explore at:
    Dataset updated
    Jun 20, 2019
    Dataset provided by
    Dataset Ninja
    Authors
    Paul Bergmann; Kilian Batzner; Michael Fauser
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The authors of the MVTec AD: the MVTec Anomaly Detection dataset addressed the critical task of detecting anomalous structures within natural image data, a crucial aspect of computer vision applications. To facilitate the development of methods for unsupervised anomaly detection, they introduced the MVTec AD dataset, comprising 5354 high-resolution color images encompassing various object and texture categories. The dataset comprises both normal images, intended for training, and images with anomalies, designed for testing. These anomalies manifest in over 70 distinct types of defects, including scratches, dents, contaminations, and structural alterations. The authors also provided pixel-precise ground truth annotations for all anomalies.

  19. Z

    Industrial screw driving dataset collection: Time series data for process...

    • data.niaid.nih.gov
    Updated Feb 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    West, Nikolai (2025). Industrial screw driving dataset collection: Time series data for process monitoring and anomaly detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14729547
    Explore at:
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    West, Nikolai
    Deuse, Jochen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Industrial Screw Driving Datasets

    Overview

    This repository contains a collection of real-world industrial screw driving datasets, designed to support research in manufacturing process monitoring, anomaly detection, and quality control. Each dataset represents different aspects and challenges of automated screw driving operations, with a focus on natural process variations and degradation patterns.

    Scenario name Number of work pieces Repetitions (screw cylces) per workpiece Individual screws per workpiece Observations Unique classes Purpose

    s01_thread-degradation 100 25 2 5.000 1 Investigation of thread degradation through repeated fastening

    s02_surface-friction 250 25 2 12.500 8 Surface friction effects on screw driving operations

    s03_error-collection-1

    1 2

    20

    s04_error-collection-2 2.500 1 2 5.000 25

    s05_injection-molding-manipulations-upper-workpiece 1.200 1 2 2.400 44 Investigation of changes in the injection molding process of the workpieces

    Dataset Collection

    The datasets were collected from operational industrial environments, specifically from automated screw driving stations used in manufacturing. Each scenario investigates specific mechanical phenomena that can occur during industrial screw driving operations:

    Currently Available Datasets:

    1. s01_thread-degradation

    Focus: Investigation of thread degradation through repeated fastening

    Samples: 5,000 screw operations (4,089 normal, 911 faulty)

    Features: Natural degradation patterns, no artificial error induction

    Equipment: Delta PT 40x12 screws, thermoplastic components

    Process: 25 cycles per location, two locations per workpiece

    First published in: HICSS 2024 (West & Deuse, 2024)

    1. s02_surface-friction

    Focus: Surface friction effects on screw driving operations

    Samples: 12,500 screw operations (9,512 normal, 2,988 faulty)

    Features: Eight distinct surface conditions (baseline to mechanical damage)

    Equipment: Delta PT 40x12 screws, thermoplastic components, surface treatment materials

    Process: 25 cycles per location, two locations per workpiece

    First published in: CIE51 2024 (West & Deuse, 2024)

    1. s05_injection-molding-manipulations-upper-workpiece

    Manipulations of the injection molding process with no changes during tightening

    Samples: 2,400 screw operations (2,397 normal, 3 faulty)

    Features: 44 classes in five distinct groups:

    Mold temperature

    Glass fiber content

    Recyclate content

    Switching point

    Injection velocity

    Equipment: Delta PT 40x12 screws, thermoplastic components

    Unpublished, work in progress

    Upcoming Datasets:

    1. s03_screw-error-collection-1 (recorded but unpublished)

    Focus: Varius manipulations of the screw driving process

    Features: More than 20 different errors recorded

    First published in: Publication planned

    Status: In preparation

    1. s04_screw-error-collection-2 (recorded but unpublished)

    Focus: Varius manipulations of the screw driving process

    Features: 25 distinct errors recorded over the course of a week

    First published in: Publication planned

    Status: In preparation

    1. s06_injection-molding-manipulations-lower-workpiece (recorded but unpublished)

    Manipulations of the injection molding process with no changes during tightening

    Additional scenarios may be added to this collection as they become available.

    Data Format

    Each dataset follows a standardized structure:

    JSON files containing individual screw operation data

    CSV files with operation metadata and labels

    Comprehensive documentation in README files

    Example code for data loading and processing is available in the companion library PyScrew

    Research Applications

    These datasets are suitable for various research purposes:

    Machine learning model development and validation

    Process monitoring and control systems

    Quality assurance methodology development

    Manufacturing analytics research

    Anomaly detection algorithm benchmarking

    Usage Notes

    All datasets include both normal operations and process anomalies

    Complete time series data for torque, angle, and additional parameters available

    Detailed documentation of experimental conditions and setup

    Data collection procedures and equipment specifications available

    Access and Citation

    These datasets are provided under an open-access license to support research and development in manufacturing analytics. When using any of these datasets, please cite the corresponding publication as detailed in each dataset's README file.

    Related Tools

    We recommend using our library PyScrew to load and prepare the data. However, the the datasets can be processed using standard JSON and CSV processing libraries. Common data analysis and machine learning frameworks may be used for the analysis. The .tar file provided all information required for each scenario.

    Contact and Support

    For questions, issues, or collaboration interests regarding these datasets, either:

    Open an issue in our GitHub repository PyScrew

    Contact us directly via email

    Acknowledgments

    These datasets were collected and prepared by:

    RIF Institute for Research and Transfer e.V.

    University of Kassel, Institute of Material Engineering

    Technical University Dortmund, Institute for Production Systems

    The preparation and provision of the research was supported by:

    German Ministry of Education and Research (BMBF)

    European Union's "NextGenerationEU" program

    The research is part of this funding program

    More information regarding the research project is available here

    Change Log

    Version Date Features

    v1.1.3 18.02.2025

    • Upload of s05 with injection molding manipulations in 44 classes

    v1.1.2 12.02.2025

    • Change to default names label.csv and README.md in all scenarios

    v1.1.1 12.02.2025

    • Reupload of both s01 and s02 as zip (smaller size) and tar (faster extraction) files

    • Change to the data structure (now organized as subdirectories per class in json/)

    v1.1.0 30.01.2025

    • Initial uplload of the second scenario s02_surface-friction

    v1.0.0 24.01.2025

    • Initial upload of the first scenario s01_thread-degradation
  20. Bank Transaction Dataset for Fraud Detection

    • kaggle.com
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vala khorasani (2024). Bank Transaction Dataset for Fraud Detection [Dataset]. https://www.kaggle.com/datasets/valakhorasani/bank-transaction-dataset-for-fraud-detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    vala khorasani
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset provides a detailed look into transactional behavior and financial activity patterns, ideal for exploring fraud detection and anomaly identification. It contains 2,512 samples of transaction data, covering various transaction attributes, customer demographics, and usage patterns. Each entry offers comprehensive insights into transaction behavior, enabling analysis for financial security and fraud detection applications.

    Key Features:

    • TransactionID: Unique alphanumeric identifier for each transaction.
    • AccountID: Unique identifier for each account, with multiple transactions per account.
    • TransactionAmount: Monetary value of each transaction, ranging from small everyday expenses to larger purchases.
    • TransactionDate: Timestamp of each transaction, capturing date and time.
    • TransactionType: Categorical field indicating 'Credit' or 'Debit' transactions.
    • Location: Geographic location of the transaction, represented by U.S. city names.
    • DeviceID: Alphanumeric identifier for devices used to perform the transaction.
    • IP Address: IPv4 address associated with the transaction, with occasional changes for some accounts.
    • MerchantID: Unique identifier for merchants, showing preferred and outlier merchants for each account.
    • AccountBalance: Balance in the account post-transaction, with logical correlations based on transaction type and amount.
    • PreviousTransactionDate: Timestamp of the last transaction for the account, aiding in calculating transaction frequency.
    • Channel: Channel through which the transaction was performed (e.g., Online, ATM, Branch).
    • CustomerAge: Age of the account holder, with logical groupings based on occupation.
    • CustomerOccupation: Occupation of the account holder (e.g., Doctor, Engineer, Student, Retired), reflecting income patterns.
    • TransactionDuration: Duration of the transaction in seconds, varying by transaction type.
    • LoginAttempts: Number of login attempts before the transaction, with higher values indicating potential anomalies.

    This dataset is ideal for data scientists, financial analysts, and researchers looking to analyze transactional patterns, detect fraud, and build predictive models for financial security applications. The dataset was designed for machine learning and pattern analysis tasks and is not intended as a primary data source for academic publications.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dashlink (2025). Comparison of Unsupervised Anomaly Detection Methods [Dataset]. https://catalog.data.gov/dataset/comparison-of-unsupervised-anomaly-detection-methods

Comparison of Unsupervised Anomaly Detection Methods

Explore at:
16 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description

Several different unsupervised anomaly detection algorithms have been applied to Space Shuttle Main Engine (SSME) data to serve the purpose of developing a comprehensive suite of Integrated Systems Health Management (ISHM) tools. As the theoretical bases for these methods vary considerably, it is reasonable to conjecture that the resulting anomalies detected by them may differ quite significantly as well. As such, it would be useful to apply a common metric with which to compare the results. However, for such a quantitative analysis to be statistically significant, a sufficient number of examples of both nominally categorized and anomalous data must be available. Due to the lack of sufficient examples of anomalous data, use of any statistics that rely upon a statistically significant sample of anomalous data is infeasible. Therefore, the main focus of this paper will be to compare actual examples of anomalies detected by the algorithms via the sensors in which they appear, as well the times at which they appear. We find that there is enough overlap in detection of the anomalies among all of the different algorithms tested in order for them to corroborate the severity of these anomalies. In certain cases, the severity of these anomalies is supported by their categorization as failures by experts, with realistic physical explanations. For those anomalies that can not be corroborated by at least one other method, this overlap says less about the severity of the anomaly, and more about their technical nuances, which will also be discussed.

Search
Clear search
Close search
Google apps
Main menu