39 datasets found
  1. Satellite telemetry data anomaly prediction

    • kaggle.com
    Updated Apr 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orvile (2025). Satellite telemetry data anomaly prediction [Dataset]. https://www.kaggle.com/datasets/orvile/satellite-telemetry-data-anomaly-prediction
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 17, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Orvile
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OPSSAT-AD - anomaly detection dataset for satellite telemetry

    This is the AI-ready benchmark dataset (OPSSAT-AD) containing the telemetry data acquired on board OPS-SAT---a CubeSat mission that has been operated by the European Space Agency.

    It is accompanied by the paper with baseline results obtained using 30 supervised and unsupervised classic and deep machine learning algorithms for anomaly detection. They were trained and validated using the training-test dataset split introduced in this work, and we present a suggested set of quality metrics that should always be calculated to confront the new algorithms for anomaly detection while exploiting OPSSAT-AD. We believe that this work may become an important step toward building a fair, reproducible, and objective validation procedure that can be used to quantify the capabilities of the emerging anomaly detection techniques in an unbiased and fully transparent way.

    The included files are:

    segments.csv with the acquired telemetry signals from ESA OPS-SAT aircraft,
    dataset.csv with the extracted, synthetic features are computed for each manually split and labeled telemetry segment.
    code files for data processing and example modeliing (dataset_generator.ipynb for data processing, modeling_examples.ipynb with simple examples, requirements.txt- with details on Python configuration, and the LICENSE file)
    

    Citation Bogdan, R. (2024). OPSSAT-AD - anomaly detection dataset for satellite telemetry [Data set]. Ruszczak. https://doi.org/10.5281/zenodo.15108715

  2. f

    Anomaly Detection in High-Dimensional Data

    • tandf.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles (2023). Anomaly Detection in High-Dimensional Data [Dataset]. http://doi.org/10.6084/m9.figshare.12844508.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation where its k-nearest neighbor distance with the maximum gap is significantly different from what we would expect if the distribution of k-nearest neighbors with the maximum gap is in the maximum domain of attraction of the Gumbel distribution. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray. Supplementary materials for this article are available online.

  3. syslrn: Learning What to Monitor for Efficient Anomaly Detection [Dataset]

    • zenodo.org
    bin, zip
    Updated Mar 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davide Sanvito; Giuseppe Siracusano; Sharan Santhanam; Roberto Gonzalez; Roberto Bifulco; Davide Sanvito; Giuseppe Siracusano; Sharan Santhanam; Roberto Gonzalez; Roberto Bifulco (2022). syslrn: Learning What to Monitor for Efficient Anomaly Detection [Dataset] [Dataset]. http://doi.org/10.5281/zenodo.6374398
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Mar 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Davide Sanvito; Giuseppe Siracusano; Sharan Santhanam; Roberto Gonzalez; Roberto Bifulco; Davide Sanvito; Giuseppe Siracusano; Sharan Santhanam; Roberto Gonzalez; Roberto Bifulco
    Description

    This repository includes the dataset for the paper:

    D. Sanvito, G. Siracusano, S. Santhanam, R. Gonzalez, R. Bifulco
    syslrn: Learning What to Monitor for Efficient Anomaly Detection
    ACM EuroMLSys 2022

    The dataset contains two directories at the root level:

    • raw_dataset
    • processed_dataset

    Each folder in the raw_dataset directory contains the raw monitoring data used to generate the graph associated to a single experiment together with additional metadata files.
    Each folder in the processed_dataset directory contains the graph associated to a single experiment as a set of three CSV files: two for the graph edges (pid_childof_pid_df.csv and pid_speakswith_pid_df.csv) and one for the graph nodes (proc_df.csv).
    We provide below a code snippet to parse a graph from processed_dataset directory.

    In both folders the name of each sub-folder is based on the following schema: [SCENARIO]_[W]wl/test_[TEST_ID] where:

    • [SCENARIO] reports the target component for the failure injection (cinder_failure, neutron_failure, nova_failure). ff indicates instead a failure-free execution
    • [W] reports the number of concurrent workloads
    • [TEST_ID] reports the ID of the specific failure scenario injected (same ID selected by the OpenStack failure injection framework [1] )

    Each experiment includes the following data in the raw_dataset sub-folders:

    • audit_raw_logs_[TEST_ID]/: raw audit monitoring data
    • bpf_tools_[TEST_ID]/: raw ebpf tools monitoring data
    • instance-[INSTANCE_ID]/: workload-specific metadata files, e.g. stdout/stderr (generated by the OpenStack failure injection framework [1] )
    • logs_workload_[TEST_ID]/: OpenStack application logs
    • perf_tools_[TEST_ID]/: raw perf tools monitoring data
    • audit_filtered_[TEST_ID].log: audit data pre-processed by ausearch (e.g. numerical entities are resolved to symbols)
    • failure_[TEST_ID].info: metadata information about the specific failure scenario (generated by the OpenStack failure injection framework [1] )
    • timestamps_[TEST_ID]: timing information

    [1] D. Cotroneo, L. De Simone, P. Liguori, R. Natella, N. Bidokhti - How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform [ACM ESEC/FSE 2019]

    Example: parsing a graph from processed_dataset directory

    import pandas as pd
    import networkx as nx
    
    def parse_csv(path):
      processes_df = pd.read_csv('%sproc_df.csv' % path, index_col=0).reset_index(drop=True)
    
      speakswith_edges_df = pd.read_csv('%spid_speakswith_pid_df.csv' % path, index_col=0)
      speakswith_edges_df['type'] = 'speaksWith'
    
      childof_edges_df = pd.read_csv('%spid_childof_pid_df.csv' % path, index_col=0)
      childof_edges_df['type'] = 'childOf'
          
      return processes_df, pd.concat([speakswith_edges_df, childof_edges_df], ignore_index=True)
    
    def make_graph(nodes_df, edges_df):
      G = nx.MultiGraph()
      
      for _, node in nodes_df.iterrows():
        G.add_node(node.pid, **node)
    
      for _, edge in edges_df.iterrows():
        G.add_edge(edge.pid1, edge.pid2, type=edge.type)
    
      return G
    
    PATH = 'processed_dataset/ff_1wl/test_1/'
    nodes_df, edges_df = parse_csv(PATH)
    G = make_graph(nodes_df, edges_df)
    nx.draw_networkx(G, node_size=10, with_labels=False)

    If you use this dataset for your research, please cite the following paper:

    @inproceedings{sanvito2022syslrn,
      title={syslrn: Learning What to Monitor for Efficient Anomaly Detection},
      author={Sanvito, Davide and Siracusano, Giuseppe and Santhanam, Sharan and Gonzalez, Roberto and Bifulco, Roberto},
      booktitle={2nd European Workshop on Machine Learning and Systems (EuroMLSys '22)},
      year={2022},
      address = {Rennes, France},
      publisher = {ACM},
      month = apr,
    }
    
  4. f

    Data from: Nonparametric Anomaly Detection on Time Series of Graphs

    • tandf.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dorcas Ofori-Boateng; Yulia R. Gel; Ivor Cribben (2023). Nonparametric Anomaly Detection on Time Series of Graphs [Dataset]. http://doi.org/10.6084/m9.figshare.13180181.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Dorcas Ofori-Boateng; Yulia R. Gel; Ivor Cribben
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Identifying change points and/or anomalies in dynamic network structures has become increasingly popular across various domains, from neuroscience to telecommunication to finance. One particular objective of anomaly detection from a neuroscience perspective is the reconstruction of the dynamic manner of brain region interactions. However, most statistical methods for detecting anomalies have the following unrealistic limitation for brain studies and beyond: that is, network snapshots at different time points are assumed to be independent. To circumvent this limitation, we propose a distribution-free framework for anomaly detection in dynamic networks. First, we present each network snapshot of the data as a linear object and find its respective univariate characterization via local and global network topological summaries. Second, we adopt a change point detection method for (weakly) dependent time series based on efficient scores, and enhance the finite sample properties of change point method by approximating the asymptotic distribution of the test statistic using the sieve bootstrap. We apply our method to simulated and to real data, particularly, two functional magnetic resonance imaging (fMRI) datasets and the Enron communication graph. We find that our new method delivers impressively accurate and realistic results in terms of identifying locations of true change points compared to the results reported by competing approaches. The new method promises to offer a deeper insight into the large-scale characterizations and functional dynamics of the brain and, more generally, into the intrinsic structure of complex dynamic networks. Supplemental materials for this article are available online.

  5. Tennessee Eastman Process Simulation Dataset

    • kaggle.com
    zip
    Updated Feb 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergei Averkiev (2020). Tennessee Eastman Process Simulation Dataset [Dataset]. https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset
    Explore at:
    zip(1370814903 bytes)Available download formats
    Dataset updated
    Feb 9, 2020
    Authors
    Sergei Averkiev
    Description

    Intro

    This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.

    Content

    Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.

    Each dataframe contains 55 columns:

    Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).

    Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).

    Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.

    Columns 4 to 55 contain the process variables; the column names retain the original variable names.

    Acknowledgements

    This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

    User Agreement

    By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.

    The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.

    In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.

    Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.

    When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.

    This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.

  6. d

    Data from: PROBABILITY CALIBRATION BY THE MINIMUM AND MAXIMUM PROBABILITY...

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). PROBABILITY CALIBRATION BY THE MINIMUM AND MAXIMUM PROBABILITY SCORES IN ONE-CLASS BAYES LEARNING FOR ANOMALY DETECTION [Dataset]. https://catalog.data.gov/dataset/probability-calibration-by-the-minimum-and-maximum-probability-scores-in-one-class-bayes-l
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    PROBABILITY CALIBRATION BY THE MINIMUM AND MAXIMUM PROBABILITY SCORES IN ONE-CLASS BAYES LEARNING FOR ANOMALY DETECTION GUICHONG LI, NATHALIE JAPKOWICZ, IAN HOFFMAN, R. KURT UNGAR ABSTRACT. One-class Bayes learning such as one-class Naïve Bayes and one-class Bayesian Network employs Bayes learning to build a classifier on the positive class only for discriminating the positive class and the negative class. It has been applied to anomaly detection for identifying abnormal behaviors that deviate from normal behaviors. Because one-class Bayes classifiers can produce probability score, which can be used for defining anomaly score for anomaly detection, they are preferable in many practical applications as compared with other one-class learning techniques. However, previously proposed one-class Bayes classifiers might suffer from poor probability estimation when the negative training examples are unavailable. In this paper, we propose a new method to improve the probability estimation. The improved one-class Bayes classifiers can exhibits high performance as compared with previously proposed one-class Bayes classifiers according to our empirical results.

  7. d

    Additional Tennessee Eastman Process Simulation Data for Anomaly Detection...

    • dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rieth, Cory A.; Amsel, Ben D.; Tran, Randy; Cook, Maia B. (2023). Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation [Dataset]. http://doi.org/10.7910/DVN/6C3JR1
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Rieth, Cory A.; Amsel, Ben D.; Tran, Randy; Cook, Maia B.
    Description

    User Agreement, Public Domain Dedication, and Disclaimer of Liability. By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms. The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission. In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights. Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law. When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work. This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website. Description This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017. Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files. Each dataframe contains 55 columns: Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions). Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping). Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively. Columns 4 to 55 contain the process variables; the column names retain the original variable names. Acknowledgments. This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

  8. Data from: Anomaly detection in the Zwicky Transient Facility DR3

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Aug 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantin Malanchev; Matwey Kornilov; Patrick Aleo; Vladimir Korolev; Konstantin Malanchev; Matwey Kornilov; Patrick Aleo; Vladimir Korolev (2022). Anomaly detection in the Zwicky Transient Facility DR3 [Dataset]. http://doi.org/10.5281/zenodo.4318700
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Konstantin Malanchev; Matwey Kornilov; Patrick Aleo; Vladimir Korolev; Konstantin Malanchev; Matwey Kornilov; Patrick Aleo; Vladimir Korolev
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The feature data set extracted from ZTF DR3 light curves. It was used in Malanchev et al. 2020 to detect anomalous astrophysical sources in ZTF data.

    "feature_XXX.dat" files contain object-ordered light curve feature data, every object is built on 42 feature values, which are encoded as little endian single precision IEEE-754 float (32bit float) numbers. Feature code-names are the same for all three data sets and are listed in plain text files "feature_XXX.name", one code-name per line. "oid_XXX.dat" files contain ZTF DR object identifiers encoded as little endian 64-bit unsigned integer numbers. "oid_XXX.dat" and "feature_XXX.dat" have same object order, for example the first 8 bytes of "oid_m31.dat" files contain the OID of the ZTF DR3 light curve which feature are presented in the first 168 bytes of "feature_m31.dat" file. "m31", "deep" and "disk" denote different ZTF fields and contain 57 546, 406 611, 1 790 565 objects. Note that observations between 58194 ≤ MJD ≤ 58483 are used, see the paper for field and features details.

    The sample Python code to access the data as Numpy arrays:

    import numpy as np
    
    oid = np.memmap('oid_m31.dat', mode='r', dtype=np.uint64)
    with open('feature_m31.name') as f:
      names = f.read().split()
    dtype = [(name, np.float32) for name in names]
    feature = np.memmap('feature_m31.dat', mode='r', dtype=dtype, shape=oid.shape)
    
    idx = np.argmax(feature['amplitude'])
    print('Object {} has maximum amplitude {:.3f}'.format(oid[idx], feature['amplitude'][idx]))

  9. R&D Dataset for LHC Olympics 2020 Anomaly Detection Challenge

    • zenodo.org
    • explore.openaire.eu
    bin
    Updated Apr 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gregor Kasieczka; Ben Nachman; David Shih; Gregor Kasieczka; Ben Nachman; David Shih (2022). R&D Dataset for LHC Olympics 2020 Anomaly Detection Challenge [Dataset]. http://doi.org/10.5281/zenodo.4287694
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 17, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gregor Kasieczka; Ben Nachman; David Shih; Gregor Kasieczka; Ben Nachman; David Shih
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the first R&D dataset for the LHC Olympics 2020 Anomaly Detection Challenge. It consists of 1M QCD dijet events and 100k W'->XY events, with X->qq and Y->qq. The W', X, and Y masses are 3.5 TeV, 500 GeV and 100 GeV respectively. The events are produced using Pythia8 and Delphes 3.4.1, with no pileup or MPI included. They are selected using a single fat-jet (R=1) trigger with pT threshold of 1.2 TeV.

    The events are randomly shuffled together, but for the purposes of testing and development, we provide the user with a signal/background truth bit for each event. Obviously, the truth bit will not be included in the actual challenge.

    These events are stored as pandas dataframes saved to compressed h5 format. For each event, all Delphes reconstructed particles in the event are assumed to be massless and are recorded in detector coordinates (pT, eta, phi). More detailed information such as particle charge is not included. Events are zero padded to constant size arrays of 700 particles, with the truth bit appended at the end. The array format is therefore (Nevents=1.1M, 2101).

    For more information, including an example Jupyter notebook illustrating how to read and process the events, see the official LHC Olympics 2020 webpage.

    https://lhco2020.github.io/homepage/

    UPDATE May 18 2020

    We have uploaded a second signal dataset for R&D, consisting of 100k W'->XY with X,Y->qqq (i.e. 3-prong substructure). Everything else about this signal dataset (particle masses, trigger, Pythia configuration, detector simulation) is the same as the previous one described above.

    UPDATE November 23 2020

    We now include high-level feature files for the background and 2-prong signal (events_anomalydetection_v2.features.h5) and for the 3-prong signal (events_anomalydetection_Z_XY_qqq.features.h5). To produce the features, we have clustered every event into R=1 jets using the anti-kT algorithm. The features (calculated using fastjet plugins) are the 3-momenta, invariant masses, and n-jettiness variables tau1, tau2 and tau3 for the highest pT jet (j1) and the second highest pT jet (j2):

    'pxj1', 'pyj1', 'pzj1', 'mj1', 'tau1j1', 'tau2j1', 'tau3j1', 'pxj2', 'pyj2', 'pzj2', 'mj2', 'tau1j2', 'tau2j2', 'tau3j2'

    The rows (events) in each feature file should be ordered exactly the same as in their corresponding raw event file. For convenience, we have also included the label (1 for signal and 0 for background) as an additional column in the first feature file (events_anomalydetection_v2.features.h5).

  10. f

    coldChainDataA.

    • figshare.com
    bin
    Updated Mar 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhibo Xie; Heng Long; Chengyi Ling; Yingjun Zhou; Yan Luo (2025). coldChainDataA. [Dataset]. http://doi.org/10.1371/journal.pone.0315322.s001
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Zhibo Xie; Heng Long; Chengyi Ling; Yingjun Zhou; Yan Luo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Anomaly detection is widely used in cold chain logistics (CCL). But, because of the high cost and technical problem, the anomaly detection performance is poor, and the anomaly can not be detected in time, which affects the quality of goods. To solve these problems, the paper presents a new anomaly detection scheme for CCL. At first, the characteristics of the collected data of CCL are analyzed, the mathematical model of data flow is established, and the sliding window and correlation coefficient are defined. Then the abnormal events in CCL are summarized, and three types of abnormal judgment conditions based on cor-relation coefficient ρjk are deduced. A measurement anomaly detection algorithm based on the improved isolated forest algorithm is proposed. Subsampling and cross factor are designed and used to overcome the shortcomings of the isolated forest algorithm (iForest). Experiments have shown that as the dimensionality of the data increases, the performance indicators of the new scheme, such as P (precision), R (recall), F1 score, and AUC (area under the curve), become increasingly superior to commonly used support vector machines (SVM), local outlier factors (LOF), and iForests. Its average P is 0.8784, average R is 0.8731, average F1 score is 0.8639, and average AUC is 0.9064. However, the execution time of the improved algorithm is slightly longer than that of the iForest.

  11. Spark Data containing logs and metrics (KPIs) for Hades

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cheryl Lee; Cheryl Lee (2023). Spark Data containing logs and metrics (KPIs) for Hades [Dataset]. http://doi.org/10.5281/zenodo.7609780
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cheryl Lee; Cheryl Lee
    Description

    Please make sure to cite our paper whenever you use the data in your research:

    @inproceedings{DBLP:conf/icse/LeeYCSYL23, author = {Cheryl Lee and Tianyi Yang and Zhuangbin Chen and Yuxin Su and Yongqiang Yang and Michael R. Lyu}, title = {Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention}, booktitle = {45th {IEEE/ACM} International Conference on Software Engineering, {ICSE} 2023, Melbourne, Australia, May 14-20, 2023}, pages = {1724--1736}, publisher = {{IEEE}}, year = {2023}, url = {https://doi.org/10.1109/ICSE48619.2023.00148}, doi = {10.1109/ICSE48619.2023.00148}, timestamp = {Wed, 19 Jul 2023 10:09:12 +0200}, biburl = {https://dblp.org/rec/conf/icse/LeeYCSYL23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

  12. H

    Data from: Election Forensics: Using Machine Learning and Synthetic Data for...

    • dataverse.harvard.edu
    tsv, txt +1
    Updated Oct 14, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2019). Election Forensics: Using Machine Learning and Synthetic Data for Possible Election Anomaly Detection [Dataset]. http://doi.org/10.7910/DVN/YZRJWD
    Explore at:
    type/x-r-syntax(21706), tsv(102384), tsv(10432), tsv(2725272), tsv(9444708), tsv(9754584), tsv(12915096), txt(1442), type/x-r-syntax(50815), tsv(734), tsv(368), tsv(39920)Available download formats
    Dataset updated
    Oct 14, 2019
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This replication package replicates the findings reported in Mali Zhang, R. Michael Alvarez, and Ines Levin, “Election Forensics: Using Machine Learning and Synthetic Data for Possible Election Anomaly Detection.” Forthcoming in PLOS ONE.

  13. 4

    Code underlying: Privacy-Preserving Membership Queries for Federated Anomaly...

    • data.4tu.nl
    zip
    Updated Oct 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jelle Vos; Sikha Pentyala; Steven Golob; Ricardo José Menezes Maia; Dean Kelley; Zekeriya Erkin; Martine De Cock; Anderson Nascimento (2023). Code underlying: Privacy-Preserving Membership Queries for Federated Anomaly Detection [Dataset]. http://doi.org/10.4121/4e1739c5-f743-47cc-aa01-df52481e3fb3.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 28, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Jelle Vos; Sikha Pentyala; Steven Golob; Ricardo José Menezes Maia; Dean Kelley; Zekeriya Erkin; Martine De Cock; Anderson Nascimento
    License

    https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html

    Description

    Privacy-Preserving Feature Extraction for Detection of

    Anomalous Financial Transactions


    ------------------------------------------------------------------------


    This repository holds the code written by the PPMLHuskies for the 2nd Place solution in the PETs Prize Challenge, Track A.


    Description


    The task is to predict probabilities for anomalous transactions, from a

    synthetic database of international transactions, and several synthetic

    databases of banking account information. We provide two solutions. One

    solution, our centralized approach, found in `solution_centralized.py`,

    uses the transactions database (PNS) and the banking database with no

    privacy protections. The second solution, which provides robust privacy

    gurantees outlined in our report, follows a federated architecture,

    found in `solution_federated.py` and model.py. In this approach, PNS

    data resides in one client, banking data is divided up accross other

    clients, and an aggregator handles all the communication between any

    clients. We have built in privacy protections so that clients and the

    aggregator learn minimal information about each other, while engaging in

    communication to detect anomalous transactions in PNS.


    The way in which we conduct training and inference in both the

    centralized and the federated architectures is fundamentally the same

    (other than the privacy protections in the latter). Several new features

    are engineered from the given PNS data. Then a model is trained on those

    features from PNS. Next, during inference, a check is made to determine

    if attributes from a PNS transaction match with the banking data, or if

    the associated account in the banking data is flagged. If any of these

    attributes are amiss, we give it a value of 1, and a 0 otherwise.

    Lastly, we take the maximum of the inferred probabilities from the PNS

    model, and the result from the Banking data validation, which is used as

    our final prediction for the probability that the transaction is

    anomalous.


    The difference between the federated and centralized logic is that in

    the federated set up, where there are one or multiple partitions of the

    banking data across clients, is that the PNS client engages in a

    cryptographic protocol based on homomorphic encryption with the banking

    clients, routed through the aggregator, to perform feature extraction.

    This protocol, to ensure privacy, and that PNS does not learn anything

    from the banks beyond the set membership of a select few features, is

    carried out over several rounds, r. r = 7 + n, where n is the number of

    bank clients.

  14. Test Sets for Jet Anomaly Detection at the LHC

    • zenodo.org
    bin
    Updated Mar 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taoli Cheng; Taoli Cheng (2021). Test Sets for Jet Anomaly Detection at the LHC [Dataset]. http://doi.org/10.5281/zenodo.3901833
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 26, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Taoli Cheng; Taoli Cheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Description

    These datasets are generated as a series of test sets for anomalous jet tagging at the LHC. They include boosted W jets, Top jets, and Higgs jets. Jet transverse momentum is focused around 600 GeV and 1200 GeV (with prefix "pt1200_" in file names). Each file includes 100k original events from MadGraph, but might have slightly less events in the final h5 files due to fatjet pre-selection. Production processes include:

    • pp -> W' -> W (jj) Z(\( u u\)); \(m_{W} = 59, 80, 120, 174 ~GeV\)
    • pp -> Z' -> t t~; \(m_t=80, 174 ~GeV\)
    • pp -> HH -> (hh) (hh), (h -> bb); \(m_H=174~GeV\), \(m_h = 20, 80 ~GeV\)

    Data Generation

    Jet samples in this dataset are generated with MadGraph, Pythia8 and Delphes (no pile-up effects simulated). Particle flow objects are used to cluster jets. FastJet was used for jet clustering. Jets are clustered using anti-kt algorithm with cone size R=1.0.

    • Leading jet: \(p_T>450 \textrm{GeV}\); sub-leading jet: \(p_T>200 \textrm{GeV}\)

    Data Structure

    • To get jets: f['objects/jets']
    • For jets, there are two datasets: ['constituents', 'obs']. (jets information is stored with higher-pt jet first)
      • `obs[:, n_j - 1]`: jet four vectors and n-subjettiness for \(n_j\)-th jet (pt, eta, phi, m, tau1, tau2, tau3, tau4, tau5)
      • pt-sorted (highest first) jet constituents information are stored in variable length arrays for \(n_j\)-th jet `constituents[:, n_j - 1]`: \(\{ E_i, P_{xi}, P_{yi}, P_{zi}, \textrm{PID}_i\}\) (PID: PDG for tracks; [22] for photon; [0] for neutral hadron)

    Extra Notes

    • Since the dataset is structured as events, for W jet samples, only leading jet is available; while for Top and Higgs jets, leading and sub-leading jets are both valid. One might need to restrict jet \(p_T\) range at use.
    • e.g. to get leading jet constituents: `f["objects/jets/constituents"][:,0]`
    • The file names are self-explanatory on the corresponding generation process.
  15. ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

    • zenodo.org
    • elki-project.github.io
    • +1more
    application/gzip
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2022
    Description

    These data sets were originally created for the following publications:

    M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
    Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
    In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

    H.-P. Kriegel, E. Schubert, A. Zimek
    Evaluation of Multiple Clustering Solutions
    In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

    The outlier data set versions were introduced in:

    E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
    On Evaluation of Outlier Rankings and Outlier Scores
    In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

    They are derived from the original image data available at https://aloi.science.uva.nl/

    The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

    Additional information is available at: https://elki-project.github.io/datasets/multi_view

    The following views are currently available:

    Feature typeDescriptionFiles
    Object numberSparse 1000 dimensional vectors that give the true object assignmentobjs.arff.gz
    RGB color histogramsStandard RGB color histograms (uniform binning)aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
    HSV color histogramsStandard HSV/HSB color histograms in various binningsaloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
    Color similiarityAverage similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
    Haralick featuresFirst 13 Haralick features (radius 1 pixel)aloi-haralick-1.csv.gz
    Front to backVectors representing front face vs. back faces of individual objectsfront.arff.gz
    Basic lightVectors indicating basic light situationslight.arff.gz
    Manual annotationsManually annotated object groups of semantically related objects such as cupsmanual1.arff.gz

    Outlier Detection Versions

    Additionally, we generated a number of subsets for outlier detection:

    Feature typeDescriptionFiles
    RGB HistogramsDownsampled to 100000 objects (553 outliers)aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
    Downsampled to 75000 objects (717 outliers)aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
    Downsampled to 50000 objects (1508 outliers)aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
  16. m

    Dataset for The Reverse Problem of Keystroke Dynamics: Guessing Typed Text...

    • data.mendeley.com
    • ieee-dataport.org
    • +1more
    Updated Apr 22, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nahuel González (2021). Dataset for The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings [Dataset]. http://doi.org/10.17632/94dwkbxf2d.1
    Explore at:
    Dataset updated
    Apr 22, 2021
    Authors
    Nahuel González
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used in the article "The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings". Source data contains CSV files with dataset results summaries, false positives lists, the evaluated sentences, and their keystroke timings. Results data contains training and evaluation ARFF files for each user and sentence with the calculated Manhattan and euclidean distance, R metric, and the directionality index for each challenge instance. The source data comes from three free text keystroke dynamics datasets used in previous studies, by the authors (LSIA) and two other unrelated groups (KM, and PROSODY, subdivided in GAY, GUN, and REVIEW). Two different languages are represented, Spanish in LSIA and English in KM and PROSODY.

    The original dataset KM was used to compare anomaly-detection algorithms for keystroke dynamics in the article "Comparing anomaly-detection algorithms forkeystroke dynamic" by Killourhy, K.S. and Maxion, R.A. The original dataset PROSODY was used to find cues of deceptive intent by analyzing variations in typing patterns in the article "Keystroke patterns as prosody in digital writings: A case study with deceptive reviews and essay" by Banerjee, R., Feng, S., Kang, J.S., and Choi, Y.

    We proposed a method to find, using only flight times (keydown/keydown), whether a medium-sized candidate list of possible texts includes the one to which the timings belong. Nor the text length neither the candidate texts list were restricted, and previous samples of the timing parameters for the candidates were not required to train the model. The method was evaluated using three datasets collected by non-mutually-collaborating sets of authors in different environments. False acceptance and false rejection rates were found to remain below or very near to 1% when user data was available for training. The former increased between two- to three-fold when the models were trained with data from other users, while the latter jumped to around 15%. These error rates are competitive against current methods for text recovery based on keystroke timings, and show that the method can be used effectively even without user-specific samples for training, by recurring to general population data.

  17. f

    The pseudocode of the length calculation.

    • plos.figshare.com
    xls
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhibo Xie; Heng Long; Chengyi Ling; Yingjun Zhou; Yan Luo (2025). The pseudocode of the length calculation. [Dataset]. http://doi.org/10.1371/journal.pone.0315322.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Zhibo Xie; Heng Long; Chengyi Ling; Yingjun Zhou; Yan Luo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Anomaly detection is widely used in cold chain logistics (CCL). But, because of the high cost and technical problem, the anomaly detection performance is poor, and the anomaly can not be detected in time, which affects the quality of goods. To solve these problems, the paper presents a new anomaly detection scheme for CCL. At first, the characteristics of the collected data of CCL are analyzed, the mathematical model of data flow is established, and the sliding window and correlation coefficient are defined. Then the abnormal events in CCL are summarized, and three types of abnormal judgment conditions based on cor-relation coefficient ρjk are deduced. A measurement anomaly detection algorithm based on the improved isolated forest algorithm is proposed. Subsampling and cross factor are designed and used to overcome the shortcomings of the isolated forest algorithm (iForest). Experiments have shown that as the dimensionality of the data increases, the performance indicators of the new scheme, such as P (precision), R (recall), F1 score, and AUC (area under the curve), become increasingly superior to commonly used support vector machines (SVM), local outlier factors (LOF), and iForests. Its average P is 0.8784, average R is 0.8731, average F1 score is 0.8639, and average AUC is 0.9064. However, the execution time of the improved algorithm is slightly longer than that of the iForest.

  18. m

    Oil Palm Tree Detection for Anomaly Identification

    • data.mendeley.com
    Updated Mar 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anderson Dominguez Meza (2025). Oil Palm Tree Detection for Anomaly Identification [Dataset]. http://doi.org/10.17632/nh7d23dgnw.1
    Explore at:
    Dataset updated
    Mar 10, 2025
    Authors
    Anderson Dominguez Meza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset supports an advanced artificial vision system for detecting anomalies in oil palm (Elaeis guineensis) crops. It consists of RGB captured using a DJI Phantom 4 Multispectral UAV. The dataset is labeled into two main classes: 'PalmSan' (healthy palms) and 'PalmAnom' (anomalous palms). It was used to train and validate a Faster R-CNN with ResNet-50 FPN model, fine-tuned in PyTorch. The dataset plays a crucial role in high-accuracy classification for automated disease detection and stress assessment, contributing to scalable and sustainable precision agriculture solutions.

  19. PowerBench Dataset – Part 3: Cyber Attacks on EVCS

    • zenodo.org
    bin
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roshni Anna Jacob; Md. Joshem Uddin; Damilola R Olojede; Baris Coskunuzer; Jie Zhang; Roshni Anna Jacob; Md. Joshem Uddin; Damilola R Olojede; Baris Coskunuzer; Jie Zhang (2025). PowerBench Dataset – Part 3: Cyber Attacks on EVCS [Dataset]. http://doi.org/10.5281/zenodo.15401290
    Explore at:
    binAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Roshni Anna Jacob; Md. Joshem Uddin; Damilola R Olojede; Baris Coskunuzer; Jie Zhang; Roshni Anna Jacob; Md. Joshem Uddin; Damilola R Olojede; Baris Coskunuzer; Jie Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PowerBench: EVCS Cyber Attack Datasets for Power Distribution Networks

    This dataset is part of the PowerBench benchmark suite designed to support machine learning research in resilient and secure power distribution networks. It includes one out of the three types of cyberattacks modeled on IEEE 34-bus, 123-bus, and 8500-node test feeders:

    EVCS Attacks

    • Adversarial manipulation of the charging behavior of grid-connected electric vehicle charging stations (EVCS).
    • Suitable for learning-based intrusion detection and localization of compromised EVCSs.

    Each attack dataset contains .pkl simulation files, .gml grid topology, and scenario metadata. All simulations were generated using OpenDSS via OpenDSSDirect.py.

    Please refer to the included README.md for detailed task guidance and loading instructions.

  20. H

    Replication data for: Robust Estimation and Outlier Detection for...

    • dataverse.harvard.edu
    Updated Nov 28, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Walter R. Mebane; Jasjeet S. Sekhon (2007). Replication data for: Robust Estimation and Outlier Detection for Overdispersed Multinomial Models of Count Data [Dataset]. http://doi.org/10.7910/DVN/RDXADE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 28, 2007
    Dataset provided by
    Harvard Dataverse
    Authors
    Walter R. Mebane; Jasjeet S. Sekhon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    1993 - 2000
    Description

    We develop a robust estimator—the hyperbolic tangent (tanh) estimator—for over dispersed multinomial regression models of count data. The tanh estimator provides accurate estimates and reliable inferences even when the specified model is not good for as much as half of the data. Seriously ill-fitted counts—outliers—are identified as part of the estimation. A Monte Carlo sampling experiment shows that the tanh estimator produces good results at practical sample sizes even when ten percent of the data are generated by a significantly different process. The experiment shows that, with contaminated data, estimation fails using four other estimators: the non-robust maximum likelihood estimator, the additive logistic model and two SUR models. Using the tanh estimator to analyze data from Florida for the 2000 presidential election matches well-known features of the election that the other four estimators fail to capture. In an analysis of data from the 1993 Polish parliamentary election, the tanh estimator gives sharper inferences than does a previously proposed hetero-skedastic SUR model.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Orvile (2025). Satellite telemetry data anomaly prediction [Dataset]. https://www.kaggle.com/datasets/orvile/satellite-telemetry-data-anomaly-prediction
Organization logo

Satellite telemetry data anomaly prediction

OPSSAT-AD - anomaly detection dataset for satellite telemetry

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 17, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Orvile
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

OPSSAT-AD - anomaly detection dataset for satellite telemetry

This is the AI-ready benchmark dataset (OPSSAT-AD) containing the telemetry data acquired on board OPS-SAT---a CubeSat mission that has been operated by the European Space Agency.

It is accompanied by the paper with baseline results obtained using 30 supervised and unsupervised classic and deep machine learning algorithms for anomaly detection. They were trained and validated using the training-test dataset split introduced in this work, and we present a suggested set of quality metrics that should always be calculated to confront the new algorithms for anomaly detection while exploiting OPSSAT-AD. We believe that this work may become an important step toward building a fair, reproducible, and objective validation procedure that can be used to quantify the capabilities of the emerging anomaly detection techniques in an unbiased and fully transparent way.

The included files are:

segments.csv with the acquired telemetry signals from ESA OPS-SAT aircraft,
dataset.csv with the extracted, synthetic features are computed for each manually split and labeled telemetry segment.
code files for data processing and example modeliing (dataset_generator.ipynb for data processing, modeling_examples.ipynb with simple examples, requirements.txt- with details on Python configuration, and the LICENSE file)

Citation Bogdan, R. (2024). OPSSAT-AD - anomaly detection dataset for satellite telemetry [Data set]. Ruszczak. https://doi.org/10.5281/zenodo.15108715

Search
Clear search
Close search
Google apps
Main menu