39 datasets found

Satellite telemetry data anomaly prediction
kaggle.com
Updated Apr 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orvile (2025). Satellite telemetry data anomaly prediction [Dataset]. https://www.kaggle.com/datasets/orvile/satellite-telemetry-data-anomaly-prediction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 17, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Orvile
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OPSSAT-AD - anomaly detection dataset for satellite telemetry

This is the AI-ready benchmark dataset (OPSSAT-AD) containing the telemetry data acquired on board OPS-SAT---a CubeSat mission that has been operated by the European Space Agency.

It is accompanied by the paper with baseline results obtained using 30 supervised and unsupervised classic and deep machine learning algorithms for anomaly detection. They were trained and validated using the training-test dataset split introduced in this work, and we present a suggested set of quality metrics that should always be calculated to confront the new algorithms for anomaly detection while exploiting OPSSAT-AD. We believe that this work may become an important step toward building a fair, reproducible, and objective validation procedure that can be used to quantify the capabilities of the emerging anomaly detection techniques in an unbiased and fully transparent way.

The included files are:

segments.csv with the acquired telemetry signals from ESA OPS-SAT aircraft, dataset.csv with the extracted, synthetic features are computed for each manually split and labeled telemetry segment. code files for data processing and example modeliing (dataset_generator.ipynb for data processing, modeling_examples.ipynb with simple examples, requirements.txt- with details on Python configuration, and the LICENSE file)

Citation Bogdan, R. (2024). OPSSAT-AD - anomaly detection dataset for satellite telemetry [Data set]. Ruszczak. https://doi.org/10.5281/zenodo.15108715
f
Anomaly Detection in High-Dimensional Data
tandf.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles (2023). Anomaly Detection in High-Dimensional Data [Dataset]. http://doi.org/10.6084/m9.figshare.12844508.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12844508.v2
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation where its k-nearest neighbor distance with the maximum gap is significantly different from what we would expect if the distribution of k-nearest neighbors with the maximum gap is in the maximum domain of attraction of the Gumbel distribution. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray. Supplementary materials for this article are available online.
syslrn: Learning What to Monitor for Efficient Anomaly Detection [Dataset]
zenodo.org
bin, zip
Updated Mar 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davide Sanvito; Giuseppe Siracusano; Sharan Santhanam; Roberto Gonzalez; Roberto Bifulco; Davide Sanvito; Giuseppe Siracusano; Sharan Santhanam; Roberto Gonzalez; Roberto Bifulco (2022). syslrn: Learning What to Monitor for Efficient Anomaly Detection [Dataset] [Dataset]. http://doi.org/10.5281/zenodo.6374398
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6374398
Dataset updated
Mar 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Davide Sanvito; Giuseppe Siracusano; Sharan Santhanam; Roberto Gonzalez; Roberto Bifulco; Davide Sanvito; Giuseppe Siracusano; Sharan Santhanam; Roberto Gonzalez; Roberto Bifulco
Description
This repository includes the dataset for the paper:

D. Sanvito, G. Siracusano, S. Santhanam, R. Gonzalez, R. Bifulco
syslrn: Learning What to Monitor for Efficient Anomaly Detection
ACM EuroMLSys 2022

The dataset contains two directories at the root level:

raw_dataset

processed_dataset

Each folder in the raw_dataset directory contains the raw monitoring data used to generate the graph associated to a single experiment together with additional metadata files.
Each folder in the processed_dataset directory contains the graph associated to a single experiment as a set of three CSV files: two for the graph edges (pid_childof_pid_df.csv and pid_speakswith_pid_df.csv) and one for the graph nodes (proc_df.csv).
We provide below a code snippet to parse a graph from processed_dataset directory.

In both folders the name of each sub-folder is based on the following schema: [SCENARIO]_[W]wl/test_[TEST_ID] where:

[SCENARIO] reports the target component for the failure injection (cinder_failure, neutron_failure, nova_failure). ff indicates instead a failure-free execution

[W] reports the number of concurrent workloads

[TEST_ID] reports the ID of the specific failure scenario injected (same ID selected by the OpenStack failure injection framework [1] )

Each experiment includes the following data in the raw_dataset sub-folders:

audit_raw_logs_[TEST_ID]/: raw audit monitoring data

bpf_tools_[TEST_ID]/: raw ebpf tools monitoring data

instance-[INSTANCE_ID]/: workload-specific metadata files, e.g. stdout/stderr (generated by the OpenStack failure injection framework [1] )

logs_workload_[TEST_ID]/: OpenStack application logs

perf_tools_[TEST_ID]/: raw perf tools monitoring data

audit_filtered_[TEST_ID].log: audit data pre-processed by ausearch (e.g. numerical entities are resolved to symbols)

failure_[TEST_ID].info: metadata information about the specific failure scenario (generated by the OpenStack failure injection framework [1] )

timestamps_[TEST_ID]: timing information

[1] D. Cotroneo, L. De Simone, P. Liguori, R. Natella, N. Bidokhti - How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform [ACM ESEC/FSE 2019]

Example: parsing a graph from processed_dataset directory

import pandas as pd import networkx as nx def parse_csv(path): processes_df = pd.read_csv('%sproc_df.csv' % path, index_col=0).reset_index(drop=True) speakswith_edges_df = pd.read_csv('%spid_speakswith_pid_df.csv' % path, index_col=0) speakswith_edges_df['type'] = 'speaksWith' childof_edges_df = pd.read_csv('%spid_childof_pid_df.csv' % path, index_col=0) childof_edges_df['type'] = 'childOf' return processes_df, pd.concat([speakswith_edges_df, childof_edges_df], ignore_index=True) def make_graph(nodes_df, edges_df): G = nx.MultiGraph() for _, node in nodes_df.iterrows(): G.add_node(node.pid, **node) for _, edge in edges_df.iterrows(): G.add_edge(edge.pid1, edge.pid2, type=edge.type) return G PATH = 'processed_dataset/ff_1wl/test_1/' nodes_df, edges_df = parse_csv(PATH) G = make_graph(nodes_df, edges_df) nx.draw_networkx(G, node_size=10, with_labels=False)

If you use this dataset for your research, please cite the following paper:

@inproceedings{sanvito2022syslrn, title={syslrn: Learning What to Monitor for Efficient Anomaly Detection}, author={Sanvito, Davide and Siracusano, Giuseppe and Santhanam, Sharan and Gonzalez, Roberto and Bifulco, Roberto}, booktitle={2nd European Workshop on Machine Learning and Systems (EuroMLSys '22)}, year={2022}, address = {Rennes, France}, publisher = {ACM}, month = apr, }
f
Data from: Nonparametric Anomaly Detection on Time Series of Graphs
tandf.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dorcas Ofori-Boateng; Yulia R. Gel; Ivor Cribben (2023). Nonparametric Anomaly Detection on Time Series of Graphs [Dataset]. http://doi.org/10.6084/m9.figshare.13180181.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13180181.v3
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Dorcas Ofori-Boateng; Yulia R. Gel; Ivor Cribben
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identifying change points and/or anomalies in dynamic network structures has become increasingly popular across various domains, from neuroscience to telecommunication to finance. One particular objective of anomaly detection from a neuroscience perspective is the reconstruction of the dynamic manner of brain region interactions. However, most statistical methods for detecting anomalies have the following unrealistic limitation for brain studies and beyond: that is, network snapshots at different time points are assumed to be independent. To circumvent this limitation, we propose a distribution-free framework for anomaly detection in dynamic networks. First, we present each network snapshot of the data as a linear object and find its respective univariate characterization via local and global network topological summaries. Second, we adopt a change point detection method for (weakly) dependent time series based on efficient scores, and enhance the finite sample properties of change point method by approximating the asymptotic distribution of the test statistic using the sieve bootstrap. We apply our method to simulated and to real data, particularly, two functional magnetic resonance imaging (fMRI) datasets and the Enron communication graph. We find that our new method delivers impressively accurate and realistic results in terms of identifying locations of true change points compared to the results reported by competing approaches. The new method promises to offer a deeper insight into the large-scale characterizations and functional dynamics of the brain and, more generally, into the intrinsic structure of complex dynamic networks. Supplemental materials for this article are available online.
Tennessee Eastman Process Simulation Dataset
kaggle.com
zip
Updated Feb 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergei Averkiev (2020). Tennessee Eastman Process Simulation Dataset [Dataset]. https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset
Explore at:
zip(1370814903 bytes)Available download formats
Dataset updated
Feb 9, 2020
Authors
Sergei Averkiev
Description
Intro

This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.

Content

Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.

Each dataframe contains 55 columns:

Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).

Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).

Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.

Columns 4 to 55 contain the process variables; the column names retain the original variable names.

Acknowledgements

This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

User Agreement

By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.

The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.

In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.

Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.

When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.

This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.
d
Data from: PROBABILITY CALIBRATION BY THE MINIMUM AND MAXIMUM PROBABILITY...
catalog.data.gov
datasets.ai
+2more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). PROBABILITY CALIBRATION BY THE MINIMUM AND MAXIMUM PROBABILITY SCORES IN ONE-CLASS BAYES LEARNING FOR ANOMALY DETECTION [Dataset]. https://catalog.data.gov/dataset/probability-calibration-by-the-minimum-and-maximum-probability-scores-in-one-class-bayes-l
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
PROBABILITY CALIBRATION BY THE MINIMUM AND MAXIMUM PROBABILITY SCORES IN ONE-CLASS BAYES LEARNING FOR ANOMALY DETECTION GUICHONG LI, NATHALIE JAPKOWICZ, IAN HOFFMAN, R. KURT UNGAR ABSTRACT. One-class Bayes learning such as one-class Naïve Bayes and one-class Bayesian Network employs Bayes learning to build a classifier on the positive class only for discriminating the positive class and the negative class. It has been applied to anomaly detection for identifying abnormal behaviors that deviate from normal behaviors. Because one-class Bayes classifiers can produce probability score, which can be used for defining anomaly score for anomaly detection, they are preferable in many practical applications as compared with other one-class learning techniques. However, previously proposed one-class Bayes classifiers might suffer from poor probability estimation when the negative training examples are unavailable. In this paper, we propose a new method to improve the probability estimation. The improved one-class Bayes classifiers can exhibits high performance as compared with previously proposed one-class Bayes classifiers according to our empirical results.
d
Additional Tennessee Eastman Process Simulation Data for Anomaly Detection...
dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rieth, Cory A.; Amsel, Ben D.; Tran, Randy; Cook, Maia B. (2023). Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation [Dataset]. http://doi.org/10.7910/DVN/6C3JR1
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/6C3JR1
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Rieth, Cory A.; Amsel, Ben D.; Tran, Randy; Cook, Maia B.
Description
User Agreement, Public Domain Dedication, and Disclaimer of Liability. By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms. The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission. In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights. Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law. When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work. This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website. Description This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017. Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files. Each dataframe contains 55 columns: Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions). Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping). Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively. Columns 4 to 55 contain the process variables; the column names retain the original variable names. Acknowledgments. This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.
Data from: Anomaly detection in the Zwicky Transient Facility DR3
zenodo.org
data.niaid.nih.gov
bin
Updated Aug 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantin Malanchev; Matwey Kornilov; Patrick Aleo; Vladimir Korolev; Konstantin Malanchev; Matwey Kornilov; Patrick Aleo; Vladimir Korolev (2022). Anomaly detection in the Zwicky Transient Facility DR3 [Dataset]. http://doi.org/10.5281/zenodo.4318700
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4318700
Dataset updated
Aug 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Konstantin Malanchev; Matwey Kornilov; Patrick Aleo; Vladimir Korolev; Konstantin Malanchev; Matwey Kornilov; Patrick Aleo; Vladimir Korolev
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The feature data set extracted from ZTF DR3 light curves. It was used in Malanchev et al. 2020 to detect anomalous astrophysical sources in ZTF data.

"feature_XXX.dat" files contain object-ordered light curve feature data, every object is built on 42 feature values, which are encoded as little endian single precision IEEE-754 float (32bit float) numbers. Feature code-names are the same for all three data sets and are listed in plain text files "feature_XXX.name", one code-name per line. "oid_XXX.dat" files contain ZTF DR object identifiers encoded as little endian 64-bit unsigned integer numbers. "oid_XXX.dat" and "feature_XXX.dat" have same object order, for example the first 8 bytes of "oid_m31.dat" files contain the OID of the ZTF DR3 light curve which feature are presented in the first 168 bytes of "feature_m31.dat" file. "m31", "deep" and "disk" denote different ZTF fields and contain 57 546, 406 611, 1 790 565 objects. Note that observations between 58194 ≤ MJD ≤ 58483 are used, see the paper for field and features details.

The sample Python code to access the data as Numpy arrays:

import numpy as np oid = np.memmap('oid_m31.dat', mode='r', dtype=np.uint64) with open('feature_m31.name') as f: names = f.read().split() dtype = [(name, np.float32) for name in names] feature = np.memmap('feature_m31.dat', mode='r', dtype=dtype, shape=oid.shape) idx = np.argmax(feature['amplitude']) print('Object {} has maximum amplitude {:.3f}'.format(oid[idx], feature['amplitude'][idx]))
R&D Dataset for LHC Olympics 2020 Anomaly Detection Challenge
zenodo.org
explore.openaire.eu
bin
Updated Apr 17, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gregor Kasieczka; Ben Nachman; David Shih; Gregor Kasieczka; Ben Nachman; David Shih (2022). R&D Dataset for LHC Olympics 2020 Anomaly Detection Challenge [Dataset]. http://doi.org/10.5281/zenodo.4287694
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4287694
Dataset updated
Apr 17, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gregor Kasieczka; Ben Nachman; David Shih; Gregor Kasieczka; Ben Nachman; David Shih
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the first R&D dataset for the LHC Olympics 2020 Anomaly Detection Challenge. It consists of 1M QCD dijet events and 100k W'->XY events, with X->qq and Y->qq. The W', X, and Y masses are 3.5 TeV, 500 GeV and 100 GeV respectively. The events are produced using Pythia8 and Delphes 3.4.1, with no pileup or MPI included. They are selected using a single fat-jet (R=1) trigger with pT threshold of 1.2 TeV.

The events are randomly shuffled together, but for the purposes of testing and development, we provide the user with a signal/background truth bit for each event. Obviously, the truth bit will not be included in the actual challenge.

These events are stored as pandas dataframes saved to compressed h5 format. For each event, all Delphes reconstructed particles in the event are assumed to be massless and are recorded in detector coordinates (pT, eta, phi). More detailed information such as particle charge is not included. Events are zero padded to constant size arrays of 700 particles, with the truth bit appended at the end. The array format is therefore (Nevents=1.1M, 2101).

For more information, including an example Jupyter notebook illustrating how to read and process the events, see the official LHC Olympics 2020 webpage.

https://lhco2020.github.io/homepage/

UPDATE May 18 2020

We have uploaded a second signal dataset for R&D, consisting of 100k W'->XY with X,Y->qqq (i.e. 3-prong substructure). Everything else about this signal dataset (particle masses, trigger, Pythia configuration, detector simulation) is the same as the previous one described above.

UPDATE November 23 2020

We now include high-level feature files for the background and 2-prong signal (events_anomalydetection_v2.features.h5) and for the 3-prong signal (events_anomalydetection_Z_XY_qqq.features.h5). To produce the features, we have clustered every event into R=1 jets using the anti-kT algorithm. The features (calculated using fastjet plugins) are the 3-momenta, invariant masses, and n-jettiness variables tau1, tau2 and tau3 for the highest pT jet (j1) and the second highest pT jet (j2):

'pxj1', 'pyj1', 'pzj1', 'mj1', 'tau1j1', 'tau2j1', 'tau3j1', 'pxj2', 'pyj2', 'pzj2', 'mj2', 'tau1j2', 'tau2j2', 'tau3j2'

The rows (events) in each feature file should be ordered exactly the same as in their corresponding raw event file. For convenience, we have also included the label (1 for signal and 0 for background) as an additional column in the first feature file (events_anomalydetection_v2.features.h5).
f
coldChainDataA.
figshare.com
bin
Updated Mar 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhibo Xie; Heng Long; Chengyi Ling; Yingjun Zhou; Yan Luo (2025). coldChainDataA. [Dataset]. http://doi.org/10.1371/journal.pone.0315322.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315322.s001
Dataset updated
Mar 10, 2025
Dataset provided by
PLOS ONE
Authors
Zhibo Xie; Heng Long; Chengyi Ling; Yingjun Zhou; Yan Luo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Anomaly detection is widely used in cold chain logistics (CCL). But, because of the high cost and technical problem, the anomaly detection performance is poor, and the anomaly can not be detected in time, which affects the quality of goods. To solve these problems, the paper presents a new anomaly detection scheme for CCL. At first, the characteristics of the collected data of CCL are analyzed, the mathematical model of data flow is established, and the sliding window and correlation coefficient are defined. Then the abnormal events in CCL are summarized, and three types of abnormal judgment conditions based on cor-relation coefficient ρjk are deduced. A measurement anomaly detection algorithm based on the improved isolated forest algorithm is proposed. Subsampling and cross factor are designed and used to overcome the shortcomings of the isolated forest algorithm (iForest). Experiments have shown that as the dimensionality of the data increases, the performance indicators of the new scheme, such as P (precision), R (recall), F1 score, and AUC (area under the curve), become increasingly superior to commonly used support vector machines (SVM), local outlier factors (LOF), and iForests. Its average P is 0.8784, average R is 0.8731, average F1 score is 0.8639, and average AUC is 0.9064. However, the execution time of the improved algorithm is slightly longer than that of the iForest.
Spark Data containing logs and metrics (KPIs) for Hades
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cheryl Lee; Cheryl Lee (2023). Spark Data containing logs and metrics (KPIs) for Hades [Dataset]. http://doi.org/10.5281/zenodo.7609780
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7609780
Dataset updated
Oct 24, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cheryl Lee; Cheryl Lee
Description
Please make sure to cite our paper whenever you use the data in your research:
@inproceedings{DBLP:conf/icse/LeeYCSYL23, author = {Cheryl Lee and Tianyi Yang and Zhuangbin Chen and Yuxin Su and Yongqiang Yang and Michael R. Lyu}, title = {Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention}, booktitle = {45th {IEEE/ACM} International Conference on Software Engineering, {ICSE} 2023, Melbourne, Australia, May 14-20, 2023}, pages = {1724--1736}, publisher = {{IEEE}}, year = {2023}, url = {https://doi.org/10.1109/ICSE48619.2023.00148}, doi = {10.1109/ICSE48619.2023.00148}, timestamp = {Wed, 19 Jul 2023 10:09:12 +0200}, biburl = {https://dblp.org/rec/conf/icse/LeeYCSYL23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
H
Data from: Election Forensics: Using Machine Learning and Synthetic Data for...
dataverse.harvard.edu
tsv, txt +1
Updated Oct 14, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2019). Election Forensics: Using Machine Learning and Synthetic Data for Possible Election Anomaly Detection [Dataset]. http://doi.org/10.7910/DVN/YZRJWD
Explore at:
type/x-r-syntax(21706), tsv(102384), tsv(10432), tsv(2725272), tsv(9444708), tsv(9754584), tsv(12915096), txt(1442), type/x-r-syntax(50815), tsv(734), tsv(368), tsv(39920)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/YZRJWD
Dataset updated
Oct 14, 2019
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This replication package replicates the findings reported in Mali Zhang, R. Michael Alvarez, and Ines Levin, “Election Forensics: Using Machine Learning and Synthetic Data for Possible Election Anomaly Detection.” Forthcoming in PLOS ONE.
4
Code underlying: Privacy-Preserving Membership Queries for Federated Anomaly...
data.4tu.nl
zip
Updated Oct 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jelle Vos; Sikha Pentyala; Steven Golob; Ricardo José Menezes Maia; Dean Kelley; Zekeriya Erkin; Martine De Cock; Anderson Nascimento (2023). Code underlying: Privacy-Preserving Membership Queries for Federated Anomaly Detection [Dataset]. http://doi.org/10.4121/4e1739c5-f743-47cc-aa01-df52481e3fb3.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/4e1739c5-f743-47cc-aa01-df52481e3fb3.v1
Dataset updated
Oct 28, 2023
Dataset provided by
4TU.ResearchData
Authors
Jelle Vos; Sikha Pentyala; Steven Golob; Ricardo José Menezes Maia; Dean Kelley; Zekeriya Erkin; Martine De Cock; Anderson Nascimento
License
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
Description
Privacy-Preserving Feature Extraction for Detection of
Anomalous Financial Transactions

------------------------------------------------------------------------

This repository holds the code written by the PPMLHuskies for the 2nd Place solution in the PETs Prize Challenge, Track A.

Description

The task is to predict probabilities for anomalous transactions, from a
synthetic database of international transactions, and several synthetic
databases of banking account information. We provide two solutions. One
solution, our centralized approach, found in `solution_centralized.py`,
uses the transactions database (PNS) and the banking database with no
privacy protections. The second solution, which provides robust privacy
gurantees outlined in our report, follows a federated architecture,
found in `solution_federated.py` and model.py. In this approach, PNS
data resides in one client, banking data is divided up accross other
clients, and an aggregator handles all the communication between any
clients. We have built in privacy protections so that clients and the
aggregator learn minimal information about each other, while engaging in
communication to detect anomalous transactions in PNS.

The way in which we conduct training and inference in both the
centralized and the federated architectures is fundamentally the same
(other than the privacy protections in the latter). Several new features
are engineered from the given PNS data. Then a model is trained on those
features from PNS. Next, during inference, a check is made to determine
if attributes from a PNS transaction match with the banking data, or if
the associated account in the banking data is flagged. If any of these
attributes are amiss, we give it a value of 1, and a 0 otherwise.
Lastly, we take the maximum of the inferred probabilities from the PNS
model, and the result from the Banking data validation, which is used as
our final prediction for the probability that the transaction is
anomalous.

The difference between the federated and centralized logic is that in
the federated set up, where there are one or multiple partitions of the
banking data across clients, is that the PNS client engages in a
cryptographic protocol based on homomorphic encryption with the banking
clients, routed through the aggregator, to perform feature extraction.
This protocol, to ensure privacy, and that PNS does not learn anything
from the banks beyond the set membership of a select few features, is
carried out over several rounds, r. r = 7 + n, where n is the number of
bank clients.
Test Sets for Jet Anomaly Detection at the LHC
zenodo.org
bin
Updated Mar 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taoli Cheng; Taoli Cheng (2021). Test Sets for Jet Anomaly Detection at the LHC [Dataset]. http://doi.org/10.5281/zenodo.3901833
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3901833
Dataset updated
Mar 26, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Taoli Cheng; Taoli Cheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Description

These datasets are generated as a series of test sets for anomalous jet tagging at the LHC. They include boosted W jets, Top jets, and Higgs jets. Jet transverse momentum is focused around 600 GeV and 1200 GeV (with prefix "pt1200_" in file names). Each file includes 100k original events from MadGraph, but might have slightly less events in the final h5 files due to fatjet pre-selection. Production processes include:

pp -> W' -> W (jj) Z(\( u u\)); \(m_{W} = 59, 80, 120, 174 ~GeV\)

pp -> Z' -> t t~; \(m_t=80, 174 ~GeV\)

pp -> HH -> (hh) (hh), (h -> bb); \(m_H=174~GeV\), \(m_h = 20, 80 ~GeV\)

Data Generation

Jet samples in this dataset are generated with MadGraph, Pythia8 and Delphes (no pile-up effects simulated). Particle flow objects are used to cluster jets. FastJet was used for jet clustering. Jets are clustered using anti-kt algorithm with cone size R=1.0.

Leading jet: \(p_T>450 \textrm{GeV}\); sub-leading jet: \(p_T>200 \textrm{GeV}\)

Data Structure

To get jets: f['objects/jets']

For jets, there are two datasets: ['constituents', 'obs']. (jets information is stored with higher-pt jet first)

`obs[:, n_j - 1]`: jet four vectors and n-subjettiness for \(n_j\)-th jet (pt, eta, phi, m, tau1, tau2, tau3, tau4, tau5)

pt-sorted (highest first) jet constituents information are stored in variable length arrays for \(n_j\)-th jet `constituents[:, n_j - 1]`: \(\{ E_i, P_{xi}, P_{yi}, P_{zi}, \textrm{PID}_i\}\) (PID: PDG for tracks; [22] for photon; [0] for neutral hadron)

Extra Notes

Since the dataset is structured as events, for W jet samples, only leading jet is available; while for Top and Higgs jets, leading and sub-leading jets are both valid. One might need to restrict jet \(p_T\) range at use.

e.g. to get leading jet constituents: `f["objects/jets/constituents"][:,0]`

The file names are self-explanatory on the corresponding generation process.

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

zenodo.org
elki-project.github.io
+1more

application/gzip

Updated May 2, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6355684

Dataset updated

May 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

2022

Description

These data sets were originally created for the following publications:

M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

The outlier data set versions were introduced in:

E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
On Evaluation of Outlier Rankings and Outlier Scores
In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

They are derived from the original image data available at https://aloi.science.uva.nl/

The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

Additional information is available at: https://elki-project.github.io/datasets/multi_view

The following views are currently available:

Feature type	Description	Files
Object number	Sparse 1000 dimensional vectors that give the true object assignment	objs.arff.gz
RGB color histograms	Standard RGB color histograms (uniform binning)	aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
HSV color histograms	Standard HSV/HSB color histograms in various binnings	aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
Color similiarity	Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)	aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
Haralick features	First 13 Haralick features (radius 1 pixel)	aloi-haralick-1.csv.gz
Front to back	Vectors representing front face vs. back faces of individual objects	front.arff.gz
Basic light	Vectors indicating basic light situations	light.arff.gz
Manual annotations	Manually annotated object groups of semantically related objects such as cups	manual1.arff.gz

Outlier Detection Versions

Additionally, we generated a number of subsets for outlier detection:

Feature type	Description	Files
RGB Histograms	Downsampled to 100000 objects (553 outliers)	aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
	Downsampled to 75000 objects (717 outliers)	aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
	Downsampled to 50000 objects (1508 outliers)	aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz

m
Dataset for The Reverse Problem of Keystroke Dynamics: Guessing Typed Text...
data.mendeley.com
ieee-dataport.org
+1more
Updated Apr 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nahuel González (2021). Dataset for The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings [Dataset]. http://doi.org/10.17632/94dwkbxf2d.1
Explore at:
Unique identifier
https://doi.org/10.17632/94dwkbxf2d.1
Dataset updated
Apr 22, 2021
Authors
Nahuel González
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset used in the article "The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings". Source data contains CSV files with dataset results summaries, false positives lists, the evaluated sentences, and their keystroke timings. Results data contains training and evaluation ARFF files for each user and sentence with the calculated Manhattan and euclidean distance, R metric, and the directionality index for each challenge instance. The source data comes from three free text keystroke dynamics datasets used in previous studies, by the authors (LSIA) and two other unrelated groups (KM, and PROSODY, subdivided in GAY, GUN, and REVIEW). Two different languages are represented, Spanish in LSIA and English in KM and PROSODY.

The original dataset KM was used to compare anomaly-detection algorithms for keystroke dynamics in the article "Comparing anomaly-detection algorithms forkeystroke dynamic" by Killourhy, K.S. and Maxion, R.A. The original dataset PROSODY was used to find cues of deceptive intent by analyzing variations in typing patterns in the article "Keystroke patterns as prosody in digital writings: A case study with deceptive reviews and essay" by Banerjee, R., Feng, S., Kang, J.S., and Choi, Y.

We proposed a method to find, using only flight times (keydown/keydown), whether a medium-sized candidate list of possible texts includes the one to which the timings belong. Nor the text length neither the candidate texts list were restricted, and previous samples of the timing parameters for the candidates were not required to train the model. The method was evaluated using three datasets collected by non-mutually-collaborating sets of authors in different environments. False acceptance and false rejection rates were found to remain below or very near to 1% when user data was available for training. The former increased between two- to three-fold when the models were trained with data from other users, while the latter jumped to around 15%. These error rates are competitive against current methods for text recovery based on keystroke timings, and show that the method can be used effectively even without user-specific samples for training, by recurring to general population data.
f
The pseudocode of the length calculation.
plos.figshare.com
xls
Updated Mar 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhibo Xie; Heng Long; Chengyi Ling; Yingjun Zhou; Yan Luo (2025). The pseudocode of the length calculation. [Dataset]. http://doi.org/10.1371/journal.pone.0315322.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315322.t003
Dataset updated
Mar 10, 2025
Dataset provided by
PLOS ONE
Authors
Zhibo Xie; Heng Long; Chengyi Ling; Yingjun Zhou; Yan Luo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Anomaly detection is widely used in cold chain logistics (CCL). But, because of the high cost and technical problem, the anomaly detection performance is poor, and the anomaly can not be detected in time, which affects the quality of goods. To solve these problems, the paper presents a new anomaly detection scheme for CCL. At first, the characteristics of the collected data of CCL are analyzed, the mathematical model of data flow is established, and the sliding window and correlation coefficient are defined. Then the abnormal events in CCL are summarized, and three types of abnormal judgment conditions based on cor-relation coefficient ρjk are deduced. A measurement anomaly detection algorithm based on the improved isolated forest algorithm is proposed. Subsampling and cross factor are designed and used to overcome the shortcomings of the isolated forest algorithm (iForest). Experiments have shown that as the dimensionality of the data increases, the performance indicators of the new scheme, such as P (precision), R (recall), F1 score, and AUC (area under the curve), become increasingly superior to commonly used support vector machines (SVM), local outlier factors (LOF), and iForests. Its average P is 0.8784, average R is 0.8731, average F1 score is 0.8639, and average AUC is 0.9064. However, the execution time of the improved algorithm is slightly longer than that of the iForest.
m
Oil Palm Tree Detection for Anomaly Identification
data.mendeley.com
Updated Mar 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anderson Dominguez Meza (2025). Oil Palm Tree Detection for Anomaly Identification [Dataset]. http://doi.org/10.17632/nh7d23dgnw.1
Explore at:
Unique identifier
https://doi.org/10.17632/nh7d23dgnw.1
Dataset updated
Mar 10, 2025
Authors
Anderson Dominguez Meza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset supports an advanced artificial vision system for detecting anomalies in oil palm (Elaeis guineensis) crops. It consists of RGB captured using a DJI Phantom 4 Multispectral UAV. The dataset is labeled into two main classes: 'PalmSan' (healthy palms) and 'PalmAnom' (anomalous palms). It was used to train and validate a Faster R-CNN with ResNet-50 FPN model, fine-tuned in PyTorch. The dataset plays a crucial role in high-accuracy classification for automated disease detection and stress assessment, contributing to scalable and sustainable precision agriculture solutions.
PowerBench Dataset – Part 3: Cyber Attacks on EVCS
zenodo.org
bin
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roshni Anna Jacob; Md. Joshem Uddin; Damilola R Olojede; Baris Coskunuzer; Jie Zhang; Roshni Anna Jacob; Md. Joshem Uddin; Damilola R Olojede; Baris Coskunuzer; Jie Zhang (2025). PowerBench Dataset – Part 3: Cyber Attacks on EVCS [Dataset]. http://doi.org/10.5281/zenodo.15401290
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15401290
Dataset updated
May 14, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Roshni Anna Jacob; Md. Joshem Uddin; Damilola R Olojede; Baris Coskunuzer; Jie Zhang; Roshni Anna Jacob; Md. Joshem Uddin; Damilola R Olojede; Baris Coskunuzer; Jie Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PowerBench: EVCS Cyber Attack Datasets for Power Distribution Networks

This dataset is part of the PowerBench benchmark suite designed to support machine learning research in resilient and secure power distribution networks. It includes one out of the three types of cyberattacks modeled on IEEE 34-bus, 123-bus, and 8500-node test feeders:

EVCS Attacks

Adversarial manipulation of the charging behavior of grid-connected electric vehicle charging stations (EVCS).

Suitable for learning-based intrusion detection and localization of compromised EVCSs.

Each attack dataset contains .pkl simulation files, .gml grid topology, and scenario metadata. All simulations were generated using OpenDSS via OpenDSSDirect.py.

Please refer to the included README.md for detailed task guidance and loading instructions.
H
Replication data for: Robust Estimation and Outlier Detection for...
dataverse.harvard.edu
Updated Nov 28, 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Walter R. Mebane; Jasjeet S. Sekhon (2007). Replication data for: Robust Estimation and Outlier Detection for Overdispersed Multinomial Models of Count Data [Dataset]. http://doi.org/10.7910/DVN/RDXADE
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/RDXADE
Dataset updated
Nov 28, 2007
Dataset provided by
Harvard Dataverse
Authors
Walter R. Mebane; Jasjeet S. Sekhon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
1993 - 2000
Description
We develop a robust estimator—the hyperbolic tangent (tanh) estimator—for over dispersed multinomial regression models of count data. The tanh estimator provides accurate estimates and reliable inferences even when the specified model is not good for as much as half of the data. Seriously ill-fitted counts—outliers—are identified as part of the estimation. A Monte Carlo sampling experiment shows that the tanh estimator produces good results at practical sample sizes even when ten percent of the data are generated by a significantly different process. The experiment shows that, with contaminated data, estimation fails using four other estimators: the non-robust maximum likelihood estimator, the additive logistic model and two SUR models. Using the tanh estimator to analyze data from Florida for the 2000 presidential election matches well-known features of the election that the other four estimators fail to capture. In an analysis of data from the 1993 Polish parliamentary election, the tanh estimator gives sharper inferences than does a previously proposed hetero-skedastic SUR model.

Facebook

Twitter

Click to copy link

Link copied

Cite

Orvile (2025). Satellite telemetry data anomaly prediction [Dataset]. https://www.kaggle.com/datasets/orvile/satellite-telemetry-data-anomaly-prediction

Satellite telemetry data anomaly prediction

OPSSAT-AD - anomaly detection dataset for satellite telemetry

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 17, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Orvile

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

OPSSAT-AD - anomaly detection dataset for satellite telemetry

This is the AI-ready benchmark dataset (OPSSAT-AD) containing the telemetry data acquired on board OPS-SAT---a CubeSat mission that has been operated by the European Space Agency.

It is accompanied by the paper with baseline results obtained using 30 supervised and unsupervised classic and deep machine learning algorithms for anomaly detection. They were trained and validated using the training-test dataset split introduced in this work, and we present a suggested set of quality metrics that should always be calculated to confront the new algorithms for anomaly detection while exploiting OPSSAT-AD. We believe that this work may become an important step toward building a fair, reproducible, and objective validation procedure that can be used to quantify the capabilities of the emerging anomaly detection techniques in an unbiased and fully transparent way.

The included files are:

segments.csv with the acquired telemetry signals from ESA OPS-SAT aircraft,
dataset.csv with the extracted, synthetic features are computed for each manually split and labeled telemetry segment.
code files for data processing and example modeliing (dataset_generator.ipynb for data processing, modeling_examples.ipynb with simple examples, requirements.txt- with details on Python configuration, and the LICENSE file)

Citation Bogdan, R. (2024). OPSSAT-AD - anomaly detection dataset for satellite telemetry [Data set]. Ruszczak. https://doi.org/10.5281/zenodo.15108715

Clear search

Close search

Google apps

Main menu

Satellite telemetry data anomaly prediction

The included files are:

Anomaly Detection in High-Dimensional Data

syslrn: Learning What to Monitor for Efficient Anomaly Detection [Dataset]

Data from: Nonparametric Anomaly Detection on Time Series of Graphs

Tennessee Eastman Process Simulation Dataset

Intro

Content

Acknowledgements

User Agreement

Data from: PROBABILITY CALIBRATION BY THE MINIMUM AND MAXIMUM PROBABILITY...

Additional Tennessee Eastman Process Simulation Data for Anomaly Detection...

Data from: Anomaly detection in the Zwicky Transient Facility DR3

R&D Dataset for LHC Olympics 2020 Anomaly Detection Challenge

coldChainDataA.

Spark Data containing logs and metrics (KPIs) for Hades

Data from: Election Forensics: Using Machine Learning and Synthetic Data for...

Code underlying: Privacy-Preserving Membership Queries for Federated Anomaly...

Test Sets for Jet Anomaly Detection at the LHC

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

Dataset for The Reverse Problem of Keystroke Dynamics: Guessing Typed Text...

The pseudocode of the length calculation.

Oil Palm Tree Detection for Anomaly Identification

PowerBench Dataset – Part 3: Cyber Attacks on EVCS

Replication data for: Robust Estimation and Outlier Detection for...

Satellite telemetry data anomaly predictionSee More Versions

OPSSAT-AD - anomaly detection dataset for satellite telemetry

The included files are:

Satellite telemetry data anomaly prediction