Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OPSSAT-AD - anomaly detection dataset for satellite telemetry
This is the AI-ready benchmark dataset (OPSSAT-AD) containing the telemetry data acquired on board OPS-SAT---a CubeSat mission that has been operated by the European Space Agency.
It is accompanied by the paper with baseline results obtained using 30 supervised and unsupervised classic and deep machine learning algorithms for anomaly detection. They were trained and validated using the training-test dataset split introduced in this work, and we present a suggested set of quality metrics that should always be calculated to confront the new algorithms for anomaly detection while exploiting OPSSAT-AD. We believe that this work may become an important step toward building a fair, reproducible, and objective validation procedure that can be used to quantify the capabilities of the emerging anomaly detection techniques in an unbiased and fully transparent way.
segments.csv with the acquired telemetry signals from ESA OPS-SAT aircraft,
dataset.csv with the extracted, synthetic features are computed for each manually split and labeled telemetry segment.
code files for data processing and example modeliing (dataset_generator.ipynb for data processing, modeling_examples.ipynb with simple examples, requirements.txt- with details on Python configuration, and the LICENSE file)
Citation Bogdan, R. (2024). OPSSAT-AD - anomaly detection dataset for satellite telemetry [Data set]. Ruszczak. https://doi.org/10.5281/zenodo.15108715
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation where its k-nearest neighbor distance with the maximum gap is significantly different from what we would expect if the distribution of k-nearest neighbors with the maximum gap is in the maximum domain of attraction of the Gumbel distribution. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray. Supplementary materials for this article are available online.
This repository includes the dataset for the paper:
D. Sanvito, G. Siracusano, S. Santhanam, R. Gonzalez, R. Bifulco
syslrn: Learning What to Monitor for Efficient Anomaly Detection
ACM EuroMLSys 2022
The dataset contains two directories at the root level:
Each folder in the raw_dataset directory contains the raw monitoring data used to generate the graph associated to a single experiment together with additional metadata files.
Each folder in the processed_dataset directory contains the graph associated to a single experiment as a set of three CSV files: two for the graph edges (pid_childof_pid_df.csv and pid_speakswith_pid_df.csv) and one for the graph nodes (proc_df.csv).
We provide below a code snippet to parse a graph from processed_dataset directory.
In both folders the name of each sub-folder is based on the following schema: [SCENARIO]_[W]wl/test_[TEST_ID] where:
Each experiment includes the following data in the raw_dataset sub-folders:
[1] D. Cotroneo, L. De Simone, P. Liguori, R. Natella, N. Bidokhti - How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform [ACM ESEC/FSE 2019]
Example: parsing a graph from processed_dataset directory
import pandas as pd
import networkx as nx
def parse_csv(path):
processes_df = pd.read_csv('%sproc_df.csv' % path, index_col=0).reset_index(drop=True)
speakswith_edges_df = pd.read_csv('%spid_speakswith_pid_df.csv' % path, index_col=0)
speakswith_edges_df['type'] = 'speaksWith'
childof_edges_df = pd.read_csv('%spid_childof_pid_df.csv' % path, index_col=0)
childof_edges_df['type'] = 'childOf'
return processes_df, pd.concat([speakswith_edges_df, childof_edges_df], ignore_index=True)
def make_graph(nodes_df, edges_df):
G = nx.MultiGraph()
for _, node in nodes_df.iterrows():
G.add_node(node.pid, **node)
for _, edge in edges_df.iterrows():
G.add_edge(edge.pid1, edge.pid2, type=edge.type)
return G
PATH = 'processed_dataset/ff_1wl/test_1/'
nodes_df, edges_df = parse_csv(PATH)
G = make_graph(nodes_df, edges_df)
nx.draw_networkx(G, node_size=10, with_labels=False)
If you use this dataset for your research, please cite the following paper:
@inproceedings{sanvito2022syslrn,
title={syslrn: Learning What to Monitor for Efficient Anomaly Detection},
author={Sanvito, Davide and Siracusano, Giuseppe and Santhanam, Sharan and Gonzalez, Roberto and Bifulco, Roberto},
booktitle={2nd European Workshop on Machine Learning and Systems (EuroMLSys '22)},
year={2022},
address = {Rennes, France},
publisher = {ACM},
month = apr,
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identifying change points and/or anomalies in dynamic network structures has become increasingly popular across various domains, from neuroscience to telecommunication to finance. One particular objective of anomaly detection from a neuroscience perspective is the reconstruction of the dynamic manner of brain region interactions. However, most statistical methods for detecting anomalies have the following unrealistic limitation for brain studies and beyond: that is, network snapshots at different time points are assumed to be independent. To circumvent this limitation, we propose a distribution-free framework for anomaly detection in dynamic networks. First, we present each network snapshot of the data as a linear object and find its respective univariate characterization via local and global network topological summaries. Second, we adopt a change point detection method for (weakly) dependent time series based on efficient scores, and enhance the finite sample properties of change point method by approximating the asymptotic distribution of the test statistic using the sieve bootstrap. We apply our method to simulated and to real data, particularly, two functional magnetic resonance imaging (fMRI) datasets and the Enron communication graph. We find that our new method delivers impressively accurate and realistic results in terms of identifying locations of true change points compared to the results reported by competing approaches. The new method promises to offer a deeper insight into the large-scale characterizations and functional dynamics of the brain and, more generally, into the intrinsic structure of complex dynamic networks. Supplemental materials for this article are available online.
This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.
Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.
Each dataframe contains 55 columns:
Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).
Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).
Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.
Columns 4 to 55 contain the process variables; the column names retain the original variable names.
This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.
By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.
The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.
In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.
Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.
When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.
This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.
PROBABILITY CALIBRATION BY THE MINIMUM AND MAXIMUM PROBABILITY SCORES IN ONE-CLASS BAYES LEARNING FOR ANOMALY DETECTION GUICHONG LI, NATHALIE JAPKOWICZ, IAN HOFFMAN, R. KURT UNGAR ABSTRACT. One-class Bayes learning such as one-class Naïve Bayes and one-class Bayesian Network employs Bayes learning to build a classifier on the positive class only for discriminating the positive class and the negative class. It has been applied to anomaly detection for identifying abnormal behaviors that deviate from normal behaviors. Because one-class Bayes classifiers can produce probability score, which can be used for defining anomaly score for anomaly detection, they are preferable in many practical applications as compared with other one-class learning techniques. However, previously proposed one-class Bayes classifiers might suffer from poor probability estimation when the negative training examples are unavailable. In this paper, we propose a new method to improve the probability estimation. The improved one-class Bayes classifiers can exhibits high performance as compared with previously proposed one-class Bayes classifiers according to our empirical results.
User Agreement, Public Domain Dedication, and Disclaimer of Liability. By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms. The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission. In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights. Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law. When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work. This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website. Description This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017. Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files. Each dataframe contains 55 columns: Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions). Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping). Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively. Columns 4 to 55 contain the process variables; the column names retain the original variable names. Acknowledgments. This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The feature data set extracted from ZTF DR3 light curves. It was used in Malanchev et al. 2020 to detect anomalous astrophysical sources in ZTF data.
"feature_XXX.dat" files contain object-ordered light curve feature data, every object is built on 42 feature values, which are encoded as little endian single precision IEEE-754 float (32bit float) numbers. Feature code-names are the same for all three data sets and are listed in plain text files "feature_XXX.name", one code-name per line. "oid_XXX.dat" files contain ZTF DR object identifiers encoded as little endian 64-bit unsigned integer numbers. "oid_XXX.dat" and "feature_XXX.dat" have same object order, for example the first 8 bytes of "oid_m31.dat" files contain the OID of the ZTF DR3 light curve which feature are presented in the first 168 bytes of "feature_m31.dat" file. "m31", "deep" and "disk" denote different ZTF fields and contain 57 546, 406 611, 1 790 565 objects. Note that observations between 58194 ≤ MJD ≤ 58483 are used, see the paper for field and features details.
The sample Python code to access the data as Numpy arrays:
import numpy as np
oid = np.memmap('oid_m31.dat', mode='r', dtype=np.uint64)
with open('feature_m31.name') as f:
names = f.read().split()
dtype = [(name, np.float32) for name in names]
feature = np.memmap('feature_m31.dat', mode='r', dtype=dtype, shape=oid.shape)
idx = np.argmax(feature['amplitude'])
print('Object {} has maximum amplitude {:.3f}'.format(oid[idx], feature['amplitude'][idx]))
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the first R&D dataset for the LHC Olympics 2020 Anomaly Detection Challenge. It consists of 1M QCD dijet events and 100k W'->XY events, with X->qq and Y->qq. The W', X, and Y masses are 3.5 TeV, 500 GeV and 100 GeV respectively. The events are produced using Pythia8 and Delphes 3.4.1, with no pileup or MPI included. They are selected using a single fat-jet (R=1) trigger with pT threshold of 1.2 TeV.
The events are randomly shuffled together, but for the purposes of testing and development, we provide the user with a signal/background truth bit for each event. Obviously, the truth bit will not be included in the actual challenge.
These events are stored as pandas dataframes saved to compressed h5 format. For each event, all Delphes reconstructed particles in the event are assumed to be massless and are recorded in detector coordinates (pT, eta, phi). More detailed information such as particle charge is not included. Events are zero padded to constant size arrays of 700 particles, with the truth bit appended at the end. The array format is therefore (Nevents=1.1M, 2101).
For more information, including an example Jupyter notebook illustrating how to read and process the events, see the official LHC Olympics 2020 webpage.
https://lhco2020.github.io/homepage/
UPDATE May 18 2020
We have uploaded a second signal dataset for R&D, consisting of 100k W'->XY with X,Y->qqq (i.e. 3-prong substructure). Everything else about this signal dataset (particle masses, trigger, Pythia configuration, detector simulation) is the same as the previous one described above.
UPDATE November 23 2020
We now include high-level feature files for the background and 2-prong signal (events_anomalydetection_v2.features.h5) and for the 3-prong signal (events_anomalydetection_Z_XY_qqq.features.h5). To produce the features, we have clustered every event into R=1 jets using the anti-kT algorithm. The features (calculated using fastjet plugins) are the 3-momenta, invariant masses, and n-jettiness variables tau1, tau2 and tau3 for the highest pT jet (j1) and the second highest pT jet (j2):
'pxj1', 'pyj1', 'pzj1', 'mj1', 'tau1j1', 'tau2j1', 'tau3j1', 'pxj2', 'pyj2', 'pzj2', 'mj2', 'tau1j2', 'tau2j2', 'tau3j2'
The rows (events) in each feature file should be ordered exactly the same as in their corresponding raw event file. For convenience, we have also included the label (1 for signal and 0 for background) as an additional column in the first feature file (events_anomalydetection_v2.features.h5).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Anomaly detection is widely used in cold chain logistics (CCL). But, because of the high cost and technical problem, the anomaly detection performance is poor, and the anomaly can not be detected in time, which affects the quality of goods. To solve these problems, the paper presents a new anomaly detection scheme for CCL. At first, the characteristics of the collected data of CCL are analyzed, the mathematical model of data flow is established, and the sliding window and correlation coefficient are defined. Then the abnormal events in CCL are summarized, and three types of abnormal judgment conditions based on cor-relation coefficient ρjk are deduced. A measurement anomaly detection algorithm based on the improved isolated forest algorithm is proposed. Subsampling and cross factor are designed and used to overcome the shortcomings of the isolated forest algorithm (iForest). Experiments have shown that as the dimensionality of the data increases, the performance indicators of the new scheme, such as P (precision), R (recall), F1 score, and AUC (area under the curve), become increasingly superior to commonly used support vector machines (SVM), local outlier factors (LOF), and iForests. Its average P is 0.8784, average R is 0.8731, average F1 score is 0.8639, and average AUC is 0.9064. However, the execution time of the improved algorithm is slightly longer than that of the iForest.
Please make sure to cite our paper whenever you use the data in your research:
@inproceedings{DBLP:conf/icse/LeeYCSYL23, author = {Cheryl Lee and Tianyi Yang and Zhuangbin Chen and Yuxin Su and Yongqiang Yang and Michael R. Lyu}, title = {Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention}, booktitle = {45th {IEEE/ACM} International Conference on Software Engineering, {ICSE} 2023, Melbourne, Australia, May 14-20, 2023}, pages = {1724--1736}, publisher = {{IEEE}}, year = {2023}, url = {https://doi.org/10.1109/ICSE48619.2023.00148}, doi = {10.1109/ICSE48619.2023.00148}, timestamp = {Wed, 19 Jul 2023 10:09:12 +0200}, biburl = {https://dblp.org/rec/conf/icse/LeeYCSYL23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This replication package replicates the findings reported in Mali Zhang, R. Michael Alvarez, and Ines Levin, “Election Forensics: Using Machine Learning and Synthetic Data for Possible Election Anomaly Detection.” Forthcoming in PLOS ONE.
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
Privacy-Preserving Feature Extraction for Detection of
Anomalous Financial Transactions
------------------------------------------------------------------------
This repository holds the code written by the PPMLHuskies for the 2nd Place solution in the PETs Prize Challenge, Track A.
Description
The task is to predict probabilities for anomalous transactions, from a
synthetic database of international transactions, and several synthetic
databases of banking account information. We provide two solutions. One
solution, our centralized approach, found in `solution_centralized.py`,
uses the transactions database (PNS) and the banking database with no
privacy protections. The second solution, which provides robust privacy
gurantees outlined in our report, follows a federated architecture,
found in `solution_federated.py` and model.py. In this approach, PNS
data resides in one client, banking data is divided up accross other
clients, and an aggregator handles all the communication between any
clients. We have built in privacy protections so that clients and the
aggregator learn minimal information about each other, while engaging in
communication to detect anomalous transactions in PNS.
The way in which we conduct training and inference in both the
centralized and the federated architectures is fundamentally the same
(other than the privacy protections in the latter). Several new features
are engineered from the given PNS data. Then a model is trained on those
features from PNS. Next, during inference, a check is made to determine
if attributes from a PNS transaction match with the banking data, or if
the associated account in the banking data is flagged. If any of these
attributes are amiss, we give it a value of 1, and a 0 otherwise.
Lastly, we take the maximum of the inferred probabilities from the PNS
model, and the result from the Banking data validation, which is used as
our final prediction for the probability that the transaction is
anomalous.
The difference between the federated and centralized logic is that in
the federated set up, where there are one or multiple partitions of the
banking data across clients, is that the PNS client engages in a
cryptographic protocol based on homomorphic encryption with the banking
clients, routed through the aggregator, to perform feature extraction.
This protocol, to ensure privacy, and that PNS does not learn anything
from the banks beyond the set membership of a select few features, is
carried out over several rounds, r. r = 7 + n, where n is the number of
bank clients.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Description
These datasets are generated as a series of test sets for anomalous jet tagging at the LHC. They include boosted W jets, Top jets, and Higgs jets. Jet transverse momentum is focused around 600 GeV and 1200 GeV (with prefix "pt1200_" in file names). Each file includes 100k original events from MadGraph, but might have slightly less events in the final h5 files due to fatjet pre-selection. Production processes include:
Data Generation
Jet samples in this dataset are generated with MadGraph, Pythia8 and Delphes (no pile-up effects simulated). Particle flow objects are used to cluster jets. FastJet was used for jet clustering. Jets are clustered using anti-kt algorithm with cone size R=1.0.
Data Structure
Extra Notes
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data sets were originally created for the following publications:
M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.
H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.
The outlier data set versions were introduced in:
E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
On Evaluation of Outlier Rankings and Outlier Scores
In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.
They are derived from the original image data available at https://aloi.science.uva.nl/
The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005
Additional information is available at: https://elki-project.github.io/datasets/multi_view
The following views are currently available:
Feature type | Description | Files |
---|---|---|
Object number | Sparse 1000 dimensional vectors that give the true object assignment | objs.arff.gz |
RGB color histograms | Standard RGB color histograms (uniform binning) | aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz |
HSV color histograms | Standard HSV/HSB color histograms in various binnings | aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz |
Color similiarity | Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black) | aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other) |
Haralick features | First 13 Haralick features (radius 1 pixel) | aloi-haralick-1.csv.gz |
Front to back | Vectors representing front face vs. back faces of individual objects | front.arff.gz |
Basic light | Vectors indicating basic light situations | light.arff.gz |
Manual annotations | Manually annotated object groups of semantically related objects such as cups | manual1.arff.gz |
Outlier Detection Versions
Additionally, we generated a number of subsets for outlier detection:
Feature type | Description | Files |
---|---|---|
RGB Histograms | Downsampled to 100000 objects (553 outliers) | aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz |
Downsampled to 75000 objects (717 outliers) | aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz | |
Downsampled to 50000 objects (1508 outliers) | aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used in the article "The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings". Source data contains CSV files with dataset results summaries, false positives lists, the evaluated sentences, and their keystroke timings. Results data contains training and evaluation ARFF files for each user and sentence with the calculated Manhattan and euclidean distance, R metric, and the directionality index for each challenge instance. The source data comes from three free text keystroke dynamics datasets used in previous studies, by the authors (LSIA) and two other unrelated groups (KM, and PROSODY, subdivided in GAY, GUN, and REVIEW). Two different languages are represented, Spanish in LSIA and English in KM and PROSODY.
The original dataset KM was used to compare anomaly-detection algorithms for keystroke dynamics in the article "Comparing anomaly-detection algorithms forkeystroke dynamic" by Killourhy, K.S. and Maxion, R.A. The original dataset PROSODY was used to find cues of deceptive intent by analyzing variations in typing patterns in the article "Keystroke patterns as prosody in digital writings: A case study with deceptive reviews and essay" by Banerjee, R., Feng, S., Kang, J.S., and Choi, Y.
We proposed a method to find, using only flight times (keydown/keydown), whether a medium-sized candidate list of possible texts includes the one to which the timings belong. Nor the text length neither the candidate texts list were restricted, and previous samples of the timing parameters for the candidates were not required to train the model. The method was evaluated using three datasets collected by non-mutually-collaborating sets of authors in different environments. False acceptance and false rejection rates were found to remain below or very near to 1% when user data was available for training. The former increased between two- to three-fold when the models were trained with data from other users, while the latter jumped to around 15%. These error rates are competitive against current methods for text recovery based on keystroke timings, and show that the method can be used effectively even without user-specific samples for training, by recurring to general population data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Anomaly detection is widely used in cold chain logistics (CCL). But, because of the high cost and technical problem, the anomaly detection performance is poor, and the anomaly can not be detected in time, which affects the quality of goods. To solve these problems, the paper presents a new anomaly detection scheme for CCL. At first, the characteristics of the collected data of CCL are analyzed, the mathematical model of data flow is established, and the sliding window and correlation coefficient are defined. Then the abnormal events in CCL are summarized, and three types of abnormal judgment conditions based on cor-relation coefficient ρjk are deduced. A measurement anomaly detection algorithm based on the improved isolated forest algorithm is proposed. Subsampling and cross factor are designed and used to overcome the shortcomings of the isolated forest algorithm (iForest). Experiments have shown that as the dimensionality of the data increases, the performance indicators of the new scheme, such as P (precision), R (recall), F1 score, and AUC (area under the curve), become increasingly superior to commonly used support vector machines (SVM), local outlier factors (LOF), and iForests. Its average P is 0.8784, average R is 0.8731, average F1 score is 0.8639, and average AUC is 0.9064. However, the execution time of the improved algorithm is slightly longer than that of the iForest.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supports an advanced artificial vision system for detecting anomalies in oil palm (Elaeis guineensis) crops. It consists of RGB captured using a DJI Phantom 4 Multispectral UAV. The dataset is labeled into two main classes: 'PalmSan' (healthy palms) and 'PalmAnom' (anomalous palms). It was used to train and validate a Faster R-CNN with ResNet-50 FPN model, fine-tuned in PyTorch. The dataset plays a crucial role in high-accuracy classification for automated disease detection and stress assessment, contributing to scalable and sustainable precision agriculture solutions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PowerBench: EVCS Cyber Attack Datasets for Power Distribution Networks
This dataset is part of the PowerBench benchmark suite designed to support machine learning research in resilient and secure power distribution networks. It includes one out of the three types of cyberattacks modeled on IEEE 34-bus, 123-bus, and 8500-node test feeders:
EVCS Attacks
Each attack dataset contains .pkl simulation files, .gml grid topology, and scenario metadata. All simulations were generated using OpenDSS via OpenDSSDirect.py.
Please refer to the included README.md for detailed task guidance and loading instructions.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We develop a robust estimator—the hyperbolic tangent (tanh) estimator—for over dispersed multinomial regression models of count data. The tanh estimator provides accurate estimates and reliable inferences even when the specified model is not good for as much as half of the data. Seriously ill-fitted counts—outliers—are identified as part of the estimation. A Monte Carlo sampling experiment shows that the tanh estimator produces good results at practical sample sizes even when ten percent of the data are generated by a significantly different process. The experiment shows that, with contaminated data, estimation fails using four other estimators: the non-robust maximum likelihood estimator, the additive logistic model and two SUR models. Using the tanh estimator to analyze data from Florida for the 2000 presidential election matches well-known features of the election that the other four estimators fail to capture. In an analysis of data from the 1993 Polish parliamentary election, the tanh estimator gives sharper inferences than does a previously proposed hetero-skedastic SUR model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OPSSAT-AD - anomaly detection dataset for satellite telemetry
This is the AI-ready benchmark dataset (OPSSAT-AD) containing the telemetry data acquired on board OPS-SAT---a CubeSat mission that has been operated by the European Space Agency.
It is accompanied by the paper with baseline results obtained using 30 supervised and unsupervised classic and deep machine learning algorithms for anomaly detection. They were trained and validated using the training-test dataset split introduced in this work, and we present a suggested set of quality metrics that should always be calculated to confront the new algorithms for anomaly detection while exploiting OPSSAT-AD. We believe that this work may become an important step toward building a fair, reproducible, and objective validation procedure that can be used to quantify the capabilities of the emerging anomaly detection techniques in an unbiased and fully transparent way.
segments.csv with the acquired telemetry signals from ESA OPS-SAT aircraft,
dataset.csv with the extracted, synthetic features are computed for each manually split and labeled telemetry segment.
code files for data processing and example modeliing (dataset_generator.ipynb for data processing, modeling_examples.ipynb with simple examples, requirements.txt- with details on Python configuration, and the LICENSE file)
Citation Bogdan, R. (2024). OPSSAT-AD - anomaly detection dataset for satellite telemetry [Data set]. Ruszczak. https://doi.org/10.5281/zenodo.15108715