39 datasets found

MNIST dataset for Outliers Detection - [ MNIST4OD ]
figshare.com
application/gzip
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9954986.v2
Dataset updated
May 17, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Giovanni Stilo; Bardh Prenkaj
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10
d
Data from: Mining Distance-Based Outliers in Near Linear Time
catalog.data.gov
datasets.ai
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
f
Data from: Multivariate Functional Data Visualization and Outlier Detection
tandf.figshare.com
application/x-rar
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenlin Dai; Marc G. Genton (2023). Multivariate Functional Data Visualization and Outlier Detection [Dataset]. http://doi.org/10.6084/m9.figshare.6308771.v1
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6308771.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Taylor & Francis
Authors
Wenlin Dai; Marc G. Genton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This article proposes a new graphical tool, the magnitude-shape (MS) plot, for visualizing both the magnitude and shape outlyingness of multivariate functional data. The proposed tool builds on the recent notion of functional directional outlyingness, which measures the centrality of functional data by simultaneously considering the level and the direction of their deviation from the central region. The MS-plot intuitively presents not only levels but also directions of magnitude outlyingness on the horizontal axis or plane, and demonstrates shape outlyingness on the vertical axis. A dividing curve or surface is provided to separate nonoutlying data from the outliers. Both the simulated data and the practical examples confirm that the MS-plot is superior to existing tools for visualizing centrality and detecting outliers for functional data. Supplementary material for this article is available online.
Marketing_bank
kaggle.com
Updated May 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ANANYA SINHA 20MCA0210 (2021). Marketing_bank [Dataset]. https://www.kaggle.com/ananyasinha20mca0210/marketing-bank/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 20, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ANANYA SINHA 20MCA0210
Description
Data Set Information:

The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

There are four datasets: 1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010) 2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs. 3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with fewer inputs). 4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).
Registration examples with different initial outlier ratios.
plos.figshare.com
figshare.com
tiff
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lei Peng; Guangyao Li; Mang Xiao; Li Xie (2023). Registration examples with different initial outlier ratios. [Dataset]. http://doi.org/10.1371/journal.pone.0148483.g008
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0148483.g008
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Lei Peng; Guangyao Li; Mang Xiao; Li Xie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The same point sets are assigned different initial outlier ratios (0.1, 0.3, 0.5, 0.7, and 0.9). The top is the registration results of the CPD algorithm, and the bottom is our method.
d
Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...
catalog.data.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
f
Data from: Simultaneous Outlier Detection and Prediction for Kriging with...
tandf.figshare.com
zip
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Youjie Zeng; Zhanfeng Wang; Youngjo Lee; Niansheng Tang (2025). Simultaneous Outlier Detection and Prediction for Kriging with True Identification [Dataset]. http://doi.org/10.6084/m9.figshare.28715504.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28715504.v1
Dataset updated
May 30, 2025
Dataset provided by
Taylor & Francis
Authors
Youjie Zeng; Zhanfeng Wang; Youngjo Lee; Niansheng Tang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Kriging with interpolation is widely used in various noise-free areas, such as computer experiments. However, owing to its Gaussian assumption, it is susceptible to outliers, which affects statistical inference, and the resulting conclusions could be misleading. Little work has explored outlier detection for kriging. Therefore, we propose a novel kriging method for simultaneous outlier detection and prediction by introducing a normal-gamma prior, which results in an unbounded penalty on the biases to distinguish outliers from normal data points. We develop a simple and efficient method, avoiding the expensive computation of the Markov chain Monte Carlo algorithm, to simultaneously detect outliers and make a prediction. We establish the true identification property for outlier detection and the consistency of the estimated hyperparameters in kriging under the increasing domain framework as if the number and locations of the outliers were known in advance. Under appropriate regularity conditions, we demonstrate information consistency for prediction in the presence of outliers. Numerical studies and real data examples show that the proposed method generally provides robust analyses in the presence of outliers. Supplementary materials for this article are available online.
g
Replication data for: Linear Models with Outliers: Choosing between...
datasearch.gesis.org
dataverse.harvard.edu
+1more
Updated Jan 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harden, Jeffrey; Desmarais, Bruce (2020). Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods [Dataset]. https://datasearch.gesis.org/dataset/httpsdataverse.unc.eduoai--hdl1902.2911608
Explore at:
Dataset updated
Jan 22, 2020
Dataset provided by
Odum Institute Dataverse Network
Authors
Harden, Jeffrey; Desmarais, Bruce
Description
State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.
z
Controlled Anomalies Time Series (CATS) Dataset
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.8338435
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8338435
Dataset updated
Jul 11, 2024
Dataset provided by
Solenix Engineering GmbH
Authors
Patrick Fleith; Patrick Fleith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

5 million timestamps. Sensors readings are at 1Hz sampling frequency.

1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.

Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.

Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).

Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

Change Log

Version 2

Metadata: we include a metadata.csv with information about:

Anomaly categories

Root cause channel (signal in which the anomaly is first visible)

Affected channel (signal in which the anomaly might propagate) through coupled system dynamics

Removal of anomaly overlaps: version 1 contained anomalies which overlapped with each other resulting in only 190 distinct anomalous segments. Now, there are no more anomaly overlaps.

Two data files: CSV and parquet for convenience.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
f
Performances of the registration methods for different initial outlier...
plos.figshare.com
tiff
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lei Peng; Guangyao Li; Mang Xiao; Li Xie (2023). Performances of the registration methods for different initial outlier ratios. [Dataset]. http://doi.org/10.1371/journal.pone.0148483.g009
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0148483.g009
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Lei Peng; Guangyao Li; Mang Xiao; Li Xie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Two groups examples with outlier-to-data ratios of 0.5 (a) and 2.0 (b) are chosen to test the influence of different initial outlier ratios (from 0 to 0.9) on the registration results. The error means and the standard deviations of 100 examples in each group of our method are compared with CPD.
o
Model checkpoints for "Outlier-Aware Training for Improving Group Accuracy...
explore.openaire.eu
Updated Oct 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li-Kuang Chen; Canasai Kruengkrai; Junichi Yamagishi (2022). Model checkpoints for "Outlier-Aware Training for Improving Group Accuracy Disparities" [Dataset]. http://doi.org/10.5281/zenodo.7260028
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7260028
Dataset updated
Oct 28, 2022
Authors
Li-Kuang Chen; Canasai Kruengkrai; Junichi Yamagishi
Description
This is the collection of model checkpoints (as well as example outputs and data) for reproducing the experiments in "Outlier-Aware Training for Improving Group Accuracy Disparities". Our code is available at: https://github.com/nii-yamagishilab/jtt-m.
d
Supporting data for \"A Standard Operating Procedure for Outlier Removal in...
search.dataone.org
dataverse.no
+1more
Updated Jul 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Holsbø, Einar (2024). Supporting data for \"A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets\" [Dataset]. http://doi.org/10.18710/FGVLKS
Explore at:
Unique identifier
https://doi.org/10.18710/FGVLKS
Dataset updated
Jul 29, 2024
Dataset provided by
DataverseNO
Authors
Holsbø, Einar
Description
This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details
f
Goodness-of-fit filtering in classical metric multidimensional scaling with...
tandf.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Graffelman (2023). Goodness-of-fit filtering in classical metric multidimensional scaling with large datasets [Dataset]. http://doi.org/10.6084/m9.figshare.11389830.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11389830.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Jan Graffelman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metric multidimensional scaling (MDS) is a widely used multivariate method with applications in almost all scientific disciplines. Eigenvalues obtained in the analysis are usually reported in order to calculate the overall goodness-of-fit of the distance matrix. In this paper, we refine MDS goodness-of-fit calculations, proposing additional point and pairwise goodness-of-fit statistics that can be used to filter poorly represented observations in MDS maps. The proposed statistics are especially relevant for large data sets that contain outliers, with typically many poorly fitted observations, and are helpful for improving MDS output and emphasizing the most important features of the dataset. Several goodness-of-fit statistics are considered, and both Euclidean and non-Euclidean distance matrices are considered. Some examples with data from demographic, genetic and geographic studies are shown.
z
Controlled Anomalies Time Series (CATS) Dataset
zenodo.org
bin
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7646897
Dataset updated
Jul 12, 2024
Dataset provided by
Solenix Engineering GmbH
Authors
Patrick Fleith; Patrick Fleith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

5 million timestamps. Sensors readings are at 1Hz sampling frequency.

1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

Different types of anomalies to understand what anomaly types can be detected by different approaches.

Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
f
Clinical Examples of the Various Categories of Each Characteristic of...
plos.figshare.com
xls
Updated May 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghayath Janoudi; Mara Uzun (Rada); Deshayne B. Fell; Joel G. Ray; Angel M. Foster; Randy Giffen; Tammy Clifford; Mark C. Walker (2024). Clinical Examples of the Various Categories of Each Characteristic of Outlier. [Dataset]. http://doi.org/10.1371/journal.pdig.0000515.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000515.t001
Dataset updated
May 22, 2024
Dataset provided by
PLOS Digital Health
Authors
Ghayath Janoudi; Mara Uzun (Rada); Deshayne B. Fell; Joel G. Ray; Angel M. Foster; Randy Giffen; Tammy Clifford; Mark C. Walker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clinical Examples of the Various Categories of Each Characteristic of Outlier.
Data to accompany the outlier-waveform-detection Github repository (internal...
zenodo.org
bin, txt
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karin Cox; Karin Cox; Daisuke Kase; Daisuke Kase; Robert Turner; Robert Turner (2024). Data to accompany the outlier-waveform-detection Github repository (internal globus pallidus, GPi) [Dataset]. http://doi.org/10.5281/zenodo.11077189
Explore at:
bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11077189
Dataset updated
May 23, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Karin Cox; Karin Cox; Daisuke Kase; Daisuke Kase; Robert Turner; Robert Turner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains data based on neuronal recordings from two monkeys (G and I, in the pre- and post-MPTP states) that serve as input to the code provided at https://github.com/turner-lab-pitt/outlier-waveform-detection. Text files located within that Github repository provide detailed instructions on how these data may be used with that code. As described in those text files, extra data are provided for Monkey G, in the pre-MPTP state.

The data-description.txt file provides detailed information regarding the contents of each zipped tar archive. Briefly, the most important components of the files are the "snips" (individual spike waveforms) from the two monkeys and MPTP states, as extracted for each of a series of single sorted units from the internal globus pallidus (GPi). The additional G-Pre data provides examples of the high-pass filtered voltage signals from which these snips were extracted. All data are stored in the Matlab .mat format.

All zipped files can be decompressed with 7-zip: https://www.7-zip.org/

These data and the associated Github code were used for analyses reported in an in-preparation manuscript (Kase et al., "Movement-related activity in the internal globus pallidus of the parkinsonian macaque"), and also with a preprint that is currently under review:

Detecting rhythmic spiking through the power spectra of point process model residuals

Karin M. Cox, Daisuke Kase, Taieb Znati, Robert S. Turner

bioRxiv 2023.09.08.556120; doi: https://doi.org/10.1101/2023.09.08.556120

This research was funded in part by Aligning Science Across Parkinson's [ASAP-020519] through the Michael J. Fox Foundation for Parkinson's Research (MJFF). For the purpose of open access, the authors have applied a Creative Commons Attribution 4.0 International (CC BY) public copyright license to this dataset.
d
Data from: Statistical context dictates the relationship between...
datadryad.org
data.niaid.nih.gov
zip
Updated Aug 21, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew R. Nassar; Rasmus Bruckner; Michael J. Frank (2019). Statistical context dictates the relationship between feedback-related EEG signals and learning [Dataset]. http://doi.org/10.5061/dryad.570pf8n
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.570pf8n
Dataset updated
Aug 21, 2019
Dataset provided by
Dryad
Authors
Matthew R. Nassar; Rasmus Bruckner; Michael J. Frank
Time period covered
2019
Description
201_Cannon_FILT_altLow_STIM.matpreprocessed EEG data from subject 201203_Cannon_FILT_altLow_STIM.matCleaned EEG data from participant 203204_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 204205_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 205206_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 206207_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 207210_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 210211_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 211212_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 212213_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 213214_Cannon_FILT_altLow_STIM.mat215_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 215216_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 216229_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 229233_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for particip...
m
Data from: The search for loci under selection: trends, biases and progress
figshare.mq.edu.au
researchdata.edu.au
+1more
bin
Updated Jun 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Collin W. Ahrens; Paul D. Rymer; Adam Stow; Jason Bragg; Shannon Dillon; Kate D. L. Umbers; Rachael Y. Dudaniec (2023). Data from: The search for loci under selection: trends, biases and progress [Dataset]. http://doi.org/10.5061/dryad.jq5g627
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.jq5g627
Dataset updated
Jun 15, 2023
Dataset provided by
Macquarie University
Authors
Collin W. Ahrens; Paul D. Rymer; Adam Stow; Jason Bragg; Shannon Dillon; Kate D. L. Umbers; Rachael Y. Dudaniec
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Detecting genetic variants under selection using FST outlier analysis (OA) and environmental association analyses (EAA) are popular approaches that provide insight into the genetic basis of local adaptation. Despite the frequent use of OA and EAA approaches and their increasing attractiveness for detecting signatures of selection, their application to field-based empirical data have not been synthesized. Here, we review 66 empirical studies that use Single Nucleotide Polymorphisms (SNPs) in OA and EAA. We report trends and biases across biological systems, sequencing methods, approaches, parameters, environmental variables and their influence on detecting signatures of selection. We found striking variability in both the use and reporting of environmental data and statistical parameters. For example, linkage disequilibrium among SNPs and numbers of unique SNP associations identified with EAA were rarely reported. The proportion of putatively adaptive SNPs detected varied widely among studies, and decreased with the number of SNPs analyzed. We found that genomic sampling effort had a greater impact than biological sampling effort on the proportion of identified SNPs under selection. OA identified a higher proportion of outliers when more individuals were sampled, but this was not the case for EAA. To facilitate repeatability, interpretation and synthesis of studies detecting selection, we recommend that future studies consistently report geographic coordinates, environmental data, model parameters, linkage disequilibrium, and measures of genetic structure. Identifying standards for how OA and EAA studies are designed and reported will aid future transparency and comparability of SNP-based selection studies and help to progress landscape and evolutionary genomics.

Usage Notes Table S1 - Full data set.Data was collected by reading papers associated with environmental association analyses. Data includes location, species, methods used, genetic parameters of data sets reviewed, and analytical parameters of the analyses.Table S1_data.xlsxR code for mixed-effects linear modelsThe R code used to create the figures and estimate regressions of the data set.Ahrens et al 2018_MolEcol_review.R

Data for: "Model-free estimation of completeness, uncertainties, and...

zenodo.org

application/gzip

Updated Mar 14, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Daniel Schwalbe-Koda; Daniel Schwalbe-Koda; Sebastien Hamel; Sebastien Hamel; Babak Sadigh; Babak Sadigh; Fei Zhou; Fei Zhou; Vincenzo Lordi; Vincenzo Lordi (2025). Data for: "Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory" [Dataset]. http://doi.org/10.5281/zenodo.15025644

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15025644

Dataset updated

Mar 14, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Daniel Schwalbe-Koda; Daniel Schwalbe-Koda; Sebastien Hamel; Sebastien Hamel; Babak Sadigh; Babak Sadigh; Fei Zhou; Fei Zhou; Vincenzo Lordi; Vincenzo Lordi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Mar 14, 2025

Description

# Data for: Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory

This dataset contains the raw data to reproduce the paper:

D. Schwalbe-Koda, S. Hamel, B. Sadigh, F. Zhou, V. Lordi. "Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory". arXiv:2404.12367 (2024). DOI: [10.48550/arXiv.2404.12367](https://doi.org/10.48550/arXiv.2404.12367)

The raw data in `2025-quests-data.tar.gz` contains all the raw data to reproduce the paper.
The tarfile is sorted by section of the paper (01 through 05) and supplementary information (A01 through A11).
Its structure is the following:


```
  data/
  ├── 02-Aluminum
  ├── 02-GAP20
  ├── 02-rMD17
  ├── 04-TM23
  ├── 05-Cu
  ├── 05-Ta
  ├── A08-Denoiser
  ├── A11-Cu
  ├── A11-QTB
  └── A11-Sn
```

The tarfile contains files of the following formats:

- CSV files containing tables with the data for the analysis
- JSON files containing structured data for the analysis
- logfiles from LAMMPS simulations
- Extended XYZ files containing the results of MD trajectories or materials structure data

### Citing

If you use QUESTS or its data/examples in a publication, please cite the following paper:

```bibtex
@article{schwalbekoda2024information,
  title = {Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory},
  author = {Schwalbe-Koda, Daniel and Hamel, Sebastien and Sadigh, Babak and Zhou, Fei and Lordi, Vincenzo},
  year = {2024},
  journal = {arXiv:2404.12367},
  url = {https://arxiv.org/abs/2404.12367},
  doi = {10.48550/arXiv.2404.12367},
}
```

f
Registration examples on the fish point set.
plos.figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lei Peng; Guangyao Li; Mang Xiao; Li Xie (2023). Registration examples on the fish point set. [Dataset]. http://doi.org/10.1371/journal.pone.0148483.g001
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0148483.g001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Lei Peng; Guangyao Li; Mang Xiao; Li Xie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
From top to bottom are the four largest degradations: deformation (0.08), noise (0.05), occlusion (0.5), and outlier (2.0). The goal is to align the model point set (blue pluses) onto the scene point set (red circles).

Facebook

Twitter

Click to copy link

Link copied

Cite

Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.9954986.v2

Dataset updated

May 17, 2024

Dataset provided by

Figsharehttp://figshare.com/

Authors

Giovanni Stilo; Bardh Prenkaj

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

Clear search

Close search

Google apps

Main menu

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Data from: Mining Distance-Based Outliers in Near Linear Time

Data from: Multivariate Functional Data Visualization and Outlier Detection

Marketing_bank

Registration examples with different initial outlier ratios.

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

Data from: Simultaneous Outlier Detection and Prediction for Kriging with...

Replication data for: Linear Models with Outliers: Choosing between...

Controlled Anomalies Time Series (CATS) Dataset

Performances of the registration methods for different initial outlier...

Model checkpoints for "Outlier-Aware Training for Improving Group Accuracy...

Supporting data for \"A Standard Operating Procedure for Outlier Removal in...

Goodness-of-fit filtering in classical metric multidimensional scaling with...

Controlled Anomalies Time Series (CATS) Dataset

Clinical Examples of the Various Categories of Each Characteristic of...

Data to accompany the outlier-waveform-detection Github repository (internal...

Data from: Statistical context dictates the relationship between...

Data from: The search for loci under selection: trends, biases and progress

Data for: "Model-free estimation of completeness, uncertainties, and...

Registration examples on the fish point set.

MNIST dataset for Outliers Detection - [ MNIST4OD ]