100+ datasets found

MNIST dataset for Outliers Detection - [ MNIST4OD ]
figshare.com
application/gzip
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9954986.v2
Dataset updated
May 17, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Giovanni Stilo; Bardh Prenkaj
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10
d
Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...
catalog.data.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
KMASH Data Repository for outlier detection
search.datacite.org
research-repository.rmit.edu.au
+1more
Updated Aug 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datacite (2021). KMASH Data Repository for outlier detection [Dataset]. http://doi.org/10.26180/5c6253c0b3323
Explore at:
Unique identifier
https://doi.org/10.26180/5c6253c0b3323
Dataset updated
Aug 17, 2021
Dataset provided by
DataCitehttps://www.datacite.org/
RMIT University
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
Australian Research Council
Description
The zip files contains 12338 datasets for outlier detection investigated in the following papers:
(1) Instance space analysis for unsupervised outlier detection Authors : Sevvandi Kandanaarachchi, Mario A. Munoz, Kate Smith-Miles
(2) On normalization and algorithm selection for unsupervised outlier detection Authors : Sevvandi Kandanaarachchi, Mario A. Munoz, Rob J. Hyndman, Kate Smith-Miles

Some of these datasets were originally discussed in the paper:

On the evaluation of unsupervised outlier detection:measures, datasets and an empirical studyAuthors : G. O. Campos, A, Zimek, J. Sander, R. J.G.B. Campello, B. Micenkova, E. Schubert, I. Assent, M.E. Houle.
d
Algorithms for Speeding up Distance-Based Outlier Detection
catalog.data.gov
cloud.csiss.gmu.edu
+2more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Algorithms for Speeding up Distance-Based Outlier Detection [Dataset]. https://catalog.data.gov/dataset/algorithms-for-speeding-up-distance-based-outlier-detection
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
The problem of distance-based outlier detection is difficult to solve efficiently in very large datasets because of potential quadratic time complexity. We address this problem and develop sequential and distributed algorithms that are significantly more efficient than state-of-the-art methods while still guaranteeing the same outliers. By combining simple but effective indexing and disk block accessing techniques, we have developed a sequential algorithm iOrca that is up to an order-of-magnitude faster than the state-of-the-art. The indexing scheme is based on sorting the data points in order of increasing distance from a fixed reference point and then accessing those points based on this sorted order. To speed up the basic outlier detection technique, we develop two distributed algorithms (DOoR and iDOoR) for modern distributed multi-core clusters of machines, connected on a ring topology. The first algorithm passes data blocks from each machine around the ring, incrementally updating the nearest neighbors of the points passed. By maintaining a cutoff threshold, it is able to prune a large number of points in a distributed fashion. The second distributed algorithm extends this basic idea with the indexing scheme discussed earlier. In our experiments, both distributed algorithms exhibit significant improvements compared to the state-of-the-art distributed methods.
f
Data from: Methodology to filter out outliers in high spatial density data...
scielo.figshare.com
jpeg
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14305658.v1
Dataset updated
Jun 4, 2023
Dataset provided by
SciELO journals
Authors
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.
f
Data from: Error and anomaly detection for intra-participant time-series...
tandf.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5189002
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
David R. Mullineaux; Gareth Irwin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.
f
Data from: A Diagnostic Procedure for Detecting Outliers in Linear...
tandf.figshare.com
figshare.com
txt
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow (2024). A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models [Dataset]. http://doi.org/10.6084/m9.figshare.12162075.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12162075.v1
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francis
Authors
Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outliers can be more problematic in longitudinal data than in independent observations due to the correlated nature of such data. It is common practice to discard outliers as they are typically regarded as a nuisance or an aberration in the data. However, outliers can also convey meaningful information concerning potential model misspecification, and ways to modify and improve the model. Moreover, outliers that occur among the latent variables (innovative outliers) have distinct characteristics compared to those impacting the observed variables (additive outliers), and are best evaluated with different test statistics and detection procedures. We demonstrate and evaluate the performance of an outlier detection approach for multi-subject state-space models in a Monte Carlo simulation study, with corresponding adaptations to improve power and reduce false detection rates. Furthermore, we demonstrate the empirical utility of the proposed approach using data from an ecological momentary assessment study of emotion regulation together with an open-source software implementation of the procedures.
f
Data from: Outlier detection in cylindrical data based on Mahalanobis...
tandf.figshare.com
text/x-tex
Updated Jan 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prashant S. Dhamale; Akanksha S. Kashikar (2025). Outlier detection in cylindrical data based on Mahalanobis distance [Dataset]. http://doi.org/10.6084/m9.figshare.24092089.v1
Explore at:
text/x-texAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24092089.v1
Dataset updated
Jan 2, 2025
Dataset provided by
Taylor & Francis
Authors
Prashant S. Dhamale; Akanksha S. Kashikar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cylindrical data are bivariate data formed from the combination of circular and linear variables. Identifying outliers is a crucial step in any data analysis work. This paper proposes a new distribution-free procedure to detect outliers in cylindrical data using the Mahalanobis distance concept. The use of Mahalanobis distance incorporates the correlation between the components of the cylindrical distribution, which had not been accounted for in the earlier papers on outlier detection in cylindrical data. The threshold for declaring an observation to be an outlier can be obtained via parametric or non-parametric bootstrap, depending on whether the underlying distribution is known or unknown. The performance of the proposed method is examined via extensive simulations from the Johnson-Wehrly distribution. The proposed method is applied to two real datasets, and the outliers are identified in those datasets.
d
Data from: Mining Distance-Based Outliers in Near Linear Time
catalog.data.gov
datasets.ai
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
d
Data from: Privacy Preserving Outlier Detection through Random Nonlinear...
catalog.data.gov
data.amerigeoss.org
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.
t
Outlier Detection on Sensor Data - Dataset - LDM
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Outlier Detection on Sensor Data - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/outlier-detection-on-sensor-data
Explore at:
Dataset updated
Dec 2, 2024
Description
The dataset used for outlier detection on sensor data from temperature and humidity sensors deployed in sensorized farms and manufacturing units on Purdue University's campus.

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

zenodo.org
elki-project.github.io
+1more

application/gzip

Updated May 2, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6355684

Dataset updated

May 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

2022

Description

These data sets were originally created for the following publications:

M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

The outlier data set versions were introduced in:

E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
On Evaluation of Outlier Rankings and Outlier Scores
In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

They are derived from the original image data available at https://aloi.science.uva.nl/

The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

Additional information is available at: https://elki-project.github.io/datasets/multi_view

The following views are currently available:

Feature type	Description	Files
Object number	Sparse 1000 dimensional vectors that give the true object assignment	objs.arff.gz
RGB color histograms	Standard RGB color histograms (uniform binning)	aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
HSV color histograms	Standard HSV/HSB color histograms in various binnings	aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
Color similiarity	Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)	aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
Haralick features	First 13 Haralick features (radius 1 pixel)	aloi-haralick-1.csv.gz
Front to back	Vectors representing front face vs. back faces of individual objects	front.arff.gz
Basic light	Vectors indicating basic light situations	light.arff.gz
Manual annotations	Manually annotated object groups of semantically related objects such as cups	manual1.arff.gz

Outlier Detection Versions

Additionally, we generated a number of subsets for outlier detection:

Feature type	Description	Files
RGB Histograms	Downsampled to 100000 objects (553 outliers)	aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
	Downsampled to 75000 objects (717 outliers)	aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
	Downsampled to 50000 objects (1508 outliers)	aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz

R
Vision Based Building Energy Data Outlier Detection Dataset
universe.roboflow.com
zip
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
energy data outlier detection (2024). Vision Based Building Energy Data Outlier Detection Dataset [Dataset]. https://universe.roboflow.com/energy-data-outlier-detection/vision-based-building-energy-data-outlier-detection
Explore at:
zipAvailable download formats
Dataset updated
Apr 3, 2024
Dataset authored and provided by
energy data outlier detection
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
11785 Bounding Boxes
Description
Vision Based Building Energy Data Outlier Detection

## Overview Vision Based Building Energy Data Outlier Detection is a dataset for object detection tasks - it contains 11785 annotations for 2,159 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
f
Anomaly Detection in High-Dimensional Data
tandf.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles (2023). Anomaly Detection in High-Dimensional Data [Dataset]. http://doi.org/10.6084/m9.figshare.12844508.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12844508.v2
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation where its k-nearest neighbor distance with the maximum gap is significantly different from what we would expect if the distribution of k-nearest neighbors with the maximum gap is in the maximum domain of attraction of the Gumbel distribution. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray. Supplementary materials for this article are available online.
f
The 12 outliers identified in the Tonga dataset.
plos.figshare.com
figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anderson B. Mayfield; Chii-Shiarng Chen; Alexandra C. Dempsey (2023). The 12 outliers identified in the Tonga dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0185857.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0185857.t004
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Anderson B. Mayfield; Chii-Shiarng Chen; Alexandra C. Dempsey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Tonga
Description
Gene expression data have been presented as non-normalized (2-Ct*109) in all but the last six rows; this allows for the back-calculation of the raw threshold cycle (Ct) values so that interested individuals can readily estimate the typical range of expression of each gene. Values representing aberrant levels for a particular parameter (z-score>2.5) have been highlighted in bold. When there was a statistically significant difference (student’s t-test, p0.05). SA = surface area. GCP = genome copy proportion. Ma Dis = Mahalanobis distance. “.” = missing data.
Gender_Classification_Dataset
kaggle.com
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sameh Raouf (2024). Gender_Classification_Dataset [Dataset]. https://www.kaggle.com/datasets/samehraouf/gender-classification-dataset/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sameh Raouf
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Title: Gender Classification Dataset

Description: This dataset contains anonymized information on height, weight, age, and gender of 10,000 individuals. The data is equally distributed between males and females, with 5,000 samples for each gender. The purpose of this dataset is to provide a comprehensive sample for studies and analyses related to physical attributes and demographics.

Content: The CSV file contains the following columns:

Gender: The gender of the individual (Male/Female) Height: The height of the individual in centimeters Weight: The weight of the individual in kilograms Age: The age of the individual in years

License: This dataset is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND 4.0) license. This means you are free to share the data, provided that you attribute the source, do not use it for commercial purposes, and do not distribute modified versions of the data.

Usage:

This dataset can be used for: - Analyzing the distribution of height, weight, and age across genders - Developing and testing machine learning models for predicting physical attributes - Educational purposes in statistics and data science courses
d
Anomaly Detection in Sequences
catalog.data.gov
s.cnmilf.com
+2more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Anomaly Detection in Sequences [Dataset]. https://catalog.data.gov/dataset/anomaly-detection-in-sequences
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
We present a set of novel algorithms which we call sequenceMiner, that detect and characterize anomalies in large sets of high-dimensional symbol sequences that arise from recordings of switch sensors in the cockpits of commercial airliners. While the algorithms we present are general and domain-independent, we focus on a specific problem that is critical to determining system-wide health of a fleet of aircraft. The approach taken uses unsupervised clustering of sequences using the normalized length of he longest common subsequence (nLCS) as a similarity measure, followed by a detailed analysis of outliers to detect anomalies. In this method, an outlier sequence is defined as a sequence that is far away from a cluster. We present new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence is deemed to be an outlier. The algorithm provides a coherent description to an analyst of the anomalies in the sequence when compared to more normal sequences. The final section of the paper demonstrates the effectiveness of sequenceMiner for anomaly detection on a real set of discrete sequence data from a fleet of commercial airliners. We show that sequenceMiner discovers actionable and operationally significant safety events. We also compare our innovations with standard HiddenMarkov Models, and show that our methods are superior
h
cifar10-outlier
huggingface.co
Updated Jul 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renumics (2023). cifar10-outlier [Dataset]. https://huggingface.co/datasets/renumics/cifar10-outlier
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2023
Dataset authored and provided by
Renumics
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for "cifar10-outlier"

📚 This dataset is an enriched version of the CIFAR-10 Dataset. The workflow is described in the medium article: Changes of Embeddings during Fine-Tuning of Transformers.

Explore the Dataset

The open source data curation tool Renumics Spotlight allows you to explorer this dataset. You can find a Hugging Face Spaces running Spotlight with this dataset here:

Full Version (High hardware requirement)… See the full description on the dataset page: https://huggingface.co/datasets/renumics/cifar10-outlier.
R code
figshare.com
txt
Updated Jun 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christine Dodge (2017). R code [Dataset]. http://doi.org/10.6084/m9.figshare.5021297.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5021297.v1
Dataset updated
Jun 5, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Christine Dodge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
d
Data from: Statistical context dictates the relationship between...
datadryad.org
data.niaid.nih.gov
zip
Updated Aug 21, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew R. Nassar; Rasmus Bruckner; Michael J. Frank (2019). Statistical context dictates the relationship between feedback-related EEG signals and learning [Dataset]. http://doi.org/10.5061/dryad.570pf8n
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.570pf8n
Dataset updated
Aug 21, 2019
Dataset provided by
Dryad
Authors
Matthew R. Nassar; Rasmus Bruckner; Michael J. Frank
Time period covered
2019
Description
201_Cannon_FILT_altLow_STIM.matpreprocessed EEG data from subject 201203_Cannon_FILT_altLow_STIM.matCleaned EEG data from participant 203204_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 204205_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 205206_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 206207_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 207210_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 210211_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 211212_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 212213_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 213214_Cannon_FILT_altLow_STIM.mat215_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 215216_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 216229_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 229233_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for particip...

Facebook

Twitter

Click to copy link

Link copied

Cite

Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.9954986.v2

Dataset updated

May 17, 2024

Dataset provided by

Figsharehttp://figshare.com/

Authors

Giovanni Stilo; Bardh Prenkaj

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

Clear search

Close search

Google apps

Main menu

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

KMASH Data Repository for outlier detection

Algorithms for Speeding up Distance-Based Outlier Detection

Data from: Methodology to filter out outliers in high spatial density data...

Data from: Error and anomaly detection for intra-participant time-series...

Data from: A Diagnostic Procedure for Detecting Outliers in Linear...

Data from: Outlier detection in cylindrical data based on Mahalanobis...

Data from: Mining Distance-Based Outliers in Near Linear Time

Data from: Privacy Preserving Outlier Detection through Random Nonlinear...

Outlier Detection on Sensor Data - Dataset - LDM

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

Vision Based Building Energy Data Outlier Detection Dataset

Vision Based Building Energy Data Outlier Detection

Anomaly Detection in High-Dimensional Data

The 12 outliers identified in the Tonga dataset.

Gender_Classification_Dataset

Anomaly Detection in Sequences

cifar10-outlier

R code

Data from: Statistical context dictates the relationship between...

MNIST dataset for Outliers Detection - [ MNIST4OD ]