82 datasets found

MNIST dataset for Outliers Detection - [ MNIST4OD ]
figshare.com
application/gzip
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9954986.v2
Dataset updated
May 17, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Giovanni Stilo; Bardh Prenkaj
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10
d
Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...
catalog.data.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
KMASH Data Repository for outlier detection
search.datacite.org
research-repository.rmit.edu.au
+1more
Updated Aug 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datacite (2021). KMASH Data Repository for outlier detection [Dataset]. http://doi.org/10.26180/5c6253c0b3323
Explore at:
Unique identifier
https://doi.org/10.26180/5c6253c0b3323
Dataset updated
Aug 17, 2021
Dataset provided by
DataCitehttps://www.datacite.org/
RMIT University
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
Australian Research Council
Description
The zip files contains 12338 datasets for outlier detection investigated in the following papers:
(1) Instance space analysis for unsupervised outlier detection Authors : Sevvandi Kandanaarachchi, Mario A. Munoz, Kate Smith-Miles
(2) On normalization and algorithm selection for unsupervised outlier detection Authors : Sevvandi Kandanaarachchi, Mario A. Munoz, Rob J. Hyndman, Kate Smith-Miles

Some of these datasets were originally discussed in the paper:

On the evaluation of unsupervised outlier detection:measures, datasets and an empirical studyAuthors : G. O. Campos, A, Zimek, J. Sander, R. J.G.B. Campello, B. Micenkova, E. Schubert, I. Assent, M.E. Houle.
d
Data from: Mining Distance-Based Outliers in Near Linear Time
catalog.data.gov
datasets.ai
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
d
Data from: Privacy Preserving Outlier Detection through Random Nonlinear...
catalog.data.gov
data.amerigeoss.org
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.
Gender_Classification_Dataset
kaggle.com
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sameh Raouf (2024). Gender_Classification_Dataset [Dataset]. https://www.kaggle.com/datasets/samehraouf/gender-classification-dataset/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sameh Raouf
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Title: Gender Classification Dataset

Description: This dataset contains anonymized information on height, weight, age, and gender of 10,000 individuals. The data is equally distributed between males and females, with 5,000 samples for each gender. The purpose of this dataset is to provide a comprehensive sample for studies and analyses related to physical attributes and demographics.

Content: The CSV file contains the following columns:

Gender: The gender of the individual (Male/Female) Height: The height of the individual in centimeters Weight: The weight of the individual in kilograms Age: The age of the individual in years

License: This dataset is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND 4.0) license. This means you are free to share the data, provided that you attribute the source, do not use it for commercial purposes, and do not distribute modified versions of the data.

Usage:

This dataset can be used for: - Analyzing the distribution of height, weight, and age across genders - Developing and testing machine learning models for predicting physical attributes - Educational purposes in statistics and data science courses
Z
Measure While Drilling (MWD) dataset with rock type labels for 15 Norwegian...
data.niaid.nih.gov
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hansen, Tom Frode (2024). Measure While Drilling (MWD) dataset with rock type labels for 15 Norwegian hard rock tunnels [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10358373
Explore at:
Dataset updated
Nov 26, 2024
Dataset authored and provided by
Hansen, Tom Frode
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is presented in the paper:

Building and analysing a labelled Measure While Drilling dataset from 15 hard rock tunnels in Norway, by T.F. Hansen, Z. Liu, J. Torressen

The paper has a preprint on SSRN: http://dx.doi.org/10.2139/ssrn.4729646 and is under review in a peer-reviewed journal.

The dataset is utilised in a machine learning analysis in the paper:

Predicting rock type from MWD tunnel data using a reproducible ML-modelling process, by T.F. Hansen, Z. Liu, J. Torressen

The paper is published in the journal Tunnelling and Underground Space Technology:

https://doi.org/10.1016/j.tust.2024.105843

Description of the dataset:

Measure While Drilling (MWD) is a technique in rock drilling, mainly used in drill and blast tunnelling, where data about the rock mass is registered by sensors while drilling. The extensive and geologically diversified dataset contains corresponding MWD-data and rock mass mappings for 5205 blasting rounds from 15 hard rock tunnels in Norway. MWD-data are presented as tabular data. 10 different rocktypes are the corresponding labels.

Four files are given:

A csv-file of the training dataset - with outliers removed

A csv-file of the testing dataset (split train/test 0.75/0.25) - with outliers removed

A csv-file with the full unsplitted dataset, cleaned and with outliers removed

A csv-file with the raw dataset, before cleaning, processing and outlier removal

The author gratefully acknowledge the tunnel software/hardware company Bever Control, which have facilitated data from the clients Bane NOR, Statens Vegvesen, Nye Veier, and the contractor AF-Gruppen.

NOTE: The dataset is only available for research, no commercial use.
Z
Multi-Domain Outlier Detection Dataset
data.niaid.nih.gov
zenodo.org
Updated Mar 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raman, Vinay (2022). Multi-Domain Outlier Detection Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5941338
Explore at:
Dataset updated
Mar 31, 2022
Dataset provided by
Kerner, Hannah
Wagstaff, Kiri
Huff, Eric
Francis, Raymond
Lu, Steven
Dubayah, Bryce
Rebbapragada, Umaa
Raman, Vinay
Kulshrestha, Sakshum
Lee, Jake
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multi-Domain Outlier Detection Dataset contains datasets for conducting outlier detection experiments for four different application domains:

Astrophysics - detecting anomalous observations in the Dark Energy Survey (DES) catalog (data type: feature vectors)

Planetary science - selecting novel geologic targets for follow-up observation onboard the Mars Science Laboratory (MSL) rover (data type: grayscale images)

Earth science: detecting anomalous samples in satellite time series corresponding to ground-truth observations of maize crops (data type: time series/feature vectors)

Fashion-MNIST/MNIST: benchmark task to detect anomalous MNIST images among Fashion-MNIST images (data type: grayscale images)

Each dataset contains a "fit" dataset (used for fitting or training outlier detection models), a "score" dataset (used for scoring samples used to evaluate model performance, analogous to test set), and a label dataset (indicates whether samples in the score dataset are considered outliers or not in the domain of each dataset).

To read more about the datasets and how they are used for outlier detection, or to cite this dataset in your own work, please see the following citation:

Kerner, H. R., Rebbapragada, U., Wagstaff, K. L., Lu, S., Dubayah, B., Huff, E., Lee, J., Raman, V., and Kulshrestha, S. (2022). Domain-agnostic Outlier Ranking Algorithms (DORA)-A Configurable Pipeline for Facilitating Outlier Detection in Scientific Datasets. Under review for Frontiers in Astronomy and Space Sciences.
g
ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...
elki-project.github.io
explore.openaire.eu
+2more
Updated Sep 2, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erich Schubert; Arthur Zimek (2011). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6355684
Dataset updated
Sep 2, 2011
Dataset provided by
TU Dortmund University
University of Southern Denmark, Denmark
Authors
Erich Schubert; Arthur Zimek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "Amsterdam Library of Object Images" is a collection of 110250 images of 1000 small objects, taken under various light conditions and rotation angles. All objects were placed on a black background. Thus the images are taken under rather uniform conditions, which means there is little uncontrolled bias in the data set (unless mixed with other sources). They do however not resemble a "typical" image collection. The data set has a rather unique property for its size: there are around 100 different images of each object, so it is well suited for clustering. By downsampling some objects it can also be used for outlier detection. For multi-view research, we offer a number of different feature vector sets for evaluating this data set.
d
Manual snow course observations, raw met data, raw snow depth observations,...
catalog.data.gov
Updated Jun 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Climate Adaptation Science Centers (2024). Manual snow course observations, raw met data, raw snow depth observations, locations, and associated metadata for Oregon sites [Dataset]. https://catalog.data.gov/dataset/manual-snow-course-observations-raw-met-data-raw-snow-depth-observations-locations-and-ass
Explore at:
Dataset updated
Jun 15, 2024
Dataset provided by
Climate Adaptation Science Centers
Area covered
Oregon
Description
OSU_SnowCourse Summary: Manual snow course observations were collected over WY 2012-2014 from four paired forest-open sites chosen to span a broad elevation range. Study sites were located in the upper McKenzie (McK) River watershed, approximately 100 km east of Corvallis, Oregon, on the western slope of the Cascade Range and in the Middle Fork Willamette (MFW) watershed, located to the south of the McKenzie. The sites were designated based on elevation, with a range of 1110-1480 m. Distributed snow depth and snow water equivalent (SWE) observations were collected via monthly manual snow courses from 1 November through 1 April and bi-weekly thereafter. Snow courses spanned 500 m of forested terrain and 500 m of adjacent open terrain. Snow depth observations were collected approximately every 10 m and SWE was measured every 100 m along the snow courses with a federal snow sampler. These data are raw observations and have not been quality controlled in any way. Distance along the transect was estimated in the field. OSU_SnowDepth Summary: 10-minute snow depth observations collected at OSU met stations in the upper McKenzie River Watershed and the Middle Fork Willamette Watershed during Water Years 2012-2014. Each meterological tower was deployed to represent either a forested or an open area at a particular site, and generally the locations were paired, with a meterological station deployed in the forest and in the open area at a single site. These data were collected in conjunction with manual snow course observations, and the meterological stations were located in the approximate center of each forest or open snow course transect. These data have undergone basic quality control. See manufacturer specifications for individual instruments to determine sensor accuracy. This file was compiled from individual raw data files (named "RawData.txt" within each site and year directory) provided by OSU, along with metadata of site attributes. We converted the Excel-based timestamp (seconds since origin) to a date, changed the NaN flags for missing data to NA, and added site attributes such as site name and cover. We replaced positive values with NA, since snow depth values in raw data are negative (i.e., flipped, with some correction to use the height of the sensor as zero). Thus, positive snow depth values in the raw data equal negative snow depth values. Second, the sign of the data was switched to make them positive. Then, the smooth.m (MATLAB) function was used to roughly smooth the data, with a moving window of 50 points. Third, outliers were removed. All values higher than the smoothed values +10, were replaced with NA. In some cases, further single point outliers were removed. OSU_Met Summary: Raw, 10-minute meteorological observations collected at OSU met stations in the upper McKenzie River Watershed and the Middle Fork Willamette Watershed during Water Years 2012-2014. Each meterological tower was deployed to represent either a forested or an open area at a particular site, and generally the locations were paired, with a meterological station deployed in the forest and in the open area at a single site. These data were collected in conjunction with manual snow course observations, and the meteorological stations were located in the approximate center of each forest or open snow course transect. These stations were deployed to collect numerous meteorological variables, of which snow depth and wind speed are included here. These data are raw datalogger output and have not been quality controlled in any way. See manufacturer specifications for individual instruments to determine sensor accuracy. This file was compiled from individual raw data files (named "RawData.txt" within each site and year directory) provided by OSU, along with metadata of site attributes. We converted the Excel-based timestamp (seconds since origin) to a date, changed the NaN and 7999 flags for missing data to NA, and added site attributes such as site name and cover. OSU_Location Summary: Location Metadata for manual snow course observations and meteorological sensors. These data are compiled from GPS data for which the horizontal accuracy is unknown, and from processed hemispherical photographs. They have not been quality controlled in any way.
R
Vision Based Building Energy Data Outlier Detection Dataset
universe.roboflow.com
zip
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
energy data outlier detection (2024). Vision Based Building Energy Data Outlier Detection Dataset [Dataset]. https://universe.roboflow.com/energy-data-outlier-detection/vision-based-building-energy-data-outlier-detection
Explore at:
zipAvailable download formats
Dataset updated
Apr 3, 2024
Dataset authored and provided by
energy data outlier detection
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
11785 Bounding Boxes
Description
Vision Based Building Energy Data Outlier Detection

## Overview Vision Based Building Energy Data Outlier Detection is a dataset for object detection tasks - it contains 11785 annotations for 2,159 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
U
Replication data for: Linear Models with Outliers: Choosing between...
dataverse-staging.rdmc.unc.edu
dataverse.harvard.edu
+1more
pdf +1
Updated Dec 17, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey Harden; Bruce Desmarais; Jeffrey Harden; Bruce Desmarais (2013). Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods [Dataset]. https://dataverse-staging.rdmc.unc.edu/dataset.xhtml?persistentId=hdl:1902.29/11608
Explore at:
pdf(198705), text/plain; charset=us-ascii(3590), text/plain; charset=us-ascii(5482)Available download formats
Dataset updated
Dec 17, 2013
Dataset provided by
UNC Dataverse
Authors
Jeffrey Harden; Bruce Desmarais; Jeffrey Harden; Bruce Desmarais
License
https://dataverse-staging.rdmc.unc.edu/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=hdl:1902.29/11608https://dataverse-staging.rdmc.unc.edu/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=hdl:1902.29/11608
Description
State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.
d
11: Streamwater sample constituent concentration outliers from 15 watersheds...
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). 11: Streamwater sample constituent concentration outliers from 15 watersheds in Gwinnett County, Georgia for water years 2003-2020 [Dataset]. https://catalog.data.gov/dataset/11-streamwater-sample-constituent-concentration-outliers-from-15-watersheds-in-gwinne-2003
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Gwinnett County, Georgia
Description
This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits of the concentration relations, which could substantially affect model predictions. Identified outliers were also excluded from loads that were calculated using the Beale ratio estimator. Notes on reason(s) for considering a concentration as an outlier are included.
n
Data from: Analyzing contentious relationships and outlier genes in...
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Jun 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph F. Walker; Joseph W. Brown; Stephen A. Smith (2018). Analyzing contentious relationships and outlier genes in phylogenomics [Dataset]. http://doi.org/10.5061/dryad.br381mg
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.br381mg
Dataset updated
Jun 5, 2018
Dataset provided by
University of Sheffield
University of Michigan
Authors
Joseph F. Walker; Joseph W. Brown; Stephen A. Smith
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Recent studies have demonstrated that conflict is common among gene trees in phylogenomic studies, and that less than one percent of genes may ultimately drive species tree inference in supermatrix analyses. Here, we examined two datasets where supermatrix and coalescent-based species trees conflict. We identified two highly influential “outlier” genes in each dataset. When removed from each dataset, the inferred supermatrix trees matched the topologies obtained from coalescent analyses. We also demonstrate that, while the outlier genes in the vertebrate dataset have been shown in a previous study to be the result of errors in orthology detection, the outlier genes from a plant dataset did not exhibit any obvious systematic error and therefore may be the result of some biological process yet to be determined. While topological comparisons among a small set of alternate topologies can be helpful in discovering outlier genes, they can be limited in several ways, such as assuming all genes share the same topology. Coalescent species tree methods relax this assumption but do not explicitly facilitate the examination of specific edges. Coalescent methods often also assume that conflict is the result of incomplete lineage sorting (ILS). Here we explored a framework that allows for quickly examining alternative edges and support for large phylogenomic datasets that does not assume a single topology for all genes. For both datasets, these analyses provided detailed results confirming the support for coalescent-based topologies. This framework suggests that we can improve our understanding of the underlying signal in phylogenomic datasets by asking more targeted edge-based questions.
z
Controlled Anomalies Time Series (CATS) Dataset
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.8338435
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8338435
Dataset updated
Jul 11, 2024
Dataset provided by
Solenix Engineering GmbH
Authors
Patrick Fleith; Patrick Fleith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

5 million timestamps. Sensors readings are at 1Hz sampling frequency.

1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.

Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.

Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).

Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

Change Log

Version 2

Metadata: we include a metadata.csv with information about:

Anomaly categories

Root cause channel (signal in which the anomaly is first visible)

Affected channel (signal in which the anomaly might propagate) through coupled system dynamics

Removal of anomaly overlaps: version 1 contained anomalies which overlapped with each other resulting in only 190 distinct anomalous segments. Now, there are no more anomaly overlaps.

Two data files: CSV and parquet for convenience.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
Data from: NPP Multi-Biome: Global Primary Production Data Initiative...
s.cnmilf.com
datasets.ai
+7more
Updated Jun 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ORNL_DAAC (2025). NPP Multi-Biome: Global Primary Production Data Initiative Products, R2 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/npp-multi-biome-global-primary-production-data-initiative-products-r2-e7869
Explore at:
Dataset updated
Jun 28, 2025
Dataset provided by
Oak Ridge National Laboratory Distributed Active Archive Center
Description
Net primary productivity (NPP) estimates were compiled by the Global Primary Production Data Initiative (GPPDI). The database covers 2,523 individual sites and 5,164 half-degree grid cells and underwent extensive review under the Ecosystem Model-Data Intercomparison (EMDI) process. The GPPDI database includes NPP measurements that were collected over a long time period by many investigators using a variety of methods. The measurements are categorized as either Class A, from intensively studied sites; Class B, from extensive sites; or reported as Class C, 0.5 latitude-longitude grid cells. The data set contains six comma-separated files (.csv format). There are two files for each class. One file for each class contains site locations, elevation, NPP estimates, climate data, biome and dominant species information, and references. The other file for each class contains model validation outlier flags derived from site-specific reviews. This document and a companion file (Olson et al., 2001) describe the compilation of NPP estimates under the GPPDI. The results of the EMDI review and outlier analysis produced a refined set of NPP estimates and model driver data (the EMDI database; Olson et al., 2001; 2013). Another ORNL DAAC data set (Zheng et al., 2013) contributed to the compilation of GPPDI. Revision Notes: This data set has been revised to correct previously reported ANPP, BNPP, and TNPP estimates for three OTTER Transect sites, USA, in the Class A NPP data file and BNPP, and TNPP estimates for Vindhyan, India, in the Class B NPP data file. Please see the Data Set Revisions section of this document for detailed information.
n
Malaria disease and grading system dataset from public hospitals reflecting...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Nov 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4xgxd25gn
Dataset updated
Nov 10, 2023
Dataset provided by
Nasarawa State University
Authors
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
Seasonal time series with random collective outliers
zenodo.org
zip
Updated Oct 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Müllner; Peter Müllner (2020). Seasonal time series with random collective outliers [Dataset]. http://doi.org/10.5281/zenodo.3732413
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3732413
Dataset updated
Oct 9, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Peter Müllner; Peter Müllner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset containing collections of time series ECG 1[1], ECG 2[1] and Sunspot[2].

Each collection comprises 10 time series, where each time series has exactly one collective outlier.

[1] https://www.cs.ucr.edu/~eamonn/discords

[2] Andrews, D.F., Herzberg, A.M.: Data: a collection of problems from many fieldsfor the student and research worker. Springer Science & Business Media (2012)
f
The 12 outliers identified in the Tonga dataset.
plos.figshare.com
figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anderson B. Mayfield; Chii-Shiarng Chen; Alexandra C. Dempsey (2023). The 12 outliers identified in the Tonga dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0185857.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0185857.t004
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Anderson B. Mayfield; Chii-Shiarng Chen; Alexandra C. Dempsey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Tonga
Description
Gene expression data have been presented as non-normalized (2-Ct*109) in all but the last six rows; this allows for the back-calculation of the raw threshold cycle (Ct) values so that interested individuals can readily estimate the typical range of expression of each gene. Values representing aberrant levels for a particular parameter (z-score>2.5) have been highlighted in bold. When there was a statistically significant difference (student’s t-test, p0.05). SA = surface area. GCP = genome copy proportion. Ma Dis = Mahalanobis distance. “.” = missing data.
f
MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as...
tandf.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mia Hubert; Peter J. Rousseeuw; Wannes Van den Bossche (2023). MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as Cellwise and Rowwise Outliers [Dataset]. http://doi.org/10.6084/m9.figshare.7624424.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7624424.v2
Dataset updated
Jun 2, 2023
Dataset provided by
Taylor & Francis
Authors
Mia Hubert; Peter J. Rousseeuw; Wannes Van den Bossche
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multivariate data are typically represented by a rectangular matrix (table) in which the rows are the objects (cases) and the columns are the variables (measurements). When there are many variables one often reduces the dimension by principal component analysis (PCA), which in its basic form is not robust to outliers. Much research has focused on handling rowwise outliers, that is, rows that deviate from the majority of the rows in the data (e.g., they might belong to a different population). In recent years also cellwise outliers are receiving attention. These are suspicious cells (entries) that can occur anywhere in the table. Even a relatively small proportion of outlying cells can contaminate over half the rows, which causes rowwise robust methods to break down. In this article, a new PCA method is constructed which combines the strengths of two existing robust methods to be robust against both cellwise and rowwise outliers. At the same time, the algorithm can cope with missing values. As of yet it is the only PCA method that can deal with all three problems simultaneously. Its name MacroPCA stands for PCA allowing for Missingness And Cellwise & Rowwise Outliers. Several simulations and real datasets illustrate its robustness. New residual maps are introduced, which help to determine which variables are responsible for the outlying behavior. The method is well-suited for online process control.

Facebook

Twitter

Click to copy link

Link copied

Cite

Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.9954986.v2

Dataset updated

May 17, 2024

Dataset provided by

Figsharehttp://figshare.com/

Authors

Giovanni Stilo; Bardh Prenkaj

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

Clear search

Close search

Google apps

Main menu

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

KMASH Data Repository for outlier detection

Data from: Mining Distance-Based Outliers in Near Linear Time

Data from: Privacy Preserving Outlier Detection through Random Nonlinear...

Gender_Classification_Dataset

Measure While Drilling (MWD) dataset with rock type labels for 15 Norwegian...

Multi-Domain Outlier Detection Dataset

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

Manual snow course observations, raw met data, raw snow depth observations,...

Vision Based Building Energy Data Outlier Detection Dataset

Vision Based Building Energy Data Outlier Detection

Replication data for: Linear Models with Outliers: Choosing between...

11: Streamwater sample constituent concentration outliers from 15 watersheds...

Data from: Analyzing contentious relationships and outlier genes in...

Controlled Anomalies Time Series (CATS) Dataset

Data from: NPP Multi-Biome: Global Primary Production Data Initiative...

Malaria disease and grading system dataset from public hospitals reflecting...

Seasonal time series with random collective outliers

The 12 outliers identified in the Tonga dataset.

MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as...

MNIST dataset for Outliers Detection - [ MNIST4OD ]