95 datasets found
  1. MNIST dataset for Outliers Detection - [ MNIST4OD ]

    • figshare.com
    application/gzip
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Giovanni Stilo; Bardh Prenkaj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

  2. d

    Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

    • catalog.data.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).

  3. f

    Data from: Outlier detection in cylindrical data based on Mahalanobis...

    • tandf.figshare.com
    text/x-tex
    Updated Jan 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prashant S. Dhamale; Akanksha S. Kashikar (2025). Outlier detection in cylindrical data based on Mahalanobis distance [Dataset]. http://doi.org/10.6084/m9.figshare.24092089.v1
    Explore at:
    text/x-texAvailable download formats
    Dataset updated
    Jan 2, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Prashant S. Dhamale; Akanksha S. Kashikar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cylindrical data are bivariate data formed from the combination of circular and linear variables. Identifying outliers is a crucial step in any data analysis work. This paper proposes a new distribution-free procedure to detect outliers in cylindrical data using the Mahalanobis distance concept. The use of Mahalanobis distance incorporates the correlation between the components of the cylindrical distribution, which had not been accounted for in the earlier papers on outlier detection in cylindrical data. The threshold for declaring an observation to be an outlier can be obtained via parametric or non-parametric bootstrap, depending on whether the underlying distribution is known or unknown. The performance of the proposed method is examined via extensive simulations from the Johnson-Wehrly distribution. The proposed method is applied to two real datasets, and the outliers are identified in those datasets.

  4. s

    Outlier Set Two-step Method (OSTI)

    • orda.shef.ac.uk
    application/x-rar
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amal Sarfraz; Abigail Birnbaum; Flannery Dolan; Jonathan Lamontagne; Lyudmila Mihaylova; Charles Rouge (2025). Outlier Set Two-step Method (OSTI) [Dataset]. http://doi.org/10.15131/shef.data.28227974.v3
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    Jul 1, 2025
    Dataset provided by
    The University of Sheffield
    Authors
    Amal Sarfraz; Abigail Birnbaum; Flannery Dolan; Jonathan Lamontagne; Lyudmila Mihaylova; Charles Rouge
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These files are supplements to the paper titled 'A Robust Two-step Method for Detection of Outlier Sets'.This paper identifies and addresses the need for a robust method that identifies sets of points that collectively deviate from typical patterns in a dataset, which it calls "outlier sets'', while excluding individual points from detection. This new methodology, Outlier Set Two-step Identification (OSTI) employs a two-step approach to detect and label these outlier sets. First, it uses Gaussian Mixture Models for probabilistic clustering, identifying candidate outlier sets based on cluster weights below a predetermined threshold. Second, OSTI measures the Inter-cluster Mahalanobis distance between each candidate outlier set's centroid and the overall dataset mean. OSTI then tests the null hypothesis that this distance does not significantly differ from its theoretical chi-square distribution, enabling the formal detection of outlier sets. We test OSTI systematically on 8,000 synthetic 2D datasets across various inlier configurations and thousands of possible outlier set characteristics. Results show OSTI robustly and consistently detects outlier sets with an average F1 score of 0.92 and an average purity (the degree to which outlier sets identified correspond to those generated synthetically, i.e., our ground truth) of 98.58%. We also compare OSTI with state-of-the-art outlier detection methods, to illuminate how OSTI fills a gap as a tool for the exclusive detection of outlier sets.

  5. h

    mnist-outlier

    • huggingface.co
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Renumics (2023). mnist-outlier [Dataset]. https://huggingface.co/datasets/renumics/mnist-outlier
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 16, 2023
    Dataset authored and provided by
    Renumics
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "mnist-outlier"

    📚 This dataset is an enriched version of the MNIST Dataset. The workflow is described in the medium article: Changes of Embeddings during Fine-Tuning of Transformers.

      Explore the Dataset
    

    The open source data curation tool Renumics Spotlight allows you to explorer this dataset. You can find a Hugging Face Space running Spotlight with this dataset here: https://huggingface.co/spaces/renumics/mnist-outlier.

    Or you can explorer it locally:
 See the full description on the dataset page: https://huggingface.co/datasets/renumics/mnist-outlier.

  6. f

    Data from: Error and anomaly detection for intra-participant time-series...

    • tandf.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    David R. Mullineaux; Gareth Irwin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.

  7. f

    Data from: Methodology to filter out outliers in high spatial density data...

    • scielo.figshare.com
    jpeg
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    SciELO journals
    Authors
    Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.

  8. d

    Anomaly Detection in Sequences

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Anomaly Detection in Sequences [Dataset]. https://catalog.data.gov/dataset/anomaly-detection-in-sequences
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    We present a set of novel algorithms which we call sequenceMiner, that detect and characterize anomalies in large sets of high-dimensional symbol sequences that arise from recordings of switch sensors in the cockpits of commercial airliners. While the algorithms we present are general and domain-independent, we focus on a specific problem that is critical to determining system-wide health of a fleet of aircraft. The approach taken uses unsupervised clustering of sequences using the normalized length of he longest common subsequence (nLCS) as a similarity measure, followed by a detailed analysis of outliers to detect anomalies. In this method, an outlier sequence is defined as a sequence that is far away from a cluster. We present new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence is deemed to be an outlier. The algorithm provides a coherent description to an analyst of the anomalies in the sequence when compared to more normal sequences. The final section of the paper demonstrates the effectiveness of sequenceMiner for anomaly detection on a real set of discrete sequence data from a fleet of commercial airliners. We show that sequenceMiner discovers actionable and operationally significant safety events. We also compare our innovations with standard HiddenMarkov Models, and show that our methods are superior

  9. Chemical outlier dataset

    • zenodo.org
    bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mario Lovric; Mario Lovric (2020). Chemical outlier dataset [Dataset]. http://doi.org/10.5281/zenodo.1167835
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mario Lovric; Mario Lovric
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The objects are numbered. The Y-variable are boiling points. Other features are structural features of molecules. In the outlier column the outliers are assigned with a value of 1.

    The data is derived from a published chemical dataset on boiling point measurements [1] and from public data [2]. Features were generated by means of the RDKit Python library [3]. The dataset was infused with known outliers (~5%) based on significant structural differences, i.e. polar and non-polar molecules.

    1. Cherqaoui D., Villemin D. Use of a Neural Network to determine the Boiling Point of Alkanes. J CHEM SOC FARADAY TRANS. 1994;90(1):97–102.
    2. https://pubchem.ncbi.nlm.nih.gov/
    3. RDKit: Open-source cheminformatics; http://www.rdkit.org

  10. h

    cifar10-outlier

    • huggingface.co
    Updated Jul 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Renumics (2023). cifar10-outlier [Dataset]. https://huggingface.co/datasets/renumics/cifar10-outlier
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 3, 2023
    Dataset authored and provided by
    Renumics
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for "cifar10-outlier"

    📚 This dataset is an enriched version of the CIFAR-10 Dataset. The workflow is described in the medium article: Changes of Embeddings during Fine-Tuning of Transformers.

      Explore the Dataset
    

    The open source data curation tool Renumics Spotlight allows you to explorer this dataset. You can find a Hugging Face Spaces running Spotlight with this dataset here:

    Full Version (High hardware requirement)
 See the full description on the dataset page: https://huggingface.co/datasets/renumics/cifar10-outlier.

  11. Z

    Multi-Domain Outlier Detection Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 31, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raman, Vinay (2022). Multi-Domain Outlier Detection Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5941338
    Explore at:
    Dataset updated
    Mar 31, 2022
    Dataset provided by
    Kulshrestha, Sakshum
    Lee, Jake
    Kerner, Hannah
    Huff, Eric
    Wagstaff, Kiri
    Lu, Steven
    Raman, Vinay
    Francis, Raymond
    Dubayah, Bryce
    Rebbapragada, Umaa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multi-Domain Outlier Detection Dataset contains datasets for conducting outlier detection experiments for four different application domains:

    Astrophysics - detecting anomalous observations in the Dark Energy Survey (DES) catalog (data type: feature vectors)

    Planetary science - selecting novel geologic targets for follow-up observation onboard the Mars Science Laboratory (MSL) rover (data type: grayscale images)

    Earth science: detecting anomalous samples in satellite time series corresponding to ground-truth observations of maize crops (data type: time series/feature vectors)

    Fashion-MNIST/MNIST: benchmark task to detect anomalous MNIST images among Fashion-MNIST images (data type: grayscale images)

    Each dataset contains a "fit" dataset (used for fitting or training outlier detection models), a "score" dataset (used for scoring samples used to evaluate model performance, analogous to test set), and a label dataset (indicates whether samples in the score dataset are considered outliers or not in the domain of each dataset).

    To read more about the datasets and how they are used for outlier detection, or to cite this dataset in your own work, please see the following citation:

    Kerner, H. R., Rebbapragada, U., Wagstaff, K. L., Lu, S., Dubayah, B., Huff, E., Lee, J., Raman, V., and Kulshrestha, S. (2022). Domain-agnostic Outlier Ranking Algorithms (DORA)-A Configurable Pipeline for Facilitating Outlier Detection in Scientific Datasets. Under review for Frontiers in Astronomy and Space Sciences.

  12. f

    Anomaly Detection in High-Dimensional Data

    • tandf.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles (2023). Anomaly Detection in High-Dimensional Data [Dataset]. http://doi.org/10.6084/m9.figshare.12844508.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation where its k-nearest neighbor distance with the maximum gap is significantly different from what we would expect if the distribution of k-nearest neighbors with the maximum gap is in the maximum domain of attraction of the Gumbel distribution. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray. Supplementary materials for this article are available online.

  13. g

    DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES

    • gimi9.com
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +3more
    Updated Sep 24, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2010). DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES [Dataset]. https://gimi9.com/dataset/data-gov_distributed-anomaly-detection-using-satellite-data-from-multiple-modalities/
    Explore at:
    Dataset updated
    Sep 24, 2010
    Description

    DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES KANISHKA BHADURI*, KAMALIKA DAS**, AND PETR VOTAVA*** Abstract. There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets ate physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.

  14. d

    Integrated Building Health Management

    • catalog.data.gov
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Integrated Building Health Management [Dataset]. https://catalog.data.gov/dataset/integrated-building-health-management
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.

  15. Bank Transaction Dataset for Fraud Detection

    • kaggle.com
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vala khorasani (2024). Bank Transaction Dataset for Fraud Detection [Dataset]. https://www.kaggle.com/datasets/valakhorasani/bank-transaction-dataset-for-fraud-detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    vala khorasani
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset provides a detailed look into transactional behavior and financial activity patterns, ideal for exploring fraud detection and anomaly identification. It contains 2,512 samples of transaction data, covering various transaction attributes, customer demographics, and usage patterns. Each entry offers comprehensive insights into transaction behavior, enabling analysis for financial security and fraud detection applications.

    Key Features:

    • TransactionID: Unique alphanumeric identifier for each transaction.
    • AccountID: Unique identifier for each account, with multiple transactions per account.
    • TransactionAmount: Monetary value of each transaction, ranging from small everyday expenses to larger purchases.
    • TransactionDate: Timestamp of each transaction, capturing date and time.
    • TransactionType: Categorical field indicating 'Credit' or 'Debit' transactions.
    • Location: Geographic location of the transaction, represented by U.S. city names.
    • DeviceID: Alphanumeric identifier for devices used to perform the transaction.
    • IP Address: IPv4 address associated with the transaction, with occasional changes for some accounts.
    • MerchantID: Unique identifier for merchants, showing preferred and outlier merchants for each account.
    • AccountBalance: Balance in the account post-transaction, with logical correlations based on transaction type and amount.
    • PreviousTransactionDate: Timestamp of the last transaction for the account, aiding in calculating transaction frequency.
    • Channel: Channel through which the transaction was performed (e.g., Online, ATM, Branch).
    • CustomerAge: Age of the account holder, with logical groupings based on occupation.
    • CustomerOccupation: Occupation of the account holder (e.g., Doctor, Engineer, Student, Retired), reflecting income patterns.
    • TransactionDuration: Duration of the transaction in seconds, varying by transaction type.
    • LoginAttempts: Number of login attempts before the transaction, with higher values indicating potential anomalies.

    This dataset is ideal for data scientists, financial analysts, and researchers looking to analyze transactional patterns, detect fraud, and build predictive models for financial security applications. The dataset was designed for machine learning and pattern analysis tasks and is not intended as a primary data source for academic publications.

  16. z

    Controlled Anomalies Time Series (CATS) Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv
    Updated Jul 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.8338435
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Solenix Engineering GmbH
    Authors
    Patrick Fleith; Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    • Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
      • 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
      • 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
      • 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
    • 5 million timestamps. Sensors readings are at 1Hz sampling frequency.
      • 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
      • 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
    • 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
    • Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.
    • Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
    • Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.
    • Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).
    • Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
    • Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
    • Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
    • No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    Change Log

    Version 2

    • Metadata: we include a metadata.csv with information about:
      • Anomaly categories
      • Root cause channel (signal in which the anomaly is first visible)
      • Affected channel (signal in which the anomaly might propagate) through coupled system dynamics
    • Removal of anomaly overlaps: version 1 contained anomalies which overlapped with each other resulting in only 190 distinct anomalous segments. Now, there are no more anomaly overlaps.
    • Two data files: CSV and parquet for convenience.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

  17. f

    DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning...

    • frontiersin.figshare.com
    pdf
    Updated Oct 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu (2023). DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning tree generation with applications on medical data.pdf [Dataset]. http://doi.org/10.3389/fphys.2023.1233341.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As an important technique for data pre-processing, outlier detection plays a crucial role in various real applications and has gained substantial attention, especially in medical fields. Despite the importance of outlier detection, many existing methods are vulnerable to the distribution of outliers and require prior knowledge, such as the outlier proportion. To address this problem to some extent, this article proposes an adaptive mini-minimum spanning tree-based outlier detection (MMOD) method, which utilizes a novel distance measure by scaling the Euclidean distance. For datasets containing different densities and taking on different shapes, our method can identify outliers without prior knowledge of outlier percentages. The results on both real-world medical data corpora and intuitive synthetic datasets demonstrate the effectiveness of the proposed method compared to state-of-the-art methods.

  18. Anomaly Detection in Sequences - Dataset - NASA Open Data Portal

    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • data.nasa.gov
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Anomaly Detection in Sequences - Dataset - NASA Open Data Portal [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/anomaly-detection-in-sequences
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    We present a set of novel algorithms which we call sequenceMiner, that detect and characterize anomalies in large sets of high-dimensional symbol sequences that arise from recordings of switch sensors in the cockpits of commercial airliners. While the algorithms we present are general and domain-independent, we focus on a specific problem that is critical to determining system-wide health of a fleet of aircraft. The approach taken uses unsupervised clustering of sequences using the normalized length of he longest common subsequence (nLCS) as a similarity measure, followed by a detailed analysis of outliers to detect anomalies. In this method, an outlier sequence is defined as a sequence that is far away from a cluster. We present new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence is deemed to be an outlier. The algorithm provides a coherent description to an analyst of the anomalies in the sequence when compared to more normal sequences. The final section of the paper demonstrates the effectiveness of sequenceMiner for anomaly detection on a real set of discrete sequence data from a fleet of commercial airliners. We show that sequenceMiner discovers actionable and operationally significant safety events. We also compare our innovations with standard HiddenMarkov Models, and show that our methods are superior

  19. Integrated Building Health Management - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Integrated Building Health Management - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/integrated-building-health-management
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.

  20. h

    Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

    • datahub.hku.hk
    Updated Aug 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
    Explore at:
    Dataset updated
    Aug 13, 2020
    Dataset provided by
    HKU Data Repository
    Authors
    Wen Ma
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description
    1. NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
Organization logo

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
application/gzipAvailable download formats
Dataset updated
May 17, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Giovanni Stilo; Bardh Prenkaj
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

Search
Clear search
Close search
Google apps
Main menu