59 datasets found
  1. f

    MNIST dataset for Outliers Detection - [ MNIST4OD ]

    • figshare.com
    application/gzip
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    figshare
    Authors
    Giovanni Stilo; Bardh Prenkaj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

  2. d

    Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

    • catalog.data.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +1more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).

  3. Multi-Domain Outlier Detection Dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Mar 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hannah Kerner; Hannah Kerner; Umaa Rebbapragada; Umaa Rebbapragada; Kiri Wagstaff; Kiri Wagstaff; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha (2022). Multi-Domain Outlier Detection Dataset [Dataset]. http://doi.org/10.5281/zenodo.5941339
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 31, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hannah Kerner; Hannah Kerner; Umaa Rebbapragada; Umaa Rebbapragada; Kiri Wagstaff; Kiri Wagstaff; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multi-Domain Outlier Detection Dataset contains datasets for conducting outlier detection experiments for four different application domains:

    1. Astrophysics - detecting anomalous observations in the Dark Energy Survey (DES) catalog (data type: feature vectors)
    2. Planetary science - selecting novel geologic targets for follow-up observation onboard the Mars Science Laboratory (MSL) rover (data type: grayscale images)
    3. Earth science: detecting anomalous samples in satellite time series corresponding to ground-truth observations of maize crops (data type: time series/feature vectors)
    4. Fashion-MNIST/MNIST: benchmark task to detect anomalous MNIST images among Fashion-MNIST images (data type: grayscale images)

    Each dataset contains a "fit" dataset (used for fitting or training outlier detection models), a "score" dataset (used for scoring samples used to evaluate model performance, analogous to test set), and a label dataset (indicates whether samples in the score dataset are considered outliers or not in the domain of each dataset).

    To read more about the datasets and how they are used for outlier detection, or to cite this dataset in your own work, please see the following citation:

    Kerner, H. R., Rebbapragada, U., Wagstaff, K. L., Lu, S., Dubayah, B., Huff, E., Lee, J., Raman, V., and Kulshrestha, S. (2022). Domain-agnostic Outlier Ranking Algorithms (DORA)-A Configurable Pipeline for Facilitating Outlier Detection in Scientific Datasets. Under review for Frontiers in Astronomy and Space Sciences.

  4. d

    Data from: Mining Distance-Based Outliers in Near Linear Time

    • catalog.data.gov
    • datasets.ai
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

  5. H

    Unsupervised Anomaly Detection Benchmark

    • dataverse.harvard.edu
    Updated Oct 6, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Goldstein (2015). Unsupervised Anomaly Detection Benchmark [Dataset]. http://doi.org/10.7910/DVN/OPQMVF
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2015
    Dataset provided by
    Harvard Dataverse
    Authors
    Markus Goldstein
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    These datasets can be used for benchmarking unsupervised anomaly detection algorithms (for example "Local Outlier Factor" LOF). The datasets have been obtained from multiple sources and are mainly based on datasets originally used for supervised machine learning. By publishing these modifications, a comparison of different algorithms is now possible for unsupervised anomaly detection.

  6. Gender_Classification_Dataset

    • kaggle.com
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sameh Raouf (2024). Gender_Classification_Dataset [Dataset]. https://www.kaggle.com/datasets/samehraouf/gender-classification-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 19, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sameh Raouf
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Title: Gender Classification Dataset

    Description: This dataset contains anonymized information on height, weight, age, and gender of 10,000 individuals. The data is equally distributed between males and females, with 5,000 samples for each gender. The purpose of this dataset is to provide a comprehensive sample for studies and analyses related to physical attributes and demographics.

    Content: The CSV file contains the following columns:

    Gender: The gender of the individual (Male/Female) Height: The height of the individual in centimeters Weight: The weight of the individual in kilograms Age: The age of the individual in years

    License: This dataset is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND 4.0) license. This means you are free to share the data, provided that you attribute the source, do not use it for commercial purposes, and do not distribute modified versions of the data.

    Usage:

    This dataset can be used for: - Analyzing the distribution of height, weight, and age across genders - Developing and testing machine learning models for predicting physical attributes - Educational purposes in statistics and data science courses

  7. f

    Sample(s) removed as outliers in each iteration of MFMW-outlier for all the...

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuk Yee Leung; Chun Qi Chang; Yeung Sam Hung (2023). Sample(s) removed as outliers in each iteration of MFMW-outlier for all the six microarray datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0046700.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yuk Yee Leung; Chun Qi Chang; Yeung Sam Hung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample(s) removed as outliers in each iteration of MFMW-outlier for all the six microarray datasets.

  8. d

    Supporting data for \"A Standard Operating Procedure for Outlier Removal in...

    • search.dataone.org
    • dataverse.no
    • +1more
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Holsbø, Einar (2024). Supporting data for \"A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets\" [Dataset]. http://doi.org/10.18710/FGVLKS
    Explore at:
    Dataset updated
    Jul 29, 2024
    Dataset provided by
    DataverseNO
    Authors
    Holsbø, Einar
    Description

    This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details

  9. z

    Controlled Anomalies Time Series (CATS) Dataset

    • zenodo.org
    bin
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Solenix Engineering GmbH
    Authors
    Patrick Fleith; Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    • Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
      • 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
      • 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
      • 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
    • 5 million timestamps. Sensors readings are at 1Hz sampling frequency.
      • 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
      • 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
    • 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
    • Different types of anomalies to understand what anomaly types can be detected by different approaches.
    • Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
    • Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
    • Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
    • Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
    • No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

  10. f

    A novel outlier-adapted multi-stage ensemble model with feature...

    • figshare.com
    txt
    Updated Mar 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaoxia WU; Dongqi Yang; Wenyu Zhang (2020). A novel outlier-adapted multi-stage ensemble model with feature transformation for credit scoring [Dataset]. http://doi.org/10.6084/m9.figshare.11894682.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 28, 2020
    Dataset provided by
    figshare
    Authors
    Xiaoxia WU; Dongqi Yang; Wenyu Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three datasets are chosen from the UCI machine learning repository in this study, which have been extensively adopted in data-driven researches, including Australian and Japanese datasets (Asuncion & Newman, 2007), and Polish bankruptcy dataset (Zięba et al., 2016). The three datasets contain different numbers of samples and features. Each sample in a credit dataset can be classified into good credit or bad credit. The size of Australian credit dataset is 690, with 307 samples in good credit and 383 in bad, and its feature dimension is 14, with 6 numerical and 8 categorical features. The size of Japanese credit dataset is 690, with 307 samples in good credit and 383 in bad, and its feature dimension is 15, with 6 numerical and 9 categorical features. Similarly, there are 7027 samples in Polish bankruptcy dataset, with 6756 samples in good credit and 271 in bad, and its 64 input features are numerical. All the dimensions of the input features of the three datasets listed in Table 1 do not include the class labels.

  11. d

    11: Streamwater sample constituent concentration outliers from 15 watersheds...

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). 11: Streamwater sample constituent concentration outliers from 15 watersheds in Gwinnett County, Georgia for water years 2003-2020 [Dataset]. https://catalog.data.gov/dataset/11-streamwater-sample-constituent-concentration-outliers-from-15-watersheds-in-gwinne-2003
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Gwinnett County, Georgia
    Description

    This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits of the concentration relations, which could substantially affect model predictions. Identified outliers were also excluded from loads that were calculated using the Beale ratio estimator. Notes on reason(s) for considering a concentration as an outlier are included.

  12. Bank Transaction Dataset for Fraud Detection

    • kaggle.com
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vala khorasani (2024). Bank Transaction Dataset for Fraud Detection [Dataset]. https://www.kaggle.com/datasets/valakhorasani/bank-transaction-dataset-for-fraud-detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    vala khorasani
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset provides a detailed look into transactional behavior and financial activity patterns, ideal for exploring fraud detection and anomaly identification. It contains 2,512 samples of transaction data, covering various transaction attributes, customer demographics, and usage patterns. Each entry offers comprehensive insights into transaction behavior, enabling analysis for financial security and fraud detection applications.

    Key Features:

    • TransactionID: Unique alphanumeric identifier for each transaction.
    • AccountID: Unique identifier for each account, with multiple transactions per account.
    • TransactionAmount: Monetary value of each transaction, ranging from small everyday expenses to larger purchases.
    • TransactionDate: Timestamp of each transaction, capturing date and time.
    • TransactionType: Categorical field indicating 'Credit' or 'Debit' transactions.
    • Location: Geographic location of the transaction, represented by U.S. city names.
    • DeviceID: Alphanumeric identifier for devices used to perform the transaction.
    • IP Address: IPv4 address associated with the transaction, with occasional changes for some accounts.
    • MerchantID: Unique identifier for merchants, showing preferred and outlier merchants for each account.
    • AccountBalance: Balance in the account post-transaction, with logical correlations based on transaction type and amount.
    • PreviousTransactionDate: Timestamp of the last transaction for the account, aiding in calculating transaction frequency.
    • Channel: Channel through which the transaction was performed (e.g., Online, ATM, Branch).
    • CustomerAge: Age of the account holder, with logical groupings based on occupation.
    • CustomerOccupation: Occupation of the account holder (e.g., Doctor, Engineer, Student, Retired), reflecting income patterns.
    • TransactionDuration: Duration of the transaction in seconds, varying by transaction type.
    • LoginAttempts: Number of login attempts before the transaction, with higher values indicating potential anomalies.

    This dataset is ideal for data scientists, financial analysts, and researchers looking to analyze transactional patterns, detect fraud, and build predictive models for financial security applications. The dataset was designed for machine learning and pattern analysis tasks and is not intended as a primary data source for academic publications.

  13. f

    Outlier diagnostics datasets

    • figshare.com
    tar
    Updated Apr 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Brett (2022). Outlier diagnostics datasets [Dataset]. http://doi.org/10.6084/m9.figshare.19673493.v1
    Explore at:
    tarAvailable download formats
    Dataset updated
    Apr 28, 2022
    Dataset provided by
    figshare
    Authors
    Matthew Brett
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A sample FMRI dataset with functional runs and task files.

    For project on automatic outlier detection.

  14. g

    Replication data for: Linear Models with Outliers: Choosing between...

    • datasearch.gesis.org
    • dataverse.harvard.edu
    • +1more
    Updated Jan 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harden, Jeffrey; Desmarais, Bruce (2020). Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods [Dataset]. https://datasearch.gesis.org/dataset/httpsdataverse.unc.eduoai--hdl1902.2911608
    Explore at:
    Dataset updated
    Jan 22, 2020
    Dataset provided by
    Odum Institute Dataverse Network
    Authors
    Harden, Jeffrey; Desmarais, Bruce
    Description

    State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.

  15. Road Traffic Mel Spectrogram for Anomaly Detection

    • kaggle.com
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    荒川永人 (2025). Road Traffic Mel Spectrogram for Anomaly Detection [Dataset]. https://www.kaggle.com/datasets/arakawaeito/road-traffic-mel-spectrogram-for-anomaly-detection/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    荒川永人
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    🚗 Mel Spectrogram Dataset for Anomaly Detection in Road Traffic Sounds

    📌 Overview

    This dataset is designed for anomaly detection in road traffic sounds.
    - Normal Data: Mel spectrograms of vehicle running sounds
    - Anomalous Data: Mel spectrograms of non-vehicle sounds

    📂 Dataset Structure

    The dataset is organized into two main folders:

    Folder NameDescriptionNumber of Samples
    road_traffic_noiseContains Mel spectrograms of vehicle running sounds (normal data).1,723
    other_soundsContains Mel spectrograms of non-vehicle sounds (anomalous data).294

    📷 Data Format

    • File Format: PNG images
    • Image Size: 224 × 251 pixels
    • Sampling Rate: 16,000 Hz
    • Mel Filter Bank Size: 224
    • Time Axis: Horizontal direction
    • Frequency Axis: Vertical direction
    • Time Duration: 1 second per image
  16. d

    Data from: Statistical context dictates the relationship between...

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Aug 21, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew R. Nassar; Rasmus Bruckner; Michael J. Frank (2019). Statistical context dictates the relationship between feedback-related EEG signals and learning [Dataset]. http://doi.org/10.5061/dryad.570pf8n
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 21, 2019
    Dataset provided by
    Dryad
    Authors
    Matthew R. Nassar; Rasmus Bruckner; Michael J. Frank
    Time period covered
    2019
    Description

    201_Cannon_FILT_altLow_STIM.matpreprocessed EEG data from subject 201203_Cannon_FILT_altLow_STIM.matCleaned EEG data from participant 203204_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 204205_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 205206_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 206207_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 207210_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 210211_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 211212_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 212213_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 213214_Cannon_FILT_altLow_STIM.mat215_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 215216_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 216229_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 229233_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for particip...

  17. Building and updating software datasets: an empirical assessment

    • zenodo.org
    zip
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Andrés Carruthers; Juan Andrés Carruthers (2025). Building and updating software datasets: an empirical assessment [Dataset]. http://doi.org/10.5281/zenodo.15008288
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juan Andrés Carruthers; Juan Andrés Carruthers
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is the repository for the scripts and data of the study "Building and updating software datasets: an empirical assessment".

    Data collected

    The data generated for the study it can be downloaded as a zip file. Each folder inside the file corresponds to one of the datasets of projects employed in the study (qualitas, currentSample and qualitasUpdated). Every dataset comprised three files "class.csv", "method.csv" and "sample.csv", with class metrics, method metrics and repository metadata of the projects respectively. Here is a description of the datasets:

    • qualitas: includes code metrics and repository metrics from the projects in the release 20130901r of the Qualitas Corpus.
    • currentSample: includes code metrics and repository metrics from a recent sample collected with our sampling procedure.
    • qualitasUpdated: includes code metrics and repository metrics from an updated version of the Qualitas Corpus applying our maintenance procedure.

    Plot graphics

    To plot the results and graphics in the article there is a Jupyter Notebook "Experiment.ipynb". It is initially configured to use the data in "datasets" folder.

    Replication Kit

    For replication purposes, the datasets containing recent projects from Github can be re-generated. To do so, the virtual environment must have installed the dependencies in "requirements.txt" file, add Github's tokens in "./token" file, re-define or leave as is the paths declared in the constants (variables written in caps) in the main method, and finally run "main.py" script. The portable versions of the source code scanner Sourcemeter are located as zip files in "./Sourcemeter/tool" directory. To install Sourcemeter the appropriate zip file must be decompressed excluding the root folder "SourceMeter-10.2.0-x64-

    The script comprise 5 steps:

    1. Project retrieval from Github: at first the sampling frame with projects complying with a specific quality criteria are retrieved from Github's API.
    2. Create samples: with the sampling frame retrieved, the current samples are selected (currentSample and qualitasUpdated). In the case of qualitasUpdated, it is important to have first the "sample.csv" file inside the qualitas folder of the dataset originally created for the study. This file contains the metadata of the projects in Qualitas Corpus.
    3. Project download and analysis: when all the samples are selected from the sampling frame (currentSample and qualitasUpdated), the repositories are downloaded and scanned with SourceMeter. In the cases in which the analysis is not possible, the projects are replaced with another one with similar size.
    4. Outlier detection: once the datasets are collected, it is necessary to manually look for possible outliers in the code metrics under study. In the notebook "Experiment.ipynb" there are specific sections dedicated for it ("Outlier detection (Section 4.2.2)").
    5. Outlier replacement: when the outliers are detected, in the same notebook there is also a section for outlier replacement ("Replace Outliers") where the outliers' url have to be listed to find the appropriate replacement.
    • If it is required, the metrics from the Qualitas Corpus can also be re-generated. First, it is necessary to download the release 20130901r from its official webpage. Second, decompress the .tar files downloaded. Third, make sure that the compressed files with source code from the projects (.java files) are placed in the "compressed" folder, in some cases it is necessary to read the "QC_README" file in the project's folder. Finally, run the original main script "Generate metrics for the Qualitas Corpus (QC) dataset" part of the code.
  18. g

    ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

    • elki-project.github.io
    • explore.openaire.eu
    • +2more
    Updated Sep 2, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erich Schubert; Arthur Zimek (2011). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684
    Explore at:
    Dataset updated
    Sep 2, 2011
    Dataset provided by
    TU Dortmund University
    University of Southern Denmark, Denmark
    Authors
    Erich Schubert; Arthur Zimek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Amsterdam Library of Object Images" is a collection of 110250 images of 1000 small objects, taken under various light conditions and rotation angles. All objects were placed on a black background. Thus the images are taken under rather uniform conditions, which means there is little uncontrolled bias in the data set (unless mixed with other sources). They do however not resemble a "typical" image collection. The data set has a rather unique property for its size: there are around 100 different images of each object, so it is well suited for clustering. By downsampling some objects it can also be used for outlier detection. For multi-view research, we offer a number of different feature vector sets for evaluating this data set.

  19. h

    DeformedTissue Dataset

    • heidata.uni-heidelberg.de
    txt, zip
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara Monji Azad; Sara Monji Azad; Claudia Scherl; David Männle; Claudia Scherl; David Männle (2025). DeformedTissue Dataset [Dataset]. http://doi.org/10.11588/DATA/OAUXWS
    Explore at:
    zip(2491037553), zip(719071), zip(712034810), zip(2898531610), txt(4878), zip(2913417023)Available download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    heiDATA
    Authors
    Sara Monji Azad; Sara Monji Azad; Claudia Scherl; David Männle; Claudia Scherl; David Männle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    AiF
    MWK Baden-Württemberg, DFG
    Description

    Tissue deformation is a critical issue in soft-tissue surgery, particularly during tumor resection, as it causes landmark displacement, complicating tissue orientation. The authors conducted an experimental study on 45 pig head cadavers to simulate tissue deformation, approved by the Mannheim Veterinary Office (DE 08 222 1019 21). We used 3D cameras and head-mounted displays to capture tissue shapes before and after controlled deformation induced by heating. The data were processed using software such as Meshroom, MeshLab, and Blender to create and evaluate 2½D meshes. The dataset includes different levels of deformation, noise, and outliers, generated using the same approach as the SynBench dataset. 1. Deformation_Level: 10 different deformation levels are considered. 0.1 and 0.7 are representing minimum and maximum deformation, respectively. Source and target files are available in each folder. The deformation process is just applied to target files. For simplicity, the corresponding source files to the target ones are available in this folder with the same name, but source ones start with Source_ and the target files start with Target_. The number after Source_ and Target_ represents the primitive object in the “Data” folder. For example, Target_3 represents that this file is generated from object number 3 in the “Data” folder. The two other numbers in the file name represent the percentage number of control points and the width of the Gaussian radial basis function, respectively. 2. Noisy_Data For all available files in the “Deformation_Level” folder (for all deformation levels), Noisy data is generated. They are generated in 4 different noise levels namely, 0.01, 0.02, 0.03, and 0.04 (More explanation about implementation can be found in the paper). The name of the files is the same as the files in the “Deformation_Level” folder. 3. Outlier_Data For all available files in the “Deformation_Level” folder (for all deformation levels), data with outliers is generated. They are generated in different outlier levels, in 5 categories, namely, 5%, 15%, 25%, 35%, and 45% (More explanation about implementation can be found in the paper). The name of the files is the same as the files in the “Deformation_Level” folder. Furthermore, for each file, there is one additional file with the same name but is started with “Outlier_”. This represents a matrix with the coordinates of outliers. Then, it would be possible to use these files as benchmarks to check the validity of future algorithms. Additional notes: Considering the fact that all challenges are generated under small to large deformation levels, the DeformedTissue dataset makes it possible for users to select their desired data based on the ability of their proposed method, to show how robust to complex challenges their methods are.

  20. n

    Data from: Subtle limits to connectivity revealed by outlier loci within two...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Feb 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adrien Tran Lu Y; Stephanie Ruault; Claire Daguin-Thiébaut; Jade Castel; Nicolas Bierne; Thomas Broquet; Patrick Wincker; Aude Perdereau; Sophie Arnaud-Haond; Pierre-Alexandre Gagnaire; Didier Jollivet; Stephane Hourdez; François Bonhomme (2022). Subtle limits to connectivity revealed by outlier loci within two divergent metapopulations of the deep-sea hydrothermal gastropod Ifremeria nautilei [Dataset]. http://doi.org/10.5061/dryad.ffbg79cwq
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 28, 2022
    Dataset provided by
    Ifremer
    Genoscope
    Sorbonne Université
    Institute of Evolutionary Science of Montpellier
    Authors
    Adrien Tran Lu Y; Stephanie Ruault; Claire Daguin-Thiébaut; Jade Castel; Nicolas Bierne; Thomas Broquet; Patrick Wincker; Aude Perdereau; Sophie Arnaud-Haond; Pierre-Alexandre Gagnaire; Didier Jollivet; Stephane Hourdez; François Bonhomme
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Hydrothermal vents form archipelagos of ephemeral deep-sea habitats that raise interesting questions about the evolution and dynamics of the associated endemic fauna, constantly subject to extinction-recolonization processes. These metal-rich environments are coveted for the mineral resources they harbor, thus raising recent conservation concerns. The evolutionary fate and demographic resilience of hydrothermal species strongly depend on the degree of connectivity among and within their fragmented metapopulations. In the deep sea, however, assessing connectivity is difficult and usually requires indirect genetic approaches. Improved detection of fine-scale genetic connectivity is now possible based on genome-wide screening for genetic differentiation. Here, we explored population connectivity in the hydrothermal vent snail Ifremeria nautilei across its species range encompassing five distinct back-arc basins in the Southwest Pacific. The global analysis, based on 10 570 single nucleotide polymorphism (SNP) markers derived from double digest restriction-site associated DNA sequencing (ddRAD-seq), depicted two semi-isolated and homogeneous genetic clusters. Demo-genetic modeling suggests that these two groups began to diverge about 70 000 generations ago, but continue to exhibit weak and slightly asymmetrical gene flow. Furthermore, a careful analysis of outlier loci showed subtle limitations to connectivity between neighboring basins within both groups. This finding indicates that migration is not strong enough to totally counterbalance drift or local selection, hence questioning the potential for demographic resilience at this latter geographical scale. These results illustrate the potential of large genomic datasets to understand fine-scale connectivity patterns in hydrothermal vents and the deep sea. Methods VCF datasets were generated “de novo” with Stacks V.2.52 from reads produce by the protocols used and provided in the manuscript.Sample associated metadata were collected during field sampling.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
application/gzipAvailable download formats
Dataset updated
May 17, 2024
Dataset provided by
figshare
Authors
Giovanni Stilo; Bardh Prenkaj
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

Search
Clear search
Close search
Google apps
Main menu