Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multi-Domain Outlier Detection Dataset contains datasets for conducting outlier detection experiments for four different application domains:
Each dataset contains a "fit" dataset (used for fitting or training outlier detection models), a "score" dataset (used for scoring samples used to evaluate model performance, analogous to test set), and a label dataset (indicates whether samples in the score dataset are considered outliers or not in the domain of each dataset).
To read more about the datasets and how they are used for outlier detection, or to cite this dataset in your own work, please see the following citation:
Kerner, H. R., Rebbapragada, U., Wagstaff, K. L., Lu, S., Dubayah, B., Huff, E., Lee, J., Raman, V., and Kulshrestha, S. (2022). Domain-agnostic Outlier Ranking Algorithms (DORA)-A Configurable Pipeline for Facilitating Outlier Detection in Scientific Datasets. Under review for Frontiers in Astronomy and Space Sciences.
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
These datasets can be used for benchmarking unsupervised anomaly detection algorithms (for example "Local Outlier Factor" LOF). The datasets have been obtained from multiple sources and are mainly based on datasets originally used for supervised machine learning. By publishing these modifications, a comparison of different algorithms is now possible for unsupervised anomaly detection.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Title: Gender Classification Dataset
Description: This dataset contains anonymized information on height, weight, age, and gender of 10,000 individuals. The data is equally distributed between males and females, with 5,000 samples for each gender. The purpose of this dataset is to provide a comprehensive sample for studies and analyses related to physical attributes and demographics.
Content: The CSV file contains the following columns:
Gender: The gender of the individual (Male/Female) Height: The height of the individual in centimeters Weight: The weight of the individual in kilograms Age: The age of the individual in years
License: This dataset is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND 4.0) license. This means you are free to share the data, provided that you attribute the source, do not use it for commercial purposes, and do not distribute modified versions of the data.
Usage:
This dataset can be used for: - Analyzing the distribution of height, weight, and age across genders - Developing and testing machine learning models for predicting physical attributes - Educational purposes in statistics and data science courses
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sample(s) removed as outliers in each iteration of MFMW-outlier for all the six microarray datasets.
This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Three datasets are chosen from the UCI machine learning repository in this study, which have been extensively adopted in data-driven researches, including Australian and Japanese datasets (Asuncion & Newman, 2007), and Polish bankruptcy dataset (Zięba et al., 2016). The three datasets contain different numbers of samples and features. Each sample in a credit dataset can be classified into good credit or bad credit. The size of Australian credit dataset is 690, with 307 samples in good credit and 383 in bad, and its feature dimension is 14, with 6 numerical and 8 categorical features. The size of Japanese credit dataset is 690, with 307 samples in good credit and 383 in bad, and its feature dimension is 15, with 6 numerical and 9 categorical features. Similarly, there are 7027 samples in Polish bankruptcy dataset, with 6756 samples in good credit and 271 in bad, and its 64 input features are numerical. All the dimensions of the input features of the three datasets listed in Table 1 do not include the class labels.
This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits of the concentration relations, which could substantially affect model predictions. Identified outliers were also excluded from loads that were calculated using the Beale ratio estimator. Notes on reason(s) for considering a concentration as an outlier are included.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides a detailed look into transactional behavior and financial activity patterns, ideal for exploring fraud detection and anomaly identification. It contains 2,512 samples of transaction data, covering various transaction attributes, customer demographics, and usage patterns. Each entry offers comprehensive insights into transaction behavior, enabling analysis for financial security and fraud detection applications.
Key Features:
This dataset is ideal for data scientists, financial analysts, and researchers looking to analyze transactional patterns, detect fraud, and build predictive models for financial security applications. The dataset was designed for machine learning and pattern analysis tasks and is not intended as a primary data source for academic publications.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A sample FMRI dataset with functional runs and task files.
For project on automatic outlier detection.
State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset is designed for anomaly detection in road traffic sounds.
- Normal Data: Mel spectrograms of vehicle running sounds
- Anomalous Data: Mel spectrograms of non-vehicle sounds
The dataset is organized into two main folders:
Folder Name | Description | Number of Samples |
---|---|---|
road_traffic_noise | Contains Mel spectrograms of vehicle running sounds (normal data). | 1,723 |
other_sounds | Contains Mel spectrograms of non-vehicle sounds (anomalous data). | 294 |
224 × 251 pixels
16,000 Hz
224
201_Cannon_FILT_altLow_STIM.matpreprocessed EEG data from subject 201203_Cannon_FILT_altLow_STIM.matCleaned EEG data from participant 203204_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 204205_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 205206_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 206207_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 207210_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 210211_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for subject 211212_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 212213_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 213214_Cannon_FILT_altLow_STIM.mat215_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 215216_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 216229_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for participant 229233_Cannon_FILT_altLow_STIM.matpreprocessed EEG data for particip...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is the repository for the scripts and data of the study "Building and updating software datasets: an empirical assessment".
The data generated for the study it can be downloaded as a zip file. Each folder inside the file corresponds to one of the datasets of projects employed in the study (qualitas, currentSample and qualitasUpdated). Every dataset comprised three files "class.csv", "method.csv" and "sample.csv", with class metrics, method metrics and repository metadata of the projects respectively. Here is a description of the datasets:
To plot the results and graphics in the article there is a Jupyter Notebook "Experiment.ipynb". It is initially configured to use the data in "datasets" folder.
For replication purposes, the datasets containing recent projects from Github can be re-generated. To do so, the virtual environment must have installed the dependencies in "requirements.txt" file, add Github's tokens in "./token" file, re-define or leave as is the paths declared in the constants (variables written in caps) in the main method, and finally run "main.py" script. The portable versions of the source code scanner Sourcemeter are located as zip files in "./Sourcemeter/tool" directory. To install Sourcemeter the appropriate zip file must be decompressed excluding the root folder "SourceMeter-10.2.0-x64-
The script comprise 5 steps:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Amsterdam Library of Object Images" is a collection of 110250 images of 1000 small objects, taken under various light conditions and rotation angles. All objects were placed on a black background. Thus the images are taken under rather uniform conditions, which means there is little uncontrolled bias in the data set (unless mixed with other sources). They do however not resemble a "typical" image collection. The data set has a rather unique property for its size: there are around 100 different images of each object, so it is well suited for clustering. By downsampling some objects it can also be used for outlier detection. For multi-view research, we offer a number of different feature vector sets for evaluating this data set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tissue deformation is a critical issue in soft-tissue surgery, particularly during tumor resection, as it causes landmark displacement, complicating tissue orientation. The authors conducted an experimental study on 45 pig head cadavers to simulate tissue deformation, approved by the Mannheim Veterinary Office (DE 08 222 1019 21). We used 3D cameras and head-mounted displays to capture tissue shapes before and after controlled deformation induced by heating. The data were processed using software such as Meshroom, MeshLab, and Blender to create and evaluate 2½D meshes. The dataset includes different levels of deformation, noise, and outliers, generated using the same approach as the SynBench dataset. 1. Deformation_Level: 10 different deformation levels are considered. 0.1 and 0.7 are representing minimum and maximum deformation, respectively. Source and target files are available in each folder. The deformation process is just applied to target files. For simplicity, the corresponding source files to the target ones are available in this folder with the same name, but source ones start with Source_ and the target files start with Target_. The number after Source_ and Target_ represents the primitive object in the “Data” folder. For example, Target_3 represents that this file is generated from object number 3 in the “Data” folder. The two other numbers in the file name represent the percentage number of control points and the width of the Gaussian radial basis function, respectively. 2. Noisy_Data For all available files in the “Deformation_Level” folder (for all deformation levels), Noisy data is generated. They are generated in 4 different noise levels namely, 0.01, 0.02, 0.03, and 0.04 (More explanation about implementation can be found in the paper). The name of the files is the same as the files in the “Deformation_Level” folder. 3. Outlier_Data For all available files in the “Deformation_Level” folder (for all deformation levels), data with outliers is generated. They are generated in different outlier levels, in 5 categories, namely, 5%, 15%, 25%, 35%, and 45% (More explanation about implementation can be found in the paper). The name of the files is the same as the files in the “Deformation_Level” folder. Furthermore, for each file, there is one additional file with the same name but is started with “Outlier_”. This represents a matrix with the coordinates of outliers. Then, it would be possible to use these files as benchmarks to check the validity of future algorithms. Additional notes: Considering the fact that all challenges are generated under small to large deformation levels, the DeformedTissue dataset makes it possible for users to select their desired data based on the ability of their proposed method, to show how robust to complex challenges their methods are.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Hydrothermal vents form archipelagos of ephemeral deep-sea habitats that raise interesting questions about the evolution and dynamics of the associated endemic fauna, constantly subject to extinction-recolonization processes. These metal-rich environments are coveted for the mineral resources they harbor, thus raising recent conservation concerns. The evolutionary fate and demographic resilience of hydrothermal species strongly depend on the degree of connectivity among and within their fragmented metapopulations. In the deep sea, however, assessing connectivity is difficult and usually requires indirect genetic approaches. Improved detection of fine-scale genetic connectivity is now possible based on genome-wide screening for genetic differentiation. Here, we explored population connectivity in the hydrothermal vent snail Ifremeria nautilei across its species range encompassing five distinct back-arc basins in the Southwest Pacific. The global analysis, based on 10 570 single nucleotide polymorphism (SNP) markers derived from double digest restriction-site associated DNA sequencing (ddRAD-seq), depicted two semi-isolated and homogeneous genetic clusters. Demo-genetic modeling suggests that these two groups began to diverge about 70 000 generations ago, but continue to exhibit weak and slightly asymmetrical gene flow. Furthermore, a careful analysis of outlier loci showed subtle limitations to connectivity between neighboring basins within both groups. This finding indicates that migration is not strong enough to totally counterbalance drift or local selection, hence questioning the potential for demographic resilience at this latter geographical scale. These results illustrate the potential of large genomic datasets to understand fine-scale connectivity patterns in hydrothermal vents and the deep sea. Methods VCF datasets were generated “de novo” with Stacks V.2.52 from reads produce by the protocols used and provided in the manuscript.Sample associated metadata were collected during field sampling.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10