Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10
Facebook
TwitterUnderstanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.
Facebook
TwitterThe following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 721 valid input features.GRM Properties:Min0.0000Max157.0200Mean9.1692Std. Dev.8.4220There were 4 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band selected was based on peak clustering found at 1894.5039 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 248 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 30 statistically significant high outlier features.There are 7 statistically significant low outlier features.There are 202 features part of statistically significant low clusters.There are 9 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high GRM values.Light Blue output features are part of a cluster of low GRM values.Red output features represent high outliers within a cluster of low GRM values.Blue output features represent low outliers within a cluster of high GRM values.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cylindrical data are bivariate data formed from the combination of circular and linear variables. Identifying outliers is a crucial step in any data analysis work. This paper proposes a new distribution-free procedure to detect outliers in cylindrical data using the Mahalanobis distance concept. The use of Mahalanobis distance incorporates the correlation between the components of the cylindrical distribution, which had not been accounted for in the earlier papers on outlier detection in cylindrical data. The threshold for declaring an observation to be an outlier can be obtained via parametric or non-parametric bootstrap, depending on whether the underlying distribution is known or unknown. The performance of the proposed method is examined via extensive simulations from the Johnson-Wehrly distribution. The proposed method is applied to two real datasets, and the outliers are identified in those datasets.
Facebook
TwitterThis dataset was created by Bharat Gokhale
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data sets were originally created for the following publications:
M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.
H.-P. Kriegel, E. Schubert, A. Zimek Evaluation of Multiple Clustering Solutions In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.
The outlier data set versions were introduced in:
E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel On Evaluation of Outlier Rankings and Outlier Scores In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.
They are derived from the original image data available at https://aloi.science.uva.nl/
The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005
Additional information is available at: https://elki-project.github.io/datasets/multi_view
The following views are currently available:
Feature type
Description
Files
Object number
Sparse 1000 dimensional vectors that give the true object assignment
objs.arff.gz
RGB color histograms
Standard RGB color histograms (uniform binning)
aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
HSV color histograms
Standard HSV/HSB color histograms in various binnings
aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
Color similiarity
Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)
aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
Haralick features
First 13 Haralick features (radius 1 pixel)
aloi-haralick-1.csv.gz
Front to back
Vectors representing front face vs. back faces of individual objects
front.arff.gz
Basic light
Vectors indicating basic light situations
light.arff.gz
Manual annotations
Manually annotated object groups of semantically related objects such as cups
manual1.arff.gz
Outlier Detection Versions
Additionally, we generated a number of subsets for outlier detection:
Feature type
Description
Files
RGB Histograms
Downsampled to 100000 objects (553 outliers)
aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
Downsampled to 75000 objects (717 outliers)
aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
Downsampled to 50000 objects (1508 outliers)
aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
Facebook
TwitterThe following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 1684 valid input features.POVERTY Properties:Min0.0000Max91.8000Mean18.9902Std. Dev.12.7152There were 22 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band was based on the average distance to 30 nearest neighbors: 3709.0000 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 1155 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 68 statistically significant high outlier features.There are 84 statistically significant low outlier features.There are 557 features part of statistically significant low clusters.There are 446 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high POVERTY values.Light Blue output features are part of a cluster of low POVERTY values.Red output features represent high outliers within a cluster of low POVERTY values.Blue output features represent low outliers within a cluster of high POVERTY values.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains product reports from different companies. We need to find a real solution to detect outliers and inliers in the data each company reports regarding their product costs. This will help in identifying any discrepancies in reported prices. We have to find an algorithm that can detect outlier and inlier datasets effectively.
1 org_id: A numerical identifier for an organization. 2 year: The year when the data was recorded. 3 month: The month when the data was recorded. 4 product_code: A code that identifies a product. 5 sub_product_code: A sub-code that further identifies specifics of the product. 6 value: A numerical value associated with the product, which could represent quantities, monetary value, or another metric depending on the context.
Facebook
TwitterThere are three files containing Stata data, and do and log-files. These are associated with the empirical models reported in the replication study, “Outlier Analysis: Natural Resources and Immigration Policy,” POLS ONE. Questions or comments regarding these materials should be directed to Seung-Whan Choi, Department of Political Science, University of Illinois at Chicago. His email address is whanchoi@uic.edu and his homepage address is https://whanchoi.people.uic.edu/.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The deep-sea microfossil record is characterized by an extraordinarily high density and abundance of fossil specimens, and by a very high degree of spatial and temporal continuity of sedimentation. This record provides a unique opportunity to study evolution at the species level for entire clades of organisms. Compilations of deep-sea microfossil species occurrences are, however, affected by reworking of material, age model errors, and taxonomic uncertainties, all of which combine to displace a small fraction of the recorded occurrence data both forward and backwards in time, extending total stratigraphic ranges for taxa. These data outliers introduce substantial errors into both biostratigraphic and evolutionary analyses of species occurrences over time. We propose a simple method—Pacman—to identify and remove outliers from such data, and to identify problematic samples or sections from which the outlier data have derived. The method consists of, for a large group of species, compiling species occurrences by time and marking as outliers calibrated fractions of the youngest and oldest occurrence data for each species. A subset of biostratigraphic marker species whose ranges have been previously documented is used to calibrate the fraction of occurrences to mark as outliers. These outlier occurrences are compiled for samples, and profiles of outlier frequency are made from the sections used to compile the data; the profiles can then identify samples and sections with problematic data caused, for example, by taxonomic errors, incorrect age models, or reworking of sediment. These samples/sections can then be targeted for re-study.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This article proposes a framework that provides early detection of anomalous series within a large collection of nonstationary streaming time-series data. We define an anomaly as an observation, that is, very unlikely given the recent distribution of a given system. The proposed framework first calculates a boundary for the system’s typical behavior using extreme value theory. Then a sliding window is used to test for anomalous series within a newly arrived collection of series. The model uses time series features as inputs, and a density-based comparison to detect any significant changes in the distribution of the features. Using various synthetic and real world datasets, we demonstrate the wide applicability and usefulness of our proposed framework. We show that the proposed algorithm can work well in the presence of noisy nonstationarity data within multiple classes of time series. This framework is implemented in the open source R package oddstream. R code and data are available in the online supplementary materials.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The spatial signature of microevolutionary processes structuring genetic variation may play an important role in the detection of loci under selection. However, the spatial location of samples has not yet been used to quantify this. Here, we present a new two-step method of spatial outlier detection at the individual and deme levels using the power spectrum of Moran eigenvector maps (MEM). The MEM power spectrum quantifies how the variation in a variable, such as the frequency of an allele at a SNP locus, is distributed across a range of spatial scales defined by MEM spatial eigenvectors. The first step (Moran spectral outlier detection: MSOD) uses genetic and spatial information to identify outlier loci by their unusual power spectrum. The second step uses Moran spectral randomization (MSR) to test the association between outlier loci and environmental predictors, accounting for spatial autocorrelation. Using simulated data from two published papers, we tested this two-step method in different scenarios of landscape configuration, selection strength, dispersal capacity and sampling design. Under scenarios that included spatial structure, MSOD alone was sufficient to detect outlier loci at the individual and deme levels without the need for incorporating environmental predictors. Follow-up with MSR generally reduced (already low) false-positive rates, though in some cases led to a reduction in power. The results were surprisingly robust to differences in sample size and sampling design. Our method represents a new tool for detecting potential loci under selection with individual-based and population-based sampling by leveraging spatial information that has hitherto been neglected.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multi-Domain Outlier Detection Dataset contains datasets for conducting outlier detection experiments for four different application domains:
Astrophysics - detecting anomalous observations in the Dark Energy Survey (DES) catalog (data type: feature vectors)
Planetary science - selecting novel geologic targets for follow-up observation onboard the Mars Science Laboratory (MSL) rover (data type: grayscale images)
Earth science: detecting anomalous samples in satellite time series corresponding to ground-truth observations of maize crops (data type: time series/feature vectors)
Fashion-MNIST/MNIST: benchmark task to detect anomalous MNIST images among Fashion-MNIST images (data type: grayscale images)
Each dataset contains a "fit" dataset (used for fitting or training outlier detection models), a "score" dataset (used for scoring samples used to evaluate model performance, analogous to test set), and a label dataset (indicates whether samples in the score dataset are considered outliers or not in the domain of each dataset).
To read more about the datasets and how they are used for outlier detection, or to cite this dataset in your own work, please see the following citation:
Kerner, H. R., Rebbapragada, U., Wagstaff, K. L., Lu, S., Dubayah, B., Huff, E., Lee, J., Raman, V., and Kulshrestha, S. (2022). Domain-agnostic Outlier Ranking Algorithms (DORA)-A Configurable Pipeline for Facilitating Outlier Detection in Scientific Datasets. Under review for Frontiers in Astronomy and Space Sciences.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Genetic differentiation is characteristically weak in marine species making assessments of population connectivity and structure difficult. However the advent of genomic methods have increased genetic resolution, enabling studies to detect weak, but significant population differentiation within marine species. With an increasing number of studies employing high resolution genome-wide techniques, we are realising the connectivity of marine populations is often complex and quantifying this complexity can provide an understanding of the processes shaping marine species genetic structure and to inform long-term, sustainable management strategies. This study aims to assess the genetic structure, connectivity and local adaptation of the Eastern Rock Lobster (Sagmariasus verreauxi), which has a maximum pelagic larval duration of 12 months and inhabits both subtropical and temperate environments. We used 645 neutral and 15 outlier SNPs to genotype lobsters collected from the only two known breeding populations and a third episodic population — encompassing S. verreauxi’s known range. Through examination of the neutral SNP panel, we detected genetic homogeneity across the three regions, which extended across the Tasman Sea encompassing both Australian and New Zealand populations. We discuss differences in neutral genetic signature of S. verreauxi and a closely-related, co-distributed rock lobster, Jasus edwardsii, determining a regional pattern of genetic disparity between the species, which have largely similar life histories. Examination of the outlier SNP panel detected weak genetic differentiation between the three regions. Outlier SNPs showed promise in assigning individuals to their sampling origin and may prove useful as a management tool for species exhibiting genetic homogeneity.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Dynamic Apparel Sales with Anomalies Dataset is based on 100,000 sales transaction data from the fashion industry, including extreme outliers, missing values, and sales_categories, reflecting the different data characteristics of real retail environments.
2) Data Utilization (1) Dynamic Apparel Sales with Anomalies Dataset has characteristics that: • This dataset consists of nine categorical variables and 10 numerical variables, including product name, brand, gender clothing, price, discount rate, inventory level, and customer behavior, making it suitable for analyzing product and customer characteristics. (2) Dynamic Apparel Sales with Anomalies Dataset can be used to: • Sales anomaly detection and quality control: Transaction data with outliers and missing values can be used to detect outliers, manage quality, refine data, and develop outlier processing techniques. • Sales Forecast and Customer Analysis Modeling: Based on a variety of product and customer characteristics, it can be used to support data-driven decision-making, such as machine learning-based sales forecasting, customer segmentation, and customized marketing strategies.
Facebook
TwitterThe COVID-19 pandemic has led to enormous movements in economic data that strongly affect parameters and forecasts obtained from standard VARs. One way to address these issues is to model extreme observations as random shifts in the stochastic volatility (SV) of VAR residuals. Specifically, we propose VAR models with outlier-augmented SV that combine transitory and persistent changes in volatility. The resulting density forecasts for the COVID-19 period are much less sensitive to outliers in the data than standard VARs. Evaluating forecast performance over the last few decades, we find that outlier-augmented SV schemes do at least as well as a conventional SV model. Predictive Bayes factors indicate that our outlier-augmented SV model provides the best data fit for the period since the pandemic’s outbreak, as well as for earlier subsamples of relatively high volatility. This version has been accepted for publication in The Review of Economics and Statistics .
Facebook
TwitterComprehensive YouTube channel statistics for Outliers Overland, featuring 110,000 subscribers and 21,783,002 total views. This dataset includes detailed performance metrics such as subscriber growth, video views, engagement rates, and estimated revenue. The channel operates in the Lifestyle category and is based in US. Track 1,011 videos with daily and monthly performance data, including view counts, subscriber changes, and earnings estimates. Analyze growth trends, engagement patterns, and compare performance against similar channels in the same category.
Facebook
TwitterAbstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Your task is to write a small Python or R script that predicts the engine rating based on the inspection parameters using only the provided dataset. You need to find all the cases/outliers where the rating has been given incorrectly as compared to the current condition of the engine.
This task is designed to test your Python or R ability, your knowledge of Data Science techniques, your ability to find trends, and outliers, the relative importance of variables with deviation in target variable, and your ability to work effectively, efficiently, and independently within a commercial setting.
This task is designed as well to test your hyper-tuning abilities or lateral thinking. Deliverables: · One Python or R script · One requirement text file including an exhaustive list of packages and version numbers used in your solution · Summary of your insights · List of cases that are outliers/incorrectly rated as high or low and it should be backed with analysis/reasons. · model object files for reproducibility.
Your solution should at a minimum do the following: · Load the data into memory · Prepare the data for modeling · EDA of the variables · Build a model on training data · Test the model on testing data · Provide some measure of performance · Outlier analysis and detection
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10