100+ datasets found

f
Data from: Methodology to filter out outliers in high spatial density data...
scielo.figshare.com
jpeg
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14305658.v1
Dataset updated
Jun 4, 2023
Dataset provided by
SciELO journals
Authors
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.
f
Identifying outliers in asset pricing data with a new weighted forward...
datasetcatalog.nlm.nih.gov
scielo.figshare.com
Updated Feb 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aronne, Alexandre; Bressan, Aureliano Angel; Grossi, Luigi (2020). Identifying outliers in asset pricing data with a new weighted forward search estimator [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000459853
Explore at:
Dataset updated
Feb 5, 2020
Authors
Aronne, Alexandre; Bressan, Aureliano Angel; Grossi, Luigi
Description
ABSTRACT The purpose of this work is to present the Weighted Forward Search (FSW) method for the detection of outliers in asset pricing data. This new estimator, which is based on an algorithm that downweights the most anomalous observations of the dataset, is tested using both simulated and empirical asset pricing data. The impact of outliers on the estimation of asset pricing models is assessed under different scenarios, and the results are evaluated with associated statistical tests based on this new approach. Our proposal generates an alternative procedure for robust estimation of portfolio betas, allowing for the comparison between concurrent asset pricing models. The algorithm, which is both efficient and robust to outliers, is used to provide robust estimates of the models’ parameters in a comparison with traditional econometric estimation methods usually used in the literature. In particular, the precision of the alphas is highly increased when the Forward Search (FS) method is used. We use Monte Carlo simulations, and also the well-known dataset of equity factor returns provided by Prof. Kenneth French, consisting of the 25 Fama-French portfolios on the United States of America equity market using single and three-factor models, on monthly and annual basis. Our results indicate that the marginal rejection of the Fama-French three-factor model is influenced by the presence of outliers in the portfolios, when using monthly returns. In annual data, the use of robust methods increases the rejection level of null alphas in the Capital Asset Pricing Model (CAPM) and the Fama-French three-factor model, with more efficient estimates in the absence of outliers and consistent alphas when outliers are present.
Data from: Valid Inference Corrected for Outlier Removal
tandf.figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v4
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9762731.v4
Dataset updated
Jun 4, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Shuxiao Chen; Jacob Bien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.
d
Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...
catalog.data.gov
s.cnmilf.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
f
Data from: Outlier detection in cylindrical data based on Mahalanobis...
tandf.figshare.com
text/x-tex
Updated Jan 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prashant S. Dhamale; Akanksha S. Kashikar (2025). Outlier detection in cylindrical data based on Mahalanobis distance [Dataset]. http://doi.org/10.6084/m9.figshare.24092089.v1
Explore at:
text/x-texAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24092089.v1
Dataset updated
Jan 2, 2025
Dataset provided by
Taylor & Francis
Authors
Prashant S. Dhamale; Akanksha S. Kashikar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cylindrical data are bivariate data formed from the combination of circular and linear variables. Identifying outliers is a crucial step in any data analysis work. This paper proposes a new distribution-free procedure to detect outliers in cylindrical data using the Mahalanobis distance concept. The use of Mahalanobis distance incorporates the correlation between the components of the cylindrical distribution, which had not been accounted for in the earlier papers on outlier detection in cylindrical data. The threshold for declaring an observation to be an outlier can be obtained via parametric or non-parametric bootstrap, depending on whether the underlying distribution is known or unknown. The performance of the proposed method is examined via extensive simulations from the Johnson-Wehrly distribution. The proposed method is applied to two real datasets, and the outliers are identified in those datasets.
Outlier classification using autoencoders: application for fluctuation...
osti.gov
dataverse.harvard.edu
Updated Jun 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. http://doi.org/10.7910/DVN/SKEHRJ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SKEHRJ
Dataset updated
Jun 2, 2021
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
Description
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
a
Find Outliers GRM
hub.arcgis.com
Updated Aug 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tippecanoe County Assessor Hub Community (2020). Find Outliers GRM [Dataset]. https://hub.arcgis.com/maps/tippecanoehub::find-outliers-grm
Explore at:
Dataset updated
Aug 8, 2020
Dataset authored and provided by
Tippecanoe County Assessor Hub Community
Area covered

Description
The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 721 valid input features.GRM Properties:Min0.0000Max157.0200Mean9.1692Std. Dev.8.4220There were 4 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band selected was based on peak clustering found at 1894.5039 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 248 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 30 statistically significant high outlier features.There are 7 statistically significant low outlier features.There are 202 features part of statistically significant low clusters.There are 9 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high GRM values.Light Blue output features are part of a cluster of low GRM values.Red output features represent high outliers within a cluster of low GRM values.Blue output features represent low outliers within a cluster of high GRM values.
Cost of living(Treat Outliers)
kaggle.com
zip
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bharat Gokhale (2023). Cost of living(Treat Outliers) [Dataset]. https://www.kaggle.com/datasets/bharatgokhale/cost-of-livingtreat-outliers
Explore at:
zip(14244 bytes)Available download formats
Dataset updated
Jun 7, 2023
Authors
Bharat Gokhale
Description
Dataset

This dataset was created by Bharat Gokhale

Contents
Product Cost Analysis for Out/inler Detection
kaggle.com
zip
Updated May 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Botir (2024). Product Cost Analysis for Out/inler Detection [Dataset]. https://www.kaggle.com/datasets/botir2/product-cost-analysis-for-outinler-detection
Explore at:
zip(712974 bytes)Available download formats
Dataset updated
May 10, 2024
Authors
Botir
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains product reports from different companies. We need to find a real solution to detect outliers and inliers in the data each company reports regarding their product costs. This will help in identifying any discrepancies in reported prices. We have to find an algorithm that can detect outlier and inlier datasets effectively.

1 org_id: A numerical identifier for an organization. 2 year: The year when the data was recorded. 3 month: The month when the data was recorded. 4 product_code: A code that identifies a product. 5 sub_product_code: A sub-code that further identifies specifics of the product. 6 value: A numerical value associated with the product, which could represent quantities, monetary value, or another metric depending on the context.
n
Data from: Pacman profiling: a simple procedure to identify stratigraphic...
data-staging.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jul 8, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Lazarus; Manuel Weinkauf; Patrick Diver (2011). Pacman profiling: a simple procedure to identify stratigraphic outliers in high-density deep-sea microfossil data [Dataset]. http://doi.org/10.5061/dryad.2m7b0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2m7b0
Dataset updated
Jul 8, 2011
Authors
David Lazarus; Manuel Weinkauf; Patrick Diver
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Marine, Global
Description
The deep-sea microfossil record is characterized by an extraordinarily high density and abundance of fossil specimens, and by a very high degree of spatial and temporal continuity of sedimentation. This record provides a unique opportunity to study evolution at the species level for entire clades of organisms. Compilations of deep-sea microfossil species occurrences are, however, affected by reworking of material, age model errors, and taxonomic uncertainties, all of which combine to displace a small fraction of the recorded occurrence data both forward and backwards in time, extending total stratigraphic ranges for taxa. These data outliers introduce substantial errors into both biostratigraphic and evolutionary analyses of species occurrences over time. We propose a simple method—Pacman—to identify and remove outliers from such data, and to identify problematic samples or sections from which the outlier data have derived. The method consists of, for a large group of species, compiling species occurrences by time and marking as outliers calibrated fractions of the youngest and oldest occurrence data for each species. A subset of biostratigraphic marker species whose ranges have been previously documented is used to calibrate the fraction of occurrences to mark as outliers. These outlier occurrences are compiled for samples, and profiles of outlier frequency are made from the sections used to compile the data; the profiles can then identify samples and sections with problematic data caused, for example, by taxonomic errors, incorrect age models, or reworking of sediment. These samples/sections can then be targeted for re-study.
Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned...
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
a
Find Outliers Percent of households with income below the Federal Poverty...
uscssi.hub.arcgis.com
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spatial Sciences Institute (2021). Find Outliers Percent of households with income below the Federal Poverty Level [Dataset]. https://uscssi.hub.arcgis.com/maps/USCSSI::find-outliers-percent-of-households-with-income-below-the-federal-poverty-level
Explore at:
Dataset updated
Dec 5, 2021
Dataset authored and provided by
Spatial Sciences Institute
Area covered

Description
The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 1684 valid input features.POVERTY Properties:Min0.0000Max91.8000Mean18.9902Std. Dev.12.7152There were 22 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band was based on the average distance to 30 nearest neighbors: 3709.0000 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 1155 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 68 statistically significant high outlier features.There are 84 statistically significant low outlier features.There are 557 features part of statistically significant low clusters.There are 446 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high POVERTY values.Light Blue output features are part of a cluster of low POVERTY values.Red output features represent high outliers within a cluster of low POVERTY values.Blue output features represent low outliers within a cluster of high POVERTY values.
d
Data from: specleanr: An R package for automated flagging of environmental...
datadryad.org
zip
Updated Nov 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Basooma; Astrid Schmidt-Kloiber; Sami Domisch; Yusdiel Torres-Cambas; Marija Smederevac-Lalić; Vanessa Bremerich; Martin Tschikof; Paul Meulenbroek; Andrea Funk; Thomas Hein; Florian Borgwardt (2025). specleanr: An R package for automated flagging of environmental outliers in ecological data for modeling workflows [Dataset]. http://doi.org/10.5061/dryad.6m905qgd7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.6m905qgd7
Dataset updated
Nov 4, 2025
Dataset provided by
Dryad
Authors
Anthony Basooma; Astrid Schmidt-Kloiber; Sami Domisch; Yusdiel Torres-Cambas; Marija Smederevac-Lalić; Vanessa Bremerich; Martin Tschikof; Paul Meulenbroek; Andrea Funk; Thomas Hein; Florian Borgwardt
Time period covered
Sep 24, 2025
Description
specleanr: An R package for automated flagging of environmental outliers in ecological data for modeling workflows

Dataset DOI: 10.5061/dryad.6m905qgd7

Description of the data and file structure

The files include species occurrences from the Global Biodiversity Information Facility. Refer to the data links file to access the original data.

Environmental data was retrieved from CHELSA and Hydrography90m. These files included B101 to 19 for CHELSA and cti, order*strahler, slopecurv*dw_cel, accumulation, spi, sti, and subcatchment from Hydrography90m. The data link file has the URL to connect to the original dataset.

Model outputs were data outputs packaged after model implementation, including modeloutput and modeloutput2.

The sdm function was implemented in the sdm_function file.

sdmodeling file that processed all files.

species prediction were archived in species model prediction output.

Files and variables

...
Data from: Anomalous values and missing data in clinical and experimental...
scielo.figshare.com
jpeg
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hélio Amante Miot (2023). Anomalous values and missing data in clinical and experimental studies [Dataset]. http://doi.org/10.6084/m9.figshare.8227163.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8227163.v1
Dataset updated
Jun 2, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Hélio Amante Miot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract During analysis of scientific research data, it is customary to encounter anomalous values or missing data. Anomalous values can be the result of errors of recording, typing, measurement by instruments, or may be true outliers. This review discusses concepts, examples and methods for identifying and dealing with such contingencies. In the case of missing data, techniques for imputation of the values are discussed in, order to avoid exclusion of the research subject, if it is not possible to retrieve information from registration forms or to re-address the participant.
Data from: Expected total thyroxine (TT4) concentrations and outlier values...
zenodo.org
datadryad.org
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maya Lottati; David Bruyette; David Aucoin; Maya Lottati; David Bruyette; David Aucoin (2022). Data from: Expected total thyroxine (TT4) concentrations and outlier values in 531,765 cats in the United States (2014-2015) [Dataset]. http://doi.org/10.5061/dryad.m6f721d
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.m6f721d
Dataset updated
May 31, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maya Lottati; David Bruyette; David Aucoin; Maya Lottati; David Bruyette; David Aucoin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
United States
Description
Background: Levels exceeding the standard reference interval (RI) for total thyroxine (TT4) concentrations are diagnostic for hyperthyroidism, however some hyperthyroid cats have TT4 values within the RI. Determining outlier TT4 concentrations should aid practitioners in identification of hyperthyroidism. The objective of this study was to determine the expected distribution of TT4 concentration using a large population of cats (531,765) of unknown health status to identify unexpected TT4 concentrations (outlier), and determine whether this concentration changes with age. Methodology/Principle Findings: This study is a population-based, retrospective study evaluating an electronic database of laboratory results to identify unique TT4 measurement between January 2014 and July 2015. An expected distribution of TT4 concentrations was determined using a large population of cats (531,765) of unknown health status, and this in turn was used to identify unexpected TT4 concentrations (outlier) and determine whether this concentration changes with age. All cats between the age of 1 and 9 years (n=141,294) had the same expected distribution of TT4 concentration (0.5-3.5ug/dL), and cats with a TT4 value >3.5ug/dL were determined to be unexpected outliers. There was a steep and progressive rise in both the total number and percentage of statistical outliers in the feline population as a function of age. The greatest acceleration in the percentage of outliers occurred between the age of 7 and 14 years, which was up to 4.6 times the rate seen between the age of 3 and 7 years. Conclusions: TT4 concentrations >3.5ug/dL represent outliers from the expected distribution of TT4 concentration. Furthermore, age has a strong influence on the proportion of cats. These findings suggest that patients with TT4 concentrations >3.5ug/dL should be more closely evaluated for hyperthyroidism, particularly between the ages of 7 and 14 years. This finding may aid clinicians in earlier identification of hyperthyroidism in at-risk patients.
Superstore Sales Analysis
kaggle.com
zip
Updated Oct 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis/versions/1
Explore at:
zip(3009057 bytes)Available download formats
Dataset updated
Oct 21, 2023
Authors
Ali Reda Elblgihy
Description
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

1- Data Import and Transformation:

Gather and import relevant sales data from various sources into Excel.

Utilize Power Query to clean, transform, and structure the data for analysis.

Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

2- Data Quality Assessment:

Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.

Standardize data formats and ensure that all data is in a consistent, usable state.

3- Calculating COGS:

Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.

Apply appropriate formulas and calculations to determine COGS accurately.

4- Discount Analysis:

Analyze the discount values offered on products to understand their impact on sales and profitability.

Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

5- Sales Metrics:

Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.

Utilize Excel functions to compute these metrics and create visuals for better insights.

6- Visualization:

Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.

Visual representations can help identify trends, outliers, and patterns in the data.

7- Report Generation:

Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
MNIST dataset for Outliers Detection - [ MNIST4OD ]
figshare.com
application/gzip
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9954986.v2
Dataset updated
May 17, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Giovanni Stilo; Bardh Prenkaj
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10
Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
pdf
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Spain, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/anomaly-detection-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jun 12, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
Canada, United States
Description
Snapshot img

Anomaly Detection Market Size 2025-2029

The anomaly detection market size is valued to increase by USD 4.44 billion, at a CAGR of 14.4% from 2024 to 2029. Anomaly detection tools gaining traction in BFSI will drive the anomaly detection market.

Major Market Trends & Insights

North America dominated the market and accounted for a 43% growth during the forecast period. By Deployment - Cloud segment was valued at USD 1.75 billion in 2023 By Component - Solution segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 173.26 million Market Future Opportunities: USD 4441.70 million CAGR from 2024 to 2029 : 14.4%

Market Summary

Anomaly detection, a critical component of advanced analytics, is witnessing significant adoption across various industries, with the financial services sector leading the charge. The increasing incidence of internal threats and cybersecurity frauds necessitates the need for robust anomaly detection solutions. These tools help organizations identify unusual patterns and deviations from normal behavior, enabling proactive response to potential threats and ensuring operational efficiency. For instance, in a supply chain context, anomaly detection can help identify discrepancies in inventory levels or delivery schedules, leading to cost savings and improved customer satisfaction. In the realm of compliance, anomaly detection can assist in maintaining regulatory adherence by flagging unusual transactions or activities, thereby reducing the risk of penalties and reputational damage. According to recent research, organizations that implement anomaly detection solutions experience a reduction in error rates by up to 25%. This improvement not only enhances operational efficiency but also contributes to increased customer trust and satisfaction. Despite these benefits, challenges persist, including data quality and the need for real-time processing capabilities. As the market continues to evolve, advancements in machine learning and artificial intelligence are expected to address these challenges and drive further growth.

What will be the Size of the Anomaly Detection Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free Sample

How is the Anomaly Detection Market Segmented ?

The anomaly detection industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Deployment Cloud On-premises Component Solution Services End-user BFSI IT and telecom Retail and e-commerce Manufacturing Others Technology Big data analytics AI and ML Data mining and business intelligence Geography North America US Canada Mexico Europe France Germany Spain UK APAC China India Japan Rest of World (ROW)

By Deployment Insights

The cloud segment is estimated to witness significant growth during the forecast period.

The market is witnessing significant growth, driven by the increasing adoption of advanced technologies such as machine learning algorithms, predictive modeling tools, and real-time monitoring systems. Businesses are increasingly relying on anomaly detection solutions to enhance their root cause analysis, improve system health indicators, and reduce false positives. This is particularly true in sectors where data is generated in real-time, such as cybersecurity threat detection, network intrusion detection, and fraud detection systems. Cloud-based anomaly detection solutions are gaining popularity due to their flexibility, scalability, and cost-effectiveness.

This growth is attributed to cloud-based solutions' quick deployment, real-time data visibility, and customization capabilities, which are offered at flexible payment options like monthly subscriptions and pay-as-you-go models. Companies like Anodot, Ltd, Cisco Systems Inc, IBM Corp, and SAS Institute Inc provide both cloud-based and on-premise anomaly detection solutions. Anomaly detection methods include outlier detection, change point detection, and statistical process control. Data preprocessing steps, such as data mining techniques and feature engineering processes, are crucial in ensuring accurate anomaly detection. Data visualization dashboards and alert fatigue mitigation techniques help in managing and interpreting the vast amounts of data generated.

Network traffic analysis, log file analysis, and sensor data integration are essential components of anomaly detection systems. Additionally, risk management frameworks, drift detection algorithms, time series forecasting, and performance degradation detection are vital in maintaining system performance and capacity planning.
Distributed Anomaly Detection Using Satellite Data From Multiple Modalities
data.nasa.gov
datasets.ai
+2more
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Distributed Anomaly Detection Using Satellite Data From Multiple Modalities [Dataset]. https://data.nasa.gov/dataset/distributed-anomaly-detection-using-satellite-data-from-multiple-modalities-c7516
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.
DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES
data.nasa.gov
gimi9.com
+2more
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES [Dataset]. https://data.nasa.gov/dataset/distributed-anomaly-detection-using-satellite-data-from-multiple-modalities
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES KANISHKA BHADURI, KAMALIKA DAS, AND PETR VOTAVA** Abstract. There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets ate physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.

Facebook

Twitter

Click to copy link

Link copied

Cite

Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1

Data from: Methodology to filter out outliers in high spatial density data to improve maps reliability

Explore at:

jpegAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.14305658.v1

Dataset updated

Jun 4, 2023

Dataset provided by

SciELO journals

Authors

Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.

Clear search

Close search

Google apps

Main menu

Data from: Methodology to filter out outliers in high spatial density data...

Identifying outliers in asset pricing data with a new weighted forward...

Data from: Valid Inference Corrected for Outlier Removal

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

Data from: Outlier detection in cylindrical data based on Mahalanobis...

Outlier classification using autoencoders: application for fluctuation...

Find Outliers GRM

Cost of living(Treat Outliers)

Dataset

Contents

Product Cost Analysis for Out/inler Detection

Data from: Pacman profiling: a simple procedure to identify stratigraphic...

Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned...

Find Outliers Percent of households with income below the Federal Poverty...

Data from: specleanr: An R package for automated flagging of environmental...

specleanr: An R package for automated flagging of environmental outliers in ecological data for modeling workflows

Description of the data and file structure

Files and variables

...

Data from: Anomalous values and missing data in clinical and experimental...

Data from: Expected total thyroxine (TT4) concentrations and outlier values...

Superstore Sales Analysis

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Distributed Anomaly Detection Using Satellite Data From Multiple Modalities

DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES

Data from: Methodology to filter out outliers in high spatial density data to improve maps reliability