44 datasets found

f
Data from: A Diagnostic Procedure for Detecting Outliers in Linear...
tandf.figshare.com
figshare.com
txt
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow (2024). A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models [Dataset]. http://doi.org/10.6084/m9.figshare.12162075.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12162075.v1
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francis
Authors
Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outliers can be more problematic in longitudinal data than in independent observations due to the correlated nature of such data. It is common practice to discard outliers as they are typically regarded as a nuisance or an aberration in the data. However, outliers can also convey meaningful information concerning potential model misspecification, and ways to modify and improve the model. Moreover, outliers that occur among the latent variables (innovative outliers) have distinct characteristics compared to those impacting the observed variables (additive outliers), and are best evaluated with different test statistics and detection procedures. We demonstrate and evaluate the performance of an outlier detection approach for multi-subject state-space models in a Monte Carlo simulation study, with corresponding adaptations to improve power and reduce false detection rates. Furthermore, we demonstrate the empirical utility of the proposed approach using data from an ecological momentary assessment study of emotion regulation together with an open-source software implementation of the procedures.

Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

technavio.com

Updated Oct 11, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2022). Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Spain, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/anomaly-detection-market-industry-analysis

Explore at:

Dataset updated

Oct 11, 2022

Dataset provided by

TechNavio

Authors

Technavio

Time period covered

2021 - 2025

Area covered

Germany, Canada, United States, Mexico, Global

Description

Snapshot img

Anomaly Detection Market Size 2025-2029

The anomaly detection market size is forecast to increase by USD 4.44 billion at a CAGR of 14.4% between 2024 and 2029.

The market is experiencing significant growth, particularly in the BFSI sector, as organizations increasingly prioritize identifying and addressing unusual patterns or deviations from normal business operations. The rising incidence of internal threats and cyber frauds necessitates the implementation of advanced anomaly detection tools to mitigate potential risks and maintain security. However, implementing these solutions comes with challenges, primarily infrastructural requirements. Ensuring compatibility with existing systems, integrating new technologies, and training staff to effectively utilize these tools pose significant hurdles for organizations.
Despite these challenges, the potential benefits of anomaly detection, such as improved risk management, enhanced operational efficiency, and increased security, make it an essential investment for businesses seeking to stay competitive and agile in today's complex and evolving threat landscape. Companies looking to capitalize on this market opportunity must carefully consider these challenges and develop strategies to address them effectively. Cloud computing is a key trend in the market, as cloud-based solutions offer quick deployment, flexibility, and scalability.

What will be the Size of the Anomaly Detection Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free Sample

In the dynamic and evolving market, advanced technologies such as resource allocation, linear regression, pattern recognition, and support vector machines are increasingly being adopted for automated decision making. Businesses are leveraging these techniques to enhance customer experience through behavioral analytics, object detection, and sentiment analysis. Machine learning algorithms, including random forests, naive Bayes, decision trees, clustering algorithms, and k-nearest neighbors, are essential tools for risk management and compliance monitoring. AI-powered analytics, time series forecasting, and predictive modeling are revolutionizing business intelligence, while process optimization is achieved through the application of decision support systems, natural language processing, and predictive analytics.
Computer vision, image recognition, logistic regression, and operational efficiency are key areas where principal component analysis and artificial neural networks contribute significantly. Speech recognition and operational efficiency are also benefiting from these advanced technologies, enabling businesses to streamline processes and improve overall performance.

How is this Anomaly Detection Industry segmented?

The anomaly detection industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Deployment

  Cloud
  On-premises


Component

  Solution
  Services


End-user

  BFSI
  IT and telecom
  Retail and e-commerce
  Manufacturing
  Others


Technology

  Big data analytics
  AI and ML
  Data mining and business intelligence


Geography

  North America

    US
    Canada
    Mexico


  Europe

    France
    Germany
    Spain
    UK


  APAC

    China
    India
    Japan


  Rest of World (ROW)

By Deployment Insights

The cloud segment is estimated to witness significant growth during the forecast period. The market is witnessing significant growth due to the increasing adoption of advanced technologies such as machine learning models, statistical methods, and real-time monitoring. These technologies enable the identification of anomalous behavior in real-time, thereby enhancing network security and data privacy. Anomaly detection algorithms, including unsupervised learning, reinforcement learning, and deep learning networks, are used to identify outliers and intrusions in large datasets. Data security is a major concern, leading to the adoption of data masking, data pseudonymization, data de-identification, and differential privacy.

Data leakage prevention and incident response are critical components of an effective anomaly detection system. False positive and false negative rates are essential metrics to evaluate the performance of these systems. Time series analysis and concept drift are important techniques used in anomaly detection. Data obfuscation, data suppression, and data aggregation are other strategies employed to maintain data privacy. Companies such as Anodot, Cisco Systems Inc, IBM Corp, and SAS Institute Inc offer both cloud-based and on-premises anomaly detection solutions. These solutions use v

Additional file 2 of Outlier identification and monitoring of institutional...
springernature.figshare.com
txt
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Menelaos Pavlou; Gareth Ambler; Rumana Z. Omar; Andrew T. Goodwin; Uday Trivedi; Peter Ludman; Mark de Belder (2023). Additional file 2 of Outlier identification and monitoring of institutional or clinician performance: an overview of statistical methods and application to national audit data [Dataset]. http://doi.org/10.6084/m9.figshare.22612465.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22612465.v1
Dataset updated
Jun 21, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Menelaos Pavlou; Gareth Ambler; Rumana Z. Omar; Andrew T. Goodwin; Uday Trivedi; Peter Ludman; Mark de Belder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 2.
f
Data from: Nonparametric Anomaly Detection on Time Series of Graphs
tandf.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dorcas Ofori-Boateng; Yulia R. Gel; Ivor Cribben (2023). Nonparametric Anomaly Detection on Time Series of Graphs [Dataset]. http://doi.org/10.6084/m9.figshare.13180181.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13180181.v3
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Dorcas Ofori-Boateng; Yulia R. Gel; Ivor Cribben
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identifying change points and/or anomalies in dynamic network structures has become increasingly popular across various domains, from neuroscience to telecommunication to finance. One particular objective of anomaly detection from a neuroscience perspective is the reconstruction of the dynamic manner of brain region interactions. However, most statistical methods for detecting anomalies have the following unrealistic limitation for brain studies and beyond: that is, network snapshots at different time points are assumed to be independent. To circumvent this limitation, we propose a distribution-free framework for anomaly detection in dynamic networks. First, we present each network snapshot of the data as a linear object and find its respective univariate characterization via local and global network topological summaries. Second, we adopt a change point detection method for (weakly) dependent time series based on efficient scores, and enhance the finite sample properties of change point method by approximating the asymptotic distribution of the test statistic using the sieve bootstrap. We apply our method to simulated and to real data, particularly, two functional magnetic resonance imaging (fMRI) datasets and the Enron communication graph. We find that our new method delivers impressively accurate and realistic results in terms of identifying locations of true change points compared to the results reported by competing approaches. The new method promises to offer a deeper insight into the large-scale characterizations and functional dynamics of the brain and, more generally, into the intrinsic structure of complex dynamic networks. Supplemental materials for this article are available online.
d
Data from: Theoretically Optimal Distributed Anomaly Detection
datasets.ai
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
33
Updated Sep 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Aeronautics and Space Administration (2024). Theoretically Optimal Distributed Anomaly Detection [Dataset]. https://datasets.ai/datasets/theoretically-optimal-distributed-anomaly-detection
Explore at:
33Available download formats
Dataset updated
Sep 14, 2024
Dataset authored and provided by
National Aeronautics and Space Administration
Description
A novel general framework for distributed anomaly detection with theoretical performance guarantees is proposed. Our algorithmic approach combines existing anomaly detection procedures with a novel method for computing global statistics using local sufficient statistics. Under a Gaussian assumption, our distributed algorithm is guaranteed to perform as well as its centralized counterpart, a condition we call Ôzero information lossÕ. We further report experimental results on synthetic as well as real-world data to demonstrate the viability of our approach.
d
Solving a prisoner's dilemma in distributed anomaly detection
catalog.data.gov
datasets.ai
+4more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Solving a prisoner's dilemma in distributed anomaly detection [Dataset]. https://catalog.data.gov/dataset/solving-a-prisoners-dilemma-in-distributed-anomaly-detection
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Anomaly detection has recently become an important problem in many industrial and financial applications. In several instances, the data to be analyzed for possible anomalies is located at multiple sites and cannot be merged due to practical constraints such as bandwidth limitations and proprietary concerns. At the same time, the size of data sets affects prediction quality in almost all data mining applications. In such circumstances, distributed data mining algorithms may be used to extract information from multiple data sites in order to make better predictions. In the absence of theoretical guarantees, however, the degree to which data decentralization affects the performance of these algorithms is not known, which reduces the data providing participants' incentive to cooperate.This creates a metaphorical 'prisoners' dilemma' in the context of data mining. In this work, we propose a novel general framework for distributed anomaly detection with theoretical performance guarantees. Our algorithmic approach combines existing anomaly detection procedures with a novel method for computing global statistics using local sufficient statistics. We show that the performance of such a distributed approach is indistinguishable from that of a centralized instantiation of the same anomaly detection algorithm, a condition that we call zero information loss. We further report experimental results on synthetic as well as real-world data to demonstrate the viability of our approach. The remaining content of this presentation is presented in Fig. 1.
Housing Price Analysis and Prediction
kaggle.com
Updated Feb 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2024). Housing Price Analysis and Prediction [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/housing-price-analysis-and-prediction/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ali Reda Elblgihy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Steps Throughout the Full Project:

1- Initial Data Exploration: Introduction to the dataset and its variables. Identification of potential relationships between variables. Examination of data quality issues such as missing values and outliers.

2- Correlation Analysis: Utilization of correlation matrices and heatmaps to identify relationships between variables. Focus on variables highly correlated with the target variable, 'SalePrice'.

3- Handling Missing Data: Analysis of missing data prevalence and patterns. Deletion of variables with high percentages of missing data. Treatment of missing observations for remaining variables based on their importance.

4- Dealing with Outliers: Identification and handling of outliers using data visualization and statistical methods. Removal of outliers that significantly deviate from the overall pattern.

5- Testing Statistical Assumptions: Assessment of normality, homoscedasticity, linearity, and absence of correlated errors. Application of data transformations to meet statistical assumptions.

6- Conversion of Categorical Variables: Conversion of categorical variables into dummy variables to prepare for modeling.

Summary: The project undertook a comprehensive analysis of housing price data, encompassing data exploration, correlation analysis, missing data handling, outlier detection, and testing of statistical assumptions. Through visualization and statistical methods, the project identified key relationships between variables and prepared the data for predictive modeling.

Recommendations: Further exploration of advanced modeling techniques such as regularized linear regression and ensemble methods for predicting housing prices. Consideration of additional variables or feature engineering to improve model performance. Evaluation of model performance using cross-validation and other validation techniques. Documentation and communication of findings and recommendations for stakeholders or further research.
t
Anomaly Detection Solution Global Market Report 2025
tbrctest.tbrc.info
pdf,excel,csv,ppt
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Business Research Company (2025). Anomaly Detection Solution Global Market Report 2025 [Dataset]. http://tbrctest.tbrc.info/report/anomaly-detection-solution-global-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Jan 15, 2025
Dataset authored and provided by
The Business Research Company
License
https://www.thebusinessresearchcompany.com/privacy-policyhttps://www.thebusinessresearchcompany.com/privacy-policy
Description
Global Anomaly Detection Solution market size is expected to reach $18 billion by 2029 at 17.4%, segmented as by statistical anomaly detection, time series analysis, control chart methods, z-score analysis
Synthetic Datasets for DINAMO: Dynamic and INterpretable Anomaly MOnitoring...
zenodo.org
tar
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arsenii Gavrikov; Arsenii Gavrikov; Julián García Pardiñas; Julián García Pardiñas; Alberto Garfagnini; Alberto Garfagnini (2025). Synthetic Datasets for DINAMO: Dynamic and INterpretable Anomaly MOnitoring for Large-Scale Particle Physics Experiments [Dataset]. http://doi.org/10.5281/zenodo.15610342
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15610342
Dataset updated
Jun 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Arsenii Gavrikov; Arsenii Gavrikov; Julián García Pardiñas; Julián García Pardiñas; Alberto Garfagnini; Alberto Garfagnini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This archive contains 1000 synthetic datasets for benchmarking the DINAMO framework (https://arxiv.org/abs/2501.19237), an automated anomaly detection solution featuring both a generalized EWMA-based statistical method and a transformer encoder-based ML approach for Data Quality Monitoring (DQM) in particle physics experiments.

The datasets overview:

Size: 1000 datasets in .npz format, each containing 5000 runs with one-dimensional Gaussian-based histograms

Labels: each run is labeled as "good" (4500 runs) or "bad" (500 runs)

Features: datasets mimic particle physics DQM data with emphasis on dynamic operational conditions:

Gradual operational drifts (sinusoidal evolution)

Abrupt hardware/software changes

Varying event statistics and Poisson uncertainties

Systematic detector uncertainties

Bad runs contain additional distortions and dead histogram bins

These datasets enable systematic evaluation of anomaly detection algorithms in time-dependent settings for the DQM problem. More details can be found in the paper and in the GitHub repository at https://github.com/ArseniiGav/DINAMO/
f
Data from: Predictive Control Charts (PCC): A Bayesian approach in online...
tandf.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Bourazas; Dimitrios Kiagias; Panagiotis Tsiamyrtzis (2023). Predictive Control Charts (PCC): A Bayesian approach in online monitoring of short runs [Dataset]. http://doi.org/10.6084/m9.figshare.14588607.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14588607.v1
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Konstantinos Bourazas; Dimitrios Kiagias; Panagiotis Tsiamyrtzis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performing online monitoring for short horizon data is a challenging, though cost effective benefit. Self-starting methods attempt to address this issue adopting a hybrid scheme that executes calibration and monitoring simultaneously. In this work, we propose a Bayesian alternative that will utilize prior information and possible historical data (via power priors), offering a head-start in online monitoring, putting emphasis on outlier detection. For cases of complete prior ignorance, the objective Bayesian version will be provided. Charting will be based on the predictive distribution and the methodological framework will be derived in a general way, to facilitate discrete and continuous data from any distribution that belongs to the regular exponential family (with Normal, Poisson and Binomial being the most representative). Being in the Bayesian arena, we will be able to not only perform process monitoring, but also draw online inference regarding the unknown process parameter(s). An extended simulation study will evaluate the proposed methodology against frequentist based competitors and it will cover topics regarding prior sensitivity and model misspecification robustness. A continuous and a discrete real data set will illustrate its use in practice. Technical details, algorithms, guidelines on prior elicitation and R-codes are provided in appendices and supplementary material. Short production runs and online phase I monitoring are among the best candidates to benefit from the developed methodology.
n
Data from: Composite measures of selection can improve the signal-to-noise...
data.niaid.nih.gov
datadryad.org
zip
Updated Mar 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katie E. Lotterhos; Daren C. Card; Sara M. Schaal; Liuyang Wang; Caitlin Collins; Bob Verity (2018). Composite measures of selection can improve the signal-to-noise ratio in genome scans [Dataset]. http://doi.org/10.5061/dryad.bp11m
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.bp11m
Dataset updated
Mar 15, 2018
Dataset provided by
Imperial College London
The University of Texas at Arlington
Duke University
Northeastern University
Authors
Katie E. Lotterhos; Daren C. Card; Sara M. Schaal; Liuyang Wang; Caitlin Collins; Bob Verity
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The growing wealth of genomic data is yielding new insights into the genetic basis of adaptation, but it also presents the challenge of extracting the relevant signal from multi-dimensional datasets. Different statistical approaches vary in their power to detect selection depending on the demographic history, type of selection, genetic architecture and experimental design. Here, we develop and evaluate new approaches for combining results from multiple tests, including multivariate distance measures and methods for combining P-values. We evaluate these methods on (i) simulated landscape genetic data analysed for differentiation outliers and genetic-environment associations and (ii) empirical genomic data analysed for selective sweeps within dog breeds for loci known to be selected for during domestication. We also introduce and evaluate how robust statistical algorithms can be used for parameter estimation in statistical genomics. On the simulated data, many of the composite measures performed well and had decreased variation in outcomes across many sampling designs. On the empirical dataset, methods based on combining P-values generally performed better with clearer signals of selection, higher significance of the signal, and in closer proximity to the known selected locus. Although robust algorithms could identify neutral loci in our simulations, they did not universally improve power to detect selection. Overall, a composite statistic that measured a robust multivariate distance from rank-based P-values performed the best. We found that composite measures of selection could improve the signal of selection in many cases, but they were not a panacea and their power is limited by the power of the univariate statistics they summarize. Since genome scans are widely used, improving inference for prioritizing candidate genes may be beneficial to medicine, agriculture, and breeding. Our results also have application to outlier detection in high-dimensional datasets and to combining results in meta-analyses in many disciplines. The compound measures we evaluate are implemented in the r package minotaur.
d
Data from: Mean and variance of phylogenetic trees
search.dataone.org
datadryad.org
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel G. Brown; Megan Owen (2025). Mean and variance of phylogenetic trees [Dataset]. http://doi.org/10.5061/dryad.h5f4117
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.h5f4117
Dataset updated
Apr 10, 2025
Dataset provided by
Dryad Digital Repository
Authors
Daniel G. Brown; Megan Owen
Time period covered
Jan 1, 2019
Description
We describe the use of the FrÃ©chet mean and variance in the Billera-Holmes-Vogtmann (BHV) treespace to summarize and explore the diversity of a set of phylogenetic trees. We show that the FrÃ©chet mean is comparable to other summary methods, and, despite its stickiness property, is more likely to be binary than the majority-rules consensus tree. We show that the FrÃ©chet variance is faster and more precise than commonly used variance measures. The FrÃ©chet mean and variance are more theoretically justified, and more robust, than previous estimates of this type, and can be estimated reasonably efficiently, providing a foundation for building more advanced statistical methods and leading to applications such as mean hypothesis testing and outlier detection.
f
Data from: Functional Outlier Detection for Density-Valued Data with...
tandf.figshare.com
txt
Updated Feb 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinyi Lei; Zhicheng Chen; Hui Li (2024). Functional Outlier Detection for Density-Valued Data with Application to Robustify Distribution-to-Distribution Regression [Dataset]. http://doi.org/10.6084/m9.figshare.21926087.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21926087.v1
Dataset updated
Feb 26, 2024
Dataset provided by
Taylor & Francis
Authors
Xinyi Lei; Zhicheng Chen; Hui Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Distributional data analysis, concerned with the statistical analysis of data objects consisting of random probability distributions in the framework of functional data analysis (FDA), has received considerable interest in recent years and is increasingly applied in various fields including engineering. Outlier detection and robustness are of great practical interest; however, these aspects remain unexplored for distributional data. To this end, this study focuses on density-valued outlier detection and its application in robust distributional regression. Specifically, we propose a transformation-based approach for single-dataset outlying density detection with an emphasis on converting the less detectable shape outliers to easily detectable magnitude outliers. We also propose a distributional regression-based approach for detecting the abnormal associations of the density-valued two-tuples associated with two datasets. Then, the proposed outlier detection methods are applied to robustify a distribution-to-distribution regression method used in engineering, and we develop a robust estimator for the regression operator by downweighting the detected outliers. The proposed methods are validated and evaluated via extensive simulation studies. The relevant results reveal the superiority of our method over other competitors in distributional outlier detection. A case study in structural health monitoring demonstrates the great potential of our proposal in engineering applications. Supplementary materials for this article are available online.
f
Identifying outliers in asset pricing data with a new weighted forward...
scielo.figshare.com
jpeg
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandre Aronne; Luigi Grossi; Aureliano Angel Bressan (2023). Identifying outliers in asset pricing data with a new weighted forward search estimator [Dataset]. http://doi.org/10.6084/m9.figshare.11804652.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11804652.v1
Dataset updated
May 30, 2023
Dataset provided by
SciELO journals
Authors
Alexandre Aronne; Luigi Grossi; Aureliano Angel Bressan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT The purpose of this work is to present the Weighted Forward Search (FSW) method for the detection of outliers in asset pricing data. This new estimator, which is based on an algorithm that downweights the most anomalous observations of the dataset, is tested using both simulated and empirical asset pricing data. The impact of outliers on the estimation of asset pricing models is assessed under different scenarios, and the results are evaluated with associated statistical tests based on this new approach. Our proposal generates an alternative procedure for robust estimation of portfolio betas, allowing for the comparison between concurrent asset pricing models. The algorithm, which is both efficient and robust to outliers, is used to provide robust estimates of the models’ parameters in a comparison with traditional econometric estimation methods usually used in the literature. In particular, the precision of the alphas is highly increased when the Forward Search (FS) method is used. We use Monte Carlo simulations, and also the well-known dataset of equity factor returns provided by Prof. Kenneth French, consisting of the 25 Fama-French portfolios on the United States of America equity market using single and three-factor models, on monthly and annual basis. Our results indicate that the marginal rejection of the Fama-French three-factor model is influenced by the presence of outliers in the portfolios, when using monthly returns. In annual data, the use of robust methods increases the rejection level of null alphas in the Capital Asset Pricing Model (CAPM) and the Fama-French three-factor model, with more efficient estimates in the absence of outliers and consistent alphas when outliers are present.
f
DataSheet1_Sequential Detection of Microgrid Bad Data via a Data-Driven...
figshare.com
pdf
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heming Huang; Fei Liu; Tinghui Ouyang; Xiaoming Zha (2023). DataSheet1_Sequential Detection of Microgrid Bad Data via a Data-Driven Approach Combining Online Machine Learning With Statistical Analysis.pdf [Dataset]. http://doi.org/10.3389/fenrg.2022.861563.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fenrg.2022.861563.s001
Dataset updated
Jun 8, 2023
Dataset provided by
Frontiers
Authors
Heming Huang; Fei Liu; Tinghui Ouyang; Xiaoming Zha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Bad data is required to be detected and removed from the microgrid data stream because it misleads the decision-making of the Energy Management Systems (EMS) and puts the microgrid at risk of instability. In this paper, the authors propose a sequential detection method that combines three data mining algorithms, that is the Online Sequential Extreme Learning Machine (OSELM), statistical analysis within a sliding time window, and the Density-Based Spatial Clustering of Applications with Noise (DBSCAN). After sequential data training, OSELM is used to construct an online updated error-filtering map to extract the electrical feature of the microgrid data sequence. Meanwhile, the statistical features, i.e. the surge of the variance and the corresponding correlation coefficients under a sliding time window are first proposed as another two complementary feature dimensions. The three-dimensional features are finally analyzed by DBSCAN to discriminate the bad data. The detection performance of this approach is verified by the data sequence collected from a four-terminal ring-shaped DC microgrid prototype. Compared with bad data detection using a single electrical feature or only statistical features, this approach shows the best performance. Moreover, it can be further applied to the online detection of microgrid bad data in the future.
f
Data from: A Cluster-Based Outlier Detection Scheme for Multivariate Data
tandf.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J. Marcus Jobe; Michael Pokojovy (2023). A Cluster-Based Outlier Detection Scheme for Multivariate Data [Dataset]. http://doi.org/10.6084/m9.figshare.1241577.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1241577.v2
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
J. Marcus Jobe; Michael Pokojovy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Detection power of the squared Mahalanobis distance statistic is significantly reduced when several outliers exist within a multivariate dataset of interest. To overcome this masking effect, we propose a computer-intensive cluster-based approach that incorporates a reweighted version of Rousseeuw’s minimum covariance determinant method with a multi-step cluster-based algorithm that initially filters out potential masking points. Compared to the most robust procedures, simulation studies show that our new method is better for outlier detection. Additional real data comparisons are given. Supplementary materials for this article are available online.
f
Data from: Evaluation of Two Different Methods of Data Quality Analysis...
scielo.figshare.com
jpeg
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renzo Angelo Viloche Morales; Carlos Eduardo Salles de Araujo (2023). Evaluation of Two Different Methods of Data Quality Analysis Using Local Daily Precipitation Measurements at Santa Catarina State [Dataset]. http://doi.org/10.6084/m9.figshare.8227127.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8227127.v1
Dataset updated
Jun 4, 2023
Dataset provided by
SciELO journals
Authors
Renzo Angelo Viloche Morales; Carlos Eduardo Salles de Araujo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
State of Santa Catarina
Description
Abstract Daily rainfall data from a meteorological network in Southern Brazil is used to assess the performance of two different outliers detection algorithms. Both methods use a statistical and spatial consistency approach based in distance and elevation difference between two rain gauge measurements. A variation of the Multiple Interval Gamma Distribution method of You, Hubbard, Nadarajah e Kunkel (2007) is considered in this study. Neighboring stations data is gathered to get the local average rainfall distribution. The precipitation range of values is partitioned so one makes the assumption that every interval can be modeled by a Gamma distribution. The second method assumes no prior distribution characteristic, and instead uses point spatial and cumulated temporal information from neighboring rain gauge stations to consist daily rainfall data. In order to assess the reliability of the detected outliers, as well the accuracy, seeded errors are introduced in the historical rainfall series. A two dimensional probability model of introduced/detected error (yes-no) is used to compute metrics related to the correct detection and false alarm probabilities made by the algorithm. We verify that the new proposed method overcomes the Multiple Interval Gamma Distribution method.
Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
Explore at:
Dataset updated
Feb 15, 2025
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
Global, United States
Description
Snapshot img

Data Science Platform Market Size 2025-2029

The data science platform market size is forecast to increase by USD 763.9 million, at a CAGR of 40.2% between 2024 and 2029.

The market is experiencing significant growth, driven by the increasing integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to derive deeper insights from their data, fueling business innovation and decision-making. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. This approach offers enhanced flexibility, scalability, and efficiency, making it an attractive choice for businesses seeking to streamline their data science operations. However, the market also faces challenges. Data privacy and security remain critical concerns, with the increasing volume and complexity of data posing significant risks. Ensuring robust data security and privacy measures is essential for companies to maintain customer trust and comply with regulatory requirements. Additionally, managing the complexity of data science platforms and ensuring seamless integration with existing systems can be a daunting task, requiring significant investment in resources and expertise. Companies must navigate these challenges effectively to capitalize on the market's opportunities and stay competitive in the rapidly evolving data landscape.

What will be the Size of the Data Science Platform Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market continues to evolve, driven by the increasing demand for advanced analytics and artificial intelligence solutions across various sectors. Real-time analytics and classification models are at the forefront of this evolution, with APIs integrations enabling seamless implementation. Deep learning and model deployment are crucial components, powering applications such as fraud detection and customer segmentation. Data science platforms provide essential tools for data cleaning and data transformation, ensuring data integrity for big data analytics. Feature engineering and data visualization facilitate model training and evaluation, while data security and data governance ensure data privacy and compliance. Machine learning algorithms, including regression models and clustering models, are integral to predictive modeling and anomaly detection. Statistical analysis and time series analysis provide valuable insights, while ETL processes streamline data integration. Cloud computing enables scalability and cost savings, while risk management and algorithm selection optimize model performance. Natural language processing and sentiment analysis offer new opportunities for data storytelling and computer vision. Supply chain optimization and recommendation engines are among the latest applications of data science platforms, demonstrating their versatility and continuous value proposition. Data mining and data warehousing provide the foundation for these advanced analytics capabilities.

How is this Data Science Platform Industry segmented?

The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudComponentPlatformServicesEnd-userBFSIRetail and e-commerceManufacturingMedia and entertainmentOthersSectorLarge enterprisesSMEsApplicationData PreparationData VisualizationMachine LearningPredictive AnalyticsData GovernanceOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period.In the dynamic the market, businesses increasingly adopt solutions to gain real-time insights from their data, enabling them to make informed decisions. Classification models and deep learning algorithms are integral parts of these platforms, providing capabilities for fraud detection, customer segmentation, and predictive modeling. API integrations facilitate seamless data exchange between systems, while data security measures ensure the protection of valuable business information. Big data analytics and feature engineering are essential for deriving meaningful insights from vast datasets. Data transformation, data mining, and statistical analysis are crucial processes in data preparation and discovery. Machine learning models, including regression and clustering, are employed for model training and evaluation. Time series analysis and natural language processing are valuable tools for understanding trends and customer sen
Data from: Multi-Source Distributed System Data for AI-powered Analytics
zenodo.org
explore.openaire.eu
+1more
zip
Updated Nov 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao (2022). Multi-Source Distributed System Data for AI-powered Analytics [Dataset]. http://doi.org/10.5281/zenodo.3549604
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3549604
Dataset updated
Nov 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract:

In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
The major contributions have been materialized in the form of novel algorithms.
Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.

General Information:

This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.

You may find details of this dataset from the original paper:

Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".

If you use the data, implementation, or any details of the paper, please cite!

BIBTEX:

_

@inproceedings{nedelkoski2020multi, title={Multi-source Distributed System Data for AI-Powered Analytics}, author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej}, booktitle={European Conference on Service-Oriented and Cloud Computing}, pages={161--176}, year={2020}, organization={Springer} }

_

The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.

The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.

Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.

Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/
Simulated cut-off points of the DMCEs statistic (α = 1.5, β = 1.5,ω = 0.5).
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adzhar Rambli; Ali H. M. Abuzaid; Ibrahim Bin Mohamed; Abdul Ghapor Hussin (2023). Simulated cut-off points of the DMCEs statistic (α = 1.5, β = 1.5,ω = 0.5). [Dataset]. http://doi.org/10.1371/journal.pone.0153074.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0153074.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Adzhar Rambli; Ali H. M. Abuzaid; Ibrahim Bin Mohamed; Abdul Ghapor Hussin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Simulated cut-off points of the DMCEs statistic (α = 1.5, β = 1.5,ω = 0.5).

Facebook

Twitter

Click to copy link

Link copied

Cite

Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow (2024). A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models [Dataset]. http://doi.org/10.6084/m9.figshare.12162075.v1

Data from: A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.12162075.v1

Dataset updated

Feb 9, 2024

Dataset provided by

Taylor & Francis

Authors

Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Outliers can be more problematic in longitudinal data than in independent observations due to the correlated nature of such data. It is common practice to discard outliers as they are typically regarded as a nuisance or an aberration in the data. However, outliers can also convey meaningful information concerning potential model misspecification, and ways to modify and improve the model. Moreover, outliers that occur among the latent variables (innovative outliers) have distinct characteristics compared to those impacting the observed variables (additive outliers), and are best evaluated with different test statistics and detection procedures. We demonstrate and evaluate the performance of an outlier detection approach for multi-subject state-space models in a Monte Carlo simulation study, with corresponding adaptations to improve power and reduce false detection rates. Furthermore, we demonstrate the empirical utility of the proposed approach using data from an ecological momentary assessment study of emotion regulation together with an open-source software implementation of the procedures.

Clear search

Close search

Google apps

Main menu

Data from: A Diagnostic Procedure for Detecting Outliers in Linear...

Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Additional file 2 of Outlier identification and monitoring of institutional...

Data from: Nonparametric Anomaly Detection on Time Series of Graphs

Data from: Theoretically Optimal Distributed Anomaly Detection

Solving a prisoner's dilemma in distributed anomaly detection

Housing Price Analysis and Prediction

Anomaly Detection Solution Global Market Report 2025

Synthetic Datasets for DINAMO: Dynamic and INterpretable Anomaly MOnitoring...

Data from: Predictive Control Charts (PCC): A Bayesian approach in online...

Data from: Composite measures of selection can improve the signal-to-noise...

Data from: Mean and variance of phylogenetic trees

Data from: Functional Outlier Detection for Density-Valued Data with...

Identifying outliers in asset pricing data with a new weighted forward...

DataSheet1_Sequential Detection of Microgrid Bad Data via a Data-Driven...

Data from: A Cluster-Based Outlier Detection Scheme for Multivariate Data

Data from: Evaluation of Two Different Methods of Data Quality Analysis...

Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Data from: Multi-Source Distributed System Data for AI-powered Analytics

Simulated cut-off points of the DMCEs statistic (α = 1.5, β = 1.5,ω = 0.5).

Data from: A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models