62 datasets found

f
Data from: Error and anomaly detection for intra-participant time-series...
tandf.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5189002
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
David R. Mullineaux; Gareth Irwin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.
Data from: Outlier classification using autoencoders: application for...
osti.gov
dataverse.harvard.edu
Updated Jun 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. http://doi.org/10.7910/DVN/SKEHRJ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SKEHRJ
Dataset updated
Jun 2, 2021
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
Description
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
f
Data from: Valid Inference Corrected for Outlier Removal
figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9762731.v1
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Shuxiao Chen; Jacob Bien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.
a
Find Outliers GRM
hub.arcgis.com
Updated Aug 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tippecanoe County Assessor Hub Community (2020). Find Outliers GRM [Dataset]. https://hub.arcgis.com/maps/tippecanoehub::find-outliers-grm
Explore at:
Dataset updated
Aug 8, 2020
Dataset authored and provided by
Tippecanoe County Assessor Hub Community
Area covered

Description
The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 721 valid input features.GRM Properties:Min0.0000Max157.0200Mean9.1692Std. Dev.8.4220There were 4 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band selected was based on peak clustering found at 1894.5039 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 248 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 30 statistically significant high outlier features.There are 7 statistically significant low outlier features.There are 202 features part of statistically significant low clusters.There are 9 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high GRM values.Light Blue output features are part of a cluster of low GRM values.Red output features represent high outliers within a cluster of low GRM values.Blue output features represent low outliers within a cluster of high GRM values.
d
Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...
catalog.data.gov
s.cnmilf.com
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
f
Identifying outliers in asset pricing data with a new weighted forward...
scielo.figshare.com
datasetcatalog.nlm.nih.gov
jpeg
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandre Aronne; Luigi Grossi; Aureliano Angel Bressan (2023). Identifying outliers in asset pricing data with a new weighted forward search estimator [Dataset]. http://doi.org/10.6084/m9.figshare.11804652.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11804652.v1
Dataset updated
May 30, 2023
Dataset provided by
SciELO journals
Authors
Alexandre Aronne; Luigi Grossi; Aureliano Angel Bressan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT The purpose of this work is to present the Weighted Forward Search (FSW) method for the detection of outliers in asset pricing data. This new estimator, which is based on an algorithm that downweights the most anomalous observations of the dataset, is tested using both simulated and empirical asset pricing data. The impact of outliers on the estimation of asset pricing models is assessed under different scenarios, and the results are evaluated with associated statistical tests based on this new approach. Our proposal generates an alternative procedure for robust estimation of portfolio betas, allowing for the comparison between concurrent asset pricing models. The algorithm, which is both efficient and robust to outliers, is used to provide robust estimates of the models’ parameters in a comparison with traditional econometric estimation methods usually used in the literature. In particular, the precision of the alphas is highly increased when the Forward Search (FS) method is used. We use Monte Carlo simulations, and also the well-known dataset of equity factor returns provided by Prof. Kenneth French, consisting of the 25 Fama-French portfolios on the United States of America equity market using single and three-factor models, on monthly and annual basis. Our results indicate that the marginal rejection of the Fama-French three-factor model is influenced by the presence of outliers in the portfolios, when using monthly returns. In annual data, the use of robust methods increases the rejection level of null alphas in the Capital Asset Pricing Model (CAPM) and the Fama-French three-factor model, with more efficient estimates in the absence of outliers and consistent alphas when outliers are present.
f
Data from: A Diagnostic Procedure for Detecting Outliers in Linear...
tandf.figshare.com
figshare.com
txt
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow (2024). A Diagnostic Procedure for Detecting Outliers in Linear State–Space Models [Dataset]. http://doi.org/10.6084/m9.figshare.12162075.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12162075.v1
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francis
Authors
Dongjun You; Michael Hunter; Meng Chen; Sy-Miin Chow
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outliers can be more problematic in longitudinal data than in independent observations due to the correlated nature of such data. It is common practice to discard outliers as they are typically regarded as a nuisance or an aberration in the data. However, outliers can also convey meaningful information concerning potential model misspecification, and ways to modify and improve the model. Moreover, outliers that occur among the latent variables (innovative outliers) have distinct characteristics compared to those impacting the observed variables (additive outliers), and are best evaluated with different test statistics and detection procedures. We demonstrate and evaluate the performance of an outlier detection approach for multi-subject state-space models in a Monte Carlo simulation study, with corresponding adaptations to improve power and reduce false detection rates. Furthermore, we demonstrate the empirical utility of the proposed approach using data from an ecological momentary assessment study of emotion regulation together with an open-source software implementation of the procedures.
e
Outliers and similarity in APOGEE - Dataset - B2FIND
b2find.eudat.eu
Updated Nov 2, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Outliers and similarity in APOGEE - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b624b506-541b-5a09-b615-14b8e202c468
Explore at:
Dataset updated
Nov 2, 2017
Description
In this work we apply and expand on a recently introduced outlier detection algorithm that is based on an unsupervised random forest. We use the algorithm to calculate a similarity measure for stellar spectra from the Apache Point Observatory Galactic Evolution Experiment (APOGEE). We show that the similarity measure traces non-trivial physical properties and contains information about complex structures in the data. We use it for visualization and clustering of the dataset, and discuss its ability to find groups of highly similar objects, including spectroscopic twins. Using the similarity matrix to search the dataset for objects allows us to find objects that are impossible to find using their best fitting model parameters. This includes extreme objects for which the models fail, and rare objects that are outside the scope of the model. We use the similarity measure to detect outliers in the dataset, and find a number of previously unknown Be-type stars, spectroscopic binaries, carbon rich stars, young stars, and a few that we cannot interpret. Our work further demonstrates the potential for scientific discovery when combining machine learning methods with modern survey data. Cone search capability for table J/MNRAS/476/2117/apogeenn (Nearest neighbors APOGEE IDs)
e
outlier detection algorithm for SDSS galaxies - Dataset - B2FIND
b2find.eudat.eu
Updated Dec 28, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). outlier detection algorithm for SDSS galaxies - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/53c648e9-7853-564c-95c8-21ebdd18ad16
Explore at:
Dataset updated
Dec 28, 2016
Description
How can we discover objects we did not know existed within the large data sets that now abound in astronomy? We present an outlier detection algorithm that we developed, based on an unsupervised Random Forest. We test the algorithm on more than two million galaxy spectra from the Sloan Digital Sky Survey and examine the 400 galaxies with the highest outlier score. We find objects which have extreme emission line ratios and abnormally strong absorption lines, objects with unusual continua, including extremely reddened galaxies. We find galaxy-galaxy gravitational lenses, double-peaked emission line galaxies and close galaxy pairs. We find galaxies with high ionization lines, galaxies that host supernovae and galaxies with unusual gas kinematics. Only a fraction of the outliers we find were reported by previous studies that used specific and tailored algorithms to find a single class of unusual objects. Our algorithm is general and detects all of these classes, and many more, regardless of what makes them peculiar. It can be executed on imaging, time series and other spectroscopic data, operates well with thousands of features, is not sensitive to missing values and is easily parallelizable.
Effect sizes calculated using MD and MC, excluding outliers
dro.deakin.edu.au
researchdata.edu.au
txt
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Don Driscoll (2024). Effect sizes calculated using MD and MC, excluding outliers [Dataset]. http://doi.org/10.26187/deakin.26264351.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.26187/deakin.26264351.v1
Dataset updated
Nov 7, 2024
Dataset provided by
Deakin Universityhttp://www.deakin.edu.au/
Authors
Don Driscoll
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Effect sizes calculated using mean difference for burnt-unburnt study designs and mean change for before-after desings. Outliers, as defined in the methods section of the paper, were excluded prior to calculating effect sizes.
f
DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning...
frontiersin.figshare.com
pdf
Updated Oct 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu (2023). DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning tree generation with applications on medical data.pdf [Dataset]. http://doi.org/10.3389/fphys.2023.1233341.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fphys.2023.1233341.s001
Dataset updated
Oct 13, 2023
Dataset provided by
Frontiers
Authors
Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As an important technique for data pre-processing, outlier detection plays a crucial role in various real applications and has gained substantial attention, especially in medical fields. Despite the importance of outlier detection, many existing methods are vulnerable to the distribution of outliers and require prior knowledge, such as the outlier proportion. To address this problem to some extent, this article proposes an adaptive mini-minimum spanning tree-based outlier detection (MMOD) method, which utilizes a novel distance measure by scaling the Euclidean distance. For datasets containing different densities and taking on different shapes, our method can identify outliers without prior knowledge of outlier percentages. The results on both real-world medical data corpora and intuitive synthetic datasets demonstrate the effectiveness of the proposed method compared to state-of-the-art methods.
f
The eleven outliers identified in the Lau Archipelago dataset.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 24, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, Chii-Shiarng; Mayfield, Anderson B.; Dempsey, Alexandra C. (2017). The eleven outliers identified in the Lau Archipelago dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001772374
Explore at:
Dataset updated
May 24, 2017
Authors
Chen, Chii-Shiarng; Mayfield, Anderson B.; Dempsey, Alexandra C.
Description
Gene expression data have been presented as non-normalized (2-Ct*109) in all but the last two rows; this allows for the back-calculation of the raw threshold cycle (Ct) values so that the typical range of expression of each gene can be more easily assessed by interested individuals. The sample number fraction following the island name represents the number of outliers over the total number of samples for which a Mahalanobis distance could be calculated (rather than the number of samples analyzed from that site). Values representing aberrant levels for a particular response variable (i.e., that contributed to the heat map score) have been highlighted in bold. When there was a statistically significant difference (student’s t-test, p<0.05) between the outlier and non-outlier averages for a parameter (instead using normalized gene expression data), the lower of the two values has been underlined. No outliers were detected amongst the colonies sampled from Tuvuca (n = 8 samples analyzed in full) and Cicia (n = 8 samples analyzed in full). Fulaga sample 54 was also determined to be an outlier after imputation of missing data (discussed in the main text), though it is not featured in this table. In the “Color” column, the values are as follows: 1 = normal, 2 = pale, 3 = very pale, and 4 = bleached. PAR = photosynthetically active radiation. SA = surface area. GCP = genome copy proportion. Ma Dis = Mahalanobis distance. “.” = missing data.
Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
pdf
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Spain, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/anomaly-detection-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jun 12, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
Canada, Germany, United Kingdom, Mexico, United States
Description
Snapshot img

Anomaly Detection Market Size 2025-2029

The anomaly detection market size is valued to increase by USD 4.44 billion, at a CAGR of 14.4% from 2024 to 2029. Anomaly detection tools gaining traction in BFSI will drive the anomaly detection market.

Major Market Trends & Insights

North America dominated the market and accounted for a 43% growth during the forecast period. By Deployment - Cloud segment was valued at USD 1.75 billion in 2023 By Component - Solution segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 173.26 million Market Future Opportunities: USD 4441.70 million CAGR from 2024 to 2029 : 14.4%

Market Summary

Anomaly detection, a critical component of advanced analytics, is witnessing significant adoption across various industries, with the financial services sector leading the charge. The increasing incidence of internal threats and cybersecurity frauds necessitates the need for robust anomaly detection solutions. These tools help organizations identify unusual patterns and deviations from normal behavior, enabling proactive response to potential threats and ensuring operational efficiency. For instance, in a supply chain context, anomaly detection can help identify discrepancies in inventory levels or delivery schedules, leading to cost savings and improved customer satisfaction. In the realm of compliance, anomaly detection can assist in maintaining regulatory adherence by flagging unusual transactions or activities, thereby reducing the risk of penalties and reputational damage. According to recent research, organizations that implement anomaly detection solutions experience a reduction in error rates by up to 25%. This improvement not only enhances operational efficiency but also contributes to increased customer trust and satisfaction. Despite these benefits, challenges persist, including data quality and the need for real-time processing capabilities. As the market continues to evolve, advancements in machine learning and artificial intelligence are expected to address these challenges and drive further growth.

What will be the Size of the Anomaly Detection Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free Sample

How is the Anomaly Detection Market Segmented ?

The anomaly detection industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Deployment Cloud On-premises Component Solution Services End-user BFSI IT and telecom Retail and e-commerce Manufacturing Others Technology Big data analytics AI and ML Data mining and business intelligence Geography North America US Canada Mexico Europe France Germany Spain UK APAC China India Japan Rest of World (ROW)

By Deployment Insights

The cloud segment is estimated to witness significant growth during the forecast period.

The market is witnessing significant growth, driven by the increasing adoption of advanced technologies such as machine learning algorithms, predictive modeling tools, and real-time monitoring systems. Businesses are increasingly relying on anomaly detection solutions to enhance their root cause analysis, improve system health indicators, and reduce false positives. This is particularly true in sectors where data is generated in real-time, such as cybersecurity threat detection, network intrusion detection, and fraud detection systems. Cloud-based anomaly detection solutions are gaining popularity due to their flexibility, scalability, and cost-effectiveness.

This growth is attributed to cloud-based solutions' quick deployment, real-time data visibility, and customization capabilities, which are offered at flexible payment options like monthly subscriptions and pay-as-you-go models. Companies like Anodot, Ltd, Cisco Systems Inc, IBM Corp, and SAS Institute Inc provide both cloud-based and on-premise anomaly detection solutions. Anomaly detection methods include outlier detection, change point detection, and statistical process control. Data preprocessing steps, such as data mining techniques and feature engineering processes, are crucial in ensuring accurate anomaly detection. Data visualization dashboards and alert fatigue mitigation techniques help in managing and interpreting the vast amounts of data generated.

Network traffic analysis, log file analysis, and sensor data integration are essential components of anomaly detection systems. Additionally, risk management frameworks, drift detection algorithms, time series forecasting, and performance degradation detection are vital in maintaining system performance and capacity planning.
e
Density-based outlier scoring on Kepler data - Dataset - B2FIND
b2find.eudat.eu
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Density-based outlier scoring on Kepler data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/049456b7-7080-5ff0-a5ff-bbb6180c4120
Explore at:
Dataset updated
Apr 23, 2024
Description
In the present era of large-scale surveys, big data present new challenges to the discovery process for anomalous data. Such data can be indicative of systematic errors, extreme (or rare) forms of known phenomena, or most interestingly, truly novel phenomena that exhibit as-of-yet unobserved behaviours. In this work, we present an outlier scoring methodology to identify and characterize the most promising unusual sources to facilitate discoveries of such anomalous data. We have developed a data mining method based on k-nearest neighbour distance in feature space to efficiently identify the most anomalous light curves. We test variations of this method including using principal components of the feature space, removing select features, the effect of the choice of k, and scoring to subset samples. We evaluate the performance of our scoring on known object classes and find that our scoring consistently scores rare (<1000) object classes higher than common classes. We have applied scoring to all long cadence light curves of Quarters 1-17 of Kepler's prime mission and present outlier scores for all 2.8 million light curves for the roughly 200k objects.
n
Data from: Drivers of contemporary and future changes in Arctic seasonal...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Dec 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yijing Liu; Peiyan Wang; Bo Elberling; Andreas Westergaard-Nielsen (2023). Drivers of contemporary and future changes in Arctic seasonal transition dates for a tundra site in coastal Greenland [Dataset]. http://doi.org/10.5061/dryad.jsxksn0hp
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.jsxksn0hp
Dataset updated
Dec 30, 2023
Dataset provided by
University of Copenhagen
Institute of Geographic Sciences and Natural Resources Research
Authors
Yijing Liu; Peiyan Wang; Bo Elberling; Andreas Westergaard-Nielsen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Arctic, Greenland
Description
Climate change has had a significant impact on the seasonal transition dates of Arctic tundra ecosystems, causing diverse variations between distinct land surface classes. However, the combined effect of multiple controls as well as their individual effects on these dates remains unclear at various scales and across diverse land surface classes. Here we quantified spatiotemporal variations of three seasonal transition dates (start of spring, maximum Normalized Difference Vegetation Index (NDVImax) day, end of fall) for five dominant land surface classes in the ice-free Greenland and analyzed their drivers for current and future climate scenarios, respectively. Methods To quantify the seasonal transition dates, we used NDVI derived from Sentinel-2 MultiSpectral Instrument (Level-1C) images during 2016–2020 based on Google Earth Engine (https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2). We performed an atmospheric correction (Yin et al., 2019) on the images before calculating NDVI. The months from May to October were set as the study period each year. The quality control process includes 3 steps: (i) the cloud was masked according to the QA60 band; (ii) images were removed if the number of pixels with NDVI values outside the range of -1–1 exceeds 30% of the total pixels while extracting the median value of each date; (iii) NDVI outliers resulting from cloud mask errors (Coluzzi et al., 2018) and sporadic snow were deleted pixel by pixel. NDVI outliers mentioned here appear as a sudden drop to almost zero in the growing season and do not form a sequence in this study (Komisarenko et al., 2022). To identify outliers, we iterated through every two consecutive NDVI values in the time series and calculated the difference between the second and first values for each pixel every year. We defined anomalous NDVI differences as points outside of the percentiles threshold [10 90], and if the NDVI difference is positive, then the first NDVI value used to calculate the difference will be the outlier, otherwise, the second one will be the outlier. Finally, 215 images were used to reflect seasonal transition dates in all 5 study periods of 2016–2020 after the quality control. Each image was resampled with 32 m spatial resolution to match the resolution of the ArcticDEM data and SnowModel outputs. To detect seasonal transition dates, we used a double sigmoid model to fit the NDVI changes on time series, and points where the curvature changes most rapidly on the fitted curve, appear at the beginning, middle, and end of each season (Klosterman et al., 2014). The applicability of this phenology method in the Arctic has been demonstrated (Ma et al., 2022; Westergaard-Nielsen et al., 2013; Westergaard-Nielsen et al., 2017). We focused on 3 seasonal transition dates, i.e., SOS, NDVImax day, and EOF. The NDVI values for some pixels are still below zero in spring and summer due to topographical shadow. We, therefore, set a quality control rule before calculating seasonal transition dates for each pixel, i.e., if the number of days with positive NDVI values from June to September is less than 60% of the total number of observed days, the pixel will not be considered for subsequent calculations. As verification of fitted dates, the seasonal transition dates in dry heaths and corresponding time-lapse photos acquired from the snow fence area are shown in Fig. 2. Snow cover extent is greatly reduced and vegetation is exposed with lower NDVI values on the SOS. All visible vegetation is green on the NDVImax day. On EOF, snow cover distributes partly, and NDVI decreases to a value close to zero.
Data from: Outlier classification using autoencoders: application for...
osti.gov
Updated Jun 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bianchi, F. M.; Brunner, D.; Kube, R.; LaBombard, B. (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1882649-outlier-classification-using-autoencoders-application-fluctuation-driven-flows-fusion-plasmas
Explore at:
Dataset updated
Jun 2, 2021
Dataset provided by
United States Department of Energyhttp://energy.gov/
Office of Sciencehttp://www.er.doe.gov/
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
Authors
Bianchi, F. M.; Brunner, D.; Kube, R.; LaBombard, B.
Description
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that aremore » identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.« less
e
Sample of 45 H{alpha}EW outliers - Dataset - B2FIND
b2find.eudat.eu
Updated Oct 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Sample of 45 H{alpha}EW outliers - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/7782063a-207c-571b-bad5-80eedba236cf
Explore at:
Dataset updated
Oct 23, 2023
Description
In this work, we calibrate the relationship between H{alpha} emission and M-dwarf ages. We compile a sample of 892 M-dwarfs with H{alpha} equivalent width (H{alpha}EW) measurements from the literature that are either comoving with a white dwarf of known age (21 stars) or in a known young association (871 stars). In this sample we identify 7 M-dwarfs that are new candidate members of known associations. By dividing the stars into active and inactive categories according to their H{alpha}EW and spectral type (SpT), we find that the fraction of active dwarfs decreases with increasing age, and the form of the decline depends on SpT. Using the compiled sample of age calibrators, we find that H{alpha} EW and fractional H{alpha} luminosity (L_H{alpha}/L_bol) decrease with increasing age. H{alpha}EW for SpT<~M7 decreases gradually up until ~1Gyr. For older ages, we found only two early M dwarfs that are both inactive and seem to continue the gradual decrease. We also found 14 mid-type M-dwarfs, out of which 11 are inactive and present a significant decrease in H{alpha}EW, suggesting that the magnetic activity decreases rapidly after ~1Gyr. We fit L_H{alpha}/L_bol versus age with a broken power law and find an index of -0.11_-0.01_^+0.02^ for ages >1Gyr) leaves this part of the relation far less constrained. Finally, from repeated independent measurements for the same stars, we find that 94% of them have a level of H{alpha}EW variability <~5{AA} at young ages (<1Gyr).
d
Integrated Building Health Management
catalog.data.gov
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Integrated Building Health Management [Dataset]. https://catalog.data.gov/dataset/integrated-building-health-management
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.
S
Water quality test data
scidb.cn
Updated Oct 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HuiyunFeng; JingangJiang (2022). Water quality test data [Dataset]. http://doi.org/10.57760/sciencedb.05375
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.05375
Dataset updated
Oct 26, 2022
Dataset provided by
Science Data Bank
Authors
HuiyunFeng; JingangJiang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outliers are often present in large datasets of water quality monitoring time series data. A method of combining the sliding window technique with Dixon detection criterion for the automatic detection of outliers in time series data is limited by the empirical determination of sliding window sizes. The scientific determination of the optimal sliding window size is very meaningful research work. This paper presents a new Monte Carlo Search Method (MCSM) based on random sampling to optimize the size of the sliding window, which fully takes advantage of computers and statistics. The MCSM was applied in a case study to automatic monitoring data of water quality factors in order to test its validity and usefulness. The results of comparing the accuracy and efficiency of the MCSM show that the new method in this paper is scientific and effective. The experimental results show that, at different sample sizes, the average accuracy is between 58.70% and 75.75%, and the average computation time increase is between 17.09% and 45.53%. In the era of big data in environmental monitoring, the proposed new methods can meet the required accuracy of outlier detection and improve the efficiency of calculation.
Data from: Expected total thyroxine (TT4) concentrations and outlier values...
zenodo.org
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maya Lottati; David Bruyette; David Aucoin; Maya Lottati; David Bruyette; David Aucoin (2022). Data from: Expected total thyroxine (TT4) concentrations and outlier values in 531,765 cats in the United States (2014-2015) [Dataset]. http://doi.org/10.5061/dryad.m6f721d
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.m6f721d
Dataset updated
May 31, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maya Lottati; David Bruyette; David Aucoin; Maya Lottati; David Bruyette; David Aucoin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
United States
Description
Background: Levels exceeding the standard reference interval (RI) for total thyroxine (TT4) concentrations are diagnostic for hyperthyroidism, however some hyperthyroid cats have TT4 values within the RI. Determining outlier TT4 concentrations should aid practitioners in identification of hyperthyroidism. The objective of this study was to determine the expected distribution of TT4 concentration using a large population of cats (531,765) of unknown health status to identify unexpected TT4 concentrations (outlier), and determine whether this concentration changes with age. Methodology/Principle Findings: This study is a population-based, retrospective study evaluating an electronic database of laboratory results to identify unique TT4 measurement between January 2014 and July 2015. An expected distribution of TT4 concentrations was determined using a large population of cats (531,765) of unknown health status, and this in turn was used to identify unexpected TT4 concentrations (outlier) and determine whether this concentration changes with age. All cats between the age of 1 and 9 years (n=141,294) had the same expected distribution of TT4 concentration (0.5-3.5ug/dL), and cats with a TT4 value >3.5ug/dL were determined to be unexpected outliers. There was a steep and progressive rise in both the total number and percentage of statistical outliers in the feline population as a function of age. The greatest acceleration in the percentage of outliers occurred between the age of 7 and 14 years, which was up to 4.6 times the rate seen between the age of 3 and 7 years. Conclusions: TT4 concentrations >3.5ug/dL represent outliers from the expected distribution of TT4 concentration. Furthermore, age has a strong influence on the proportion of cats. These findings suggest that patients with TT4 concentrations >3.5ug/dL should be more closely evaluated for hyperthyroidism, particularly between the ages of 7 and 14 years. This finding may aid clinicians in earlier identification of hyperthyroidism in at-risk patients.

Facebook

Twitter

Click to copy link

Link copied

Cite

David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002

Data from: Error and anomaly detection for intra-participant time-series data

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.5189002

Dataset updated

Jun 1, 2023

Dataset provided by

Taylor & Francis

Authors

David R. Mullineaux; Gareth Irwin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.

Clear search

Close search

Google apps

Main menu

Data from: Error and anomaly detection for intra-participant time-series...

Data from: Outlier classification using autoencoders: application for...

Data from: Valid Inference Corrected for Outlier Removal

Find Outliers GRM

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

Identifying outliers in asset pricing data with a new weighted forward...

Data from: A Diagnostic Procedure for Detecting Outliers in Linear...

Outliers and similarity in APOGEE - Dataset - B2FIND

outlier detection algorithm for SDSS galaxies - Dataset - B2FIND

Effect sizes calculated using MD and MC, excluding outliers

DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning...

The eleven outliers identified in the Lau Archipelago dataset.

Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Density-based outlier scoring on Kepler data - Dataset - B2FIND

Data from: Drivers of contemporary and future changes in Arctic seasonal...

Data from: Outlier classification using autoencoders: application for...

Sample of 45 H{alpha}EW outliers - Dataset - B2FIND

Integrated Building Health Management

Water quality test data

Data from: Expected total thyroxine (TT4) concentrations and outlier values...

Data from: Error and anomaly detection for intra-participant time-series data