Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The results represent experiments on four datasets based on 20 simulated experiments and the class size of 220. The proposed method (NewAlgo) produces the best overall results.
State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
Anomaly Detection Market Size 2025-2029
The anomaly detection market size is forecast to increase by USD 4.44 billion at a CAGR of 14.4% between 2024 and 2029.
The market is experiencing significant growth, particularly in the BFSI sector, as organizations increasingly prioritize identifying and addressing unusual patterns or deviations from normal business operations. The rising incidence of internal threats and cyber frauds necessitates the implementation of advanced anomaly detection tools to mitigate potential risks and maintain security. However, implementing these solutions comes with challenges, primarily infrastructural requirements. Ensuring compatibility with existing systems, integrating new technologies, and training staff to effectively utilize these tools pose significant hurdles for organizations.
Despite these challenges, the potential benefits of anomaly detection, such as improved risk management, enhanced operational efficiency, and increased security, make it an essential investment for businesses seeking to stay competitive and agile in today's complex and evolving threat landscape. Companies looking to capitalize on this market opportunity must carefully consider these challenges and develop strategies to address them effectively. Cloud computing is a key trend in the market, as cloud-based solutions offer quick deployment, flexibility, and scalability.
What will be the Size of the Anomaly Detection Market during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free Sample
In the dynamic and evolving market, advanced technologies such as resource allocation, linear regression, pattern recognition, and support vector machines are increasingly being adopted for automated decision making. Businesses are leveraging these techniques to enhance customer experience through behavioral analytics, object detection, and sentiment analysis. Machine learning algorithms, including random forests, naive Bayes, decision trees, clustering algorithms, and k-nearest neighbors, are essential tools for risk management and compliance monitoring. AI-powered analytics, time series forecasting, and predictive modeling are revolutionizing business intelligence, while process optimization is achieved through the application of decision support systems, natural language processing, and predictive analytics.
Computer vision, image recognition, logistic regression, and operational efficiency are key areas where principal component analysis and artificial neural networks contribute significantly. Speech recognition and operational efficiency are also benefiting from these advanced technologies, enabling businesses to streamline processes and improve overall performance.
How is this Anomaly Detection Industry segmented?
The anomaly detection industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Deployment
Cloud
On-premises
Component
Solution
Services
End-user
BFSI
IT and telecom
Retail and e-commerce
Manufacturing
Others
Technology
Big data analytics
AI and ML
Data mining and business intelligence
Geography
North America
US
Canada
Mexico
Europe
France
Germany
Spain
UK
APAC
China
India
Japan
Rest of World (ROW)
By Deployment Insights
The cloud segment is estimated to witness significant growth during the forecast period. The market is witnessing significant growth due to the increasing adoption of advanced technologies such as machine learning models, statistical methods, and real-time monitoring. These technologies enable the identification of anomalous behavior in real-time, thereby enhancing network security and data privacy. Anomaly detection algorithms, including unsupervised learning, reinforcement learning, and deep learning networks, are used to identify outliers and intrusions in large datasets. Data security is a major concern, leading to the adoption of data masking, data pseudonymization, data de-identification, and differential privacy.
Data leakage prevention and incident response are critical components of an effective anomaly detection system. False positive and false negative rates are essential metrics to evaluate the performance of these systems. Time series analysis and concept drift are important techniques used in anomaly detection. Data obfuscation, data suppression, and data aggregation are other strategies employed to maintain data privacy. Companies such as Anodot, Cisco Systems Inc, IBM Corp, and SAS Institute Inc offer both cloud-based and on-premises anomaly detection solutions. These solutions use v
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The objects are numbered. The Y-variable are boiling points. Other features are structural features of molecules. In the outlier column the outliers are assigned with a value of 1.
The data is derived from a published chemical dataset on boiling point measurements [1] and from public data [2]. Features were generated by means of the RDKit Python library [3]. The dataset was infused with known outliers (~5%) based on significant structural differences, i.e. polar and non-polar molecules.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Amsterdam Library of Object Images" is a collection of 110250 images of 1000 small objects, taken under various light conditions and rotation angles. All objects were placed on a black background. Thus the images are taken under rather uniform conditions, which means there is little uncontrolled bias in the data set (unless mixed with other sources). They do however not resemble a "typical" image collection. The data set has a rather unique property for its size: there are around 100 different images of each object, so it is well suited for clustering. By downsampling some objects it can also be used for outlier detection. For multi-view research, we offer a number of different feature vector sets for evaluating this data set.
Data for Figure 3.30 from Chapter 3 of the Working Group I (WGI) Contribution to the Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report (AR6). Figure 3.30 shows observed and CMIP6 simulated AMOC mean state, variability and long-term trends. --------------------------------------------------- How to cite this dataset --------------------------------------------------- When citing this dataset, please include both the data citation below (under 'Citable as') and the following citation for the report component from which the figure originates: Eyring, V., N.P. Gillett, K.M. Achuta Rao, R. Barimalala, M. Barreiro Parrillo, N. Bellouin, C. Cassou, P.J. Durack, Y. Kosaka, S. McGregor, S. Min, O. Morgenstern, and Y. Sun, 2021: Human Influence on the Climate System. In Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [Masson-Delmotte, V., P. Zhai, A. Pirani, S.L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, pp. 423–552, doi:10.1017/9781009157896.005. --------------------------------------------------- Figure subpanels --------------------------------------------------- The figure has 6 subpanels with data provided for all panels in subdirectories named panel_a, panel_b, panel_c, panel_d, panel_e and panel_f. --------------------------------------------------- List of data provided --------------------------------------------------- This dataset contains: - AMOC streamfunction profiles from CMIP5 (1860-2004) and CMIP6 (1860-2014) historical simulations - AMOC mean maximum overturning depth from CMIP5 (1860-2004) and CMIP6 (1860-2014) historical simulations - AMOC mean maximum overturning depth from RAPID observational dataset (2004-2018) - AMOC mean maximum overturning streamfunction from CMIP5 (1860-2004) and CMIP6 (1860-2014) historical simulations - AMOC mean maximum overturning streamfunction from RAPID observational dataset (2004-2018) - AMOC 8-year trends from CMIP5 and CMIP6 simulations and RAPID observations (2004-2012) - Interannual AMOC changes from CMIP5 and CMIP6 simulations and RAPID observations (2008-2010) - Longterm AMOC trends (1850-2014) from CMIP6 simulations - Longterm AMOC trends (1940-1985) from CMIP6 simulations - Longterm AMOC trends (1985-2014) from CMIP6 simulations --------------------------------------------------- Data provided in relation to figure --------------------------------------------------- - panel_a/amoc_mean_state_boxes.csv has the data for the grey observations lines and blue and red boxes with whiskers - panel_a/amoc_profiles_shadings.csv has data for the blue and red profile shadings. - panel_a/amoc_profile_cmip5.csv has data for the blue profile - panel_a/amoc_profile_cmip6.csv has data for the red profile - panel_b/amoc_trends_2004_2012.csv has data for boxes and whiskers and outlier dots - panel_b/amoc_trends_cmip5_cmip6_additional_outliers.csv has data for additional outlier dots for CMIP5 and CMIP6 - panel_c/interannual_variability_AMOC.csv has data for boxes and whiskers and outlier dots - panel_c/interannual_variability_AMOC_cmip5_cmip6_additional_outliers.csv has data for additional outlier dots for CMIP5 and CMIP6 - panel_d/amoc_longtern_trend_1850_2014.csv has data for grey, green, blue and orange boxes and whiskers - panel_e/amoc_longtern_trend_1940_1985.csv has data for grey, green, blue and orange boxes and whiskers - panel_f/amoc_longtern_trend_1985_2014.csv has data for grey, green, blue and orange boxes and whiskers CMIP6 is the sixth phase of the Coupled Model Intercomparison Project. AMOC is the Atlantic Meridional Overturning Circulation. --------------------------------------------------- Sources of additional information --------------------------------------------------- The following weblinks are provided in the Related Documents section of this catalogue record: - Link to the report component containing the figure (Chapter 3) - Link to the Supplementary Material for Chapter 3, which contains details on the input data used in Table 3.SM.1 - Link to the code for the figure, archived on Zenodo - Link to the figure on the IPCC AR6 website
QST, a measure of quantitative genetic differentiation among populations, is an index that can suggest local adaptation if QST for a trait is sufficiently larger than the mean FST of neutral genetic markers. A previous method by Whitlock and Guillaume derived a simulation resampling approach to statistically test for a difference between QST and FST, but that method is limited to balanced data sets with offspring related as half-sibs through shared fathers. We extend this approach to (1) allow for a model more suitable for some plant populations or breeding designs in which offspring are related through mothers (assuming independent fathers for each offspring; half-sibs by dam), and (2) by explicitly allowing for unbalanced data sets. The resulting approach is made available through the R package QstFstComp.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two groups examples with outlier-to-data ratios of 0.5 (a) and 2.0 (b) are chosen to test the influence of different initial outlier ratios (from 0 to 0.9) on the registration results. The error means and the standard deviations of 100 examples in each group of our method are compared with CPD.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Selection was α = 600 and Nm = 10. Mean localization is given in distance (kb) from the real selective position.
This paper presents a sliding window constrained fault-tolerant filtering method for sampling data in petrochemical instrumentation. The method requires the design of an appropriate sliding window width based on the time series, as well as the expansion of both ends of the series. By utilizing a sliding window constraint function, the method produces a smoothed estimate for the current moment within the window. As the window advances, a series of smoothed estimates of the original sampled data is generated. Subsequently, the original series is subtracted from this smoothed estimate to create a new series that represents the differences between the two. This difference series is then subjected to an additional smoothing estimation process, and the resulting smoothed estimates are employed to compensate for the smoothed estimates of original sampled series. The experimental results indicate that, compared with sliding mean filtering, sliding median filtering, and Savitzky-Golay filtering,..., , , # Sliding window constrained fault-tolerant filtering of compressor vibration data
https://doi.org/10.5061/dryad.pc866t20z
Data type
Files containing ‘fdata1case1’ in the file represents the case "1" of the location of the outlier in the measured data "1", and so on;
Files containing ‘fwavedata’ in the file name are wave signals with outliers;
Files containing ‘fwave2data’ in the file name are polynomial signals with outliers;
Files containing ‘normaldata’ in the file name are normal measured data;
Files containing ‘normalwavedata’ in the file name are normal wave signals;
Files containing ‘normalwave2data’ in the file name are normal polynomial signals;
Files containing ‘ftffiltered’ in the file name indicate that the data have been processed by sliding-window constrained error-tolerant filtering;
Files containing ‘sgfiltered’ in the file name indicate data after Savitzky-Golay filtering...
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Here are six files that provide details for all 44,120 identified single nucleotide polymorphisms (SNPs) or the 215 outlier SNPs associated with the evolution of rapid character displacement among replicate islands with (2Spp) and without competition (1Spp) between two Anolis species. On 2Spp islands, A. carolinensis occurs higher in trees and have evolved larger toe pads. Among 1Spp and 2Spp island populations, we identify 44,120 SNPs, with 215-outlier SNPs with improbably large FST values, low nucleotide variation, greater linkage than expected, and these SNPs are enriched for animal walking behavior. Thus, we conclude that these 215-outliers are evolving by natural selection in response to the phenotypic convergent evolution of character displacement. There are two, non-mutually exclusive perspective of these nucleotide variants. One is character displacement is convergent: all 215 outlier SNPs are shared among 3 out of 5 2Spp island and 24% of outlier SNPS are shared among all five out of five 2Spp island. Second, character displacement is genetically redundant because the allele frequencies in one or more 2Spp are similar to 1Spp islands: among one or more 2Spp islands 33% of outlier SNPS are within the range of 1Spp MiAF and 76% of outliers are more similar to 1Spp island than mean MiAF of 2Spp islands. Focusing on convergence SNP is scientifically more robust, yet it distracts from the perspective of multiple genetic solutions that enhances the rate and stability of adaptive change. The six files include: a description of eight islands, details of 94 individuals, and four files on SNPs. The four SNP files include the VCF files for 94 individuals with 44KSNPs and two files (Excel sheet/tab-delimited file) with FST, p-values and outlier status for all 44,120 identified single nucleotide polymorphisms (SNPs) associated with the evolution of rapid character displacement. The sixth file is a detailed file on the 215 outlier SNPs. Complete sequence data is available at Bioproject PRJNA833453, which including samples not included in this study. The 94 individuals used in this study are described in “Supplemental_Sample_description.txt” Methods Anoles and genomic DNA: Tissue or DNA for 160 Anolis carolinensis and 20 A. sagrei samples were provided by the Museum of Comparative Zoology at Harvard University (Table S2). Samples were previously used to examine evolution of character displacement in native A. carolinensis following invasion by A. sagrei onto man-made spoil islands in Mosquito Lagoon Florida (Stuart et al. 2014). One hundred samples were genomic DNAs, and 80 samples were tissues (terminal tail clip, Table S2). Genomic DNA was isolated from 80 of 160 A. carolinensis individuals (MCZ, Table S2) using a custom SPRI magnetic bead protocol (Psifidi et al. 2015). Briefly, after removing ethanol, tissues were placed in 200 ul of GH buffer (25 mM Tris- HCl pH 7.5, 25 mM EDTA, , 2M GuHCl Guanidine hydrochloride, G3272 SIGMA, 5 mM CaCl2, 0.5% v/v Triton X-100, 1% N-Lauroyl-Sarcosine) with 5% per volume of 20 mg/ml proteinase K (10 ul/200 ul GH) and digested at 55º C for at least 2 hours. After proteinase K digestion, 100 ul of 0.1% carboxyl-modified Sera-Mag Magnetic beads (Fisher Scientific) resuspended in 2.5 M NaCl, 20% PEG were added and allowed to bind the DNA. Beads were subsequently magnetized and washed twice with 200 ul 70% EtOH, and then DNA was eluted in 100 ul 0.1x TE (10 mM Tris, 0.1 mM EDTA). All DNA samples were gel electrophoresed to ensure high molecular mass and quantified by spectrophotometry and fluorescence using Biotium AccuBlueTM High Sensitivity dsDNA Quantitative Solution according to manufacturer’s instructions. Genotyping-by-sequencing (GBS) libraries were prepared using a modified protocol after Elshire et al. (Elshire et al. 2011). Briefly, high-molecular-weight genomic DNA was aliquoted and digested using ApeKI restriction enzyme. Digests from each individual sample were uniquely barcoded, pooled, and size selected to yield insert sizes between 300-700 bp (Borgstrom et al. 2011). Pooled libraries were PCR amplified (15 cycles) using custom primers that extend into the genomic DNA insert by 3 bases (CTG). Adding 3 extra base pairs systematically reduces the number of sequenced GBS tags, ensuring sufficient sequencing depth. The final library had a mean size of 424 bp ranging from 188 to 700 bp . Anolis SNPs: Pooled libraries were sequenced on one lane on the Illumina HiSeq 4000 in 2x150 bp paired-end configuration, yielding approximately 459 million paired-end reads ( ~138 Gb). The medium Q-Score was 42 with the lower 10% Q-Scores exceeding 32 for all 150 bp. The initial library contained 180 individuals with 8,561,493 polymorphic sites. Twenty individuals were Anolis sagrei, and two individuals (Yan 1610 & Yin 1411) clustered with A. sagrei and were not used to define A. carolinesis’ SNPs. Anolis carolinesis reads were aligned to the Anolis carolinensis genome (NCBI RefSeq accession number:/GCF_000090745.1_AnoCar2.0). Single nucleotide polymorphisms (SNPs) for A. carolinensis were called using the GBeaSy analysis pipeline (Wickland et al. 2017) with the following filter settings: minimum read length of 100 bp after barcode and adapter trimming, minimum phred-scaled variant quality of 30 and minimum read depth of 5. SNPs were further filtered by requiring SNPs to occur in > 50% of individuals, and 66 individuals were removed because they had less than 70% of called SNPs. These filtering steps resulted in 51,155 SNPs among 94 individuals. Final filtering among 94 individuals required all sites to be polymorphic (with fewer individuals, some sites were no longer polymorphic) with a maximum of 2 alleles (all are bi-allelic), minimal allele frequency 0.05, and He that does not exceed HWE (FDR <0.01). SNPs with large He were removed (2,280 SNPs). These SNPs with large significant heterozygosity may result from aligning paralogues (different loci), and thus may not represent polymorphisms. No SNPs were removed with low He (due to possible demography or other exceptions to HWE). After filtering, 94 individual yielded 44,120 SNPs. Thus, the final filtered SNP data set was 44K SNPs from 94 indiviuals. Statistical Analyses: Eight A. carolinensis populations were analyzed: three populations from islands with native species only (1Spp islands) and 5 populations from islands where A. carolinesis co-exist with A. sagrei (2Spp islands, Table 1, Table S1). Most analyses pooled the three 1Spp islands and contrasted these with the pooled five 2Spp islands. Two approaches were used to define SNPs with unusually large allele frequency differences between 1Spp and 2Spp islands: 1) comparison of FST values to random permutations and 2) a modified FDIST approach to identify outlier SNPs with large and statistically unlikely FST values. Random Permutations: FST values were calculated in VCFTools (version 4.2, (Danecek et al. 2011)) where the p-value per SNP were defined by comparing FST values to 1,000 random permutations using a custom script (below). Basically, individuals and all their SNPs were randomly assigned to one of eight islands or to 1Spp versus 2Spp groups. The sample sizes (55 for 2Spp and 39 for 1Spp islands) were maintained. FST values were re-calculated for each 1,000 randomizations using VCFTools. Modified FDIST: To identify outlier SNPs with statistically large FST values, a modified FDIST (Beaumont and Nichols 1996) was implemented in Arlequin (Excoffier et al. 2005). This modified approach applies 50,000 coalescent simulations using hierarchical population structure, in which demes are arranged into k groups of d demes and in which migration rates between demes are different within and between groups. Unlike the finite island models, which have led to large frequencies of false positive because populations share different histories (Lotterhos and Whitlock 2014), the hierarchical island model avoids these false positives by avoiding the assumption of similar ancestry (Excoffier et al. 2009). References Beaumont, M. A. and R. A. Nichols. 1996. Evaluating loci for use in the genetic analysis of population structure. P Roy Soc B-Biol Sci 263:1619-1626. Borgstrom, E., S. Lundin, and J. Lundeberg. 2011. Large scale library generation for high throughput sequencing. PLoS One 6:e19119. Bradbury, P. J., Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss, and E. S. Buckler. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633-2635. Cingolani, P., A. Platts, L. Wang le, M. Coon, T. Nguyen, L. Wang, S. J. Land, X. Lu, and D. M. Ruden. 2012. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6:80-92. Danecek, P., A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin, and G. Genomes Project Analysis. 2011. The variant call format and VCFtools. Bioinformatics 27:2156-2158. Earl, D. A. and B. M. vonHoldt. 2011. Structure Harvester: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genet Resour 4:359-361. Elshire, R. J., J. C. Glaubitz, Q. Sun, J. A. Poland, K. Kawamoto, E. S. Buckler, and S. E. Mitchell. 2011. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6:e19379. Evanno, G., S. Regnaut, and J. Goudet. 2005. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14:2611-2620. Excoffier, L., T. Hofer, and M. Foll. 2009. Detecting loci under selection in a hierarchically structured population. Heredity 103:285-298. Excoffier, L., G. Laval, and S. Schneider. 2005. Arlequin (version 3.0): An integrated software package for population genetics data analysis.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Natural selection favors sexual dimorphism that reduces resource competition between the sexes of the same species. However, niche partitioning among interspecific competitors should counter such divergence, as partitioning the niche results in smaller total niche widths for each individual species, leaving less room for the sexes to diverge. A straightforward (and long-standing) hypothesis emerges: species in competitor-rich ecological communities should show less sexual dimorphism than species in competitor-poor ecological communities. Here, we test this prediction using a well-documented natural experiment generated by the recent arrival of Anolis sagrei to a set of small islands in Mosquito Lagoon, Florida, containing Anolis carolinensis. Despite known interspecific habitat partitioning and rapid evolution in habitat-use traits by A. carolinensis in this system, sexual dimorphism between male and female A. carolinensis was not reduced as predicted on two-species islands relative to islands with only A. carolinensis. This is consistent with a small but growing body of empirical tests of the dimorphism-richness hypothesis that have been ambiguous in their support at best. A rethinking of the validity of this intuitive hypothesis is needed.
Methods See details in the Methods section of main paper. See Stuart et al., 2014, Science 346: 463-466 and its supporting information for coordinates of island location.
Outlier detection was performed with outlier defined as values > 3 standard deviations from the mean for a given island (see 1.data.manip.R). For individuals with many outliers in the size corrected, the size (SVL) data were checked against the original data sheet and corrected where errors were detected in data entry. A small number of individuals were detected with one or two outliers which is to be expected given the large data set used. One individual with nine outliers was excluded from the analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
This dataset provides monthly summaries of evapotranspiration (ET) data from OpenET v2.0 image collections for the period 2008-2023 for all National Watershed Boundary Dataset subwatersheds (12-digit hydrologic unit codes [HUC12s]) in the US that overlap the spatial extent of OpenET datasets. For each HUC12, this dataset contains spatial aggregation statistics (minimum, mean, median, and maximum) for each of the ET variables from each of the publicly available image collections from OpenET for the six available models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop) and the Ensemble image collection, which is a pixel-wise ensemble of all 6 individual models after filtering and removal of outliers according to the median absolute deviation approach (Melton and others, 2022). Data are available in this data release in two different formats: comma-separated values (CSV) and parquet, a high-performance format that is optimized for storage and processing of columnar data. CSV files containing data for each 4-digit HUC are grouped by 2-digit HUCs for easier access of regional data, and the single parquet file provides convenient access to the entire dataset. For each of the ET models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop), variables in the model-specific CSV data files include: -huc12: The 12-digit hydrologic unit code -ET: Actual evapotranspiration (in millimeters) over the HUC12 area in the month calculated as the sum of daily ET interpolated between Landsat overpasses -statistic: Max, mean, median, or min. Statistic used in the spatial aggregation within each HUC12. For example, maximum ET is the maximum monthly pixel ET value occurring within the HUC12 boundary after summing daily ET in the month -year: 4-digit year -month: 2-digit month -count: Number of Landsat overpasses included in the ET calculation in the month -et_coverage_pct: Integer percentage of the HUC12 with ET data, which can be used to determine how representative the ET statistic is of the entire HUC12 -count_coverage_pct: Integer percentage of the HUC12 with count data, which can be different than the et_coverage_pct value because the “count” band in the source image collection extends beyond the “et” band in the eastern portion of the image collection extent For the Ensemble data, these additional variables are included in the CSV files: -et_mad: Ensemble ET value, computed as the mean of the ensemble after filtering outliers using the median absolute deviation (MAD) -et_mad_count: The number of models used to compute the ensemble ET value after filtering for outliers using the MAD -et_mad_max: The maximum value in the ensemble range, after filtering for outliers using the MAD -et_mad_min: The minimum value in the ensemble range, after filtering for outliers using the MAD -et_sam: A simple arithmetic mean (across the 6 models) of actual ET average without outlier removal Below are the locations of each OpenET image collection used in this summary: DisALEXI: https://developers.google.com/earth-engine/datasets/catalog/OpenET_DISALEXI_CONUS_GRIDMET_MONTHLY_v2_0 eeMETRIC: https://developers.google.com/earth-engine/datasets/catalog/OpenET_EEMETRIC_CONUS_GRIDMET_MONTHLY_v2_0 geeSEBAL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_GEESEBAL_CONUS_GRIDMET_MONTHLY_v2_0 PT-JPL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_PTJPL_CONUS_GRIDMET_MONTHLY_v2_0 SIMS: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SIMS_CONUS_GRIDMET_MONTHLY_v2_0 SSEBop: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SSEBOP_CONUS_GRIDMET_MONTHLY_v2_0 Ensemble: https://developers.google.com/earth-engine/datasets/catalog/OpenET_ENSEMBLE_CONUS_GRIDMET_MONTHLY_v2_0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.
This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:
ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)
ESM_2.py – Python script to calculate Z-scores from raw financial ratios
ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios
ESM_4.py – Python script for generating the correlation heatmap of the Z-scores
ESM_5.xlsx – Mahalanobis distance values for each firm
ESM_6.py – Python script to compute Mahalanobis distances
ESM_7.py – Python script to visualize Mahalanobis distances
ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)
ESM_9.py – Python script to compute mean Z-scores
ESM_10.xlsx – Re-standardized Z-scores based on firm-level means
ESM_11.py – Python script to re-standardize mean Z-scores
ESM_12.py – Python script to generate the hierarchical clustering dendrogram
All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
The U.S. Geological Survey (USGS), in cooperation with the Missouri Department of Natural Resources (MDNR), collects data pertaining to the surface-water resources of Missouri. These data are collected as part of the Missouri Ambient Water-Quality Monitoring Network (AWQMN) and are stored and maintained by the USGS National Water Information System (NWIS) database. These data constitute a valuable source of reliable, impartial, and timely information for developing an improved understanding of the water resources of the State. Water-quality data collected between water years 1993 and 2017 were analyzed for long term trends and the network was investigated to identify data gaps or redundant data to assist MDNR on how to optimize the network in the future. This is a companion data release product to the Scientific Investigation Report: Richards, J.M., and Barr, M.N., 2021, General water-quality conditions, long-term trends, and network analysis at selected sites within the Ambient Water-Quality Monitoring Network in Missouri, water years 1993–2017: U.S. Geological Survey Scientific Investigations Report 2021–5079, 75 p., https://doi.org/10.3133/sir20215079. The following selected tables are included in this data release in compressed (.zip) format: AWQMN_EGRET_data.xlsx -- Data retrieved from the USGS National Water Information System database that was quality assured and conditioned for network analysis of the Missouri Ambient Water-Quality Monitoring Network AWQMN_R-QWTREND_data.xlsx -- Data retrieved from the USGS National Water Information System database that was quality assured and conditioned for analysis of flow-weighted trends for selected sites in the Missouri Ambient Water-Quality Monitoring Network AWQMN_R-QWTREND_outliers.xlsx -- Data flagged as outliers during analysis of flow-weighted trends for selected sites in the Missouri Ambient Water-Quality Monitoring Network AWQMN_R-QWTREND_outliers_quarterly.xlsx -- Data flagged as outliers during analysis of flow-weighted trends using a simulated quarterly sampling frequency dataset for selected sites in the Missouri Ambient Water-Quality Monitoring Network AWQMN_descriptive_statistics_WY1993-2017.xlsx -- Descriptive statistics for selected water-quality parameters at selected sites in the Missouri Ambient Water-Quality Monitoring Network The following selected graphics are included in this data release in .pdf format. Also included in this data release are web pages accessible for people with disabilities provided in compressed .zip format. The web pages present the same information as the .pdf files: Annual and seasonal discharge trends.pdf -- Graphics of discharge trends produced from the EGRET software for selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Annual_and_seasonal_discharge_trends_htm.zip -- Compressed web page presenting graphics of discharge trends produced from the EGRET software for selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Graphics of simulated quarterly sampling frequency trends.pdf -- Graphics of results of simulated quarterly sampling frequency trends produced by the R-QWTREND software at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Graphics_of_simulated_quarterly_sampling_frequency_trends_htm.zip -- Compressed web page presenting graphics of results of simulated quarterly sampling frequency trends produced by the R-QWTREND software at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Graphics of median parameter values.pdf -- Graphics of median values for selected parameters at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Graphics_of_median_parameter_values_htm.zip -- Compressed web page presenting graphics of median values for selected parameters at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Parameter value versus time.pdf -- Scatter plots of the value of selected parameters versus time at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Parameter_value_versus_time_htm.zip -- Compressed web page presenting scatter plots of the value of selected parameters versus time at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Parameter value versus discharge.pdf -- Scatter plots of the value of selected parameters versus discharge at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Parameter_value_versus_discharge_htm.zip -- Compressed web page presenting scatter plots of the value of selected parameters versus discharge at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Boxplot of parameter value distribution by season.pdf -- Seasonal boxplots of selected parameters from selected sites in the Missouri Ambient Water-Quality Monitoring Network. Seasons defined as Winter (December, January, and February), Spring (March, April, and May), Summer (June, July, and August), and Fall (September, October, and November). Graphics provided to support the interpretations in the Scientific Investigations Report. Boxplot_of_parameter_value_distribution_by_season_htm.zip -- Compressed web page presenting seasonal boxplots of selected parameters from selected sites in the Missouri Ambient Water-Quality Monitoring Network. Seasons defined as Winter (December, January, and February), Spring (March, April, and May), Summer (June, July, and August), and Fall (September, October, and November). Graphics provided to support the interpretations in the Scientific Investigations Report. Boxplot of sampled discharge compared with mean daily discharge.pdf -- Boxplots of the distribution of discharge collected at the time of sampling of selected parameters compared with the period of record discharge distribution from selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Boxplot_of_sampled_discharge_compared_with_mean_daily_discharge_htm.zip -- Compressed web page presenting boxplots of the distribution of discharge collected at the time of sampling of selected parameters compared with the period of record discharge distribution from selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Boxplot of parameter value distribution by month.pdf -- Monthly boxplots of selected parameters from selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Boxplot_of_parameter_value_distribution_by_month_htm.zip -- Compressed web page presenting monthly boxplots of selected parameters from selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report.
We describe the use of the Fréchet mean and variance in the Billera-Holmes-Vogtmann (BHV) treespace to summarize and explore the diversity of a set of phylogenetic trees. We show that the Fréchet mean is comparable to other summary methods, and, despite its stickiness property, is more likely to be binary than the majority-rules consensus tree. We show that the Fréchet variance is faster and more precise than commonly used variance measures. The Fréchet mean and variance are more theoretically justified, and more robust, than previous estimates of this type, and can be estimated reasonably efficiently, providing a foundation for building more advanced statistical methods and leading to applications such as mean hypothesis testing and outlier detection.
Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Verbal and Quantitative Reasoning GRE scores and percentiles were collected by querying the student database for the appropriate information. Any student records that were missing data such as GRE scores or grade point average were removed from the study before the data were analyzed. The GRE Scores of entering doctoral students from 2007-2012 were collected and analyzed. A total of 528 student records were reviewed. Ninety-six records were removed from the data because of a lack of GRE scores. Thirty-nine of these records belonged to MD/PhD applicants who were not required to take the GRE to be reviewed for admission. Fifty-seven more records were removed because they did not have an admissions committee score in the database. After 2011, the GRE’s scoring system was changed from a scale of 200-800 points per section to 130-170 points per section. As a result, 12 more records were removed because their scores were representative of the new scoring system and therefore were not able to be compared to the older scores based on raw score. After removal of these 96 records from our analyses, a total of 420 student records remained which included students that were currently enrolled, left the doctoral program without a degree, or left the doctoral program with an MS degree. To maintain consistency in the participants, we removed 100 additional records so that our analyses only considered students that had graduated with a doctoral degree. In addition, thirty-nine admissions scores were identified as outliers by statistical analysis software and removed for a final data set of 286 (see Outliers below). Outliers We used the automated ROUT method included in the PRISM software to test the data for the presence of outliers which could skew our data. The false discovery rate for outlier detection (Q) was set to 1%. After removing the 96 students without a GRE score, 432 students were reviewed for the presence of outliers. ROUT detected 39 outliers that were removed before statistical analysis was performed. Sample See detailed description in the Participants section. Linear regression analysis was used to examine potential trends between GRE scores, GRE percentiles, normalized admissions scores or GPA and outcomes between selected student groups. The D’Agostino & Pearson omnibus and Shapiro-Wilk normality tests were used to test for normality regarding outcomes in the sample. The Pearson correlation coefficient was calculated to determine the relationship between GRE scores, GRE percentiles, admissions scores or GPA (undergraduate and graduate) and time to degree. Candidacy exam results were divided into students who either passed or failed the exam. A Mann-Whitney test was then used to test for statistically significant differences between mean GRE scores, percentiles, and undergraduate GPA and candidacy exam results. Other variables were also observed such as gender, race, ethnicity, and citizenship status within the samples. Predictive Metrics. The input variables used in this study were GPA and scores and percentiles of applicants on both the Quantitative and Verbal Reasoning GRE sections. GRE scores and percentiles were examined to normalize variances that could occur between tests. Performance Metrics. The output variables used in the statistical analyses of each data set were either the amount of time it took for each student to earn their doctoral degree, or the student’s candidacy examination result.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The results represent experiments on four datasets based on 20 simulated experiments and the class size of 220. The proposed method (NewAlgo) produces the best overall results.