Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table identifies all state-level causes of death that were at least five times the national rate in at least one of the periods 1999-2003, 2004-2008, and 2009-2013. Data are based on the 113 Cause of Death list and are based on the CDC's Underlying Cause of Death file accessible at: http://wonder.cdc.gov/ucd-icd10.html.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Selection was α = 600 and Nm = 10. Mean localization is given in distance (kb) from the real selective position.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation where its k-nearest neighbor distance with the maximum gap is significantly different from what we would expect if the distribution of k-nearest neighbors with the maximum gap is in the maximum domain of attraction of the Gumbel distribution. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This article introduces lookout, a new approach to detect outliers using leave-one-out kernel density estimates and extreme value theory. Outlier detection methods that use kernel density estimates generally employ a user defined parameter to determine the bandwidth. Lookout uses persistent homology to construct a bandwidth suitable for outlier detection without any user input. We demonstrate the effectiveness of lookout on an extensive data repository by comparing its performance with other outlier detection methods based on extreme value theory. Furthermore, we introduce outlier persistence, a useful concept that explores the birth and the cessation of outliers with changing bandwidth and significance levels. The R package lookout implements this algorithm. Supplementary files for this article are available online.
In this work we apply and expand on a recently introduced outlier detection algorithm that is based on an unsupervised random forest. We use the algorithm to calculate a similarity measure for stellar spectra from the Apache Point Observatory Galactic Evolution Experiment (APOGEE). We show that the similarity measure traces non-trivial physical properties and contains information about complex structures in the data. We use it for visualization and clustering of the dataset, and discuss its ability to find groups of highly similar objects, including spectroscopic twins. Using the similarity matrix to search the dataset for objects allows us to find objects that are impossible to find using their best fitting model parameters. This includes extreme objects for which the models fail, and rare objects that are outside the scope of the model. We use the similarity measure to detect outliers in the dataset, and find a number of previously unknown Be-type stars, spectroscopic binaries, carbon rich stars, young stars, and a few that we cannot interpret. Our work further demonstrates the potential for scientific discovery when combining machine learning methods with modern survey data. Cone search capability for table J/MNRAS/476/2117/apogeenn (Nearest neighbors APOGEE IDs)
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Dynamic Apparel Sales with Anomalies Dataset is based on 100,000 sales transaction data from the fashion industry, including extreme outliers, missing values, and sales_categories, reflecting the different data characteristics of real retail environments.
2) Data Utilization (1) Dynamic Apparel Sales with Anomalies Dataset has characteristics that: • This dataset consists of nine categorical variables and 10 numerical variables, including product name, brand, gender clothing, price, discount rate, inventory level, and customer behavior, making it suitable for analyzing product and customer characteristics. (2) Dynamic Apparel Sales with Anomalies Dataset can be used to: • Sales anomaly detection and quality control: Transaction data with outliers and missing values can be used to detect outliers, manage quality, refine data, and develop outlier processing techniques. • Sales Forecast and Customer Analysis Modeling: Based on a variety of product and customer characteristics, it can be used to support data-driven decision-making, such as machine learning-based sales forecasting, customer segmentation, and customized marketing strategies.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This study is a first global estimate of the extreme water level values along the metropolitan coastline. It is to be refined locally with all available data and knowledge. The method used is based on a statistical analysis of the tide charts available in ports. It does not take into account the observations of the waves. The results between ports are obtained by an interpolation method. The study produces statistical estimates at reference ports: — extreme values of open sea overcots in the Channel and Atlantic; — extreme water level values for the entire metropolis. And a set of statistical estimation maps of extreme water levels along the coastline. The estimates provided go up to the return period 1000 years. In view of the observation times used at ports, the user must check whether the estimates of a return period of more than 50 or 100 years still make sense.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset simulates CPU performance logs collected from multiple systems.
Columns:
CPU Usage (%) - CPU utilization (with missing values, outliers, and anomalies).
CPU Temperature (°C) - Temperature readings (with random noise and extreme values).
Clock Speed (GHz) - CPU clock speed (with inconsistent formatting and missing values).
Cache Miss Rate (%) - Percentage of cache misses (skewed distribution and corrupted values).
Power Consumption (W) - Power usage in watts (with extreme outliers and inconsistent scaling).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sweden Consumer Survey: KI: Perceived Inflation Now: excl Extreme Values data was reported at 2.210 % in Jul 2018. This records a decrease from the previous number of 2.550 % for Jun 2018. Sweden Consumer Survey: KI: Perceived Inflation Now: excl Extreme Values data is updated monthly, averaging 1.720 % from Dec 2001 (Median) to Jul 2018, with 200 observations. The data reached an all-time high of 4.770 % in Jul 2008 and a record low of 0.000 % in Mar 2016. Sweden Consumer Survey: KI: Perceived Inflation Now: excl Extreme Values data remains active status in CEIC and is reported by National Institute of Economic Research. The data is categorized under Global Database’s Sweden – Table SE.H009: Consumer Survey: National Institute of Economic Research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set provides experimental results for the analysis of extreme values of trilateration localization error in wireless communication systems. The analysis is based upon the analytical model of trilateration localization error described and discussed in the manuscript titled "An Analytical Model of Trilateration Localization Error".
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset is about: Tab. 4 Absolute air temperature extreme values during the observational period 1951-1965. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.745935 for more information.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Simulations of systems with random interactions, for different underlying distributions and realizations of the interaction.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary result of number of outliers in selected MNCH data items.
In the analysis of DNA sequences on related individuals, most methods strive to incorporate as much information as possible, with little or no attention paid to the issue of statistical significance. For example, a modern workstation can easily handle the computations needed to perform a large-scale genome-wide inheritance-by-descent (IBD) scan, but accurate assessment of the significance of that scan is often hindered by inaccurate approximations and computationally intensive simulation. To address these issues, we developed gLOD-a test of co-segregation that, for large samples, models chromosome-specific IBD statistics as a collection of stationary Gaussian processes. With this simple model, the parametric bootstrap yields an accurate and rapid assessment of significance-the genome-wide corrected P-value. Furthermore, we show that (i) under the null hypothesis, the limiting distribution of the gLOD is the standard Gumbel distribution; (ii) our parametric bootstrap simulator is approxi...
This dataset is synthetically generated to mimic weather data for classification tasks. It includes various weather-related features and categorizes the weather into four types: Rainy, Sunny, Cloudy, and Snowy. This dataset is designed for practicing classification algorithms, data preprocessing, and outlier detection methods.
This dataset is useful for data scientists, students especially beginners, and practitioners to investigate classification algorithm's performance, practice data preprocessing, feature engineering, model evaluation, and test outlier detection methods. It provides opportunities for learning and experimenting with weather data analysis and machine learning techniques.
This dataset is synthetically produced and does not convey real-world weather data. It includes intentional outliers to provide opportunities for practicing outlier detection and handling. The values, ranges, and distributions may not accurately represent real-world conditions, and the data should primarily be used for educational and experimental purposes.
Anyone is free to share and use the data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Captions:Figure 1: Phase diagram depicting shape parameter ξ for accepted GEV distribution forER-ER multiplex network as a function of IC inclusion probabilities (pin) in both the layers. Region Bcorresponds to the Weibull. Region A stands for undefined distributions. Size of the network N=100 in eachlayer.Figure 2: (Color online) Distribution of Rmax of SF networks with average degree hki = 4 for various ICinclusion probabilities (pin). Histogram is fitted with normal (blue dotted line) and GEV (red solid line)distributions. Network size N=500.Figure 3: (Color online) Distribution of Rmax of SF networks with average degree hki = 6 for various ICinclusion probabilities (pin). Histogram is fitted with normal (blue dotted line) and GEV (red solid line)distributions. Network size N=500.Table 1: Estimated parameters of KS test for fitting GEV and normal distributions of Rmax for differentnetwork sizes of SF network over a average of 5000 random realization. Other parameters are inhibitioninclusion probability pin = 0.5 and average degree k = 6.Table 2: Estimated parameters of KS test for fitting of GEV and normal distributions of Rmax for different inhibitoryinclusion probability (pin) of SF- SF network over 5000 population. Other parameters are network size N = 100 in each layer andaverage degree k = 6.Table 3: Estimated parameters of KS test for fitting of GEV and normal distributions of Rmax for different inhibitoryinclusion probability (pin) of ER- SF network over 5000 population. Other parameters are network size N = 100 in each layer andaverage degree k = 6.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The detailed introduction of the outlier data analysis dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attendance absences have a substantial impact on student’s future physical and mental health as well as academic progress. Numerous personal, familial, and social issues are among the causes of student absences. Any kind of absence from school should be minimized. Extremely high rates of student absences may indicate the abrupt commencement of a serious school health crisis or public health crisis, such as the spread of tuberculosis or COVID-19, which provides school health professionals with an early warning. We take the extreme values in absence data as the object and attempt to apply the extreme value theory (EVT) to describe the distribution of extreme values. This study aims to predict extreme instances of student absences. School health professionals can take preventative measures to reduce future excessive absences, according to the predicted results. Five statistical distributions were applied to individually characterize the extreme values. Our findings suggest that EVT is a useful tool for predicting extreme student absences, thereby aiding preventative measures in public health.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The detection of water quality indicators such as Temperature, pH, Turbidity, Conductivity, and TDS involves five national standard methods. Chemically based measurement techniques may generate liquid residue, causing secondary pollution. The water quality monitoring and data analysis system can effectively address the issues that conventional methods require multiple pieces of equipment and repeated measurements. This paper analyzes the distribution characteristics of the historical data from five sensors at a specific time, displays them graphically in real time, and provides an early warning of exceeding the standard; It selects four water samples from different sections of the Li River, based on the national standard method, the average measurement errors of Temperature, PH, TDS, Conductivity and Turbidity are 0.98%, 2.23%, 2.92%, 3.05% and 3.98%.;It further uses the quartile method to analyze the outlier data over 100,000 records and five historical periods are selected. Experiment results show the system is relatively stable in measuring Temperature, PH and TDS, and the proportion of outlier is 0.42%, 0.84% and 1.24%. When Turbidity and Conductivity are measured, the proportion is 3.11% and 2.92%. In the experiment of using 7 methods to fill outlier, K nearest neighbor algorithm is better than others. The analysis of data trends, outliers, means, and extreme values assists in making decisions, such as updating and maintaining equipment, addressing extreme water quality situations, and enhancing regional water quality oversight.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File List rextreme.txt - R source code for fitting extreme value distributions
Description This is a text file containing R-language source code for fitting extreme value distributions. These functions were originally written in S by Stuart Coles and converted to R by Alec Stephenson. For an explanation of how to use these functions, see the Appendix in: Stuart Coles, 2001. An introduction to the statistical modeling of extreme values. Springer-Verlag, London, UK.
These functions are included to document how the results in the paper were obtained. For other use, it is recommended that the entire suite of functions for extreme value analysis be downloaded. The Extremes Toolkit includes this suite of functions, as well as a graphical user interface (currently available at: www.esig.ucar.edu/extremevalues/evtk.html). gev.fit is a R function that estimates the parameters of the generalized extreme value distribution by the method of maximum likelihood.
gpd.fit is a R function that estimates the parameters of the generalized Pareto distribution by the method of maximum likelihood.
pp.fit is a R function that estimates the parameters of the generalized extreme value distribution, via the point process representation, by the method of maximum likelihood.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table identifies all state-level causes of death that were at least five times the national rate in at least one of the periods 1999-2003, 2004-2008, and 2009-2013. Data are based on the 113 Cause of Death list and are based on the CDC's Underlying Cause of Death file accessible at: http://wonder.cdc.gov/ucd-icd10.html.