The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 721 valid input features.GRM Properties:Min0.0000Max157.0200Mean9.1692Std. Dev.8.4220There were 4 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band selected was based on peak clustering found at 1894.5039 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 248 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 30 statistically significant high outlier features.There are 7 statistically significant low outlier features.There are 202 features part of statistically significant low clusters.There are 9 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high GRM values.Light Blue output features are part of a cluster of low GRM values.Red output features represent high outliers within a cluster of low GRM values.Blue output features represent low outliers within a cluster of high GRM values.
The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 137 valid input features.There were 4 outlier locations; these will not be used to compute the polygon cell size.Incident AggregationThe polygon cell size was 49251.0000 Meters.The aggregation process resulted in 72 weighted areas.Incident Count Properties:Min1.0000Max21.0000Mean1.9028Std. Dev.2.4561Scale of AnalysisThe optimal fixed distance band selected was based on peak clustering found at 94199.9365 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 3 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 2 statistically significant high outlier features.There are 0 statistically significant low outlier features.There are 0 features part of statistically significant low clusters.There are 1 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high values.Light Blue output features are part of a cluster of low values.Red output features represent high outliers within a cluster of low values.Blue output features represent low outliers within a cluster of high values.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "mnist-outlier"
📚 This dataset is an enriched version of the MNIST Dataset. The workflow is described in the medium article: Changes of Embeddings during Fine-Tuning of Transformers.
Explore the Dataset
The open source data curation tool Renumics Spotlight allows you to explorer this dataset. You can find a Hugging Face Space running Spotlight with this dataset here: https://huggingface.co/spaces/renumics/mnist-outlier.
Or you can explorer it locally:… See the full description on the dataset page: https://huggingface.co/datasets/renumics/mnist-outlier.
The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 1684 valid input features.POVERTY Properties:Min0.0000Max91.8000Mean18.9902Std. Dev.12.7152There were 22 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band was based on the average distance to 30 nearest neighbors: 3709.0000 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 1155 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 68 statistically significant high outlier features.There are 84 statistically significant low outlier features.There are 557 features part of statistically significant low clusters.There are 446 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high POVERTY values.Light Blue output features are part of a cluster of low POVERTY values.Red output features represent high outliers within a cluster of low POVERTY values.Blue output features represent low outliers within a cluster of high POVERTY values.
This data package includes the underlying data and files to replicate the calculations, charts, and tables presented in United States Is Outlier in Tax Trends in Advanced and Large Emerging Economies, PIIE Policy Brief 17-29. If you use the data, please cite as: Djankov, Simeon. (2017). United States Is Outlier in Tax Trends in Advanced and Large Emerging Economies. PIIE Policy Brief 17-29. Peterson Institute for International Economics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Outliers are often present in large datasets of water quality monitoring time series data. A method of combining the sliding window technique with Dixon detection criterion for the automatic detection of outliers in time series data is limited by the empirical determination of sliding window sizes. The scientific determination of the optimal sliding window size is very meaningful research work. This paper presents a new Monte Carlo Search Method (MCSM) based on random sampling to optimize the size of the sliding window, which fully takes advantage of computers and statistics. The MCSM was applied in a case study to automatic monitoring data of water quality factors in order to test its validity and usefulness. The results of comparing the accuracy and efficiency of the MCSM show that the new method in this paper is scientific and effective. The experimental results show that, at different sample sizes, the average accuracy is between 58.70% and 75.75%, and the average computation time increase is between 17.09% and 45.53%. In the era of big data in environmental monitoring, the proposed new methods can meet the required accuracy of outlier detection and improve the efficiency of calculation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Effect sizes calculated using mean difference for burnt-unburnt study designs and mean change for before-after desings. Outliers, as defined in the methods section of the paper, were excluded prior to calculating effect sizes.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset builds upon "Financial Statement Data Sets" by incorporating several key improvements to enhance the accuracy and usability of US-GAAP financial data from SEC filings of U.S. exchange-listed companies. Drawing on submissions from January 2009 onward, the enhanced dataset aims to provide analysts with a cleaner, more consistent dataset by addressing common challenges found in the original data.
The source code for data extraction is available here
OSU_SnowCourse Summary: Manual snow course observations were collected over WY 2012-2014 from four paired forest-open sites chosen to span a broad elevation range. Study sites were located in the upper McKenzie (McK) River watershed, approximately 100 km east of Corvallis, Oregon, on the western slope of the Cascade Range and in the Middle Fork Willamette (MFW) watershed, located to the south of the McKenzie. The sites were designated based on elevation, with a range of 1110-1480 m. Distributed snow depth and snow water equivalent (SWE) observations were collected via monthly manual snow courses from 1 November through 1 April and bi-weekly thereafter. Snow courses spanned 500 m of forested terrain and 500 m of adjacent open terrain. Snow depth observations were collected approximately every 10 m and SWE was measured every 100 m along the snow courses with a federal snow sampler. These data are raw observations and have not been quality controlled in any way. Distance along the transect was estimated in the field. OSU_SnowDepth Summary: 10-minute snow depth observations collected at OSU met stations in the upper McKenzie River Watershed and the Middle Fork Willamette Watershed during Water Years 2012-2014. Each meterological tower was deployed to represent either a forested or an open area at a particular site, and generally the locations were paired, with a meterological station deployed in the forest and in the open area at a single site. These data were collected in conjunction with manual snow course observations, and the meterological stations were located in the approximate center of each forest or open snow course transect. These data have undergone basic quality control. See manufacturer specifications for individual instruments to determine sensor accuracy. This file was compiled from individual raw data files (named "RawData.txt" within each site and year directory) provided by OSU, along with metadata of site attributes. We converted the Excel-based timestamp (seconds since origin) to a date, changed the NaN flags for missing data to NA, and added site attributes such as site name and cover. We replaced positive values with NA, since snow depth values in raw data are negative (i.e., flipped, with some correction to use the height of the sensor as zero). Thus, positive snow depth values in the raw data equal negative snow depth values. Second, the sign of the data was switched to make them positive. Then, the smooth.m (MATLAB) function was used to roughly smooth the data, with a moving window of 50 points. Third, outliers were removed. All values higher than the smoothed values +10, were replaced with NA. In some cases, further single point outliers were removed. OSU_Met Summary: Raw, 10-minute meteorological observations collected at OSU met stations in the upper McKenzie River Watershed and the Middle Fork Willamette Watershed during Water Years 2012-2014. Each meterological tower was deployed to represent either a forested or an open area at a particular site, and generally the locations were paired, with a meterological station deployed in the forest and in the open area at a single site. These data were collected in conjunction with manual snow course observations, and the meteorological stations were located in the approximate center of each forest or open snow course transect. These stations were deployed to collect numerous meteorological variables, of which snow depth and wind speed are included here. These data are raw datalogger output and have not been quality controlled in any way. See manufacturer specifications for individual instruments to determine sensor accuracy. This file was compiled from individual raw data files (named "RawData.txt" within each site and year directory) provided by OSU, along with metadata of site attributes. We converted the Excel-based timestamp (seconds since origin) to a date, changed the NaN and 7999 flags for missing data to NA, and added site attributes such as site name and cover. OSU_Location Summary: Location Metadata for manual snow course observations and meteorological sensors. These data are compiled from GPS data for which the horizontal accuracy is unknown, and from processed hemispherical photographs. They have not been quality controlled in any way.
This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits of the concentration relations, which could substantially affect model predictions. Identified outliers were also excluded from loads that were calculated using the Beale ratio estimator. Notes on reason(s) for considering a concentration as an outlier are included.
All the raw data are obtained from other publications as shown below. We further analyzed the data and provide the results of the analyses here. The methods used to analyze the data are described in the paper.
Dataset
Species
Genes
Download
Plants
104
852
DOI 10.1186/2047-217X-3-17
Mammals
37
424
DOI 10.13012/C5BG2KWG
Insects
144
1478
http://esayyari.github.io/InsectsData
Cannon
78
213
DOI 10.5061/dryad.493b7
Rouse
26
393
DOI 10.5061/dryad.79dq1
Frogs
164
95
DOI 10.5061/dryad.12546.2
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.
This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:
ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)
ESM_2.py – Python script to calculate Z-scores from raw financial ratios
ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios
ESM_4.py – Python script for generating the correlation heatmap of the Z-scores
ESM_5.xlsx – Mahalanobis distance values for each firm
ESM_6.py – Python script to compute Mahalanobis distances
ESM_7.py – Python script to visualize Mahalanobis distances
ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)
ESM_9.py – Python script to compute mean Z-scores
ESM_10.xlsx – Re-standardized Z-scores based on firm-level means
ESM_11.py – Python script to re-standardize mean Z-scores
ESM_12.py – Python script to generate the hierarchical clustering dendrogram
All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Climate change has had a significant impact on the seasonal transition dates of Arctic tundra ecosystems, causing diverse variations between distinct land surface classes. However, the combined effect of multiple controls as well as their individual effects on these dates remains unclear at various scales and across diverse land surface classes. Here we quantified spatiotemporal variations of three seasonal transition dates (start of spring, maximum Normalized Difference Vegetation Index (NDVImax) day, end of fall) for five dominant land surface classes in the ice-free Greenland and analyzed their drivers for current and future climate scenarios, respectively. Methods To quantify the seasonal transition dates, we used NDVI derived from Sentinel-2 MultiSpectral Instrument (Level-1C) images during 2016–2020 based on Google Earth Engine (https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2). We performed an atmospheric correction (Yin et al., 2019) on the images before calculating NDVI. The months from May to October were set as the study period each year. The quality control process includes 3 steps: (i) the cloud was masked according to the QA60 band; (ii) images were removed if the number of pixels with NDVI values outside the range of -1–1 exceeds 30% of the total pixels while extracting the median value of each date; (iii) NDVI outliers resulting from cloud mask errors (Coluzzi et al., 2018) and sporadic snow were deleted pixel by pixel. NDVI outliers mentioned here appear as a sudden drop to almost zero in the growing season and do not form a sequence in this study (Komisarenko et al., 2022). To identify outliers, we iterated through every two consecutive NDVI values in the time series and calculated the difference between the second and first values for each pixel every year. We defined anomalous NDVI differences as points outside of the percentiles threshold [10 90], and if the NDVI difference is positive, then the first NDVI value used to calculate the difference will be the outlier, otherwise, the second one will be the outlier. Finally, 215 images were used to reflect seasonal transition dates in all 5 study periods of 2016–2020 after the quality control. Each image was resampled with 32 m spatial resolution to match the resolution of the ArcticDEM data and SnowModel outputs. To detect seasonal transition dates, we used a double sigmoid model to fit the NDVI changes on time series, and points where the curvature changes most rapidly on the fitted curve, appear at the beginning, middle, and end of each season (Klosterman et al., 2014). The applicability of this phenology method in the Arctic has been demonstrated (Ma et al., 2022; Westergaard-Nielsen et al., 2013; Westergaard-Nielsen et al., 2017). We focused on 3 seasonal transition dates, i.e., SOS, NDVImax day, and EOF. The NDVI values for some pixels are still below zero in spring and summer due to topographical shadow. We, therefore, set a quality control rule before calculating seasonal transition dates for each pixel, i.e., if the number of days with positive NDVI values from June to September is less than 60% of the total number of observed days, the pixel will not be considered for subsequent calculations. As verification of fitted dates, the seasonal transition dates in dry heaths and corresponding time-lapse photos acquired from the snow fence area are shown in Fig. 2. Snow cover extent is greatly reduced and vegetation is exposed with lower NDVI values on the SOS. All visible vegetation is green on the NDVImax day. On EOF, snow cover distributes partly, and NDVI decreases to a value close to zero.
Data statistics DVF, available on explore.data.gouv.fr/immobilier. The files contain the number of sales, the average and the median of prices per m2. - Total DVF statistics: statistics by geographical scale, over the 10 semesters available. - Monthly DVF statistics: statistics by geographical scale and by month. ## Description of treatment The code allows statistics to be generated from the data of land value requests, aggregated at different scales, and their evolution over time (monthly). The following indicators have been calculated on a monthly basis and over the entire period available (10 semesters): * number of mutations * average prices per m2 * median of prices per m2 * Breakdown of sales prices by tranches for each type of property from: * houses * apartments * houses + apartments * commercial premises and for each scale from: * nation * Department * EPCI * municipality * Cadastral section The source data contain the following types of mutations: sale, sale in the future state of completion, sale of building land, tendering, expropriation and exchange. We have chosen to keep only sales, sales in the future state of completion and auctions for statistics*. In addition, for the sake of simplicity, we have chosen to keep only mutations that concern a single asset (excluding dependency)*. Our path is as follows: 1. for a transfer that would include assets of several types (e.g. a house + a commercial premises), it is not possible to reconstitute the share of the land value allocated to each of the assets included. 2. for a transfer that would include several assets of the same type (e.g. X apartments), the total value of the transfer is not necessarily equal to X times the value of an apartment, especially in the case where the assets are very different (area, work to be carried out, floor, etc.). We had initially kept these goods by calculating the price per m2 of the mutation by considering the goods of the mutation as a single good of an area to the sum of the surfaces of the goods, but this method, which ultimately concerned only a marginal quantity of goods, did not convince us for the final version. The price per m2 is then calculated by dividing the land value of the change by the surface area of the building of the property concerned. We finally exclude mutations for which we could not calculate the price per m2, as well as those whose price per m2 is more than € 100k (arbitrary choice)*. We have not incorporated any other outlier restrictions in order to maintain fidelity to the original data and to report potential anomalies. Displaying the median on the site reduces the impact of outliers on color scales. _*: The mentioned filters are applied for the calculation of statistics, but all mutations of the source files are well displayed on the application at the plot level.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LOF calculation time (seconds) comparison.
To perform accurate engineering predictions, a method which accounts for both Gaussian process regression (GPR) and possibilistic fuzzy c-means clustering (PFCM) is developed in this paper, where the Gaussian process regression method is used in relationship regressions and the corresponding prediction errors are utilised to determine the memberships of the training samples. On the basis of its memberships and the prediction errors of the clusters, the typicality of each training sample is computed and used to determine the existence of outliers. In actual applications, the identified outliers should be eliminated and predictive model could be developed with the rest of the training samples. In addition to the method of predictive model construction, the influence of key parameters on the model accuracy is also investigated using two numerical problems. The results indicate that compared with standard outlier detection approaches and Gaussian process regression, the proposed approach is able to identify outliers with more precision and generate more accurate prediction results. To further identify the ability and feasibility of the method proposed in this paper in actual engineering applications, a predictive model was developed which can be used to predict the inlet pressure of a nuclear control valve on the basis of its in-situ data. The findings show that the proposed approach outperforms Gaussian process regression. In comparison to the traditional Gaussian process regression, the proposed approach reduces the detrimental impact of outliers and generates a more precise prediction model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data and code to calculate Probability-Density-Ranking (PDR) outliers and Most Probable Range (MPR)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metric multidimensional scaling (MDS) is a widely used multivariate method with applications in almost all scientific disciplines. Eigenvalues obtained in the analysis are usually reported in order to calculate the overall goodness-of-fit of the distance matrix. In this paper, we refine MDS goodness-of-fit calculations, proposing additional point and pairwise goodness-of-fit statistics that can be used to filter poorly represented observations in MDS maps. The proposed statistics are especially relevant for large data sets that contain outliers, with typically many poorly fitted observations, and are helpful for improving MDS output and emphasizing the most important features of the dataset. Several goodness-of-fit statistics are considered, and both Euclidean and non-Euclidean distance matrices are considered. Some examples with data from demographic, genetic and geographic studies are shown.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionA common quality indicator for monitoring and comparing hospitals is based on death within 30 days of admission. An important use is to determine whether a hospital has higher or lower mortality than other hospitals. Thus, the ability to identify such outliers correctly is essential. Two approaches for detection are: 1) calculating the ratio of observed to expected number of deaths (OE) per hospital and 2) including all hospitals in a logistic regression (LR) comparing each hospital to a form of average over all hospitals. The aim of this study was to compare OE and LR with respect to correctly identifying 30-day mortality outliers. Modifications of the methods, i.e., variance corrected approach of OE (OE-Faris), bias corrected LR (LR-Firth), and trimmed mean variants of LR and LR-Firth were also studied.Materials and methodsTo study the properties of OE and LR and their variants, we performed a simulation study by generating patient data from hospitals with known outlier status (low mortality, high mortality, non-outlier). Data from simulated scenarios with varying number of hospitals, hospital volume, and mortality outlier status, were analysed by the different methods and compared by level of significance (ability to falsely claim an outlier) and power (ability to reveal an outlier). Moreover, administrative data for patients with acute myocardial infarction (AMI), stroke, and hip fracture from Norwegian hospitals for 2012–2014 were analysed.ResultsNone of the methods achieved the nominal (test) level of significance for both low and high mortality outliers. For low mortality outliers, the levels of significance were increased four- to fivefold for OE and OE-Faris. For high mortality outliers, OE and OE-Faris, LR 25% trimmed and LR-Firth 10% and 25% trimmed maintained approximately the nominal level. The methods agreed with respect to outlier status for 94.1% of the AMI hospitals, 98.0% of the stroke, and 97.8% of the hip fracture hospitals.ConclusionWe recommend, on the balance, LR-Firth 10% or 25% trimmed for detection of both low and high mortality outliers.
The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 721 valid input features.GRM Properties:Min0.0000Max157.0200Mean9.1692Std. Dev.8.4220There were 4 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band selected was based on peak clustering found at 1894.5039 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 248 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 30 statistically significant high outlier features.There are 7 statistically significant low outlier features.There are 202 features part of statistically significant low clusters.There are 9 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high GRM values.Light Blue output features are part of a cluster of low GRM values.Red output features represent high outliers within a cluster of low GRM values.Blue output features represent low outliers within a cluster of high GRM values.