24 datasets found

a
Find Outliers GRM
hub.arcgis.com
Updated Aug 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tippecanoe County Assessor Hub Community (2020). Find Outliers GRM [Dataset]. https://hub.arcgis.com/datasets/45934af390204d408d9d075fede51f6c
Explore at:
Dataset updated
Aug 7, 2020
Dataset authored and provided by
Tippecanoe County Assessor Hub Community
Area covered

Description
The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 721 valid input features.GRM Properties:Min0.0000Max157.0200Mean9.1692Std. Dev.8.4220There were 4 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band selected was based on peak clustering found at 1894.5039 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 248 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 30 statistically significant high outlier features.There are 7 statistically significant low outlier features.There are 202 features part of statistically significant low clusters.There are 9 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high GRM values.Light Blue output features are part of a cluster of low GRM values.Red output features represent high outliers within a cluster of low GRM values.Blue output features represent low outliers within a cluster of high GRM values.
a
Find Outliers Minnesota Hospitals
umn.hub.arcgis.com
Updated May 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Minnesota (2020). Find Outliers Minnesota Hospitals [Dataset]. https://umn.hub.arcgis.com/maps/UMN::find-outliers-minnesota-hospitals
Explore at:
Dataset updated
May 6, 2020
Dataset authored and provided by
University of Minnesota
Area covered

Description
The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 137 valid input features.There were 4 outlier locations; these will not be used to compute the polygon cell size.Incident AggregationThe polygon cell size was 49251.0000 Meters.The aggregation process resulted in 72 weighted areas.Incident Count Properties:Min1.0000Max21.0000Mean1.9028Std. Dev.2.4561Scale of AnalysisThe optimal fixed distance band selected was based on peak clustering found at 94199.9365 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 3 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 2 statistically significant high outlier features.There are 0 statistically significant low outlier features.There are 0 features part of statistically significant low clusters.There are 1 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high values.Light Blue output features are part of a cluster of low values.Red output features represent high outliers within a cluster of low values.Blue output features represent low outliers within a cluster of high values.
h
mnist-outlier
huggingface.co
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renumics (2023). mnist-outlier [Dataset]. https://huggingface.co/datasets/renumics/mnist-outlier
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 16, 2023
Dataset authored and provided by
Renumics
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "mnist-outlier"

📚 This dataset is an enriched version of the MNIST Dataset. The workflow is described in the medium article: Changes of Embeddings during Fine-Tuning of Transformers.

Explore the Dataset

The open source data curation tool Renumics Spotlight allows you to explorer this dataset. You can find a Hugging Face Space running Spotlight with this dataset here: https://huggingface.co/spaces/renumics/mnist-outlier.

Or you can explorer it locally:… See the full description on the dataset page: https://huggingface.co/datasets/renumics/mnist-outlier.
a
Find Outliers Percent of households with income below the Federal Poverty...
uscssi.hub.arcgis.com
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spatial Sciences Institute (2021). Find Outliers Percent of households with income below the Federal Poverty Level [Dataset]. https://uscssi.hub.arcgis.com/maps/USCSSI::find-outliers-percent-of-households-with-income-below-the-federal-poverty-level
Explore at:
Dataset updated
Dec 5, 2021
Dataset authored and provided by
Spatial Sciences Institute
Area covered

Description
The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 1684 valid input features.POVERTY Properties:Min0.0000Max91.8000Mean18.9902Std. Dev.12.7152There were 22 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band was based on the average distance to 30 nearest neighbors: 3709.0000 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 1155 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 68 statistically significant high outlier features.There are 84 statistically significant low outlier features.There are 557 features part of statistically significant low clusters.There are 446 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high POVERTY values.Light Blue output features are part of a cluster of low POVERTY values.Red output features represent high outliers within a cluster of low POVERTY values.Blue output features represent low outliers within a cluster of high POVERTY values.
Replication dataset and calculations for PIIE PB 17-29, United States Is...
piie.com
Updated Nov 2, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simeon Djankov (2017). Replication dataset and calculations for PIIE PB 17-29, United States Is Outlier in Tax Trends in Advanced and Large Emerging Economies, by Simeon Djankov. (2017). [Dataset]. https://www.piie.com/publications/policy-briefs/united-states-outlier-tax-trends-advanced-and-large-emerging-economies
Explore at:
Dataset updated
Nov 2, 2017
Dataset provided by
Peterson Institute for International Economicshttp://www.piie.com/
Authors
Simeon Djankov
Area covered
United States
Description
This data package includes the underlying data and files to replicate the calculations, charts, and tables presented in United States Is Outlier in Tax Trends in Advanced and Large Emerging Economies, PIIE Policy Brief 17-29. If you use the data, please cite as: Djankov, Simeon. (2017). United States Is Outlier in Tax Trends in Advanced and Large Emerging Economies. PIIE Policy Brief 17-29. Peterson Institute for International Economics.
S
Water quality test data
scidb.cn
Updated Oct 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HuiyunFeng; JingangJiang (2022). Water quality test data [Dataset]. http://doi.org/10.57760/sciencedb.05375
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.05375
Dataset updated
Oct 26, 2022
Dataset provided by
Science Data Bank
Authors
HuiyunFeng; JingangJiang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outliers are often present in large datasets of water quality monitoring time series data. A method of combining the sliding window technique with Dixon detection criterion for the automatic detection of outliers in time series data is limited by the empirical determination of sliding window sizes. The scientific determination of the optimal sliding window size is very meaningful research work. This paper presents a new Monte Carlo Search Method (MCSM) based on random sampling to optimize the size of the sliding window, which fully takes advantage of computers and statistics. The MCSM was applied in a case study to automatic monitoring data of water quality factors in order to test its validity and usefulness. The results of comparing the accuracy and efficiency of the MCSM show that the new method in this paper is scientific and effective. The experimental results show that, at different sample sizes, the average accuracy is between 58.70% and 75.75%, and the average computation time increase is between 17.09% and 45.53%. In the era of big data in environmental monitoring, the proposed new methods can meet the required accuracy of outlier detection and improve the efficiency of calculation.
Effect sizes calculated using MD and MC, excluding outliers
researchdata.edu.au
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prof. Don Driscoll; Prof Don Driscoll (2024). Effect sizes calculated using MD and MC, excluding outliers [Dataset]. http://doi.org/10.26187/DEAKIN.26264351.V1
Explore at:
Unique identifier
https://doi.org/10.26187/DEAKIN.26264351.V1
Dataset updated
Jul 17, 2024
Dataset provided by
Deakin Universityhttp://www.deakin.edu.au/
Authors
Prof. Don Driscoll; Prof Don Driscoll
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Effect sizes calculated using mean difference for burnt-unburnt study designs and mean change for before-after desings. Outliers, as defined in the methods section of the paper, were excluded prior to calculating effect sizes.
Enhanced US-GAAP Financial Statement Data Set
kaggle.com
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vadim Vanak (2025). Enhanced US-GAAP Financial Statement Data Set [Dataset]. https://www.kaggle.com/datasets/vadimvanak/step-2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vadim Vanak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset builds upon "Financial Statement Data Sets" by incorporating several key improvements to enhance the accuracy and usability of US-GAAP financial data from SEC filings of U.S. exchange-listed companies. Drawing on submissions from January 2009 onward, the enhanced dataset aims to provide analysts with a cleaner, more consistent dataset by addressing common challenges found in the original data.

Key Enhancements:

Outlier Detection and Correction: Outliers in the original dataset have been systematically identified and corrected, providing more reliable financial figures.

Amendment Adjustments: In cases where SEC rules allow amendment filings to only include delta figures, full figures from the original submissions have been carried over for consistency, facilitating more straightforward analysis.

Missing Figure Estimation: Using calculation arcs from the US-GAAP taxonomy, missing financial figures have been computed where possible, ensuring greater completeness.

Data Structuring: Financial figures that previously appeared as separate rows have been consolidated into single rows with new columns, offering a cleaner structure.

Scope:

Data Scope: The dataset is restricted to figures reported under US-GAAP standards, with the exception of EntityCommonStockSharesOutstanding and EntityPublicFloat.

Currency and Units: The dataset exclusively includes figures reported in USD or shares, ensuring uniformity and comparability. It excludes ratios and non-financial metrics to maintain focus on financial data.

Company Selection: The dataset is limited to companies with U.S. exchange tickers, providing a concentrated analysis of publicly traded firms within the United States.

Submission Types: The dataset only incorporates data from 10-Q, 10-K, 10-Q/A, and 10-K/A filings, ensuring consistency in the type of financial reports analyzed.

Dataset Features:

Refined Financial Data: Accurate and consistent figures by addressing reporting issues, corrections for outliers, and data consolidation.

Enhanced Usability: By handling amendment submissions and leveraging GAAP taxonomies, the dataset offers a more analysis-friendly structure.

Improved Completeness: Where original submissions had gaps in reporting, this dataset fills those gaps using calculated figures based on accounting principles.

The source code for data extraction is available here
d
Manual snow course observations, raw met data, raw snow depth observations,...
catalog.data.gov
Updated Jun 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Climate Adaptation Science Centers (2024). Manual snow course observations, raw met data, raw snow depth observations, locations, and associated metadata for Oregon sites [Dataset]. https://catalog.data.gov/dataset/manual-snow-course-observations-raw-met-data-raw-snow-depth-observations-locations-and-ass
Explore at:
Dataset updated
Jun 15, 2024
Dataset provided by
Climate Adaptation Science Centers
Area covered
Oregon
Description
OSU_SnowCourse Summary: Manual snow course observations were collected over WY 2012-2014 from four paired forest-open sites chosen to span a broad elevation range. Study sites were located in the upper McKenzie (McK) River watershed, approximately 100 km east of Corvallis, Oregon, on the western slope of the Cascade Range and in the Middle Fork Willamette (MFW) watershed, located to the south of the McKenzie. The sites were designated based on elevation, with a range of 1110-1480 m. Distributed snow depth and snow water equivalent (SWE) observations were collected via monthly manual snow courses from 1 November through 1 April and bi-weekly thereafter. Snow courses spanned 500 m of forested terrain and 500 m of adjacent open terrain. Snow depth observations were collected approximately every 10 m and SWE was measured every 100 m along the snow courses with a federal snow sampler. These data are raw observations and have not been quality controlled in any way. Distance along the transect was estimated in the field. OSU_SnowDepth Summary: 10-minute snow depth observations collected at OSU met stations in the upper McKenzie River Watershed and the Middle Fork Willamette Watershed during Water Years 2012-2014. Each meterological tower was deployed to represent either a forested or an open area at a particular site, and generally the locations were paired, with a meterological station deployed in the forest and in the open area at a single site. These data were collected in conjunction with manual snow course observations, and the meterological stations were located in the approximate center of each forest or open snow course transect. These data have undergone basic quality control. See manufacturer specifications for individual instruments to determine sensor accuracy. This file was compiled from individual raw data files (named "RawData.txt" within each site and year directory) provided by OSU, along with metadata of site attributes. We converted the Excel-based timestamp (seconds since origin) to a date, changed the NaN flags for missing data to NA, and added site attributes such as site name and cover. We replaced positive values with NA, since snow depth values in raw data are negative (i.e., flipped, with some correction to use the height of the sensor as zero). Thus, positive snow depth values in the raw data equal negative snow depth values. Second, the sign of the data was switched to make them positive. Then, the smooth.m (MATLAB) function was used to roughly smooth the data, with a moving window of 50 points. Third, outliers were removed. All values higher than the smoothed values +10, were replaced with NA. In some cases, further single point outliers were removed. OSU_Met Summary: Raw, 10-minute meteorological observations collected at OSU met stations in the upper McKenzie River Watershed and the Middle Fork Willamette Watershed during Water Years 2012-2014. Each meterological tower was deployed to represent either a forested or an open area at a particular site, and generally the locations were paired, with a meterological station deployed in the forest and in the open area at a single site. These data were collected in conjunction with manual snow course observations, and the meteorological stations were located in the approximate center of each forest or open snow course transect. These stations were deployed to collect numerous meteorological variables, of which snow depth and wind speed are included here. These data are raw datalogger output and have not been quality controlled in any way. See manufacturer specifications for individual instruments to determine sensor accuracy. This file was compiled from individual raw data files (named "RawData.txt" within each site and year directory) provided by OSU, along with metadata of site attributes. We converted the Excel-based timestamp (seconds since origin) to a date, changed the NaN and 7999 flags for missing data to NA, and added site attributes such as site name and cover. OSU_Location Summary: Location Metadata for manual snow course observations and meteorological sensors. These data are compiled from GPS data for which the horizontal accuracy is unknown, and from processed hemispherical photographs. They have not been quality controlled in any way.
d
11: Streamwater sample constituent concentration outliers from 15 watersheds...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). 11: Streamwater sample constituent concentration outliers from 15 watersheds in Gwinnett County, Georgia for water years 2003-2020 [Dataset]. https://catalog.data.gov/dataset/11-streamwater-sample-constituent-concentration-outliers-from-15-watersheds-in-gwinne-2003
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Georgia, Gwinnett County
Description
This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits of the concentration relations, which could substantially affect model predictions. Identified outliers were also excluded from loads that were calculated using the Beale ratio estimator. Notes on reason(s) for considering a concentration as an outlier are included.
d
TreeShrink: fast and accurate detection of outlier long branches in...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jun 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siavash Mirarab; Uyen Mai (2023). TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees [Dataset]. http://doi.org/10.6076/D1HC71
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6076/D1HC71
Dataset updated
Jun 26, 2023
Dataset provided by
Dryad
Authors
Siavash Mirarab; Uyen Mai
Time period covered
2023
Description
All the raw data are obtained from other publications as shown below. We further analyzed the data and provide the results of the analyses here. The methods used to analyze the data are described in the paper.

Dataset

Species

Genes

Download

Plants

104

852

DOI 10.1186/2047-217X-3-17

Mammals

37

424

DOI 10.13012/C5BG2KWG

Insects

144

1478

http://esayyari.github.io/InsectsData

Cannon

78

213

DOI 10.5061/dryad.493b7

Rouse

26

393

DOI 10.5061/dryad.79dq1

Frogs

164

95

DOI 10.5061/dryad.12546.2
Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...
zenodo.org
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
o; o (2025). Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of U.S. Tech Firms [Dataset]. http://doi.org/10.5281/zenodo.15337959
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15337959
Dataset updated
May 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
o; o
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 4, 2025
Description
Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.

This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:

ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)

ESM_2.py – Python script to calculate Z-scores from raw financial ratios

ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios

ESM_4.py – Python script for generating the correlation heatmap of the Z-scores

ESM_5.xlsx – Mahalanobis distance values for each firm

ESM_6.py – Python script to compute Mahalanobis distances

ESM_7.py – Python script to visualize Mahalanobis distances

ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)

ESM_9.py – Python script to compute mean Z-scores

ESM_10.xlsx – Re-standardized Z-scores based on firm-level means

ESM_11.py – Python script to re-standardize mean Z-scores

ESM_12.py – Python script to generate the hierarchical clustering dendrogram

All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
n
Data from: Drivers of contemporary and future changes in Arctic seasonal...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yijing Liu; Peiyan Wang; Bo Elberling; Andreas Westergaard-Nielsen (2023). Drivers of contemporary and future changes in Arctic seasonal transition dates for a tundra site in coastal Greenland [Dataset]. http://doi.org/10.5061/dryad.jsxksn0hp
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.jsxksn0hp
Dataset updated
Dec 30, 2023
Dataset provided by
University of Copenhagen
Institute of Geographic Sciences and Natural Resources Research
Authors
Yijing Liu; Peiyan Wang; Bo Elberling; Andreas Westergaard-Nielsen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Arctic, Greenland
Description
Climate change has had a significant impact on the seasonal transition dates of Arctic tundra ecosystems, causing diverse variations between distinct land surface classes. However, the combined effect of multiple controls as well as their individual effects on these dates remains unclear at various scales and across diverse land surface classes. Here we quantified spatiotemporal variations of three seasonal transition dates (start of spring, maximum Normalized Difference Vegetation Index (NDVImax) day, end of fall) for five dominant land surface classes in the ice-free Greenland and analyzed their drivers for current and future climate scenarios, respectively. Methods To quantify the seasonal transition dates, we used NDVI derived from Sentinel-2 MultiSpectral Instrument (Level-1C) images during 2016–2020 based on Google Earth Engine (https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2). We performed an atmospheric correction (Yin et al., 2019) on the images before calculating NDVI. The months from May to October were set as the study period each year. The quality control process includes 3 steps: (i) the cloud was masked according to the QA60 band; (ii) images were removed if the number of pixels with NDVI values outside the range of -1–1 exceeds 30% of the total pixels while extracting the median value of each date; (iii) NDVI outliers resulting from cloud mask errors (Coluzzi et al., 2018) and sporadic snow were deleted pixel by pixel. NDVI outliers mentioned here appear as a sudden drop to almost zero in the growing season and do not form a sequence in this study (Komisarenko et al., 2022). To identify outliers, we iterated through every two consecutive NDVI values in the time series and calculated the difference between the second and first values for each pixel every year. We defined anomalous NDVI differences as points outside of the percentiles threshold [10 90], and if the NDVI difference is positive, then the first NDVI value used to calculate the difference will be the outlier, otherwise, the second one will be the outlier. Finally, 215 images were used to reflect seasonal transition dates in all 5 study periods of 2016–2020 after the quality control. Each image was resampled with 32 m spatial resolution to match the resolution of the ArcticDEM data and SnowModel outputs. To detect seasonal transition dates, we used a double sigmoid model to fit the NDVI changes on time series, and points where the curvature changes most rapidly on the fitted curve, appear at the beginning, middle, and end of each season (Klosterman et al., 2014). The applicability of this phenology method in the Arctic has been demonstrated (Ma et al., 2022; Westergaard-Nielsen et al., 2013; Westergaard-Nielsen et al., 2017). We focused on 3 seasonal transition dates, i.e., SOS, NDVImax day, and EOF. The NDVI values for some pixels are still below zero in spring and summer due to topographical shadow. We, therefore, set a quality control rule before calculating seasonal transition dates for each pixel, i.e., if the number of days with positive NDVI values from June to September is less than 60% of the total number of observed days, the pixel will not be considered for subsequent calculations. As verification of fitted dates, the seasonal transition dates in dry heaths and corresponding time-lapse photos acquired from the snow fence area are shown in Fig. 2. Snow cover extent is greatly reduced and vegetation is exposed with lower NDVI values on the SOS. All visible vegetation is green on the NDVImax day. On EOF, snow cover distributes partly, and NDVI decreases to a value close to zero.
g
DVF statistics
gimi9.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DVF statistics [Dataset]. https://gimi9.com/dataset/eu_64998de5926530ebcecc7b15
Explore at:
Description
Data statistics DVF, available on explore.data.gouv.fr/immobilier. The files contain the number of sales, the average and the median of prices per m2. - Total DVF statistics: statistics by geographical scale, over the 10 semesters available. - Monthly DVF statistics: statistics by geographical scale and by month. ## Description of treatment The code allows statistics to be generated from the data of land value requests, aggregated at different scales, and their evolution over time (monthly). The following indicators have been calculated on a monthly basis and over the entire period available (10 semesters): * number of mutations * average prices per m2 * median of prices per m2 * Breakdown of sales prices by tranches for each type of property from: * houses * apartments * houses + apartments * commercial premises and for each scale from: * nation * Department * EPCI * municipality * Cadastral section The source data contain the following types of mutations: sale, sale in the future state of completion, sale of building land, tendering, expropriation and exchange. We have chosen to keep only sales, sales in the future state of completion and auctions for statistics*. In addition, for the sake of simplicity, we have chosen to keep only mutations that concern a single asset (excluding dependency)*. Our path is as follows: 1. for a transfer that would include assets of several types (e.g. a house + a commercial premises), it is not possible to reconstitute the share of the land value allocated to each of the assets included. 2. for a transfer that would include several assets of the same type (e.g. X apartments), the total value of the transfer is not necessarily equal to X times the value of an apartment, especially in the case where the assets are very different (area, work to be carried out, floor, etc.). We had initially kept these goods by calculating the price per m2 of the mutation by considering the goods of the mutation as a single good of an area to the sum of the surfaces of the goods, but this method, which ultimately concerned only a marginal quantity of goods, did not convince us for the final version. The price per m2 is then calculated by dividing the land value of the change by the surface area of the building of the property concerned. We finally exclude mutations for which we could not calculate the price per m2, as well as those whose price per m2 is more than € 100k (arbitrary choice)*. We have not incorporated any other outlier restrictions in order to maintain fidelity to the original data and to report potential anomalies. Displaying the median on the site reduces the impact of outliers on color scales. _*: The mentioned filters are applied for the calculation of statistics, but all mutations of the source files are well displayed on the application at the plot level.
f
LOF calculation time (seconds) comparison.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jihwan Lee; Nam-Wook Cho (2023). LOF calculation time (seconds) comparison. [Dataset]. http://doi.org/10.1371/journal.pone.0165972.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0165972.t003
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Jihwan Lee; Nam-Wook Cho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LOF calculation time (seconds) comparison.
d
A possibilistic fuzzy-based Gaussian process regression and its application...
dataone.org
dataverse.harvard.edu
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zhang, zenglei (2024). A possibilistic fuzzy-based Gaussian process regression and its application in nuclear valves [Dataset]. http://doi.org/10.7910/DVN/2CAXYG
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/2CAXYG
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
zhang, zenglei
Description
To perform accurate engineering predictions, a method which accounts for both Gaussian process regression (GPR) and possibilistic fuzzy c-means clustering (PFCM) is developed in this paper, where the Gaussian process regression method is used in relationship regressions and the corresponding prediction errors are utilised to determine the memberships of the training samples. On the basis of its memberships and the prediction errors of the clusters, the typicality of each training sample is computed and used to determine the existence of outliers. In actual applications, the identified outliers should be eliminated and predictive model could be developed with the rest of the training samples. In addition to the method of predictive model construction, the influence of key parameters on the model accuracy is also investigated using two numerical problems. The results indicate that compared with standard outlier detection approaches and Gaussian process regression, the proposed approach is able to identify outliers with more precision and generate more accurate prediction results. To further identify the ability and feasibility of the method proposed in this paper in actual engineering applications, a predictive model was developed which can be used to predict the inlet pressure of a nuclear control valve on the basis of its in-situ data. The findings show that the proposed approach outperforms Gaussian process regression. In comparison to the traditional Gaussian process regression, the proposed approach reduces the detrimental impact of outliers and generates a more precise prediction model.
R code
figshare.com
txt
Updated Jun 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christine Dodge (2017). R code [Dataset]. http://doi.org/10.6084/m9.figshare.5021297.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5021297.v1
Dataset updated
Jun 5, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Christine Dodge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
Probability-Density-Ranking (PDR) outliers and Most Probable Range (MPR) of...
springernature.figshare.com
application/gzip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chenghua Shao; Huanwang Yang; Sijiang Wang; Zonghong Liu; Stephen K. Burley (2023). Probability-Density-Ranking (PDR) outliers and Most Probable Range (MPR) of PDB data [Dataset]. http://doi.org/10.6084/m9.figshare.7150124.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7150124.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Chenghua Shao; Huanwang Yang; Sijiang Wang; Zonghong Liu; Stephen K. Burley
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data and code to calculate Probability-Density-Ranking (PDR) outliers and Most Probable Range (MPR)
f
Goodness-of-fit filtering in classical metric multidimensional scaling with...
tandf.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Graffelman (2023). Goodness-of-fit filtering in classical metric multidimensional scaling with large datasets [Dataset]. http://doi.org/10.6084/m9.figshare.11389830.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11389830.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Jan Graffelman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metric multidimensional scaling (MDS) is a widely used multivariate method with applications in almost all scientific disciplines. Eigenvalues obtained in the analysis are usually reported in order to calculate the overall goodness-of-fit of the distance matrix. In this paper, we refine MDS goodness-of-fit calculations, proposing additional point and pairwise goodness-of-fit statistics that can be used to filter poorly represented observations in MDS maps. The proposed statistics are especially relevant for large data sets that contain outliers, with typically many poorly fitted observations, and are helpful for improving MDS output and emphasizing the most important features of the dataset. Several goodness-of-fit statistics are considered, and both Euclidean and non-Euclidean distance matrices are considered. Some examples with data from demographic, genetic and geographic studies are shown.
f
Observed to expected or logistic regression to identify hospitals with high...
figshare.com
7z
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doris Tove Kristoffersen; Jon Helgeland; Jocelyne Clench-Aas; Petter Laake; Marit B. Veierød (2023). Observed to expected or logistic regression to identify hospitals with high or low 30-day mortality? [Dataset]. http://doi.org/10.1371/journal.pone.0195248
Explore at:
7zAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0195248
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Doris Tove Kristoffersen; Jon Helgeland; Jocelyne Clench-Aas; Petter Laake; Marit B. Veierød
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionA common quality indicator for monitoring and comparing hospitals is based on death within 30 days of admission. An important use is to determine whether a hospital has higher or lower mortality than other hospitals. Thus, the ability to identify such outliers correctly is essential. Two approaches for detection are: 1) calculating the ratio of observed to expected number of deaths (OE) per hospital and 2) including all hospitals in a logistic regression (LR) comparing each hospital to a form of average over all hospitals. The aim of this study was to compare OE and LR with respect to correctly identifying 30-day mortality outliers. Modifications of the methods, i.e., variance corrected approach of OE (OE-Faris), bias corrected LR (LR-Firth), and trimmed mean variants of LR and LR-Firth were also studied.Materials and methodsTo study the properties of OE and LR and their variants, we performed a simulation study by generating patient data from hospitals with known outlier status (low mortality, high mortality, non-outlier). Data from simulated scenarios with varying number of hospitals, hospital volume, and mortality outlier status, were analysed by the different methods and compared by level of significance (ability to falsely claim an outlier) and power (ability to reveal an outlier). Moreover, administrative data for patients with acute myocardial infarction (AMI), stroke, and hip fracture from Norwegian hospitals for 2012–2014 were analysed.ResultsNone of the methods achieved the nominal (test) level of significance for both low and high mortality outliers. For low mortality outliers, the levels of significance were increased four- to fivefold for OE and OE-Faris. For high mortality outliers, OE and OE-Faris, LR 25% trimmed and LR-Firth 10% and 25% trimmed maintained approximately the nominal level. The methods agreed with respect to outlier status for 94.1% of the AMI hospitals, 98.0% of the stroke, and 97.8% of the hip fracture hospitals.ConclusionWe recommend, on the balance, LR-Firth 10% or 25% trimmed for detection of both low and high mortality outliers.

Facebook

Twitter

Click to copy link

Link copied

Cite

Tippecanoe County Assessor Hub Community (2020). Find Outliers GRM [Dataset]. https://hub.arcgis.com/datasets/45934af390204d408d9d075fede51f6c

Find Outliers GRM

Explore at:

Dataset updated

Aug 7, 2020

Dataset authored and provided by

Tippecanoe County Assessor Hub Community

Area covered

Description

The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 721 valid input features.GRM Properties:Min0.0000Max157.0200Mean9.1692Std. Dev.8.4220There were 4 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band selected was based on peak clustering found at 1894.5039 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 248 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 30 statistically significant high outlier features.There are 7 statistically significant low outlier features.There are 202 features part of statistically significant low clusters.There are 9 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high GRM values.Light Blue output features are part of a cluster of low GRM values.Red output features represent high outliers within a cluster of low GRM values.Blue output features represent low outliers within a cluster of high GRM values.

Clear search

Close search

Google apps

Main menu

Find Outliers GRM

Find Outliers Minnesota Hospitals

mnist-outlier

Find Outliers Percent of households with income below the Federal Poverty...

Replication dataset and calculations for PIIE PB 17-29, United States Is...

Water quality test data

Effect sizes calculated using MD and MC, excluding outliers

Enhanced US-GAAP Financial Statement Data Set

Key Enhancements:

Scope:

Dataset Features:

Manual snow course observations, raw met data, raw snow depth observations,...

11: Streamwater sample constituent concentration outliers from 15 watersheds...

TreeShrink: fast and accurate detection of outlier long branches in...

Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...

Data from: Drivers of contemporary and future changes in Arctic seasonal...

DVF statistics

LOF calculation time (seconds) comparison.

A possibilistic fuzzy-based Gaussian process regression and its application...

R code

Probability-Density-Ranking (PDR) outliers and Most Probable Range (MPR) of...

Goodness-of-fit filtering in classical metric multidimensional scaling with...

Observed to expected or logistic regression to identify hospitals with high...

Find Outliers GRM