32 datasets found

H
Outlier Boundary SImulation across ML Data Cleaning Techniques
dataverse.harvard.edu
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jie Li (2025). Outlier Boundary SImulation across ML Data Cleaning Techniques [Dataset]. http://doi.org/10.7910/DVN/GB3EFB
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GB3EFB
Dataset updated
Apr 11, 2025
Dataset provided by
Harvard Dataverse
Authors
Jie Li
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is a demonstration of the outlier boundary set up across different ML data cleaning techniques.
f
Data from: Boundary peeling: An outlier detection method
tandf.figshare.com
pdf
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheikh Arafat; Na Sun; Maria L. Weese; Waldyn G. Martinez (2025). Boundary peeling: An outlier detection method [Dataset]. http://doi.org/10.6084/m9.figshare.28776694.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28776694.v1
Dataset updated
Apr 11, 2025
Dataset provided by
Taylor & Francis
Authors
Sheikh Arafat; Na Sun; Maria L. Weese; Waldyn G. Martinez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Unsupervised outlier detection constitutes a crucial phase within data analysis and remains an open area of research. A good outlier detection algorithm should be computationally efficient, robust to tuning parameter selection, and perform consistently well across diverse underlying data distributions. We introduce Boundary Peeling, an unsupervised outlier detection algorithm. Boundary Peeling uses the average signed distance from iteratively peeled, flexible boundaries generated by one-class support vector machines to flag outliers. The method is similar to convex hull peeling but well suited for high-dimensional data and has flexibility to adapt to different distributions. Boundary Peeling has robust hyperparameter settings and, for increased flexibility, can be cast as an ensemble method. In unimodal and multimodal synthetic data simulations Boundary Peeling outperforms all state of the art methods when no outliers are present while maintaining comparable or superior performance in the presence of outliers. Boundary Peeling performs competitively or better in terms of correct classification, AUC, and processing time using semantically meaningful benchmark datasets.
f
Data from: Methodology to filter out outliers in high spatial density data...
scielo.figshare.com
jpeg
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14305658.v1
Dataset updated
Jun 4, 2023
Dataset provided by
SciELO journals
Authors
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.
Data from: Outlier classification using autoencoders: application for...
osti.gov
Updated Jun 2, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bianchi, F. M.; Brunner, D.; Kube, R.; LaBombard, B. (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1882649-outlier-classification-using-autoencoders-application-fluctuation-driven-flows-fusion-plasmas
Explore at:
Dataset updated
Jun 2, 2021
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
United States Department of Energyhttp://energy.gov/
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
Authors
Bianchi, F. M.; Brunner, D.; Kube, R.; LaBombard, B.
Description
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that aremore » identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.« less
t
Data from: Matching Map Recovery with an Unknown Number of Outliers
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Matching Map Recovery with an Unknown Number of Outliers [Dataset]. https://service.tib.eu/ldmservice/dataset/matching-map-recovery-with-an-unknown-number-of-outliers
Explore at:
Dataset updated
Dec 16, 2024
Description
The dataset used in the paper is a set of feature-vectors from two sets of d-dimensional noisy feature-vectors.
d
Data from: Privacy Preserving Outlier Detection through Random Nonlinear...
catalog.data.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.
n
Data from: Subtle limits to connectivity revealed by outlier loci within two...
data.niaid.nih.gov
dataone.org
+1more
zip
Updated Feb 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adrien Tran Lu Y; Stephanie Ruault; Claire Daguin-Thiébaut; Jade Castel; Nicolas Bierne; Thomas Broquet; Patrick Wincker; Aude Perdereau; Sophie Arnaud-Haond; Pierre-Alexandre Gagnaire; Didier Jollivet; Stephane Hourdez; François Bonhomme (2022). Subtle limits to connectivity revealed by outlier loci within two divergent metapopulations of the deep-sea hydrothermal gastropod Ifremeria nautilei [Dataset]. http://doi.org/10.5061/dryad.ffbg79cwq
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ffbg79cwq
Dataset updated
Feb 28, 2022
Dataset provided by
Ifremer
Sorbonne Université
Genoscope
Institute of Evolutionary Science of Montpellier
Authors
Adrien Tran Lu Y; Stephanie Ruault; Claire Daguin-Thiébaut; Jade Castel; Nicolas Bierne; Thomas Broquet; Patrick Wincker; Aude Perdereau; Sophie Arnaud-Haond; Pierre-Alexandre Gagnaire; Didier Jollivet; Stephane Hourdez; François Bonhomme
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Hydrothermal vents form archipelagos of ephemeral deep-sea habitats that raise interesting questions about the evolution and dynamics of the associated endemic fauna, constantly subject to extinction-recolonization processes. These metal-rich environments are coveted for the mineral resources they harbor, thus raising recent conservation concerns. The evolutionary fate and demographic resilience of hydrothermal species strongly depend on the degree of connectivity among and within their fragmented metapopulations. In the deep sea, however, assessing connectivity is difficult and usually requires indirect genetic approaches. Improved detection of fine-scale genetic connectivity is now possible based on genome-wide screening for genetic differentiation. Here, we explored population connectivity in the hydrothermal vent snail Ifremeria nautilei across its species range encompassing five distinct back-arc basins in the Southwest Pacific. The global analysis, based on 10 570 single nucleotide polymorphism (SNP) markers derived from double digest restriction-site associated DNA sequencing (ddRAD-seq), depicted two semi-isolated and homogeneous genetic clusters. Demo-genetic modeling suggests that these two groups began to diverge about 70 000 generations ago, but continue to exhibit weak and slightly asymmetrical gene flow. Furthermore, a careful analysis of outlier loci showed subtle limitations to connectivity between neighboring basins within both groups. This finding indicates that migration is not strong enough to totally counterbalance drift or local selection, hence questioning the potential for demographic resilience at this latter geographical scale. These results illustrate the potential of large genomic datasets to understand fine-scale connectivity patterns in hydrothermal vents and the deep sea. Methods VCF datasets were generated “de novo” with Stacks V.2.52 from reads produce by the protocols used and provided in the manuscript.Sample associated metadata were collected during field sampling.
Privacy Preserving Outlier Detection through Random Nonlinear Data...
data.nasa.gov
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.
d
Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit Hydrologic...
catalog.data.gov
data.usgs.gov
Updated Nov 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit Hydrologic Unit Codes, 2008-2023 [Dataset]. https://catalog.data.gov/dataset/monthly-openet-image-collections-v2-0-summarized-by-12-digit-hydrologic-unit-codes-2008-20
Explore at:
Dataset updated
Nov 23, 2024
Dataset provided by
U.S. Geological Survey
Description
This dataset provides monthly summaries of evapotranspiration (ET) data from OpenET v2.0 image collections for the period 2008-2023 for all National Watershed Boundary Dataset subwatersheds (12-digit hydrologic unit codes [HUC12s]) in the US that overlap the spatial extent of OpenET datasets. For each HUC12, this dataset contains spatial aggregation statistics (minimum, mean, median, and maximum) for each of the ET variables from each of the publicly available image collections from OpenET for the six available models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop) and the Ensemble image collection, which is a pixel-wise ensemble of all 6 individual models after filtering and removal of outliers according to the median absolute deviation approach (Melton and others, 2022). Data are available in this data release in two different formats: comma-separated values (CSV) and parquet, a high-performance format that is optimized for storage and processing of columnar data. CSV files containing data for each 4-digit HUC are grouped by 2-digit HUCs for easier access of regional data, and the single parquet file provides convenient access to the entire dataset. For each of the ET models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop), variables in the model-specific CSV data files include: -huc12: The 12-digit hydrologic unit code -ET: Actual evapotranspiration (in millimeters) over the HUC12 area in the month calculated as the sum of daily ET interpolated between Landsat overpasses -statistic: Max, mean, median, or min. Statistic used in the spatial aggregation within each HUC12. For example, maximum ET is the maximum monthly pixel ET value occurring within the HUC12 boundary after summing daily ET in the month -year: 4-digit year -month: 2-digit month -count: Number of Landsat overpasses included in the ET calculation in the month -et_coverage_pct: Integer percentage of the HUC12 with ET data, which can be used to determine how representative the ET statistic is of the entire HUC12 -count_coverage_pct: Integer percentage of the HUC12 with count data, which can be different than the et_coverage_pct value because the “count” band in the source image collection extends beyond the “et” band in the eastern portion of the image collection extent For the Ensemble data, these additional variables are included in the CSV files: -et_mad: Ensemble ET value, computed as the mean of the ensemble after filtering outliers using the median absolute deviation (MAD) -et_mad_count: The number of models used to compute the ensemble ET value after filtering for outliers using the MAD -et_mad_max: The maximum value in the ensemble range, after filtering for outliers using the MAD -et_mad_min: The minimum value in the ensemble range, after filtering for outliers using the MAD -et_sam: A simple arithmetic mean (across the 6 models) of actual ET average without outlier removal Below are the locations of each OpenET image collection used in this summary: DisALEXI: https://developers.google.com/earth-engine/datasets/catalog/OpenET_DISALEXI_CONUS_GRIDMET_MONTHLY_v2_0 eeMETRIC: https://developers.google.com/earth-engine/datasets/catalog/OpenET_EEMETRIC_CONUS_GRIDMET_MONTHLY_v2_0 geeSEBAL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_GEESEBAL_CONUS_GRIDMET_MONTHLY_v2_0 PT-JPL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_PTJPL_CONUS_GRIDMET_MONTHLY_v2_0 SIMS: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SIMS_CONUS_GRIDMET_MONTHLY_v2_0 SSEBop: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SSEBOP_CONUS_GRIDMET_MONTHLY_v2_0 Ensemble: https://developers.google.com/earth-engine/datasets/catalog/OpenET_ENSEMBLE_CONUS_GRIDMET_MONTHLY_v2_0
n
Data from: RADseq analyses reveal concordant Indian Ocean biogeographic and...
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+2more
zip
Updated May 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eva M. Salas; Giacomo Bernardi; Michael L. Berumen; Michelle Gaither; Luiz A. Rocha (2019). RADseq analyses reveal concordant Indian Ocean biogeographic and phylogeographic boundaries in the reef fish Dascyllus trimaculatus [Dataset]. http://doi.org/10.5061/dryad.bn457rr
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.bn457rr
Dataset updated
May 6, 2019
Dataset provided by
California Academy of Sciences
University of Central Florida
King Abdullah University of Science and Technology
University of California, Santa Cruz
Authors
Eva M. Salas; Giacomo Bernardi; Michael L. Berumen; Michelle Gaither; Luiz A. Rocha
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Red Sea, Indian Ocean, Indian Ocean
Description
Population genetic analysis is an important tool for estimating the degree of evolutionary connectivity in marine organisms. Here, we investigate the population structure of the three-spot damselfish Dascyllus trimaculatus in the Red Sea, Arabian Sea and Western Indian Ocean, using 1,174 single nucleotide polymorphisms (SNPs). Neutral loci revealed a signature of weak genetic differentiation between the Northwestern (Red Sea and Arabian Sea) and Western Indian Ocean biogeographic provinces. Loci potentially under selection (outlier loci) revealed a similar pattern but with a much stronger signal of genetic structure between regions. The Oman population appears to be genetically distinct from all other populations included in the analysis. While we could not clearly identify the mechanisms driving these patterns (isolation, adaptation, or both), the datasets indicate that population level divergences are largely concordant with biogeographic boundaries based on species composition. Our data can be used along with genetic connectivity of other species to identify the common genetic breaks that need to be considered for the conservation of biodiversity and evolutionary processes in the poorly studied Western Indian Ocean region.
d
Data from: ClinePlotR: Visualizing genomic clines and detecting outliers in...
search.dataone.org
datadryad.org
+1more
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bradley T. Martin; Tyler K. Chafin; Marlis R. Douglas; Michael E. Douglas (2025). ClinePlotR: Visualizing genomic clines and detecting outliers in R [Dataset]. http://doi.org/10.5061/dryad.b2rbnzsc8
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.b2rbnzsc8
Dataset updated
Apr 23, 2025
Dataset provided by
Dryad Digital Repository
Authors
Bradley T. Martin; Tyler K. Chafin; Marlis R. Douglas; Michael E. Douglas
Time period covered
Jan 1, 2020
Description
Patterns of multi-locus differentiation (i.e., genomic clines) often extend broadly across hybrid zones and their quantification can help diagnose how species boundaries are shaped by adaptive processes, both intrinsic and extrinsic. In this sense, the transitioning of loci across admixed individuals can be contrasted as a function of the genome-wide trend, in turn allowing an expansion of clinal theory across a much wider array of biodiversity. However, computational tools that serve to interpret and consequently visualize â€˜genomic clinesâ€™ are limited.

Here, we introduce the ClinePlotR R-package for visualizing genomic clines and detecting outlier loci using output generated by two popular software packages, bgc and Introgress.

ClinePlotR bundles both input generation (i.e, filtering datasets and creating specialized file formats) and output processing (e.g., MCMC thinning and burn-in) with functions that directly facilitate interpretation and hypothesis testing. Tools are also p...
Data from: Outlier SNP markers reveal fine-scale genetic structuring across...
zenodo.org
datasetcatalog.nlm.nih.gov
+3more
Updated Jun 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ilaria Milano; Massimiliano Babbucci; Alessia Cariani; Miroslava Atanassova; Dorte Bekkevold; Gary R. Carvalho; Montserrat Espiñeira; Fabio Fiorentino; Germana Garofalo; Audrey J. Geffen; Einar E. Nielsen; Rob Ogden; Tomaso Patarnello; Marco Stagioni; Fausto Tinti; Luca Bargelloni; Ilaria Milano; Massimiliano Babbucci; Alessia Cariani; Miroslava Atanassova; Dorte Bekkevold; Gary R. Carvalho; Montserrat Espiñeira; Fabio Fiorentino; Germana Garofalo; Audrey J. Geffen; Einar E. Nielsen; Rob Ogden; Tomaso Patarnello; Marco Stagioni; Fausto Tinti; Luca Bargelloni (2022). Data from: Outlier SNP markers reveal fine-scale genetic structuring across European hake populations (Merluccius merluccius) [Dataset]. http://doi.org/10.5061/dryad.7bn22
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.7bn22
Dataset updated
Jun 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ilaria Milano; Massimiliano Babbucci; Alessia Cariani; Miroslava Atanassova; Dorte Bekkevold; Gary R. Carvalho; Montserrat Espiñeira; Fabio Fiorentino; Germana Garofalo; Audrey J. Geffen; Einar E. Nielsen; Rob Ogden; Tomaso Patarnello; Marco Stagioni; Fausto Tinti; Luca Bargelloni; Ilaria Milano; Massimiliano Babbucci; Alessia Cariani; Miroslava Atanassova; Dorte Bekkevold; Gary R. Carvalho; Montserrat Espiñeira; Fabio Fiorentino; Germana Garofalo; Audrey J. Geffen; Einar E. Nielsen; Rob Ogden; Tomaso Patarnello; Marco Stagioni; Fausto Tinti; Luca Bargelloni
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Shallow population structure is generally reported for most marine fish and explained as a consequence of high dispersal, connectivity and large population size. Targeted gene analyses and more recently genome-wide studies have challenged such view, suggesting that adaptive divergence might occur even when neutral markers provide genetic homogeneity across populations. Here, 381 SNPs located in transcribed regions were used to assess large- and fine-scale population structure in the European hake (Merluccius merluccius), a widely distributed demersal species of high priority for the European fishery. Analysis of 850 individuals from 19 locations across the entire distribution range showed evidence for several outlier loci, with significantly higher resolving power. While 299 putatively neutral SNPs confirmed the genetic break between basins (FCT = 0.016) and weak differentiation within basins, outlier loci revealed a dramatic divergence between Atlantic and Mediterranean populations (FCT range 0.275–0.705) and fine-scale significant population structure. Outlier loci separated North Sea and Northern Portugal populations from all other Atlantic samples and revealed a strong differentiation among Western, Central and Eastern Mediterranean geographical samples. Significant correlation of allele frequencies at outlier loci with seawater surface temperature and salinity supported the hypothesis that populations might be adapted to local conditions. Such evidence highlights the importance of integrating information from neutral and adaptive evolutionary patterns towards a better assessment of genetic diversity. Accordingly, the generated outlier SNP data could be used for tackling illegal practices in hake fishing and commercialization as well as to develop explicit spatial models for defining management units and stock boundaries.
U
Field, remote sensing, and modeling data used for Collins et al., Rockfall...
data.usgs.gov
s.cnmilf.com
+1more
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skye Corbett; Brian Collins; Elizabeth Horton (2024). Field, remote sensing, and modeling data used for Collins et al., Rockfall Kinematics from Massive Rock Cliffs: Outlier Boulders and Flyrock Resulting from the 2020 Whitney Portal, California Rockfalls [Dataset]. http://doi.org/10.5066/P93TJUXH
Explore at:
Unique identifier
https://doi.org/10.5066/P93TJUXH
Dataset updated
Sep 3, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Skye Corbett; Brian Collins; Elizabeth Horton
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Jul 6, 2020 - Nov 14, 2020
Area covered
Whitney Portal, California
Description
This data release includes information used to support the manuscript "Rockfall kinematics from massive rock cliffs: outlier boulders and flyrock from Whitney Portal, California rockfalls". The included datasets and supplement include data that was collected and processed to investigate the kinematics of boulder trajectories and impacts to both other boulders and to existing trees on the talus slope beneath the source area cliffs. This data release includes four folders and one .csv file: 1) GIS Data – shapefile (.shp) of runout zone boundary, 2) RockyFor3d Model Data - .asc and .csv files necessary as input for RockyFor3d model, 3) Terrestrial Lidar- .txt file containing the XYZRGB point cloud collected post rockfall on July 6, 2020, 4) UAV Data- photos taken from UAV flight (.dng and .jpg), GPS data (.csv), processing report of the model (.pdf), and the Structure from Motion (SFM) point cloud (.txt), and (5) .csv file of the outlier boulder locations.
f
Data Sheet 1_Outliers and anomalies in training and testing datasets for...
figshare.com
pdf
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuriy Vasilev; Anastasia Pamova; Tatiana Bobrovskaya; Anton Vladzimirskyy; Olga Omelyanskaya; Elena Astapenko; Artem Kruchinkin; Novik Vladimir; Kirill Arzamasov (2025). Data Sheet 1_Outliers and anomalies in training and testing datasets for AI-powered morphometry—evidence from CT scans of the spleen.pdf [Dataset]. http://doi.org/10.3389/frai.2025.1607348.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1607348.s001
Dataset updated
Jul 15, 2025
Dataset provided by
Frontiers
Authors
Yuriy Vasilev; Anastasia Pamova; Tatiana Bobrovskaya; Anton Vladzimirskyy; Olga Omelyanskaya; Elena Astapenko; Artem Kruchinkin; Novik Vladimir; Kirill Arzamasov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionCreating training and testing datasets for machine learning algorithms to measure linear dimensions of organs is a tedious task. There are no universally accepted methods for evaluating outliers or anomalies in such datasets. This can cause errors in machine learning and compromise the quality of end products. The goal of this study is to identify optimal methods for detecting organ anomalies and outliers in medical datasets designed to train and test neural networks in morphometrics.MethodsA dataset was created containing linear measurements of the spleen obtained from CT scans. Labelling was performed by three radiologists. The total number of studies included in the sample was N = 197 patients. Using visual methods (1.5 interquartile range; heat map; boxplot; histogram; scatter plot), machine learning algorithms (Isolation forest; Density-Based Spatial Clustering of Applications with Noise; K-nearest neighbors algorithm; Local outlier factor; One-class support vector machines; EllipticEnvelope; Autoencoders), and mathematical statistics (z-score, Grubb’s test; Rosner’s test).ResultsWe identified measurement errors, input errors, abnormal size values and non-standard shapes of the organ (sickle-shaped, round, triangular, additional lobules). The most effective methods included visual techniques (including boxplots and histograms) and machine learning algorithms such is OSVM, KNN and autoencoders. A total of 32 outlier anomalies were found.DiscussionCuration of complex morphometric datasets must involve thorough mathematical and clinical analyses. Relying solely on mathematical statistics or machine learning methods appears inadequate.
p
Peel Watershed Land Management Units
hub.planyukon.ca
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yukon Land Use Planning Council (2023). Peel Watershed Land Management Units [Dataset]. https://hub.planyukon.ca/datasets/peel-watershed-land-management-units-1
Explore at:
Dataset updated
Dec 8, 2023
Dataset authored and provided by
Yukon Land Use Planning Council
Area covered

Description
NOTE: These boundaries are subject to slight revisions in 2024 as the park management plans are completed.This version is as produced for the approved Plan of August 2019, except:the boundaries have been slightly adjusted to match Yukon Government's Order-In-Councils (legal withdrawals of land)the boundaries have been slightly adjusted to match updated planning region boundary corrections (YLUPC March 2023)Surface and linear disturbance statistics for each LMU were calculated using GeoYukon's Surface disturbance layers published 2022-10-11The "Dist_years" attribute provides the range of years of imagery used to map disturbances in that LMU. Outlier years (i.e., those used for only one disturbance feature) were not includedAttributes describing threshold levels were added as described in table 3.2The attribute "SD room before cautionary level" provides the amount of surface disturbance in km2 within that LMU that can happen before the cautionary level is reached. The attribute "LD room before cautionary level" provides the amount of linear disturbance in km within that LMU that can happen before the cautionary level is reached. Both attributes above do not consider recovery, permits, reclamation etc. at this time. Negative values indicate the amount that the cautionary level has been exceeded. Published ~June 15, 2023
p
North Yukon Land Management Units
hub.planyukon.ca
Updated Dec 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yukon Land Use Planning Council (2023). North Yukon Land Management Units [Dataset]. https://hub.planyukon.ca/datasets/north-yukon-land-management-units-1
Explore at:
Dataset updated
Dec 15, 2023
Dataset authored and provided by
Yukon Land Use Planning Council
Area covered

Description
This version is as produced for the Approved Plan of 2009, except:the boundaries have been slightly adjusted to match Yukon Government's Order-In-Councils (legal withdrawals of land)the boundaries have been slightly adjusted to match updated planning region boundary corrections (YLUPC March 2023)the boundaries of the highway corridors (see section 5.4.1.1: 1000m buffer of highway centerline) and the Community Area (see section 4.3: 5000m buffer of the center of Old Crow in LMU 2A) were created and merged (or "unioned") with the LMUsthe attribute "CE_Exempt" was added. The Community Area and Corridors exempt from the CE framework are flagged as a "1".Surface and linear disturbance statistics for each LMU were calculated using GeoYukon's Surface disturbance layers published 2022-10-11The "Dist_years" attribute provides the range of years of imagery used to map disturbances in that LMU. Outlier years (i.e., those used for only one disturbance feature) were not includedAttributes describing threshold levels were added as described in table 3.2The attribute "SD room before cautionary level" provides the amount of surface disturbance in km2 within that LMU that can happen before the cautionary level is reached. The attribute "LD room before cautionary level" provides the amount of linear disturbance in km within that LMU that can happen before the cautionary level is reached. Both attributes above do not consider recovery, permits, reclamation etc. at this time. Negative values indicate the amount that the cautionary level has been exceeded. LMUs marked "
a
Visualize A Space Time Cube in 3D
gemelo-digital-en-arcgis-gemelodigital.hub.arcgis.com
hub.arcgis.com
Updated Dec 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Society for Conservation GIS (2020). Visualize A Space Time Cube in 3D [Dataset]. https://gemelo-digital-en-arcgis-gemelodigital.hub.arcgis.com/maps/acddde8dae114381889b436fa0ff4b2f
Explore at:
Dataset updated
Dec 3, 2020
Dataset authored and provided by
Society for Conservation GIS
Description
Stamp Out COVID-19An apple a day keeps the doctor away.Linda Angulo LopezDecember 3, 2020https://theconversation.com/coronavirus-where-do-new-viruses-come-from-136105SNAP Participation Rates, was explored and analysed on ArcGIS Pro, the results of which can help decision makers set up further SNAP-D initiatives.In the USA foods are stored in every State and U.S. territory and may be used by state agencies or local disaster relief organizations to provide food to shelters or people who are in need.US Food Stamp Program has been ExtendedThe Supplemental Nutrition Assistance Program, SNAP, is a State Organized Food Stamp Program in the USA and was put in place to help individuals and families during this exceptional time. State agencies may request to operate a Disaster Supplemental Nutrition Assistance Program (D-SNAP) .D-SNAP Interactive DashboardAlmost all States have set up Food Relief Programs, in response to COVID-19.Scroll Down to Learn more about the SNAP Participation Analysis & ResultsSNAP Participation AnalysisInitial results of yearly participation rates to geography show statistically significant trends, to get acquainted with the results, explore the following 3D Time Cube Map:Visualize A Space Time Cube in 3Dhttps://arcg.is/1q8LLPnetCDF ResultsWORKFLOW: a space-time cube was generated as a netCDF structure with the ArcGIS Pro Space-Time Mining Tool : Create a Space Time Cube from Defined Locations, other tools were then used to incorporate the spatial and temporal aspects of the SNAP County Participation Rate Feature to reveal and render statistically significant trends about Nutrition Assistance in the USA.Hot Spot Analysis Explore the results in 2D or 3D.2D Hot Spotshttps://arcg.is/1Pu5WH02D Hot Spot ResultsWORKFLOW: Hot Spot Analysis, with the Hot Spot Analysis Tool shows that there are various trends across the USA for instance the Southeastern States have a mixture of consecutive, intensifying, and oscillating hot spots.3D Hot Spotshttps://arcg.is/1b41T43D Hot Spot ResultsThese trends over time are expanded in the above 3D Map, by inspecting the stacked columns you can see the trends over time which give result to the overall Hot Spot Results.Not all counties have significant trends, symbolized as Never Significant in the Space Time Cubes.Space-Time Pattern Mining AnalysisThe North-central areas of the USA, have mostly diminishing cold spots.2D Space-Time Mininghttps://arcg.is/1PKPj02D Space Time Mining ResultsWORKFLOW: Analysis, with the Emerging Hot Spot Analysis Tool shows that there are various trends across the USA for instance the South-Eastern States have a mixture of consecutive, intensifying, and oscillating hot spots.Results ShowThe USA has counties with persistent malnourished populations, they depend on Food Aide.3D Space-Time Mininghttps://arcg.is/01fTWf3D Space Time Mining ResultsIn addition to obvious planning for consistent Hot-Hot Spot Areas, areas oscillating Hot-Cold and/or Cold-Hot Spots can be identified for further analysis to mitigate the upward trend in food insecurity in the USA, since 2009 which has become even worse since the outbreak of the COVID-19 pandemic.After Notes:(i) The Johns Hopkins University has an Interactive Dashboard of the Evolution of the COVID-19 Pandemic.Coronavirus COVID-19 (2019-nCoV)(ii) Since March 2020 in a Response to COVID-19, SNAP has had to extend its benefits to help people in need. The Food Relief is coordinated within States and by local and voluntary organizations to provide nutrition assistance to those most affected by a disaster or emergency.Visit SNAPs Interactive DashboardFood Relief has been extended, reach out to your state SNAP office, if you are in need.(iii) Follow these Steps to build an ArcGIS Pro StoryMap:Step 1: [Get Data][Open An ArcGIS Pro Project][Run a Hot Spot Analysis][Review analysis parameters][Interpret the results][Run an Outlier Analysis][Interpret the results]Step 2: [Open the Space-Time Pattern Mining 2 Map][Create a space-time cube][Visualize a space-time cube in 2D][Visualize a space-time cube in 3D][Run a Local Outlier Analysis][Visualize a Local Outlier Analysis in 3DStep 3: [Communicate Analysis][Identify your Audience & Takeaways][Create an Outline][Find Images][Prepare Maps & Scenes][Create a New Story][Add Story Elements][Add Maps & Scenes] [Review the Story][Publish & Share]A submission for the Esri MOOCSpatial Data Science: The New Frontier in AnalyticsLinda Angulo LopezLauren Bennett . Shannon Kalisky . Flora Vale . Alberto Nieto . Atma Mani . Kevin Johnston . Orhun Aydin . Ankita Bakshi . Vinay Viswambharan . Jennifer Bell & Nick Giner
Data from: Localizing FST outliers on a QTL map reveals evidence for large...
zenodo.org
search.dataone.org
+2more
txt
Updated May 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Via; Gina Conte; Casey Mason-Foley; Kelly Mills; Sara Via; Gina Conte; Casey Mason-Foley; Kelly Mills (2022). Data from: Localizing FST outliers on a QTL map reveals evidence for large genomic regions of reduced gene exchange during speciation-with-gene-flow [Dataset]. http://doi.org/10.5061/dryad.9cf75
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.9cf75
Dataset updated
May 28, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sara Via; Gina Conte; Casey Mason-Foley; Kelly Mills; Sara Via; Gina Conte; Casey Mason-Foley; Kelly Mills
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Populations that maintain phenotypic divergence in sympatry typically show a mosaic pattern of genomic divergence, requiring a corresponding mosaic of genomic isolation (reduced gene flow). However, mechanisms that could produce the genomic isolation required for divergence-with-gene-flow have barely been explored, apart from the traditional localized effects of selection and reduced recombination near centromeres or inversions. By localizing FST outliers from a genome scan of wild pea aphid host races on a Quantitative Trait Locus (QTL) map of key traits, we test the hypothesis that between-population recombination and gene exchange are reduced over large 'divergence hitchhiking' (DH) regions. As expected under divergence hitchhiking, our map confirms that QTL and divergent markers cluster together in multiple large genomic regions. Under divergence hitchhiking, the nonoutlier markers within these regions should show signs of reduced gene exchange relative to nonoutlier markers in genomic regions where ongoing gene flow is expected. We use this predicted difference among nonoutliers to perform a critical test of divergence hitchhiking. Results show that nonoutlier markers within clusters of FST outliers and QTL resolve the genetic population structure of the two host races nearly as well as the outliers themselves, while nonoutliers outside DH regions reveal no population structure, as expected if they experience more gene flow. These results provide clear evidence for divergence hitchhiking, a mechanism that may dramatically facilitate the process of speciation-with-gene-flow. They also show the power of integrating genome scans with genetic analyses of the phenotypic traits involved in local adaptation and population divergence.
H
Per-Cloud Pixelated Map Result Tables (Machine Readable)
dataverse.harvard.edu
Updated Feb 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Catherine Zucker (2019). Per-Cloud Pixelated Map Result Tables (Machine Readable) [Dataset]. http://doi.org/10.7910/DVN/74Y5KU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/74Y5KU
Dataset updated
Feb 2, 2019
Dataset provided by
Harvard Dataverse
Authors
Catherine Zucker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A machine readable version of the pixelated map results for each cloud listed in Table 1. The results for each cloud are listed in a separate file, labeled by cloud name. For each model parameter, we report the 16th, 50th, and 84th percentile of the samples from our dynesty chain, which should be regarded as the statistical uncertainties. An additional systematic uncertainty of 5% should be added to the distances. The column headings are as follows: 'name' is the cloud coincident with the sightline 'l' is the Galactic longitude of the sightline (in degrees) 'b' is the Galactic latitude of the sightline (in degrees) 'n' is the normalization parameter 'f' is the foreground extinction parameter (in mag) 'm' is the cloud distance modulus parameter (in mag) 'd' is the cloud distance (derived from m) in pc 'p' is the outlier fraction parameter 'sfore' is the foreground smoothing parameter 'sback' is the background smoothing parameter See Section 3.2 for a complete description of the model parameters.
GIS Shapefile - GIS Shapefile, Assessments and Taxation Database, MD...
search.dataone.org
portal.edirepository.org
Updated Apr 5, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cary Institute Of Ecosystem Studies; Jarlath O'Neil-Dunne; Morgan Grove (2019). GIS Shapefile - GIS Shapefile, Assessments and Taxation Database, MD Property View 2003, Baltimore City [Dataset]. https://search.dataone.org/view/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%2Fknb-lter-bes%2F349%2F610
Explore at:
Dataset updated
Apr 5, 2019
Dataset provided by
Long Term Ecological Research Networkhttp://www.lternet.edu/
Authors
Cary Institute Of Ecosystem Studies; Jarlath O'Neil-Dunne; Morgan Grove
Time period covered
Jan 1, 2003 - Jan 1, 2004
Area covered

Description
AT_2003_BACI_1 File Geodatabase Feature Class Thumbnail Not Available Tags There are no tags for this item. Summary There is no summary for this item. Description MD Property View 2003 A&T Database. For more information on the A&T Database refer to the enclosed documentation. This layer was edited to remove spatial outliers in the A&T Database. Spatial outliers are those points that were not geocoded and as a result fell outside of the Baltimore City Boundary; 416 spatial outliers were removed from this layer. The field BLOCKLOT2 can be used to join this layer with the Baltimore City parcel layer. Credits There are no credits for this item. Use limitations There are no access and use limitations for this item. Extent West -76.713418 East -76.526031 North 39.374429 South 39.197452