78 datasets found

f
Data from: Outlier detection in cylindrical data based on Mahalanobis...
tandf.figshare.com
text/x-tex
Updated Jan 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prashant S. Dhamale; Akanksha S. Kashikar (2025). Outlier detection in cylindrical data based on Mahalanobis distance [Dataset]. http://doi.org/10.6084/m9.figshare.24092089.v1
Explore at:
text/x-texAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24092089.v1
Dataset updated
Jan 2, 2025
Dataset provided by
Taylor & Francis
Authors
Prashant S. Dhamale; Akanksha S. Kashikar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cylindrical data are bivariate data formed from the combination of circular and linear variables. Identifying outliers is a crucial step in any data analysis work. This paper proposes a new distribution-free procedure to detect outliers in cylindrical data using the Mahalanobis distance concept. The use of Mahalanobis distance incorporates the correlation between the components of the cylindrical distribution, which had not been accounted for in the earlier papers on outlier detection in cylindrical data. The threshold for declaring an observation to be an outlier can be obtained via parametric or non-parametric bootstrap, depending on whether the underlying distribution is known or unknown. The performance of the proposed method is examined via extensive simulations from the Johnson-Wehrly distribution. The proposed method is applied to two real datasets, and the outliers are identified in those datasets.
s
Outlier Set Two-step Method (OSTI)
orda.shef.ac.uk
application/x-rar
Updated Jul 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amal Sarfraz; Abigail Birnbaum; Flannery Dolan; Jonathan Lamontagne; Lyudmila Mihaylova; Charles Rouge (2025). Outlier Set Two-step Method (OSTI) [Dataset]. http://doi.org/10.15131/shef.data.28227974.v3
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.15131/shef.data.28227974.v3
Dataset updated
Jul 1, 2025
Dataset provided by
The University of Sheffield
Authors
Amal Sarfraz; Abigail Birnbaum; Flannery Dolan; Jonathan Lamontagne; Lyudmila Mihaylova; Charles Rouge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These files are supplements to the paper titled 'A Robust Two-step Method for Detection of Outlier Sets'.This paper identifies and addresses the need for a robust method that identifies sets of points that collectively deviate from typical patterns in a dataset, which it calls "outlier sets'', while excluding individual points from detection. This new methodology, Outlier Set Two-step Identification (OSTI) employs a two-step approach to detect and label these outlier sets. First, it uses Gaussian Mixture Models for probabilistic clustering, identifying candidate outlier sets based on cluster weights below a predetermined threshold. Second, OSTI measures the Inter-cluster Mahalanobis distance between each candidate outlier set's centroid and the overall dataset mean. OSTI then tests the null hypothesis that this distance does not significantly differ from its theoretical chi-square distribution, enabling the formal detection of outlier sets. We test OSTI systematically on 8,000 synthetic 2D datasets across various inlier configurations and thousands of possible outlier set characteristics. Results show OSTI robustly and consistently detects outlier sets with an average F1 score of 0.92 and an average purity (the degree to which outlier sets identified correspond to those generated synthetically, i.e., our ground truth) of 98.58%. We also compare OSTI with state-of-the-art outlier detection methods, to illuminate how OSTI fills a gap as a tool for the exclusive detection of outlier sets.
f
Data from: Functional Outlier Detection for Density-Valued Data with...
tandf.figshare.com
txt
Updated Feb 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinyi Lei; Zhicheng Chen; Hui Li (2024). Functional Outlier Detection for Density-Valued Data with Application to Robustify Distribution-to-Distribution Regression [Dataset]. http://doi.org/10.6084/m9.figshare.21926087.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21926087.v1
Dataset updated
Feb 26, 2024
Dataset provided by
Taylor & Francis
Authors
Xinyi Lei; Zhicheng Chen; Hui Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Distributional data analysis, concerned with the statistical analysis of data objects consisting of random probability distributions in the framework of functional data analysis (FDA), has received considerable interest in recent years and is increasingly applied in various fields including engineering. Outlier detection and robustness are of great practical interest; however, these aspects remain unexplored for distributional data. To this end, this study focuses on density-valued outlier detection and its application in robust distributional regression. Specifically, we propose a transformation-based approach for single-dataset outlying density detection with an emphasis on converting the less detectable shape outliers to easily detectable magnitude outliers. We also propose a distributional regression-based approach for detecting the abnormal associations of the density-valued two-tuples associated with two datasets. Then, the proposed outlier detection methods are applied to robustify a distribution-to-distribution regression method used in engineering, and we develop a robust estimator for the regression operator by downweighting the detected outliers. The proposed methods are validated and evaluated via extensive simulation studies. The relevant results reveal the superiority of our method over other competitors in distributional outlier detection. A case study in structural health monitoring demonstrates the great potential of our proposal in engineering applications. Supplementary materials for this article are available online.
f
Anomaly Detection in High-Dimensional Data
tandf.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles (2023). Anomaly Detection in High-Dimensional Data [Dataset]. http://doi.org/10.6084/m9.figshare.12844508.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12844508.v2
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation where its k-nearest neighbor distance with the maximum gap is significantly different from what we would expect if the distribution of k-nearest neighbors with the maximum gap is in the maximum domain of attraction of the Gumbel distribution. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray. Supplementary materials for this article are available online.
Gender_Classification_Dataset
kaggle.com
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sameh Raouf (2024). Gender_Classification_Dataset [Dataset]. https://www.kaggle.com/datasets/samehraouf/gender-classification-dataset/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sameh Raouf
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Title: Gender Classification Dataset

Description: This dataset contains anonymized information on height, weight, age, and gender of 10,000 individuals. The data is equally distributed between males and females, with 5,000 samples for each gender. The purpose of this dataset is to provide a comprehensive sample for studies and analyses related to physical attributes and demographics.

Content: The CSV file contains the following columns:

Gender: The gender of the individual (Male/Female) Height: The height of the individual in centimeters Weight: The weight of the individual in kilograms Age: The age of the individual in years

License: This dataset is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND 4.0) license. This means you are free to share the data, provided that you attribute the source, do not use it for commercial purposes, and do not distribute modified versions of the data.

Usage:

This dataset can be used for: - Analyzing the distribution of height, weight, and age across genders - Developing and testing machine learning models for predicting physical attributes - Educational purposes in statistics and data science courses
g
Replication data for: Linear Models with Outliers: Choosing between...
datasearch.gesis.org
dataverse.harvard.edu
+1more
Updated Jan 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harden, Jeffrey; Desmarais, Bruce (2020). Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods [Dataset]. https://datasearch.gesis.org/dataset/httpsdataverse.unc.eduoai--hdl1902.2911608
Explore at:
Dataset updated
Jan 22, 2020
Dataset provided by
Odum Institute Dataverse Network
Authors
Harden, Jeffrey; Desmarais, Bruce
Description
State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.
f
An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data
plos.figshare.com
doc
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nysia I. George; John F. Bowyer; Nathaniel M. Crabtree; Ching-Wei Chang (2023). An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data [Dataset]. http://doi.org/10.1371/journal.pone.0125224
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0125224
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Nysia I. George; John F. Bowyer; Nathaniel M. Crabtree; Ching-Wei Chang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The discrete data structure and large sequencing depth of RNA sequencing (RNA-seq) experiments can often generate outlier read counts in one or more RNA samples within a homogeneous group. Thus, how to identify and manage outlier observations in RNA-seq data is an emerging topic of interest. One of the main objectives in these research efforts is to develop statistical methodology that effectively balances the impact of outlier observations and achieves maximal power for statistical testing. To reach that goal, strengthening the accuracy of outlier detection is an important precursor. Current outlier detection algorithms for RNA-seq data are executed within a testing framework and may be sensitive to sparse data and heavy-tailed distributions. Therefore, we propose a univariate algorithm that utilizes a probabilistic approach to measure the deviation between an observation and the distribution generating the remaining data and implement it within in an iterative leave-one-out design strategy. Analyses of real and simulated RNA-seq data show that the proposed methodology has higher outlier detection rates for both non-normalized and normalized negative binomial distributed data.
Z
ROcD-nGoM: A River-Ocean Coupled Database for the Northern Gulf of Mexico
data.niaid.nih.gov
zenodo.org
Updated Nov 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuang Zhang (2024). ROcD-nGoM: A River-Ocean Coupled Database for the Northern Gulf of Mexico [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7967051
Explore at:
Dataset updated
Nov 8, 2024
Dataset provided by
Bailey Armos
Shuang Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Gulf of Mexico (Gulf of America)
Description
This is a River-Ocean Coupled Database for the Northern Gulf of Mexico. This database contains river chemistry and discharge (Q) data for 54 rivers, streams, and bayous entering the nGoM. It has both the raw observations (ConcAve) as well as a daily concentration (ConcDay) and flux estimations from the USGS WRTDS model. The database also contains 17 chemical and physical ocean parameters from the Gulf of Mexico Coastal Ocean Observing System (GCOOS) and the MODIS and seaWiFS satellite sensors. The data is provided in two formats: (1) raw with outlier flags and (2) cleaned and time averaged. The raw format provides the daily concentrations of the USGS and GCOOS data and monthly data for the satellite product. There are two outlier flags for parameters which indicate if the value is in the 0.5 percentile of high or low concentrations of all the data. They are called "outlier_99.5" for the 0.5% of data on the high end of the distribution and "outlier_0.05" for the 0.5% of the data on the low end of the distribution. The USGS river data has outlier flags for "ConcAve", "ConcDay", and "Q". The ConcDay range designation is defined by the distribution of the actual raw measurements (ConcAve). The ocean data has outlier flags for each parameter except wind/current direction. The time averaged format includes monthly, seasonal, and yearly values of all the parameters. These values are calculated following the removal of the outliers in the raw data. The R script used to create the river portion of the database is also included as ROcDnGoM_river_script.
f
Data from: Leave-One-Out Kernel Density Estimates for Outlier Detection
tandf.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sevvandi Kandanaarachchi; Rob J Hyndman (2023). Leave-One-Out Kernel Density Estimates for Outlier Detection [Dataset]. http://doi.org/10.6084/m9.figshare.16942936.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16942936.v2
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Sevvandi Kandanaarachchi; Rob J Hyndman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This article introduces lookout, a new approach to detect outliers using leave-one-out kernel density estimates and extreme value theory. Outlier detection methods that use kernel density estimates generally employ a user defined parameter to determine the bandwidth. Lookout uses persistent homology to construct a bandwidth suitable for outlier detection without any user input. We demonstrate the effectiveness of lookout on an extensive data repository by comparing its performance with other outlier detection methods based on extreme value theory. Furthermore, we introduce outlier persistence, a useful concept that explores the birth and the cessation of outliers with changing bandwidth and significance levels. The R package lookout implements this algorithm. Supplementary files for this article are available online.
Z
BOREALIS Power Analysis Code and Data
data.niaid.nih.gov
zenodo.org
Updated Nov 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jenkinson. W Garrett (2022). BOREALIS Power Analysis Code and Data [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7343135
Explore at:
Dataset updated
Nov 22, 2022
Dataset provided by
Jenkinson. W Garrett
Klee, Eric W
Oliver, Gavin R
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This contains the code and data necessary to rerun the power analysis used in testing BOREALIS.

Borealis is an R library performing outlier analysis for count-based bisulfite sequencing data. It detects outlier methylated CpG sites from bisulfite sequencing (BS-seq). The core of Borealis is modeling Beta-Binomial distributions. This can be useful for rare disease diagnoses.
a
Find Outliers Percent of households with income below the Federal Poverty...
uscssi.hub.arcgis.com
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spatial Sciences Institute (2021). Find Outliers Percent of households with income below the Federal Poverty Level [Dataset]. https://uscssi.hub.arcgis.com/maps/USCSSI::find-outliers-percent-of-households-with-income-below-the-federal-poverty-level
Explore at:
Dataset updated
Dec 5, 2021
Dataset authored and provided by
Spatial Sciences Institute
Area covered

Description
The following report outlines the workflow used to optimize your Find Outliers result:Initial Data Assessment.There were 1684 valid input features.POVERTY Properties:Min0.0000Max91.8000Mean18.9902Std. Dev.12.7152There were 22 outlier locations; these will not be used to compute the optimal fixed distance band.Scale of AnalysisThe optimal fixed distance band was based on the average distance to 30 nearest neighbors: 3709.0000 Meters.Outlier AnalysisCreating the random reference distribution with 499 permutations.There are 1155 output features statistically significant based on a FDR correction for multiple testing and spatial dependence.There are 68 statistically significant high outlier features.There are 84 statistically significant low outlier features.There are 557 features part of statistically significant low clusters.There are 446 features part of statistically significant high clusters.OutputPink output features are part of a cluster of high POVERTY values.Light Blue output features are part of a cluster of low POVERTY values.Red output features represent high outliers within a cluster of low POVERTY values.Blue output features represent low outliers within a cluster of high POVERTY values.
r
Data from: Male responses to sperm competition risk when rivals vary in...
researchdata.edu.au
search.dataone.org
+1more
Updated 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leigh W. Simmons; Joseph L. Tomkins; Samuel J. Lymbery; School of Biological Sciences (2019). Data from: Male responses to sperm competition risk when rivals vary in their number and familiarity [Dataset]. http://doi.org/10.5061/DRYAD.M097580
Explore at:
Unique identifier
https://doi.org/10.5061/DRYAD.M097580
Dataset updated
2019
Dataset provided by
The University of Western Australia
DRYAD
Authors
Leigh W. Simmons; Joseph L. Tomkins; Samuel J. Lymbery; School of Biological Sciences
Description
Males of many species adjust their reproductive investment to the number of rivals present simultaneously. However, few studies have investigated whether males sum previous encounters with rivals, and the total level of competition has never been explicitly separated from social familiarity. Social familiarity can be an important component of kin recognition and has been suggested as a cue that males use to avoid harming females when competing with relatives. Previous work has succeeded in independently manipulating social familiarity and relatedness among rivals, but experimental manipulations of familiarity are confounded with manipulations of the total number of rivals that males encounter. Using the seed beetle Callosobruchus maculatus we manipulated three factors: familiarity among rival males, the number of rivals encountered simultaneously, and the total number of rivals encountered over a 48-hour period. Males produced smaller ejaculates when exposed to more rivals in total, regardless of the maximum number of rivals they encountered simultaneously. Males did not respond to familiarity. Our results demonstrate that males of this species can sum the number of rivals encountered over separate days, and therefore the confounding of familiarity with the total level of competition in previous studies should not be ignored.,Lymbery et al 2018 Full datasetContains all the data used in the statistical analyses for the associated manuscript. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Full Dataset.xlsxLymbery et al 2018 Reduced dataset 1Contains data used in the attached manuscript following the removal of three outliers for the purposes of data distribution, as described in the associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After 1st Round of Outlier Removal.xlsxLymbery et al 2018 Reduced dataset 2Contains the data used in the statistical analyses for the associated manuscript, after the removal of all outliers stated in the manuscript and associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After Final Outlier Removal.xlsxLymbery et al 2018 R ScriptContains all the R code used for statistical analysis in this manuscript, with annotations to aid interpretation.,
z
Data for "PTP over wide area networks with offset measurement outlier...
zenodo.org
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Vázquez; Víctor Vázquez; Carlos Megías; Carlos Megías; Carmen Vélez; Héctor Esteban; Javier Díaz; Javier Díaz; Eduardo Ros; Eduardo Ros; Carmen Vélez; Héctor Esteban (2025). Data for "PTP over wide area networks with offset measurement outlier filtering" [Dataset]. http://doi.org/10.5281/zenodo.14231423
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14231423
Dataset updated
May 22, 2025
Dataset provided by
Zenodo
Authors
Víctor Vázquez; Víctor Vázquez; Carlos Megías; Carlos Megías; Carmen Vélez; Héctor Esteban; Javier Díaz; Javier Díaz; Eduardo Ros; Eduardo Ros; Carmen Vélez; Héctor Esteban
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data collected in the Time-based Technologies and Networks Laboratory of the University of Granada when researching packet-based time synchronization over Wide-Area Networks.

This dataset includes results from the different experiments carried out with the ptp4l PTP client and chrony NTP client for time transfer over long distances, including our custom approach using offset estimation filters. All experiments are based on the case of time transfer between the Spanish UTC designated institute at the Real Instituto y Observatorio de la Armada (ROA) in San Fernando, Cádiz; and the University of Granada in Granada. Apart from the field tests, data from experiments using network emulation on a Calnex Paragon-X and stability data from the NIC used (an Intel XXV710-DA2T) are also included.

The dataset is divided as follows:

intel-xxv710da2t-phm-100k.csv contains the time error between a free-running Intel XXV710-DA2T and our common-view-disciplined passive hydrogen maser (PHM). Measurements where made using 1 PPS signals and a frequency counter, and are specified in seconds.

delay.tar contains delay measurements between UGR and ROA labs over the andalusian CICA network using the GNSS common-view technique for accurate time synchronization between the measuring nodes. The CSV files included in this TAR are in long format, containing three columns (timestamp, metric and value). The metrics included are:

owd_s2c (float64): One-way delay from ROA to UGR, in seconds.

owd_c2s (float64): One-way delay from UGR to ROA, in seconds.

chrony.tar contains experimental data from chrony NTP synchronization between ROA (server) and UGR (client). The CSV files included in this TAR are in long format, containing three columns (timestamp, metric and value). The metrics included are:

offset_hw (float64): Time error observation between the server and client using a frequency counter and 1 PPS signals, in seconds.

Please note that, for the network emulation tests this comparison is made directly between the server and the client, using the client and the common-view-disciplined PHM for the field tests instead.

ptp4l.tar contains experimental data from ptp4l PTP synchronization between ROA (server) and UGR (client). These CSV files are also in long format (timestamp, metric and value). The metrics included are:

offset_hw (float64): Time error observation between the server and client using a frequency counter and 1 PPS signals, in seconds.

offset_sw (float64): Time error estimation given by the ptp4l client and used as input for the clock servo, in seconds.

offset_raw (float64): Raw time error estimation given by the ptp4l client, in seconds.

delay_raw (float64): Raw one-way downwards delay estimation given by the ptp4l client, in seconds.

delay_filt (float64): Filtered one-way downwards delay estimation given by the ptp4l client, in seconds.

freq (float64): Frequency adjustment applied to the local oscillator by the ptp4l client.

Please note that, when using the offset filters, the offset_raw metric contains the raw estimations that are the input of the filter, while offset_sw is the filtered estimation produced as output of the filter. If no offset filter is used, offset_raw and offset_sw have the same value.

File names contain metadata about the actual experiment performed, and have the following structure:
f
Data_Sheet_1_The hazards of dealing with response time outliers.pdf
frontiersin.figshare.com
pdf
Updated Aug 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan I. Vankov (2023). Data_Sheet_1_The hazards of dealing with response time outliers.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2023.1220281.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2023.1220281.s001
Dataset updated
Aug 24, 2023
Dataset provided by
Frontiers
Authors
Ivan I. Vankov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The presence of outliers in response times can affect statistical analyses and lead to incorrect interpretation of the outcome of a study. Therefore, it is a widely accepted practice to try to minimize the effect of outliers by preprocessing the raw data. There exist numerous methods for handling outliers and researchers are free to choose among them. In this article, we use computer simulations to show that serious problems arise from this flexibility. Choosing between alternative ways for handling outliers can result in the inflation of p-values and the distortion of confidence intervals and measures of effect size. Using Bayesian parameter estimation and probability distributions with heavier tails eliminates the need to deal with response times outliers, but at the expense of opening another source of flexibility.
Data from: Batch effects in a multi-year sequencing study: false biological...
zenodo.org
data.niaid.nih.gov
+1more
bin, vcf
Updated May 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deborah M. Leigh; Heidi E.L. Lischer; Christine Grossen; Lukas F. Keller; Deborah M. Leigh; Heidi E.L. Lischer; Christine Grossen; Lukas F. Keller (2022). Data from: Batch effects in a multi-year sequencing study: false biological trends due to changes in read lengths [Dataset]. http://doi.org/10.5061/dryad.8vm8d
Explore at:
bin, vcfAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.8vm8d
Dataset updated
May 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Deborah M. Leigh; Heidi E.L. Lischer; Christine Grossen; Lukas F. Keller; Deborah M. Leigh; Heidi E.L. Lischer; Christine Grossen; Lukas F. Keller
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
High-throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects, technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomisation of sample groups across batches. However, in long-term or multi-year studies where data are added incrementally, full randomisation is impossible and batch effects may be a common feature. Here we present a case study where false signals of selection were detected due to a batch effect in a multi-year study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in non-random distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multi-year high-throughput sequencing studies.
Gender, Age, and Emotion Detection from Voice
kaggle.com
Updated May 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohit Zaman (2021). Gender, Age, and Emotion Detection from Voice [Dataset]. https://www.kaggle.com/datasets/rohitzaman/gender-age-and-emotion-detection-from-voice/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 29, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rohit Zaman
Description
Context

Our target was to predict gender, age and emotion from audio. We found audio labeled datasets on Mozilla and RAVDESS. So by using R programming language 20 statistical features were extracted and then after adding the labels these datasets were formed. Audio files were collected from "Mozilla Common Voice" and “Ryerson AudioVisual Database of Emotional Speech and Song (RAVDESS)”.

Content

Datasets contains 20 feature columns and 1 column for denoting the label. The 20 statistical features were extracted through the Frequency Spectrum Analysis using R programming Language. They are: 1) meanfreq - The mean frequency (in kHz) is a pitch measure, that assesses the center of the distribution of power across frequencies. 2) sd - The standard deviation of frequency is a statistical measure that describes a dataset’s dispersion relative to its mean and is calculated as the variance’s square root. 3) median - The median frequency (in kHz) is the middle number in the sorted, ascending, or descending list of numbers. 4) Q25 - The first quartile (in kHz), referred to as Q1, is the median of the lower half of the data set. This means that about 25 percent of the data set numbers are below Q1, and about 75 percent are above Q1. 5) Q75 - The third quartile (in kHz), referred to as Q3, is the central point between the median and the highest distributions. 6) IQR - The interquartile range (in kHz) is a measure of statistical dispersion, equal to the difference between 75th and 25th percentiles or between upper and lower quartiles. 7) skew - The skewness is the degree of distortion from the normal distribution. It measures the lack of symmetry in the data distribution. 8) kurt - The kurtosis is a statistical measure that determines how much the tails of distribution vary from the tails of a normal distribution. It is actually the measure of outliers present in the data distribution. 9) sp.ent - The spectral entropy is a measure of signal irregularity that sums up the normalized signal’s spectral power. 10) sfm - The spectral flatness or tonality coefficient, also known as Wiener entropy, is a measure used for digital signal processing to characterize an audio spectrum. Spectral flatness is usually measured in decibels, which, instead of being noise-like, offers a way to calculate how tone-like a sound is. 11) mode - The mode frequency is the most frequently observed value in a data set. 12) centroid - The spectral centroid is a metric used to describe a spectrum in digital signal processing. It means where the spectrum’s center of mass is centered. 13) meanfun - The meanfun is the average of the fundamental frequency measured across the acoustic signal. 14) minfun - The minfun is the minimum fundamental frequency measured across the acoustic signal 15) maxfun - The maxfun is the maximum fundamental frequency measured across the acoustic signal. 16) meandom - The meandom is the average of dominant frequency measured across the acoustic signal. 17) mindom - The mindom is the minimum of dominant frequency measured across the acoustic signal. 18) maxdom - The maxdom is the maximum of dominant frequency measured across the acoustic signal 19) dfrange - The dfrange is the range of dominant frequency measured across the acoustic signal. 20) modindx - the modindx is the modulation index, which calculates the degree of frequency modulation expressed numerically as the ratio of the frequency deviation to the frequency of the modulating signal for a pure tone modulation.

Acknowledgements

Gender and Age Audio Data Souce: Link: https://commonvoice.mozilla.org/en Emotion Audio Data Souce: Link : https://smartlaboratory.org/ravdess/
Data from: Outlier analyses to test for local adaptation to breeding grounds...
zenodo.org
data.niaid.nih.gov
+1more
bin, txt, zip
Updated May 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Tigano; Allison J. Shultz; Scott V. Edwards; Gregory J. Robertson; Vicki L. Friesen; Anna Tigano; Allison J. Shultz; Scott V. Edwards; Gregory J. Robertson; Vicki L. Friesen (2022). Data from: Outlier analyses to test for local adaptation to breeding grounds in a migratory arctic seabird [Dataset]. http://doi.org/10.5061/dryad.7182c
Explore at:
bin, txt, zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7182c
Dataset updated
May 31, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anna Tigano; Allison J. Shultz; Scott V. Edwards; Gregory J. Robertson; Vicki L. Friesen; Anna Tigano; Allison J. Shultz; Scott V. Edwards; Gregory J. Robertson; Vicki L. Friesen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Arctic
Description
Investigating the extent (or the existence) of local adaptation is crucial to understanding how populations adapt. When experiments or fitness measurements are difficult or impossible to perform in natural populations, genomic techniques allow us to investigate local adaptation through the comparison of allele frequencies and outlier loci along environmental clines. The thick-billed murre (Uria lomvia) is a highly philopatric colonial arctic seabird that occupies a significant environmental gradient, shows marked phenotypic differences among colonies, and has large effective population sizes. To test whether thick-billed murres from five colonies along the eastern Canadian Arctic coast show genomic signatures of local adaptation to their breeding grounds, we analyzed geographic variation in genome-wide markers mapped to a newly assembled thick-billed murre reference genome. We used outlier analyses to detect loci putatively under selection, and clustering analyses to investigate patterns of differentiation based on 2220 genomewide single nucleotide polymorphisms (SNPs) and 137 outlier SNPs. We found no evidence of population structure among colonies using all loci but found population structure based on outliers only, where birds from the two northernmost colonies (Minarets and Prince Leopold) grouped with birds from the southernmost colony (Gannet), and birds from Coats and Akpatok were distinct from all other colonies. Although results from our analyses did not support local adaptation along the latitudinal cline of breeding colonies, outlier loci grouped birds from different colonies according to their non-breeding distributions, suggesting that outliers may be informative about adaptation and/or demographic connectivity associated with their migration patterns or nonbreeding grounds.
Data from: Constraints on the FST–heterozygosity outlier approach
zenodo.org
data.niaid.nih.gov
+2more
txt, zip
Updated May 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah P. Flanagan; Adam G. Jones; Sarah P. Flanagan; Adam G. Jones (2022). Data from: Constraints on the FST–heterozygosity outlier approach [Dataset]. http://doi.org/10.5061/dryad.785bn
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.785bn
Dataset updated
May 28, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sarah P. Flanagan; Adam G. Jones; Sarah P. Flanagan; Adam G. Jones
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The FST-heterozygosity outlier approach has been a popular method for identifying loci under balancing and positive selection since Beaumont and Nichols first proposed it in 1996 and recommended its use for studies sampling a large number of independent populations (at least 10). Since then, their program FDIST2 and a user-friendly program optimized for large datasets, LOSITAN, have been used widely in the population genetics literature, often without the requisite number of samples. We observed empirical datasets whose distributions could not be reconciled with the confidence intervals generated by the null coalescent island model. Here, we use forward-in-time simulations to investigate circumstances under which the FST-heterozygosity outlier approach performs poorly for next-generation single-nucleotide polymorphism (SNP) datasets. Our results show that samples involving few independent populations, particularly when migration rates are low, result in distributions of the FST-heterozygosity relationship that are not described by the null model implemented in LOSITAN. In addition, even under favorable conditions LOSITAN rarely provides confidence intervals that precisely fit SNP data, making the associated p-values only roughly valid at best. We present an alternative method, implemented in a new R package named fsthet, which uses the raw empirical data to generate smoothed outlier plots for the FST-heterozygosity relationship.
Data from: Outlier SNP markers reveal fine-scale genetic structuring across...
zenodo.org
data.niaid.nih.gov
+2more
Updated Jun 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ilaria Milano; Massimiliano Babbucci; Alessia Cariani; Miroslava Atanassova; Dorte Bekkevold; Gary R. Carvalho; Montserrat Espiñeira; Fabio Fiorentino; Germana Garofalo; Audrey J. Geffen; Einar E. Nielsen; Rob Ogden; Tomaso Patarnello; Marco Stagioni; Fausto Tinti; Luca Bargelloni; Ilaria Milano; Massimiliano Babbucci; Alessia Cariani; Miroslava Atanassova; Dorte Bekkevold; Gary R. Carvalho; Montserrat Espiñeira; Fabio Fiorentino; Germana Garofalo; Audrey J. Geffen; Einar E. Nielsen; Rob Ogden; Tomaso Patarnello; Marco Stagioni; Fausto Tinti; Luca Bargelloni (2022). Data from: Outlier SNP markers reveal fine-scale genetic structuring across European hake populations (Merluccius merluccius) [Dataset]. http://doi.org/10.5061/dryad.7bn22
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.7bn22
Dataset updated
Jun 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ilaria Milano; Massimiliano Babbucci; Alessia Cariani; Miroslava Atanassova; Dorte Bekkevold; Gary R. Carvalho; Montserrat Espiñeira; Fabio Fiorentino; Germana Garofalo; Audrey J. Geffen; Einar E. Nielsen; Rob Ogden; Tomaso Patarnello; Marco Stagioni; Fausto Tinti; Luca Bargelloni; Ilaria Milano; Massimiliano Babbucci; Alessia Cariani; Miroslava Atanassova; Dorte Bekkevold; Gary R. Carvalho; Montserrat Espiñeira; Fabio Fiorentino; Germana Garofalo; Audrey J. Geffen; Einar E. Nielsen; Rob Ogden; Tomaso Patarnello; Marco Stagioni; Fausto Tinti; Luca Bargelloni
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Shallow population structure is generally reported for most marine fish and explained as a consequence of high dispersal, connectivity and large population size. Targeted gene analyses and more recently genome-wide studies have challenged such view, suggesting that adaptive divergence might occur even when neutral markers provide genetic homogeneity across populations. Here, 381 SNPs located in transcribed regions were used to assess large- and fine-scale population structure in the European hake (Merluccius merluccius), a widely distributed demersal species of high priority for the European fishery. Analysis of 850 individuals from 19 locations across the entire distribution range showed evidence for several outlier loci, with significantly higher resolving power. While 299 putatively neutral SNPs confirmed the genetic break between basins (FCT = 0.016) and weak differentiation within basins, outlier loci revealed a dramatic divergence between Atlantic and Mediterranean populations (FCT range 0.275–0.705) and fine-scale significant population structure. Outlier loci separated North Sea and Northern Portugal populations from all other Atlantic samples and revealed a strong differentiation among Western, Central and Eastern Mediterranean geographical samples. Significant correlation of allele frequencies at outlier loci with seawater surface temperature and salinity supported the hypothesis that populations might be adapted to local conditions. Such evidence highlights the importance of integrating information from neutral and adaptive evolutionary patterns towards a better assessment of genetic diversity. Accordingly, the generated outlier SNP data could be used for tackling illegal practices in hake fishing and commercialization as well as to develop explicit spatial models for defining management units and stock boundaries.
H
The Social Cost of Carbon: Trends, Outliers and Catastrophes [Dataset]
data.niaid.nih.gov
xls, zip
Updated Nov 25, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard S.J. Tol (2009). The Social Cost of Carbon: Trends, Outliers and Catastrophes [Dataset] [Dataset]. http://doi.org/10.7910/DVN/LGIF0V
Explore at:
xls, zipAvailable download formats
Unique identifier
https://doi.org/10.7910/DVN/LGIF0V
Dataset updated
Nov 25, 2009
Dataset provided by
Economic and Social Research Institute, Dublin
Authors
Richard S.J. Tol
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Global
Description
211 estimates of the social cost of carbon are included in a meta-analysis. The results confirm that a lower discount rate implies a higher estimate; and that higher estimates are found in the gray literature. It is also found that there is a downward trend in the economic impact estimates of the climate; that the Stern Review’s estimates of the social cost of carbon is an outlier; and that the right tail of the distribution is fat. There is a fair chance that the annual climate liability exceeds the annual income of many people.

Facebook

Twitter

Click to copy link

Link copied

Cite

Prashant S. Dhamale; Akanksha S. Kashikar (2025). Outlier detection in cylindrical data based on Mahalanobis distance [Dataset]. http://doi.org/10.6084/m9.figshare.24092089.v1

Data from: Outlier detection in cylindrical data based on Mahalanobis distance

Explore at:

text/x-texAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.24092089.v1

Dataset updated

Jan 2, 2025

Dataset provided by

Taylor & Francis

Authors

Prashant S. Dhamale; Akanksha S. Kashikar

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Cylindrical data are bivariate data formed from the combination of circular and linear variables. Identifying outliers is a crucial step in any data analysis work. This paper proposes a new distribution-free procedure to detect outliers in cylindrical data using the Mahalanobis distance concept. The use of Mahalanobis distance incorporates the correlation between the components of the cylindrical distribution, which had not been accounted for in the earlier papers on outlier detection in cylindrical data. The threshold for declaring an observation to be an outlier can be obtained via parametric or non-parametric bootstrap, depending on whether the underlying distribution is known or unknown. The performance of the proposed method is examined via extensive simulations from the Johnson-Wehrly distribution. The proposed method is applied to two real datasets, and the outliers are identified in those datasets.

Clear search

Close search

Google apps

Main menu

Data from: Outlier detection in cylindrical data based on Mahalanobis...

Outlier Set Two-step Method (OSTI)

Data from: Functional Outlier Detection for Density-Valued Data with...

Anomaly Detection in High-Dimensional Data

Gender_Classification_Dataset

Replication data for: Linear Models with Outliers: Choosing between...

An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data

ROcD-nGoM: A River-Ocean Coupled Database for the Northern Gulf of Mexico

Data from: Leave-One-Out Kernel Density Estimates for Outlier Detection

BOREALIS Power Analysis Code and Data

Find Outliers Percent of households with income below the Federal Poverty...

Data from: Male responses to sperm competition risk when rivals vary in...

Data for "PTP over wide area networks with offset measurement outlier...

Data_Sheet_1_The hazards of dealing with response time outliers.pdf

Data from: Batch effects in a multi-year sequencing study: false biological...

Gender, Age, and Emotion Detection from Voice

Context

Content

Acknowledgements

Data from: Outlier analyses to test for local adaptation to breeding grounds...

Data from: Constraints on the FST–heterozygosity outlier approach

Data from: Outlier SNP markers reveal fine-scale genetic structuring across...

The Social Cost of Carbon: Trends, Outliers and Catastrophes [Dataset]

Data from: Outlier detection in cylindrical data based on Mahalanobis distance