40 datasets found

f
Data from: Error and anomaly detection for intra-participant time-series...
tandf.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5189002
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
David R. Mullineaux; Gareth Irwin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.
g
Replication data for: Linear Models with Outliers: Choosing between...
datasearch.gesis.org
dataverse.harvard.edu
+1more
Updated Jan 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harden, Jeffrey; Desmarais, Bruce (2020). Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods [Dataset]. https://datasearch.gesis.org/dataset/httpsdataverse.unc.eduoai--hdl1902.2911608
Explore at:
Dataset updated
Jan 22, 2020
Dataset provided by
Odum Institute Dataverse Network
Authors
Harden, Jeffrey; Desmarais, Bruce
Description
State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.
f
Data_Sheet_1_The hazards of dealing with response time outliers.pdf
frontiersin.figshare.com
pdf
Updated Aug 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan I. Vankov (2023). Data_Sheet_1_The hazards of dealing with response time outliers.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2023.1220281.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2023.1220281.s001
Dataset updated
Aug 24, 2023
Dataset provided by
Frontiers
Authors
Ivan I. Vankov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The presence of outliers in response times can affect statistical analyses and lead to incorrect interpretation of the outcome of a study. Therefore, it is a widely accepted practice to try to minimize the effect of outliers by preprocessing the raw data. There exist numerous methods for handling outliers and researchers are free to choose among them. In this article, we use computer simulations to show that serious problems arise from this flexibility. Choosing between alternative ways for handling outliers can result in the inflation of p-values and the distortion of confidence intervals and measures of effect size. Using Bayesian parameter estimation and probability distributions with heavier tails eliminates the need to deal with response times outliers, but at the expense of opening another source of flexibility.
f
Data after outlier processing.
plos.figshare.com
txt
Updated Dec 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qian Yang; Xueli Wang; Xianbing Cao; Shuai Liu; Feng Xie; Yumei Li (2023). Data after outlier processing. [Dataset]. http://doi.org/10.1371/journal.pone.0295674.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295674.s002
Dataset updated
Dec 22, 2023
Dataset provided by
PLOS ONE
Authors
Qian Yang; Xueli Wang; Xianbing Cao; Shuai Liu; Feng Xie; Yumei Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Physical fitness is a key element of a healthy life, and being overweight or lacking physical exercise will lead to health problems. Therefore, assessing an individual’s physical health status from a non-medical, cost-effective perspective is essential. This paper aimed to evaluate the national physical health status through national physical examination data, selecting 12 indicators to divide the physical health status into four levels: excellent, good, pass, and fail. The existing challenge lies in the fact that most literature on physical fitness assessment mainly focuses on the two major groups of sports athletes and school students. Unfortunately, there is no reasonable index system has been constructed. The evaluation method has limitations and cannot be applied to other groups. This paper builds a reasonable health indicator system based on national physical examination data, breaks group restrictions, studies national groups, and hopes to use machine learning models to provide helpful health suggestions for citizens to measure their physical status. We analyzed the significance of the selected indicators through nonparametric tests and exploratory statistical analysis. We used seven machine learning models to obtain the best multi-classification model for the physical fitness test level. Comprehensive research showed that MLP has the best classification effect, with macro-precision reaching 74.4% and micro-precision reaching 72.8%. Furthermore, the recall rates are also above 70%, and the Hamming loss is the smallest, i.e., 0.272. The practical implications of these findings are significant. Individuals can use the classification model to understand their physical fitness level and status, exercise appropriately according to the measurement indicators, and adjust their lifestyle, which is an important aspect of health management.
Data from: Outlier classification using autoencoders: application for...
osti.gov
Updated Jun 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. http://doi.org/10.7910/DVN/SKEHRJ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SKEHRJ
Dataset updated
Jun 2, 2021
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
Description
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
Chemical outlier dataset
zenodo.org
bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mario Lovric; Mario Lovric (2020). Chemical outlier dataset [Dataset]. http://doi.org/10.5281/zenodo.1167835
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1167835
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mario Lovric; Mario Lovric
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The objects are numbered. The Y-variable are boiling points. Other features are structural features of molecules. In the outlier column the outliers are assigned with a value of 1.

The data is derived from a published chemical dataset on boiling point measurements [1] and from public data [2]. Features were generated by means of the RDKit Python library [3]. The dataset was infused with known outliers (~5%) based on significant structural differences, i.e. polar and non-polar molecules.

Cherqaoui D., Villemin D. Use of a Neural Network to determine the Boiling Point of Alkanes. J CHEM SOC FARADAY TRANS. 1994;90(1):97–102.

https://pubchem.ncbi.nlm.nih.gov/

RDKit: Open-source cheminformatics; http://www.rdkit.org
Gender_Classification_Dataset
kaggle.com
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sameh Raouf (2024). Gender_Classification_Dataset [Dataset]. https://www.kaggle.com/datasets/samehraouf/gender-classification-dataset/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sameh Raouf
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Title: Gender Classification Dataset

Description: This dataset contains anonymized information on height, weight, age, and gender of 10,000 individuals. The data is equally distributed between males and females, with 5,000 samples for each gender. The purpose of this dataset is to provide a comprehensive sample for studies and analyses related to physical attributes and demographics.

Content: The CSV file contains the following columns:

Gender: The gender of the individual (Male/Female) Height: The height of the individual in centimeters Weight: The weight of the individual in kilograms Age: The age of the individual in years

License: This dataset is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND 4.0) license. This means you are free to share the data, provided that you attribute the source, do not use it for commercial purposes, and do not distribute modified versions of the data.

Usage:

This dataset can be used for: - Analyzing the distribution of height, weight, and age across genders - Developing and testing machine learning models for predicting physical attributes - Educational purposes in statistics and data science courses
d
Data from: QST FST comparisons with unbalanced half-sib designs
search.dataone.org
data.niaid.nih.gov
+2more
Updated Apr 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kimberly J. Gilbert; Michael C. Whitlock (2025). QST FST comparisons with unbalanced half-sib designs [Dataset]. http://doi.org/10.5061/dryad.rm574
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.rm574
Dataset updated
Apr 4, 2025
Dataset provided by
Dryad Digital Repository
Authors
Kimberly J. Gilbert; Michael C. Whitlock
Time period covered
Jun 20, 2020
Description
QST, a measure of quantitative genetic differentiation among populations, is an index that can suggest local adaptation if QST for a trait is sufficiently larger than the mean FST of neutral genetic markers. A previous method by Whitlock and Guillaume derived a simulation resampling approach to statistically test for a difference between QST and FST, but that method is limited to balanced data sets with offspring related as half-sibs through shared fathers. We extend this approach to (1) allow for a model more suitable for some plant populations or breeding designs in which offspring are related through mothers (assuming independent fathers for each offspring; half-sibs by dam), and (2) by explicitly allowing for unbalanced data sets. The resulting approach is made available through the R package QstFstComp.
d
Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit Hydrologic...
catalog.data.gov
data.usgs.gov
Updated Nov 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit Hydrologic Unit Codes, 2008-2023 [Dataset]. https://catalog.data.gov/dataset/monthly-openet-image-collections-v2-0-summarized-by-12-digit-hydrologic-unit-codes-2008-20
Explore at:
Dataset updated
Nov 23, 2024
Dataset provided by
U.S. Geological Survey
Description
This dataset provides monthly summaries of evapotranspiration (ET) data from OpenET v2.0 image collections for the period 2008-2023 for all National Watershed Boundary Dataset subwatersheds (12-digit hydrologic unit codes [HUC12s]) in the US that overlap the spatial extent of OpenET datasets. For each HUC12, this dataset contains spatial aggregation statistics (minimum, mean, median, and maximum) for each of the ET variables from each of the publicly available image collections from OpenET for the six available models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop) and the Ensemble image collection, which is a pixel-wise ensemble of all 6 individual models after filtering and removal of outliers according to the median absolute deviation approach (Melton and others, 2022). Data are available in this data release in two different formats: comma-separated values (CSV) and parquet, a high-performance format that is optimized for storage and processing of columnar data. CSV files containing data for each 4-digit HUC are grouped by 2-digit HUCs for easier access of regional data, and the single parquet file provides convenient access to the entire dataset. For each of the ET models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop), variables in the model-specific CSV data files include: -huc12: The 12-digit hydrologic unit code -ET: Actual evapotranspiration (in millimeters) over the HUC12 area in the month calculated as the sum of daily ET interpolated between Landsat overpasses -statistic: Max, mean, median, or min. Statistic used in the spatial aggregation within each HUC12. For example, maximum ET is the maximum monthly pixel ET value occurring within the HUC12 boundary after summing daily ET in the month -year: 4-digit year -month: 2-digit month -count: Number of Landsat overpasses included in the ET calculation in the month -et_coverage_pct: Integer percentage of the HUC12 with ET data, which can be used to determine how representative the ET statistic is of the entire HUC12 -count_coverage_pct: Integer percentage of the HUC12 with count data, which can be different than the et_coverage_pct value because the “count” band in the source image collection extends beyond the “et” band in the eastern portion of the image collection extent For the Ensemble data, these additional variables are included in the CSV files: -et_mad: Ensemble ET value, computed as the mean of the ensemble after filtering outliers using the median absolute deviation (MAD) -et_mad_count: The number of models used to compute the ensemble ET value after filtering for outliers using the MAD -et_mad_max: The maximum value in the ensemble range, after filtering for outliers using the MAD -et_mad_min: The minimum value in the ensemble range, after filtering for outliers using the MAD -et_sam: A simple arithmetic mean (across the 6 models) of actual ET average without outlier removal Below are the locations of each OpenET image collection used in this summary: DisALEXI: https://developers.google.com/earth-engine/datasets/catalog/OpenET_DISALEXI_CONUS_GRIDMET_MONTHLY_v2_0 eeMETRIC: https://developers.google.com/earth-engine/datasets/catalog/OpenET_EEMETRIC_CONUS_GRIDMET_MONTHLY_v2_0 geeSEBAL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_GEESEBAL_CONUS_GRIDMET_MONTHLY_v2_0 PT-JPL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_PTJPL_CONUS_GRIDMET_MONTHLY_v2_0 SIMS: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SIMS_CONUS_GRIDMET_MONTHLY_v2_0 SSEBop: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SSEBOP_CONUS_GRIDMET_MONTHLY_v2_0 Ensemble: https://developers.google.com/earth-engine/datasets/catalog/OpenET_ENSEMBLE_CONUS_GRIDMET_MONTHLY_v2_0
f
Data from: Leave-One-Out Kernel Density Estimates for Outlier Detection
tandf.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sevvandi Kandanaarachchi; Rob J Hyndman (2023). Leave-One-Out Kernel Density Estimates for Outlier Detection [Dataset]. http://doi.org/10.6084/m9.figshare.16942936.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16942936.v2
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Sevvandi Kandanaarachchi; Rob J Hyndman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This article introduces lookout, a new approach to detect outliers using leave-one-out kernel density estimates and extreme value theory. Outlier detection methods that use kernel density estimates generally employ a user defined parameter to determine the bandwidth. Lookout uses persistent homology to construct a bandwidth suitable for outlier detection without any user input. We demonstrate the effectiveness of lookout on an extensive data repository by comparing its performance with other outlier detection methods based on extreme value theory. Furthermore, we introduce outlier persistence, a useful concept that explores the birth and the cessation of outliers with changing bandwidth and significance levels. The R package lookout implements this algorithm. Supplementary files for this article are available online.
n
Anolis carolinensis character displacement SNP
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jan 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Douglas Crawford (2023). Anolis carolinensis character displacement SNP [Dataset]. http://doi.org/10.5061/dryad.qbzkh18ks
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.qbzkh18ks
Dataset updated
Jan 27, 2023
Dataset provided by
University of Miami
Authors
Douglas Crawford
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Here are six files that provide details for all 44,120 identified single nucleotide polymorphisms (SNPs) or the 215 outlier SNPs associated with the evolution of rapid character displacement among replicate islands with (2Spp) and without competition (1Spp) between two Anolis species. On 2Spp islands, A. carolinensis occurs higher in trees and have evolved larger toe pads. Among 1Spp and 2Spp island populations, we identify 44,120 SNPs, with 215-outlier SNPs with improbably large FST values, low nucleotide variation, greater linkage than expected, and these SNPs are enriched for animal walking behavior. Thus, we conclude that these 215-outliers are evolving by natural selection in response to the phenotypic convergent evolution of character displacement. There are two, non-mutually exclusive perspective of these nucleotide variants. One is character displacement is convergent: all 215 outlier SNPs are shared among 3 out of 5 2Spp island and 24% of outlier SNPS are shared among all five out of five 2Spp island. Second, character displacement is genetically redundant because the allele frequencies in one or more 2Spp are similar to 1Spp islands: among one or more 2Spp islands 33% of outlier SNPS are within the range of 1Spp MiAF and 76% of outliers are more similar to 1Spp island than mean MiAF of 2Spp islands. Focusing on convergence SNP is scientifically more robust, yet it distracts from the perspective of multiple genetic solutions that enhances the rate and stability of adaptive change. The six files include: a description of eight islands, details of 94 individuals, and four files on SNPs. The four SNP files include the VCF files for 94 individuals with 44KSNPs and two files (Excel sheet/tab-delimited file) with FST, p-values and outlier status for all 44,120 identified single nucleotide polymorphisms (SNPs) associated with the evolution of rapid character displacement. The sixth file is a detailed file on the 215 outlier SNPs. Complete sequence data is available at Bioproject PRJNA833453, which including samples not included in this study. The 94 individuals used in this study are described in “Supplemental_Sample_description.txt” Methods Anoles and genomic DNA: Tissue or DNA for 160 Anolis carolinensis and 20 A. sagrei samples were provided by the Museum of Comparative Zoology at Harvard University (Table S2). Samples were previously used to examine evolution of character displacement in native A. carolinensis following invasion by A. sagrei onto man-made spoil islands in Mosquito Lagoon Florida (Stuart et al. 2014). One hundred samples were genomic DNAs, and 80 samples were tissues (terminal tail clip, Table S2). Genomic DNA was isolated from 80 of 160 A. carolinensis individuals (MCZ, Table S2) using a custom SPRI magnetic bead protocol (Psifidi et al. 2015). Briefly, after removing ethanol, tissues were placed in 200 ul of GH buffer (25 mM Tris- HCl pH 7.5, 25 mM EDTA, , 2M GuHCl Guanidine hydrochloride, G3272 SIGMA, 5 mM CaCl2, 0.5% v/v Triton X-100, 1% N-Lauroyl-Sarcosine) with 5% per volume of 20 mg/ml proteinase K (10 ul/200 ul GH) and digested at 55º C for at least 2 hours. After proteinase K digestion, 100 ul of 0.1% carboxyl-modified Sera-Mag Magnetic beads (Fisher Scientific) resuspended in 2.5 M NaCl, 20% PEG were added and allowed to bind the DNA. Beads were subsequently magnetized and washed twice with 200 ul 70% EtOH, and then DNA was eluted in 100 ul 0.1x TE (10 mM Tris, 0.1 mM EDTA). All DNA samples were gel electrophoresed to ensure high molecular mass and quantified by spectrophotometry and fluorescence using Biotium AccuBlueTM High Sensitivity dsDNA Quantitative Solution according to manufacturer’s instructions. Genotyping-by-sequencing (GBS) libraries were prepared using a modified protocol after Elshire et al. (Elshire et al. 2011). Briefly, high-molecular-weight genomic DNA was aliquoted and digested using ApeKI restriction enzyme. Digests from each individual sample were uniquely barcoded, pooled, and size selected to yield insert sizes between 300-700 bp (Borgstrom et al. 2011). Pooled libraries were PCR amplified (15 cycles) using custom primers that extend into the genomic DNA insert by 3 bases (CTG). Adding 3 extra base pairs systematically reduces the number of sequenced GBS tags, ensuring sufficient sequencing depth. The final library had a mean size of 424 bp ranging from 188 to 700 bp . Anolis SNPs: Pooled libraries were sequenced on one lane on the Illumina HiSeq 4000 in 2x150 bp paired-end configuration, yielding approximately 459 million paired-end reads ( ~138 Gb). The medium Q-Score was 42 with the lower 10% Q-Scores exceeding 32 for all 150 bp. The initial library contained 180 individuals with 8,561,493 polymorphic sites. Twenty individuals were Anolis sagrei, and two individuals (Yan 1610 & Yin 1411) clustered with A. sagrei and were not used to define A. carolinesis’ SNPs. Anolis carolinesis reads were aligned to the Anolis carolinensis genome (NCBI RefSeq accession number:/GCF_000090745.1_AnoCar2.0). Single nucleotide polymorphisms (SNPs) for A. carolinensis were called using the GBeaSy analysis pipeline (Wickland et al. 2017) with the following filter settings: minimum read length of 100 bp after barcode and adapter trimming, minimum phred-scaled variant quality of 30 and minimum read depth of 5. SNPs were further filtered by requiring SNPs to occur in > 50% of individuals, and 66 individuals were removed because they had less than 70% of called SNPs. These filtering steps resulted in 51,155 SNPs among 94 individuals. Final filtering among 94 individuals required all sites to be polymorphic (with fewer individuals, some sites were no longer polymorphic) with a maximum of 2 alleles (all are bi-allelic), minimal allele frequency 0.05, and He that does not exceed HWE (FDR <0.01). SNPs with large He were removed (2,280 SNPs). These SNPs with large significant heterozygosity may result from aligning paralogues (different loci), and thus may not represent polymorphisms. No SNPs were removed with low He (due to possible demography or other exceptions to HWE). After filtering, 94 individual yielded 44,120 SNPs. Thus, the final filtered SNP data set was 44K SNPs from 94 indiviuals. Statistical Analyses: Eight A. carolinensis populations were analyzed: three populations from islands with native species only (1Spp islands) and 5 populations from islands where A. carolinesis co-exist with A. sagrei (2Spp islands, Table 1, Table S1). Most analyses pooled the three 1Spp islands and contrasted these with the pooled five 2Spp islands. Two approaches were used to define SNPs with unusually large allele frequency differences between 1Spp and 2Spp islands: 1) comparison of FST values to random permutations and 2) a modified FDIST approach to identify outlier SNPs with large and statistically unlikely FST values. Random Permutations: FST values were calculated in VCFTools (version 4.2, (Danecek et al. 2011)) where the p-value per SNP were defined by comparing FST values to 1,000 random permutations using a custom script (below). Basically, individuals and all their SNPs were randomly assigned to one of eight islands or to 1Spp versus 2Spp groups. The sample sizes (55 for 2Spp and 39 for 1Spp islands) were maintained. FST values were re-calculated for each 1,000 randomizations using VCFTools. Modified FDIST: To identify outlier SNPs with statistically large FST values, a modified FDIST (Beaumont and Nichols 1996) was implemented in Arlequin (Excoffier et al. 2005). This modified approach applies 50,000 coalescent simulations using hierarchical population structure, in which demes are arranged into k groups of d demes and in which migration rates between demes are different within and between groups. Unlike the finite island models, which have led to large frequencies of false positive because populations share different histories (Lotterhos and Whitlock 2014), the hierarchical island model avoids these false positives by avoiding the assumption of similar ancestry (Excoffier et al. 2009). References Beaumont, M. A. and R. A. Nichols. 1996. Evaluating loci for use in the genetic analysis of population structure. P Roy Soc B-Biol Sci 263:1619-1626. Borgstrom, E., S. Lundin, and J. Lundeberg. 2011. Large scale library generation for high throughput sequencing. PLoS One 6:e19119. Bradbury, P. J., Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss, and E. S. Buckler. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633-2635. Cingolani, P., A. Platts, L. Wang le, M. Coon, T. Nguyen, L. Wang, S. J. Land, X. Lu, and D. M. Ruden. 2012. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6:80-92. Danecek, P., A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin, and G. Genomes Project Analysis. 2011. The variant call format and VCFtools. Bioinformatics 27:2156-2158. Earl, D. A. and B. M. vonHoldt. 2011. Structure Harvester: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genet Resour 4:359-361. Elshire, R. J., J. C. Glaubitz, Q. Sun, J. A. Poland, K. Kawamoto, E. S. Buckler, and S. E. Mitchell. 2011. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6:e19379. Evanno, G., S. Regnaut, and J. Goudet. 2005. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14:2611-2620. Excoffier, L., T. Hofer, and M. Foll. 2009. Detecting loci under selection in a hierarchically structured population. Heredity 103:285-298. Excoffier, L., G. Laval, and S. Schneider. 2005. Arlequin (version 3.0): An integrated software package for population genetics data analysis.
d
Sliding window constrained fault-tolerant filtering of compressor vibration...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaolin Hu; Xianxi Chen; Guo Xi Sun (2025). Sliding window constrained fault-tolerant filtering of compressor vibration data [Dataset]. http://doi.org/10.5061/dryad.pc866t20z
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.pc866t20z
Dataset updated
Feb 5, 2025
Dataset provided by
Dryad Digital Repository
Authors
Shaolin Hu; Xianxi Chen; Guo Xi Sun
Description
This paper presents a sliding window constrained fault-tolerant filtering method for sampling data in petrochemical instrumentation. The method requires the design of an appropriate sliding window width based on the time series, as well as the expansion of both ends of the series. By utilizing a sliding window constraint function, the method produces a smoothed estimate for the current moment within the window. As the window advances, a series of smoothed estimates of the original sampled data is generated. Subsequently, the original series is subtracted from this smoothed estimate to create a new series that represents the differences between the two. This difference series is then subjected to an additional smoothing estimation process, and the resulting smoothed estimates are employed to compensate for the smoothed estimates of original sampled series. The experimental results indicate that, compared with sliding mean filtering, sliding median filtering, and Savitzky-Golay filtering,..., , , # Sliding window constrained fault-tolerant filtering of compressor vibration data

https://doi.org/10.5061/dryad.pc866t20z

Description of the data and file structure

Data type

Files containing â€˜fdata1case1â€™ in the file represents the case "1" of the location of the outlier in the measured data "1", and so on;

Files containing â€˜fwavedataâ€™ in the file name are wave signals with outliers;

Files containing â€˜fwave2dataâ€™ in the file name are polynomial signals with outliers;

Files containing â€˜normaldataâ€™ in the file name are normal measured data;

Files containing â€˜normalwavedataâ€™ in the file name are normal wave signals;

Files containing â€˜normalwave2dataâ€™ in the file name are normal polynomial signals;

Files containing â€˜ftffilteredâ€™ in the file name indicate that the data have been processed by sliding-window constrained error-tolerant filtering;

Files containing â€˜sgfilteredâ€™ in the file name indicate data after Savitzky-Golay filtering...
g
ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...
elki-project.github.io
explore.openaire.eu
+2more
Updated Sep 2, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erich Schubert; Arthur Zimek (2011). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6355684
Dataset updated
Sep 2, 2011
Dataset provided by
TU Dortmund University
University of Southern Denmark, Denmark
Authors
Erich Schubert; Arthur Zimek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "Amsterdam Library of Object Images" is a collection of 110250 images of 1000 small objects, taken under various light conditions and rotation angles. All objects were placed on a black background. Thus the images are taken under rather uniform conditions, which means there is little uncontrolled bias in the data set (unless mixed with other sources). They do however not resemble a "typical" image collection. The data set has a rather unique property for its size: there are around 100 different images of each object, so it is well suited for clustering. By downsampling some objects it can also be used for outlier detection. For multi-view research, we offer a number of different feature vector sets for evaluating this data set.
Predictive Validity Data Set
figshare.com
txt
Updated Dec 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Abeyta (2022). Predictive Validity Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.17030021.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17030021.v1
Dataset updated
Dec 18, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Antonio Abeyta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Verbal and Quantitative Reasoning GRE scores and percentiles were collected by querying the student database for the appropriate information. Any student records that were missing data such as GRE scores or grade point average were removed from the study before the data were analyzed. The GRE Scores of entering doctoral students from 2007-2012 were collected and analyzed. A total of 528 student records were reviewed. Ninety-six records were removed from the data because of a lack of GRE scores. Thirty-nine of these records belonged to MD/PhD applicants who were not required to take the GRE to be reviewed for admission. Fifty-seven more records were removed because they did not have an admissions committee score in the database. After 2011, the GRE’s scoring system was changed from a scale of 200-800 points per section to 130-170 points per section. As a result, 12 more records were removed because their scores were representative of the new scoring system and therefore were not able to be compared to the older scores based on raw score. After removal of these 96 records from our analyses, a total of 420 student records remained which included students that were currently enrolled, left the doctoral program without a degree, or left the doctoral program with an MS degree. To maintain consistency in the participants, we removed 100 additional records so that our analyses only considered students that had graduated with a doctoral degree. In addition, thirty-nine admissions scores were identified as outliers by statistical analysis software and removed for a final data set of 286 (see Outliers below). Outliers We used the automated ROUT method included in the PRISM software to test the data for the presence of outliers which could skew our data. The false discovery rate for outlier detection (Q) was set to 1%. After removing the 96 students without a GRE score, 432 students were reviewed for the presence of outliers. ROUT detected 39 outliers that were removed before statistical analysis was performed. Sample See detailed description in the Participants section. Linear regression analysis was used to examine potential trends between GRE scores, GRE percentiles, normalized admissions scores or GPA and outcomes between selected student groups. The D’Agostino & Pearson omnibus and Shapiro-Wilk normality tests were used to test for normality regarding outcomes in the sample. The Pearson correlation coefficient was calculated to determine the relationship between GRE scores, GRE percentiles, admissions scores or GPA (undergraduate and graduate) and time to degree. Candidacy exam results were divided into students who either passed or failed the exam. A Mann-Whitney test was then used to test for statistically significant differences between mean GRE scores, percentiles, and undergraduate GPA and candidacy exam results. Other variables were also observed such as gender, race, ethnicity, and citizenship status within the samples. Predictive Metrics. The input variables used in this study were GPA and scores and percentiles of applicants on both the Quantitative and Verbal Reasoning GRE sections. GRE scores and percentiles were examined to normalize variances that could occur between tests. Performance Metrics. The output variables used in the statistical analyses of each data set were either the amount of time it took for each student to earn their doctoral degree, or the student’s candidacy examination result.
c
Health and wellbeing statistics (GP-level, England): Missing data and...
data.catchmentbasedapproach.org
Updated Apr 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Rivers Trust (2021). Health and wellbeing statistics (GP-level, England): Missing data and potential outliers [Dataset]. https://data.catchmentbasedapproach.org/datasets/42421dd003cb4f8a879077c802cc8ef1
Explore at:
Dataset updated
Apr 9, 2021
Dataset authored and provided by
The Rivers Trust
Area covered

Description
SUMMARYTo be viewed in combination with the following datasets and their accompanying metadata:- Levels of obesity, inactivity and associated illnesses (England): Summary-Levels of obesity and inactivity related illnesses (physical and mental illnesses): Summary -Levels of obesity and inactivity related illnesses (physical illnesses only): Summary-Obesity in adults (ages 18+)-Depression in adults (aged 18+)-Asthma (in persons of all ages)-Cancer (in persons of all ages)-Chronic kidney disease (in adults aged 18+)-Coronary heart disease (in persons of all ages)-Diabetes mellitus (in persons aged 17+)-Hypertension (in persons of all ages)-Stroke and transient ischaemic attack (in persons of all ages)For the aforementioned analyses, GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset identifies areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Produced using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.
d
Morphological data quantifying sexual dimorphism of Anolis carolinensis in...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Aug 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thor Veen; Yoel Stuart; Ambika Kamath; William Sherwin (2021). Morphological data quantifying sexual dimorphism of Anolis carolinensis in presence and absence of congener [Dataset]. http://doi.org/10.5061/dryad.d51c5b03v
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.d51c5b03v
Dataset updated
Aug 31, 2021
Dataset provided by
Dryad
Authors
Thor Veen; Yoel Stuart; Ambika Kamath; William Sherwin
Time period covered
2021
Description
README file included.
z
Controlled Anomalies Time Series (CATS) Dataset
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.8338435
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8338435
Dataset updated
Jul 11, 2024
Dataset provided by
Solenix Engineering GmbH
Authors
Patrick Fleith; Patrick Fleith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

5 million timestamps. Sensors readings are at 1Hz sampling frequency.

1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.

Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.

Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).

Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

Change Log

Version 2

Metadata: we include a metadata.csv with information about:

Anomaly categories

Root cause channel (signal in which the anomaly is first visible)

Affected channel (signal in which the anomaly might propagate) through coupled system dynamics

Removal of anomaly overlaps: version 1 contained anomalies which overlapped with each other resulting in only 190 distinct anomalous segments. Now, there are no more anomaly overlaps.

Two data files: CSV and parquet for convenience.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
f
Data from: S1 Data set -
plos.figshare.com
xlsx
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nassim Dehouche; Sorawit Viravan; Ubolrat Santawat; Nungruethai Torsuwan; Sakuna Taijan; Atthakorn Intharakosum; Yongyut Sirivatanauksorn (2023). S1 Data set - [Dataset]. http://doi.org/10.1371/journal.pone.0288239.s012
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288239.s012
Dataset updated
Jul 13, 2023
Dataset provided by
PLOS ONE
Authors
Nassim Dehouche; Sorawit Viravan; Ubolrat Santawat; Nungruethai Torsuwan; Sakuna Taijan; Atthakorn Intharakosum; Yongyut Sirivatanauksorn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe typical hospital Length of Stay (LOS) distribution is known to be right-skewed, to vary considerably across Diagnosis Related Groups (DRGs), and to contain markedly high values, in significant proportions. These very long stays are often considered outliers, and thin-tailed statistical distributions are assumed. However, resource consumption and planning occur at the level of medical specialty departments covering multiple DRGs, and when considered at this decision-making scale, extreme LOS values represent a significant component of the distribution of LOS (the right tail) that determines many of its statistical properties.ObjectiveTo build actionable statistical models of LOS for resource planning at the level of healthcare units.MethodsThrough a study of 46, 364 electronic health records over four medical specialty departments (Pediatrics, Obstetrics/Gynecology, Surgery, and Rehabilitation Medicine) in the largest hospital in Thailand (Siriraj Hospital in Bangkok), we show that the distribution of LOS exhibits a tail behavior that is consistent with a subexponential distribution. We analyze some empirical properties of such a distribution that are of relevance to cost and resource planning, notably the concentration of resource consumption among a minority of admissions/patients, an increasing residual LOS, where the longer a patient has been admitted, the longer they would be expected to remain admitted, and a slow convergence of the Law of Large Numbers, making empirical estimates of moments (e.g. mean, variance) unreliable.ResultsWe propose a novel Beta-Geometric model that shows a good fit with observed data and reproduces these empirical properties of LOS. Finally, we use our findings to make practical recommendations regarding the pricing and management of LOS.
f
Data from: Tests for Scale Changes Based on Pairwise Differences
tandf.figshare.com
zip
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carina Gerstenberger; Daniel Vogel; Martin Wendler (2023). Tests for Scale Changes Based on Pairwise Differences [Dataset]. http://doi.org/10.6084/m9.figshare.8295638.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8295638.v3
Dataset updated
Jun 3, 2023
Dataset provided by
Taylor & Francis
Authors
Carina Gerstenberger; Daniel Vogel; Martin Wendler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In many applications it is important to know whether the amount of fluctuation in a series of observations changes over time. In this article, we investigate different tests for detecting changes in the scale of mean-stationary time series. The classical approach, based on the CUSUM test applied to the squared centered observations, is very vulnerable to outliers and impractical for heavy-tailed data, which leads us to contemplate test statistics based on alternative, less outlier-sensitive scale estimators. It turns out that the tests based on Gini’s mean difference (the average of all pairwise distances) and generalized Qn estimators (sample quantiles of all pairwise distances) are very suitable candidates. They improve upon the classical test not only under heavy tails or in the presence of outliers, but also under normality. We use recent results on the process convergence of U-statistics and U-quantiles for dependent sequences to derive the limiting distribution of the test statistics and propose estimators for the long-run variance. We show the consistency of the tests and demonstrate the applicability of the new change-point detection methods at two real-life data examples from hydrology and finance. Supplementary materials for this article are available online.
f
Data from: DELINEATION OF HOMOGENEOUS ZONES BASED ON GEOSTATISTICAL MODELS...
scielo.figshare.com
jpeg
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DANILO PEREIRA BARBOSA; EDUARDO LEONEL BOTTEGA; DOMINGOS SÁRVIO MAGALHÃES VALENTE; NERILSON TERRA SANTOS; WELLINGTON DONIZETE GUIMARÃES (2023). DELINEATION OF HOMOGENEOUS ZONES BASED ON GEOSTATISTICAL MODELS ROBUST TO OUTLIERS [Dataset]. http://doi.org/10.6084/m9.figshare.8986808.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8986808.v1
Dataset updated
May 30, 2023
Dataset provided by
SciELO journals
Authors
DANILO PEREIRA BARBOSA; EDUARDO LEONEL BOTTEGA; DOMINGOS SÁRVIO MAGALHÃES VALENTE; NERILSON TERRA SANTOS; WELLINGTON DONIZETE GUIMARÃES
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT Measures of the apparent electrical conductivity (ECa) of soil are used in many studies as indicators of spatial variability in physicochemical characteristics of production fields. Based on these measures, management zones (MZs) are delineated to improve agricultural management. However, these measures include outliers. The presence or incorrect identification and exclusion of outliers affect the variogram function and result in unreliable parameter estimates. Thus, the aim of this study was to model ECa data with outliers using methods based on robust approximation theory and model-based geostatistics to delineate MZs. Robust estimators developed by Cressie-Hawkins, Genton and MAD Dowd were tested. The Cressie-Hawkins semivariance estimator was selected, followed by the semivariogram cubic fit using Akaike information criterion (AIC). The robust kriging with an external drift plug-in was applied to fitted estimates, and the fuzzy k-means classifier was applied to the resulting ECa kriging map. Models with multiple MZs were evaluated using fuzzy k-means, and a map with two MZs was selected based on the fuzzy performance index (FPI), modified partition entropy (MPE) and Fukuyama-Sugeno and Xie-Beni indices. The defined MZs were validated based on differences between the ECa means using mixed linear models. The independent errors model was chosen for validation based on its AIC value. Thus, the results demonstrate that it is possible to delineate an MZ map without outlier exclusion, evidencing the efficacy of this methodology.

Facebook

Twitter

Click to copy link

Link copied

Cite

David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002

Data from: Error and anomaly detection for intra-participant time-series data

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.5189002

Dataset updated

Jun 1, 2023

Dataset provided by

Taylor & Francis

Authors

David R. Mullineaux; Gareth Irwin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.

Clear search

Close search

Google apps

Main menu

Data from: Error and anomaly detection for intra-participant time-series...

Replication data for: Linear Models with Outliers: Choosing between...

Data_Sheet_1_The hazards of dealing with response time outliers.pdf

Data after outlier processing.

Data from: Outlier classification using autoencoders: application for...

Chemical outlier dataset

Gender_Classification_Dataset

Data from: QST FST comparisons with unbalanced half-sib designs

Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit Hydrologic...

Data from: Leave-One-Out Kernel Density Estimates for Outlier Detection

Anolis carolinensis character displacement SNP

Sliding window constrained fault-tolerant filtering of compressor vibration...

Description of the data and file structure

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

Predictive Validity Data Set

Health and wellbeing statistics (GP-level, England): Missing data and...

Morphological data quantifying sexual dimorphism of Anolis carolinensis in...

Controlled Anomalies Time Series (CATS) Dataset

Data from: S1 Data set -

Data from: Tests for Scale Changes Based on Pairwise Differences

Data from: DELINEATION OF HOMOGENEOUS ZONES BASED ON GEOSTATISTICAL MODELS...

Data from: Error and anomaly detection for intra-participant time-series data