52 datasets found

H
Replication data for: Linear Models with Outliers: Choosing between...
dataverse.harvard.edu
dataverse-staging.rdmc.unc.edu
+1more
pdf +1
Updated Aug 10, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2011). Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods [Dataset]. http://doi.org/10.7910/DVN/JJLJKZ
Explore at:
text/plain; charset=us-ascii(5482), text/plain; charset=us-ascii(3590), pdf(198705)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/JJLJKZ
Dataset updated
Aug 10, 2011
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.
f
Data from: Error and anomaly detection for intra-participant time-series...
tandf.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5189002
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
David R. Mullineaux; Gareth Irwin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.
Outlier Detection and Feature Correlation
kaggle.com
zip
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Mashayekhi (2025). Outlier Detection and Feature Correlation [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/outlier-detection-and-feature-correlation
Explore at:
zip(1094301 bytes)Available download formats
Dataset updated
Apr 18, 2025
Authors
Mahdi Mashayekhi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This synthetic machine learning dataset is designed to help practitioners and students explore essential data preprocessing techniques, such as:

Outlier detection and handling

Handling missing (NaN) values

Understanding and resolving multicollinearity (high VIF)

Feature selection and engineering

It includes:

5000 records

25 numerical features named feature_1 to feature_25

Binary target variable named target (0 or 1)

~15% of injected outliers: certain values in selected columns deviate significantly from the mean to simulate anomalies.

~10% missing values (NaNs): randomly distributed to simulate real-world data imperfections.
Number of outlier years (>2 standard errors above or below the mean) per...
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebecca Chaplin-Kramer; Melvin R. George (2023). Number of outlier years (>2 standard errors above or below the mean) per time period. [Dataset]. http://doi.org/10.1371/journal.pone.0057723.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0057723.t004
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Rebecca Chaplin-Kramer; Melvin R. George
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of outlier years (>2 standard errors above or below the mean) per time period.
f
Results from stationary unit tests performed with 40 low-cost CatLog GPS...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 18, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poulle, Marie-Lazarine; Forin-Wiart, Marie-Amélie; Sirguey, Pascal; Hubert, Pauline (2015). Results from stationary unit tests performed with 40 low-cost CatLog GPS data loggers: the fix success rate (FSR) ± standard deviation (SD), mean time of the fix acquisition (μFAT), root mean square of the location errors (LERMS), mean location error (μLE), median location error (mLE), percentage of fixes with LE < 10 m, the mean number of outliers per unit (N outliers) and root mean square of the location errors after the removal of outliers (LERMS without outliers) for positional fixes collected from for two antenna positions, three fix intervals programs and four habitat types. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001919478
Explore at:
Dataset updated
Jun 18, 2015
Authors
Poulle, Marie-Lazarine; Forin-Wiart, Marie-Amélie; Sirguey, Pascal; Hubert, Pauline
Description
Results from stationary unit tests performed with 40 low-cost CatLog GPS data loggers: the fix success rate (FSR) ± standard deviation (SD), mean time of the fix acquisition (μFAT), root mean square of the location errors (LERMS), mean location error (μLE), median location error (mLE), percentage of fixes with LE < 10 m, the mean number of outliers per unit (N outliers) and root mean square of the location errors after the removal of outliers (LERMS without outliers) for positional fixes collected from for two antenna positions, three fix intervals programs and four habitat types.
Outlier classification using autoencoders: application for fluctuation...
osti.gov
dataverse.harvard.edu
Updated Jun 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. http://doi.org/10.7910/DVN/SKEHRJ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SKEHRJ
Dataset updated
Jun 2, 2021
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
Description
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
d
Data from: Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit Hydrologic Unit Codes, 2008-2023 [Dataset]. https://catalog.data.gov/dataset/monthly-openet-image-collections-v2-0-summarized-by-12-digit-hydrologic-unit-codes-2008-20
Explore at:
Dataset updated
Nov 21, 2025
Dataset provided by
U.S. Geological Survey
Description
This dataset provides monthly summaries of evapotranspiration (ET) data from OpenET v2.0 image collections for the period 2008-2023 for all National Watershed Boundary Dataset subwatersheds (12-digit hydrologic unit codes [HUC12s]) in the US that overlap the spatial extent of OpenET datasets. For each HUC12, this dataset contains spatial aggregation statistics (minimum, mean, median, and maximum) for each of the ET variables from each of the publicly available image collections from OpenET for the six available models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop) and the Ensemble image collection, which is a pixel-wise ensemble of all 6 individual models after filtering and removal of outliers according to the median absolute deviation approach (Melton and others, 2022). Data are available in this data release in two different formats: comma-separated values (CSV) and parquet, a high-performance format that is optimized for storage and processing of columnar data. CSV files containing data for each 4-digit HUC are grouped by 2-digit HUCs for easier access of regional data, and the single parquet file provides convenient access to the entire dataset. For each of the ET models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop), variables in the model-specific CSV data files include: -huc12: The 12-digit hydrologic unit code -ET: Actual evapotranspiration (in millimeters) over the HUC12 area in the month calculated as the sum of daily ET interpolated between Landsat overpasses -statistic: Max, mean, median, or min. Statistic used in the spatial aggregation within each HUC12. For example, maximum ET is the maximum monthly pixel ET value occurring within the HUC12 boundary after summing daily ET in the month -year: 4-digit year -month: 2-digit month -count: Number of Landsat overpasses included in the ET calculation in the month -et_coverage_pct: Integer percentage of the HUC12 with ET data, which can be used to determine how representative the ET statistic is of the entire HUC12 -count_coverage_pct: Integer percentage of the HUC12 with count data, which can be different than the et_coverage_pct value because the “count” band in the source image collection extends beyond the “et” band in the eastern portion of the image collection extent For the Ensemble data, these additional variables are included in the CSV files: -et_mad: Ensemble ET value, computed as the mean of the ensemble after filtering outliers using the median absolute deviation (MAD) -et_mad_count: The number of models used to compute the ensemble ET value after filtering for outliers using the MAD -et_mad_max: The maximum value in the ensemble range, after filtering for outliers using the MAD -et_mad_min: The minimum value in the ensemble range, after filtering for outliers using the MAD -et_sam: A simple arithmetic mean (across the 6 models) of actual ET average without outlier removal Below are the locations of each OpenET image collection used in this summary: DisALEXI: https://developers.google.com/earth-engine/datasets/catalog/OpenET_DISALEXI_CONUS_GRIDMET_MONTHLY_v2_0 eeMETRIC: https://developers.google.com/earth-engine/datasets/catalog/OpenET_EEMETRIC_CONUS_GRIDMET_MONTHLY_v2_0 geeSEBAL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_GEESEBAL_CONUS_GRIDMET_MONTHLY_v2_0 PT-JPL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_PTJPL_CONUS_GRIDMET_MONTHLY_v2_0 SIMS: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SIMS_CONUS_GRIDMET_MONTHLY_v2_0 SSEBop: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SSEBOP_CONUS_GRIDMET_MONTHLY_v2_0 Ensemble: https://developers.google.com/earth-engine/datasets/catalog/OpenET_ENSEMBLE_CONUS_GRIDMET_MONTHLY_v2_0
The mean and standard deviation TPR for the anomaly detection algorithms.
plos.figshare.com
xls
Updated Jun 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Firuz Kamalov; Hana Sulieman; David Santandreu Calonge (2023). The mean and standard deviation TPR for the anomaly detection algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0254340.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0254340.t002
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Firuz Kamalov; Hana Sulieman; David Santandreu Calonge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The results represent experiments on four datasets based on 20 simulated experiments. The proposed method (NewAlgo) produces the best overall results.
Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...
zenodo.org
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
o; o (2025). Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of U.S. Tech Firms [Dataset]. http://doi.org/10.5281/zenodo.15337959
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15337959
Dataset updated
May 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
o; o
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 4, 2025
Description
Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.

This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:

ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)

ESM_2.py – Python script to calculate Z-scores from raw financial ratios

ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios

ESM_4.py – Python script for generating the correlation heatmap of the Z-scores

ESM_5.xlsx – Mahalanobis distance values for each firm

ESM_6.py – Python script to compute Mahalanobis distances

ESM_7.py – Python script to visualize Mahalanobis distances

ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)

ESM_9.py – Python script to compute mean Z-scores

ESM_10.xlsx – Re-standardized Z-scores based on firm-level means

ESM_11.py – Python script to re-standardize mean Z-scores

ESM_12.py – Python script to generate the hierarchical clustering dendrogram

All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
f
The 12 outliers identified in the Tonga dataset.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayfield, Anderson B.; Dempsey, Alexandra C.; Chen, Chii-Shiarng (2017). The 12 outliers identified in the Tonga dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001760878
Explore at:
Dataset updated
Nov 1, 2017
Authors
Mayfield, Anderson B.; Dempsey, Alexandra C.; Chen, Chii-Shiarng
Area covered
Tonga
Description
Gene expression data have been presented as non-normalized (2-Ct*109) in all but the last six rows; this allows for the back-calculation of the raw threshold cycle (Ct) values so that interested individuals can readily estimate the typical range of expression of each gene. Values representing aberrant levels for a particular parameter (z-score>2.5) have been highlighted in bold. When there was a statistically significant difference (student’s t-test, p<0.05) between the outlier and non-outlier averages for a parameter (instead using normalized gene expression data), the lower of the two values has been underlined. All samples hosted Symbiodinium of clade C only unless noted otherwise. The mean Mahalanobis distance did not differ between Pocillopora damicornis and P. acuta (student’s t-test, p>0.05). SA = surface area. GCP = genome copy proportion. Ma Dis = Mahalanobis distance. “.” = missing data.
f
Median and Mean Values of the Parameters of Heroin and Control Participants...
datasetcatalog.nlm.nih.gov
figshare.com
+1more
Updated Nov 12, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
González-Vallejo, Claudia; Cheng, Jiuqing (2014). Median and Mean Values of the Parameters of Heroin and Control Participants in [30]. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001223884
Explore at:
Dataset updated
Nov 12, 2014
Authors
González-Vallejo, Claudia; Cheng, Jiuqing
Description
Note: Superscripts S and L index small and large payoffs. Numbers in parentheses are IQR for median, and standard deviation for means; means computed without outliers.Median and Mean Values of the Parameters of Heroin and Control Participants in [30].
Weather Anomalies in the United States
kaggle.com
zip
Updated Nov 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Weather Anomalies in the United States [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-anomalies-in-the-united-states
Explore at:
zip(98365651 bytes)Available download formats
Dataset updated
Nov 22, 2022
Authors
The Devastator
Area covered
United States
Description
Weather Anomalies in the United States

Outliers from 1964-2013

By Carl V. Lewis [source]

About this dataset

Historical Weather Outliers in the United States,1964-2013:This dataset contains historical weather outliers in the United States from 1964 to 2013. The data includes thereporting station ID, name, min/max temperature, as well as degree coordinates of the recorded weather. The original weather data was collected from NOAA.

Each entry in this dataset represents a report from a weather station with high or low temperatures that were historical outliers within that month, averaged over time. This table's columns contain data that was collected from NOAA as well as data that was calculated using Enigma's assortment of weather data. The direct source of the information is identified in the description of the column.

Columns:date_str,degrees_from_mean,longitude,latitude,max_temp,min_temp,station_name,type

How to use the dataset

This dataset contains historical weather outliers in the United States from 1964 to 2013. The data includes the station ID, name, minimum and maximum temperatures, as well as degree coordinates of the recorded weather.

To use this dataset, simply download it and open it in a text editor or spreadsheet program. The data is organized by columns, with each column representing a different piece of information. Here is a brief explanation of each column:

date_str: The date of the weather report.

degrees_from_mean: The number of degrees that the temperature was above or below the historical mean for that month.

longitude: The longitude of the weather station.

latitude: The latitude of the weather station.

max_temp: The maximum temperature reported by the weather station.

min_temp: The minimum temperature reported by the weather station.

station_name: The name of the weather station.

type: The type of outlier, either high or low

Research Ideas

Plotting the locations of outliers on a map of the US

Identifying weather patterns associated with outliers

Determining which areas of the US are most vulnerable to extreme weather events

Acknowledgements

This dataset was originally published by Enigma.io Analysis.

#

Data Source>

License

Unknown License - Please check the dataset description for more information.

Columns

File: weather-anomalies-1964-2013.csv | Column name | Description | |:----------------------|:----------------------------------------------------------------------------------------------------| | date_str | The date of the weather anomaly. (Date) | | degrees_from_mean | The number of degrees that the temperature was above or below the monthly mean temperature. (Float) | | longitude | The longitude of the weather station where the anomaly was recorded. (Float) | | latitude | The latitude of the weather station where the anomaly was recorded. (Float) | | max_temp | The maximum temperature recorded at the weather station on the date of the anomaly. (Float) | | min_temp | The minimum temperature recorded at the weather station on the date of the anomaly. (Float) | | station_name | The name of the weather station where the anomaly was recorded. (String) | | type | The type of anomaly, either high or low temperature. (String) |

Acknowledgements

If you use this dataset in your research, please credit Carl V. Lewis.
Effect sizes calculated using MD and MC, excluding outliers
dro.deakin.edu.au
researchdata.edu.au
txt
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Don Driscoll (2024). Effect sizes calculated using MD and MC, excluding outliers [Dataset]. http://doi.org/10.26187/deakin.26264351.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.26187/deakin.26264351.v1
Dataset updated
Nov 7, 2024
Dataset provided by
Deakin Universityhttp://www.deakin.edu.au/
Authors
Don Driscoll
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Effect sizes calculated using mean difference for burnt-unburnt study designs and mean change for before-after desings. Outliers, as defined in the methods section of the paper, were excluded prior to calculating effect sizes.
Thyroid Disease Unsupervised Anomaly Detection
kaggle.com
zip
Updated May 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LIFR (2021). Thyroid Disease Unsupervised Anomaly Detection [Dataset]. https://www.kaggle.com/zhonglifr/thyroid-disease-unsupervised-anomaly-detection
Explore at:
zip(85118 bytes)Available download formats
Dataset updated
May 16, 2021
Authors
LIFR
Description
Context

"This is a dataset originally from the UCI Thyroid Disease Data Set. Then it was modified for unsupervised anomaly detection by Goldstein Markus et al. in 2015."

Content

This dataset has 16 categorical attributes, 5 numerical attributes, and 1 target attribute, then 22 attributes in total.

1) here is the variable description for the categorical attributes: age: continuous. sex: categorical, M, F. on thyroxine: categorical, f, t. query on thyroxine: categorical, f, t. on antithyroid medication: categorical, f, t. sick: categorical, f, t. pregnant: categorical, f, t. thyroid surgery: categorical, f, t. I131 treatment: categorical, f, t. query hypothyroid: categorical, f, t. query hyperthyroid: categorical, f, t. lithium: categorical, f, t. goitre: categorical, f, t. tumor: categorical, f, t. hypopituitary: categorical, f, t. psych: categorical, f, t. For the sake of convenience, age is normalised into (0,1), all the categorical variables are mapped in the following ways: {"M" -> 0 , "F" -> 1}, or {"f" ->0, "t" -> 1}.

2). here is the variable description for the numerical attributes: TSH: continuous. T3: continuous. TT4: continuous. T4U: continuous. FTI: continuous.

3). here is the variable description for the target attributes: outlier_label(target): categorical, o, n. For the target attribute(Outlier_label), "o" means outlier and "n" means normal. By the way, please just remove the last empty column.

Acknowledgements

As stated by the original research paper [1]: "The thyroid dataset is another dataset from UCI machine learning repository in the medical domain. The raw patient measurements contain categorical attributes as well as missing values such that it was preprocessed in order to apply neural networks [2], also known as the “annthyroid” dataset. We make also use of this preprocessing, resulting in 21 dimensions. Normal instances (healthy non-hypothyroid patients) were taken from the train- ing and test datasets. From the test set, we sampled 250 outliers from the two disease classes (subnormal function and hyperfunction) resulting in a new dataset containing 6,916 records with 3.61% anomalies."

Reference

[1] Goldstein M, Uchida S. A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data[J]. PloS one, 2016, 11(4): e0152173. [2] Schiffmann W, Joost M, Werner R. Synthesis and performance analysis of multilayer neural network architectures[J]. 1992. [3] Goldstein, Markus, 2015, "annthyroid-unsupervised-ad.tab", Unsupervised Anomaly Detection Benchmark, https://doi.org/10.7910/DVN/OPQMVF/CJURKL, Harvard Dataverse, V1, UNF:6:jJUwpBJ4iBlQto8WT6zsUg== [fileUNF]
d
Data from: QST FST comparisons with unbalanced half-sib designs
search.dataone.org
data.niaid.nih.gov
+1more
Updated Apr 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kimberly J. Gilbert; Michael C. Whitlock (2025). QST FST comparisons with unbalanced half-sib designs [Dataset]. http://doi.org/10.5061/dryad.rm574
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.rm574
Dataset updated
Apr 4, 2025
Dataset provided by
Dryad Digital Repository
Authors
Kimberly J. Gilbert; Michael C. Whitlock
Time period covered
Jun 20, 2020
Description
QST, a measure of quantitative genetic differentiation among populations, is an index that can suggest local adaptation if QST for a trait is sufficiently larger than the mean FST of neutral genetic markers. A previous method by Whitlock and Guillaume derived a simulation resampling approach to statistically test for a difference between QST and FST, but that method is limited to balanced data sets with offspring related as half-sibs through shared fathers. We extend this approach to (1) allow for a model more suitable for some plant populations or breeding designs in which offspring are related through mothers (assuming independent fathers for each offspring; half-sibs by dam), and (2) by explicitly allowing for unbalanced data sets. The resulting approach is made available through the R package QstFstComp.
Mean, standard deviation (after discarding of the outliers) and threshold...
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulia Di Lullo; Francesca Ieva; Renato Longhi; Anna Maria Paganoni; Maria Pia Protti (2023). Mean, standard deviation (after discarding of the outliers) and threshold for IFN-γ and IL-5 concentration at day 14 in the un-stimulated wells. [Dataset]. http://doi.org/10.1371/journal.pone.0042340.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0042340.t001
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Giulia Di Lullo; Francesca Ieva; Renato Longhi; Anna Maria Paganoni; Maria Pia Protti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
an.p., not performed.
n
Anolis carolinensis character displacement SNP
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jan 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Douglas Crawford (2023). Anolis carolinensis character displacement SNP [Dataset]. http://doi.org/10.5061/dryad.qbzkh18ks
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.qbzkh18ks
Dataset updated
Jan 27, 2023
Dataset provided by
University of Miami
Authors
Douglas Crawford
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Here are six files that provide details for all 44,120 identified single nucleotide polymorphisms (SNPs) or the 215 outlier SNPs associated with the evolution of rapid character displacement among replicate islands with (2Spp) and without competition (1Spp) between two Anolis species. On 2Spp islands, A. carolinensis occurs higher in trees and have evolved larger toe pads. Among 1Spp and 2Spp island populations, we identify 44,120 SNPs, with 215-outlier SNPs with improbably large FST values, low nucleotide variation, greater linkage than expected, and these SNPs are enriched for animal walking behavior. Thus, we conclude that these 215-outliers are evolving by natural selection in response to the phenotypic convergent evolution of character displacement. There are two, non-mutually exclusive perspective of these nucleotide variants. One is character displacement is convergent: all 215 outlier SNPs are shared among 3 out of 5 2Spp island and 24% of outlier SNPS are shared among all five out of five 2Spp island. Second, character displacement is genetically redundant because the allele frequencies in one or more 2Spp are similar to 1Spp islands: among one or more 2Spp islands 33% of outlier SNPS are within the range of 1Spp MiAF and 76% of outliers are more similar to 1Spp island than mean MiAF of 2Spp islands. Focusing on convergence SNP is scientifically more robust, yet it distracts from the perspective of multiple genetic solutions that enhances the rate and stability of adaptive change. The six files include: a description of eight islands, details of 94 individuals, and four files on SNPs. The four SNP files include the VCF files for 94 individuals with 44KSNPs and two files (Excel sheet/tab-delimited file) with FST, p-values and outlier status for all 44,120 identified single nucleotide polymorphisms (SNPs) associated with the evolution of rapid character displacement. The sixth file is a detailed file on the 215 outlier SNPs. Complete sequence data is available at Bioproject PRJNA833453, which including samples not included in this study. The 94 individuals used in this study are described in “Supplemental_Sample_description.txt” Methods Anoles and genomic DNA: Tissue or DNA for 160 Anolis carolinensis and 20 A. sagrei samples were provided by the Museum of Comparative Zoology at Harvard University (Table S2). Samples were previously used to examine evolution of character displacement in native A. carolinensis following invasion by A. sagrei onto man-made spoil islands in Mosquito Lagoon Florida (Stuart et al. 2014). One hundred samples were genomic DNAs, and 80 samples were tissues (terminal tail clip, Table S2). Genomic DNA was isolated from 80 of 160 A. carolinensis individuals (MCZ, Table S2) using a custom SPRI magnetic bead protocol (Psifidi et al. 2015). Briefly, after removing ethanol, tissues were placed in 200 ul of GH buffer (25 mM Tris- HCl pH 7.5, 25 mM EDTA, , 2M GuHCl Guanidine hydrochloride, G3272 SIGMA, 5 mM CaCl2, 0.5% v/v Triton X-100, 1% N-Lauroyl-Sarcosine) with 5% per volume of 20 mg/ml proteinase K (10 ul/200 ul GH) and digested at 55º C for at least 2 hours. After proteinase K digestion, 100 ul of 0.1% carboxyl-modified Sera-Mag Magnetic beads (Fisher Scientific) resuspended in 2.5 M NaCl, 20% PEG were added and allowed to bind the DNA. Beads were subsequently magnetized and washed twice with 200 ul 70% EtOH, and then DNA was eluted in 100 ul 0.1x TE (10 mM Tris, 0.1 mM EDTA). All DNA samples were gel electrophoresed to ensure high molecular mass and quantified by spectrophotometry and fluorescence using Biotium AccuBlueTM High Sensitivity dsDNA Quantitative Solution according to manufacturer’s instructions. Genotyping-by-sequencing (GBS) libraries were prepared using a modified protocol after Elshire et al. (Elshire et al. 2011). Briefly, high-molecular-weight genomic DNA was aliquoted and digested using ApeKI restriction enzyme. Digests from each individual sample were uniquely barcoded, pooled, and size selected to yield insert sizes between 300-700 bp (Borgstrom et al. 2011). Pooled libraries were PCR amplified (15 cycles) using custom primers that extend into the genomic DNA insert by 3 bases (CTG). Adding 3 extra base pairs systematically reduces the number of sequenced GBS tags, ensuring sufficient sequencing depth. The final library had a mean size of 424 bp ranging from 188 to 700 bp . Anolis SNPs: Pooled libraries were sequenced on one lane on the Illumina HiSeq 4000 in 2x150 bp paired-end configuration, yielding approximately 459 million paired-end reads ( ~138 Gb). The medium Q-Score was 42 with the lower 10% Q-Scores exceeding 32 for all 150 bp. The initial library contained 180 individuals with 8,561,493 polymorphic sites. Twenty individuals were Anolis sagrei, and two individuals (Yan 1610 & Yin 1411) clustered with A. sagrei and were not used to define A. carolinesis’ SNPs. Anolis carolinesis reads were aligned to the Anolis carolinensis genome (NCBI RefSeq accession number:/GCF_000090745.1_AnoCar2.0). Single nucleotide polymorphisms (SNPs) for A. carolinensis were called using the GBeaSy analysis pipeline (Wickland et al. 2017) with the following filter settings: minimum read length of 100 bp after barcode and adapter trimming, minimum phred-scaled variant quality of 30 and minimum read depth of 5. SNPs were further filtered by requiring SNPs to occur in > 50% of individuals, and 66 individuals were removed because they had less than 70% of called SNPs. These filtering steps resulted in 51,155 SNPs among 94 individuals. Final filtering among 94 individuals required all sites to be polymorphic (with fewer individuals, some sites were no longer polymorphic) with a maximum of 2 alleles (all are bi-allelic), minimal allele frequency 0.05, and He that does not exceed HWE (FDR <0.01). SNPs with large He were removed (2,280 SNPs). These SNPs with large significant heterozygosity may result from aligning paralogues (different loci), and thus may not represent polymorphisms. No SNPs were removed with low He (due to possible demography or other exceptions to HWE). After filtering, 94 individual yielded 44,120 SNPs. Thus, the final filtered SNP data set was 44K SNPs from 94 indiviuals. Statistical Analyses: Eight A. carolinensis populations were analyzed: three populations from islands with native species only (1Spp islands) and 5 populations from islands where A. carolinesis co-exist with A. sagrei (2Spp islands, Table 1, Table S1). Most analyses pooled the three 1Spp islands and contrasted these with the pooled five 2Spp islands. Two approaches were used to define SNPs with unusually large allele frequency differences between 1Spp and 2Spp islands: 1) comparison of FST values to random permutations and 2) a modified FDIST approach to identify outlier SNPs with large and statistically unlikely FST values. Random Permutations: FST values were calculated in VCFTools (version 4.2, (Danecek et al. 2011)) where the p-value per SNP were defined by comparing FST values to 1,000 random permutations using a custom script (below). Basically, individuals and all their SNPs were randomly assigned to one of eight islands or to 1Spp versus 2Spp groups. The sample sizes (55 for 2Spp and 39 for 1Spp islands) were maintained. FST values were re-calculated for each 1,000 randomizations using VCFTools. Modified FDIST: To identify outlier SNPs with statistically large FST values, a modified FDIST (Beaumont and Nichols 1996) was implemented in Arlequin (Excoffier et al. 2005). This modified approach applies 50,000 coalescent simulations using hierarchical population structure, in which demes are arranged into k groups of d demes and in which migration rates between demes are different within and between groups. Unlike the finite island models, which have led to large frequencies of false positive because populations share different histories (Lotterhos and Whitlock 2014), the hierarchical island model avoids these false positives by avoiding the assumption of similar ancestry (Excoffier et al. 2009). References Beaumont, M. A. and R. A. Nichols. 1996. Evaluating loci for use in the genetic analysis of population structure. P Roy Soc B-Biol Sci 263:1619-1626. Borgstrom, E., S. Lundin, and J. Lundeberg. 2011. Large scale library generation for high throughput sequencing. PLoS One 6:e19119. Bradbury, P. J., Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss, and E. S. Buckler. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633-2635. Cingolani, P., A. Platts, L. Wang le, M. Coon, T. Nguyen, L. Wang, S. J. Land, X. Lu, and D. M. Ruden. 2012. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6:80-92. Danecek, P., A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin, and G. Genomes Project Analysis. 2011. The variant call format and VCFtools. Bioinformatics 27:2156-2158. Earl, D. A. and B. M. vonHoldt. 2011. Structure Harvester: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genet Resour 4:359-361. Elshire, R. J., J. C. Glaubitz, Q. Sun, J. A. Poland, K. Kawamoto, E. S. Buckler, and S. E. Mitchell. 2011. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6:e19379. Evanno, G., S. Regnaut, and J. Goudet. 2005. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14:2611-2620. Excoffier, L., T. Hofer, and M. Foll. 2009. Detecting loci under selection in a hierarchically structured population. Heredity 103:285-298. Excoffier, L., G. Laval, and S. Schneider. 2005. Arlequin (version 3.0): An integrated software package for population genetics data analysis.
d
Data from: Sliding window constrained fault-tolerant filtering of compressor...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaolin Hu; Xianxi Chen; Guo Xi Sun (2025). Sliding window constrained fault-tolerant filtering of compressor vibration data [Dataset]. http://doi.org/10.5061/dryad.pc866t20z
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.pc866t20z
Dataset updated
Feb 5, 2025
Dataset provided by
Dryad Digital Repository
Authors
Shaolin Hu; Xianxi Chen; Guo Xi Sun
Description
This paper presents a sliding window constrained fault-tolerant filtering method for sampling data in petrochemical instrumentation. The method requires the design of an appropriate sliding window width based on the time series, as well as the expansion of both ends of the series. By utilizing a sliding window constraint function, the method produces a smoothed estimate for the current moment within the window. As the window advances, a series of smoothed estimates of the original sampled data is generated. Subsequently, the original series is subtracted from this smoothed estimate to create a new series that represents the differences between the two. This difference series is then subjected to an additional smoothing estimation process, and the resulting smoothed estimates are employed to compensate for the smoothed estimates of original sampled series. The experimental results indicate that, compared with sliding mean filtering, sliding median filtering, and Savitzky-Golay filtering,..., , , # Sliding window constrained fault-tolerant filtering of compressor vibration data

https://doi.org/10.5061/dryad.pc866t20z

Description of the data and file structure

Data type

Files containing â€˜fdata1case1â€™ in the file represents the case "1" of the location of the outlier in the measured data "1", and so on;

Files containing â€˜fwavedataâ€™ in the file name are wave signals with outliers;

Files containing â€˜fwave2dataâ€™ in the file name are polynomial signals with outliers;

Files containing â€˜normaldataâ€™ in the file name are normal measured data;

Files containing â€˜normalwavedataâ€™ in the file name are normal wave signals;

Files containing â€˜normalwave2dataâ€™ in the file name are normal polynomial signals;

Files containing â€˜ftffilteredâ€™ in the file name indicate that the data have been processed by sliding-window constrained error-tolerant filtering;

Files containing â€˜sgfilteredâ€™ in the file name indicate data after Savitzky-Golay filtering...
f
Gene ontology enrichment analysis based on outlier windows for high mean FST...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Dec 20, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saelao, Perot; Stevens, Kristian A.; Langley, Charles H.; Begun, David J.; Cardeno, Charis M.; Pool, John E.; Corbett-Detig, Russell B.; Emerson, J. J.; Sugino, Ryuichi P.; Duchen, Pablo; Crepeau, Marc W. (2012). Gene ontology enrichment analysis based on outlier windows for high mean FST for African population comparisons. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001160013
Explore at:
Dataset updated
Dec 20, 2012
Authors
Saelao, Perot; Stevens, Kristian A.; Langley, Charles H.; Begun, David J.; Cardeno, Charis M.; Pool, John E.; Corbett-Detig, Russell B.; Emerson, J. J.; Sugino, Ryuichi P.; Duchen, Pablo; Crepeau, Marc W.
Description
Listed are GO categories with P<0.05 and outlier genes >1. Full results are given in Table S16.
d
Morphological data quantifying sexual dimorphism of Anolis carolinensis in...
datadryad.org
data.niaid.nih.gov
zip
Updated Aug 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thor Veen; Yoel Stuart; Ambika Kamath; William Sherwin (2021). Morphological data quantifying sexual dimorphism of Anolis carolinensis in presence and absence of congener [Dataset]. http://doi.org/10.5061/dryad.d51c5b03v
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.d51c5b03v
Dataset updated
Aug 31, 2021
Dataset provided by
Dryad
Authors
Thor Veen; Yoel Stuart; Ambika Kamath; William Sherwin
Time period covered
Aug 17, 2021
Description
README file included.

Facebook

Twitter

Click to copy link

Link copied

Cite

Harvard Dataverse (2011). Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods [Dataset]. http://doi.org/10.7910/DVN/JJLJKZ

Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods

Explore at:

text/plain; charset=us-ascii(5482), text/plain; charset=us-ascii(3590), pdf(198705)Available download formats

Unique identifier

https://doi.org/10.7910/DVN/JJLJKZ

Dataset updated

Aug 10, 2011

Dataset provided by

Harvard Dataverse

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.

Clear search

Close search

Google apps

Main menu

Replication data for: Linear Models with Outliers: Choosing between...

Data from: Error and anomaly detection for intra-participant time-series...

Outlier Detection and Feature Correlation

Number of outlier years (>2 standard errors above or below the mean) per...

Results from stationary unit tests performed with 40 low-cost CatLog GPS...

Outlier classification using autoencoders: application for fluctuation...

Data from: Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit...

The mean and standard deviation TPR for the anomaly detection algorithms.

Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...

The 12 outliers identified in the Tonga dataset.

Median and Mean Values of the Parameters of Heroin and Control Participants...

Weather Anomalies in the United States

Weather Anomalies in the United States

Outliers from 1964-2013

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Effect sizes calculated using MD and MC, excluding outliers

Thyroid Disease Unsupervised Anomaly Detection

Context

Content

Acknowledgements

Reference

Data from: QST FST comparisons with unbalanced half-sib designs

Mean, standard deviation (after discarding of the outliers) and threshold...

Anolis carolinensis character displacement SNP

Data from: Sliding window constrained fault-tolerant filtering of compressor...

Description of the data and file structure

Gene ontology enrichment analysis based on outlier windows for high mean FST...

Morphological data quantifying sexual dimorphism of Anolis carolinensis in...

Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods