52 datasets found
  1. H

    Replication data for: Linear Models with Outliers: Choosing between...

    • dataverse.harvard.edu
    • dataverse-staging.rdmc.unc.edu
    • +1more
    pdf +1
    Updated Aug 10, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2011). Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods [Dataset]. http://doi.org/10.7910/DVN/JJLJKZ
    Explore at:
    text/plain; charset=us-ascii(5482), text/plain; charset=us-ascii(3590), pdf(198705)Available download formats
    Dataset updated
    Aug 10, 2011
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.

  2. f

    Data from: Error and anomaly detection for intra-participant time-series...

    • tandf.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    David R. Mullineaux; Gareth Irwin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.

  3. Outlier Detection and Feature Correlation

    • kaggle.com
    zip
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Mashayekhi (2025). Outlier Detection and Feature Correlation [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/outlier-detection-and-feature-correlation
    Explore at:
    zip(1094301 bytes)Available download formats
    Dataset updated
    Apr 18, 2025
    Authors
    Mahdi Mashayekhi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This synthetic machine learning dataset is designed to help practitioners and students explore essential data preprocessing techniques, such as:

    Outlier detection and handling

    Handling missing (NaN) values

    Understanding and resolving multicollinearity (high VIF)

    Feature selection and engineering

    It includes:

    5000 records

    25 numerical features named feature_1 to feature_25

    Binary target variable named target (0 or 1)

    ~15% of injected outliers: certain values in selected columns deviate significantly from the mean to simulate anomalies.

    ~10% missing values (NaNs): randomly distributed to simulate real-world data imperfections.

  4. Number of outlier years (>2 standard errors above or below the mean) per...

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rebecca Chaplin-Kramer; Melvin R. George (2023). Number of outlier years (>2 standard errors above or below the mean) per time period. [Dataset]. http://doi.org/10.1371/journal.pone.0057723.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rebecca Chaplin-Kramer; Melvin R. George
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of outlier years (>2 standard errors above or below the mean) per time period.

  5. f

    Results from stationary unit tests performed with 40 low-cost CatLog GPS...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 18, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Poulle, Marie-Lazarine; Forin-Wiart, Marie-Amélie; Sirguey, Pascal; Hubert, Pauline (2015). Results from stationary unit tests performed with 40 low-cost CatLog GPS data loggers: the fix success rate (FSR) ± standard deviation (SD), mean time of the fix acquisition (μFAT), root mean square of the location errors (LERMS), mean location error (μLE), median location error (mLE), percentage of fixes with LE < 10 m, the mean number of outliers per unit (N outliers) and root mean square of the location errors after the removal of outliers (LERMS without outliers) for positional fixes collected from for two antenna positions, three fix intervals programs and four habitat types. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001919478
    Explore at:
    Dataset updated
    Jun 18, 2015
    Authors
    Poulle, Marie-Lazarine; Forin-Wiart, Marie-Amélie; Sirguey, Pascal; Hubert, Pauline
    Description

    Results from stationary unit tests performed with 40 low-cost CatLog GPS data loggers: the fix success rate (FSR) ± standard deviation (SD), mean time of the fix acquisition (μFAT), root mean square of the location errors (LERMS), mean location error (μLE), median location error (mLE), percentage of fixes with LE < 10 m, the mean number of outliers per unit (N outliers) and root mean square of the location errors after the removal of outliers (LERMS without outliers) for positional fixes collected from for two antenna positions, three fix intervals programs and four habitat types.

  6. Outlier classification using autoencoders: application for fluctuation...

    • osti.gov
    • dataverse.harvard.edu
    Updated Jun 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. http://doi.org/10.7910/DVN/SKEHRJ
    Explore at:
    Dataset updated
    Jun 2, 2021
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
    Description

    Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.

  7. d

    Data from: Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit Hydrologic Unit Codes, 2008-2023 [Dataset]. https://catalog.data.gov/dataset/monthly-openet-image-collections-v2-0-summarized-by-12-digit-hydrologic-unit-codes-2008-20
    Explore at:
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    This dataset provides monthly summaries of evapotranspiration (ET) data from OpenET v2.0 image collections for the period 2008-2023 for all National Watershed Boundary Dataset subwatersheds (12-digit hydrologic unit codes [HUC12s]) in the US that overlap the spatial extent of OpenET datasets. For each HUC12, this dataset contains spatial aggregation statistics (minimum, mean, median, and maximum) for each of the ET variables from each of the publicly available image collections from OpenET for the six available models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop) and the Ensemble image collection, which is a pixel-wise ensemble of all 6 individual models after filtering and removal of outliers according to the median absolute deviation approach (Melton and others, 2022). Data are available in this data release in two different formats: comma-separated values (CSV) and parquet, a high-performance format that is optimized for storage and processing of columnar data. CSV files containing data for each 4-digit HUC are grouped by 2-digit HUCs for easier access of regional data, and the single parquet file provides convenient access to the entire dataset. For each of the ET models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop), variables in the model-specific CSV data files include: -huc12: The 12-digit hydrologic unit code -ET: Actual evapotranspiration (in millimeters) over the HUC12 area in the month calculated as the sum of daily ET interpolated between Landsat overpasses -statistic: Max, mean, median, or min. Statistic used in the spatial aggregation within each HUC12. For example, maximum ET is the maximum monthly pixel ET value occurring within the HUC12 boundary after summing daily ET in the month -year: 4-digit year -month: 2-digit month -count: Number of Landsat overpasses included in the ET calculation in the month -et_coverage_pct: Integer percentage of the HUC12 with ET data, which can be used to determine how representative the ET statistic is of the entire HUC12 -count_coverage_pct: Integer percentage of the HUC12 with count data, which can be different than the et_coverage_pct value because the “count” band in the source image collection extends beyond the “et” band in the eastern portion of the image collection extent For the Ensemble data, these additional variables are included in the CSV files: -et_mad: Ensemble ET value, computed as the mean of the ensemble after filtering outliers using the median absolute deviation (MAD) -et_mad_count: The number of models used to compute the ensemble ET value after filtering for outliers using the MAD -et_mad_max: The maximum value in the ensemble range, after filtering for outliers using the MAD -et_mad_min: The minimum value in the ensemble range, after filtering for outliers using the MAD -et_sam: A simple arithmetic mean (across the 6 models) of actual ET average without outlier removal Below are the locations of each OpenET image collection used in this summary: DisALEXI: https://developers.google.com/earth-engine/datasets/catalog/OpenET_DISALEXI_CONUS_GRIDMET_MONTHLY_v2_0 eeMETRIC: https://developers.google.com/earth-engine/datasets/catalog/OpenET_EEMETRIC_CONUS_GRIDMET_MONTHLY_v2_0 geeSEBAL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_GEESEBAL_CONUS_GRIDMET_MONTHLY_v2_0 PT-JPL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_PTJPL_CONUS_GRIDMET_MONTHLY_v2_0 SIMS: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SIMS_CONUS_GRIDMET_MONTHLY_v2_0 SSEBop: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SSEBOP_CONUS_GRIDMET_MONTHLY_v2_0 Ensemble: https://developers.google.com/earth-engine/datasets/catalog/OpenET_ENSEMBLE_CONUS_GRIDMET_MONTHLY_v2_0

  8. The mean and standard deviation TPR for the anomaly detection algorithms.

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Firuz Kamalov; Hana Sulieman; David Santandreu Calonge (2023). The mean and standard deviation TPR for the anomaly detection algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0254340.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Firuz Kamalov; Hana Sulieman; David Santandreu Calonge
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The results represent experiments on four datasets based on 20 simulated experiments. The proposed method (NewAlgo) produces the best overall results.

  9. Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...

    • zenodo.org
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    o; o (2025). Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of U.S. Tech Firms [Dataset]. http://doi.org/10.5281/zenodo.15337959
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    o; o
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 4, 2025
    Description

    Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.

    This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:

    • ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)

    • ESM_2.py – Python script to calculate Z-scores from raw financial ratios

    • ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios

    • ESM_4.py – Python script for generating the correlation heatmap of the Z-scores

    • ESM_5.xlsx – Mahalanobis distance values for each firm

    • ESM_6.py – Python script to compute Mahalanobis distances

    • ESM_7.py – Python script to visualize Mahalanobis distances

    • ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)

    • ESM_9.py – Python script to compute mean Z-scores

    • ESM_10.xlsx – Re-standardized Z-scores based on firm-level means

    • ESM_11.py – Python script to re-standardize mean Z-scores

    • ESM_12.py – Python script to generate the hierarchical clustering dendrogram

    All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).

  10. f

    The 12 outliers identified in the Tonga dataset.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayfield, Anderson B.; Dempsey, Alexandra C.; Chen, Chii-Shiarng (2017). The 12 outliers identified in the Tonga dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001760878
    Explore at:
    Dataset updated
    Nov 1, 2017
    Authors
    Mayfield, Anderson B.; Dempsey, Alexandra C.; Chen, Chii-Shiarng
    Area covered
    Tonga
    Description

    Gene expression data have been presented as non-normalized (2-Ct*109) in all but the last six rows; this allows for the back-calculation of the raw threshold cycle (Ct) values so that interested individuals can readily estimate the typical range of expression of each gene. Values representing aberrant levels for a particular parameter (z-score>2.5) have been highlighted in bold. When there was a statistically significant difference (student’s t-test, p<0.05) between the outlier and non-outlier averages for a parameter (instead using normalized gene expression data), the lower of the two values has been underlined. All samples hosted Symbiodinium of clade C only unless noted otherwise. The mean Mahalanobis distance did not differ between Pocillopora damicornis and P. acuta (student’s t-test, p>0.05). SA = surface area. GCP = genome copy proportion. Ma Dis = Mahalanobis distance. “.” = missing data.

  11. f

    Median and Mean Values of the Parameters of Heroin and Control Participants...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    • +1more
    Updated Nov 12, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    González-Vallejo, Claudia; Cheng, Jiuqing (2014). Median and Mean Values of the Parameters of Heroin and Control Participants in [30]. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001223884
    Explore at:
    Dataset updated
    Nov 12, 2014
    Authors
    González-Vallejo, Claudia; Cheng, Jiuqing
    Description

    Note: Superscripts S and L index small and large payoffs. Numbers in parentheses are IQR for median, and standard deviation for means; means computed without outliers.Median and Mean Values of the Parameters of Heroin and Control Participants in [30].

  12. Weather Anomalies in the United States

    • kaggle.com
    zip
    Updated Nov 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Weather Anomalies in the United States [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-anomalies-in-the-united-states
    Explore at:
    zip(98365651 bytes)Available download formats
    Dataset updated
    Nov 22, 2022
    Authors
    The Devastator
    Area covered
    United States
    Description

    Weather Anomalies in the United States

    Outliers from 1964-2013

    By Carl V. Lewis [source]

    About this dataset

    Historical Weather Outliers in the United States,1964-2013:This dataset contains historical weather outliers in the United States from 1964 to 2013. The data includes thereporting station ID, name, min/max temperature, as well as degree coordinates of the recorded weather. The original weather data was collected from NOAA.

    Each entry in this dataset represents a report from a weather station with high or low temperatures that were historical outliers within that month, averaged over time. This table's columns contain data that was collected from NOAA as well as data that was calculated using Enigma's assortment of weather data. The direct source of the information is identified in the description of the column.

    Columns:date_str,degrees_from_mean,longitude,latitude,max_temp,min_temp,station_name,type

    How to use the dataset

    This dataset contains historical weather outliers in the United States from 1964 to 2013. The data includes the station ID, name, minimum and maximum temperatures, as well as degree coordinates of the recorded weather.

    To use this dataset, simply download it and open it in a text editor or spreadsheet program. The data is organized by columns, with each column representing a different piece of information. Here is a brief explanation of each column:

    • date_str: The date of the weather report.
    • degrees_from_mean: The number of degrees that the temperature was above or below the historical mean for that month.
    • longitude: The longitude of the weather station.
    • latitude: The latitude of the weather station.
    • max_temp: The maximum temperature reported by the weather station.
    • min_temp: The minimum temperature reported by the weather station.
    • station_name: The name of the weather station.
    • type: The type of outlier, either high or low

    Research Ideas

    • Plotting the locations of outliers on a map of the US
    • Identifying weather patterns associated with outliers
    • Determining which areas of the US are most vulnerable to extreme weather events

    Acknowledgements

    This dataset was originally published by Enigma.io Analysis.

    #

    Data Source>

    License

    Unknown License - Please check the dataset description for more information.

    Columns

    File: weather-anomalies-1964-2013.csv | Column name | Description | |:----------------------|:----------------------------------------------------------------------------------------------------| | date_str | The date of the weather anomaly. (Date) | | degrees_from_mean | The number of degrees that the temperature was above or below the monthly mean temperature. (Float) | | longitude | The longitude of the weather station where the anomaly was recorded. (Float) | | latitude | The latitude of the weather station where the anomaly was recorded. (Float) | | max_temp | The maximum temperature recorded at the weather station on the date of the anomaly. (Float) | | min_temp | The minimum temperature recorded at the weather station on the date of the anomaly. (Float) | | station_name | The name of the weather station where the anomaly was recorded. (String) | | type | The type of anomaly, either high or low temperature. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit Carl V. Lewis.

  13. Effect sizes calculated using MD and MC, excluding outliers

    • dro.deakin.edu.au
    • researchdata.edu.au
    txt
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Don Driscoll (2024). Effect sizes calculated using MD and MC, excluding outliers [Dataset]. http://doi.org/10.26187/deakin.26264351.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 7, 2024
    Dataset provided by
    Deakin Universityhttp://www.deakin.edu.au/
    Authors
    Don Driscoll
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Effect sizes calculated using mean difference for burnt-unburnt study designs and mean change for before-after desings. Outliers, as defined in the methods section of the paper, were excluded prior to calculating effect sizes.

  14. Thyroid Disease Unsupervised Anomaly Detection

    • kaggle.com
    zip
    Updated May 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LIFR (2021). Thyroid Disease Unsupervised Anomaly Detection [Dataset]. https://www.kaggle.com/zhonglifr/thyroid-disease-unsupervised-anomaly-detection
    Explore at:
    zip(85118 bytes)Available download formats
    Dataset updated
    May 16, 2021
    Authors
    LIFR
    Description

    Context

    "This is a dataset originally from the UCI Thyroid Disease Data Set. Then it was modified for unsupervised anomaly detection by Goldstein Markus et al. in 2015."

    Content

    This dataset has 16 categorical attributes, 5 numerical attributes, and 1 target attribute, then 22 attributes in total.

    1) here is the variable description for the categorical attributes: age: continuous. sex: categorical, M, F. on thyroxine: categorical, f, t. query on thyroxine: categorical, f, t. on antithyroid medication: categorical, f, t. sick: categorical, f, t. pregnant: categorical, f, t. thyroid surgery: categorical, f, t. I131 treatment: categorical, f, t. query hypothyroid: categorical, f, t. query hyperthyroid: categorical, f, t. lithium: categorical, f, t. goitre: categorical, f, t. tumor: categorical, f, t. hypopituitary: categorical, f, t. psych: categorical, f, t. For the sake of convenience, age is normalised into (0,1), all the categorical variables are mapped in the following ways: {"M" -> 0 , "F" -> 1}, or {"f" ->0, "t" -> 1}.

    2). here is the variable description for the numerical attributes: TSH: continuous. T3: continuous. TT4: continuous. T4U: continuous. FTI: continuous.

    3). here is the variable description for the target attributes: outlier_label(target): categorical, o, n. For the target attribute(Outlier_label), "o" means outlier and "n" means normal. By the way, please just remove the last empty column.

    Acknowledgements

    As stated by the original research paper [1]: "The thyroid dataset is another dataset from UCI machine learning repository in the medical domain. The raw patient measurements contain categorical attributes as well as missing values such that it was preprocessed in order to apply neural networks [2], also known as the “annthyroid” dataset. We make also use of this preprocessing, resulting in 21 dimensions. Normal instances (healthy non-hypothyroid patients) were taken from the train- ing and test datasets. From the test set, we sampled 250 outliers from the two disease classes (subnormal function and hyperfunction) resulting in a new dataset containing 6,916 records with 3.61% anomalies."

    Reference

    [1] Goldstein M, Uchida S. A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data[J]. PloS one, 2016, 11(4): e0152173. [2] Schiffmann W, Joost M, Werner R. Synthesis and performance analysis of multilayer neural network architectures[J]. 1992. [3] Goldstein, Markus, 2015, "annthyroid-unsupervised-ad.tab", Unsupervised Anomaly Detection Benchmark, https://doi.org/10.7910/DVN/OPQMVF/CJURKL, Harvard Dataverse, V1, UNF:6:jJUwpBJ4iBlQto8WT6zsUg== [fileUNF]

  15. d

    Data from: QST FST comparisons with unbalanced half-sib designs

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Apr 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kimberly J. Gilbert; Michael C. Whitlock (2025). QST FST comparisons with unbalanced half-sib designs [Dataset]. http://doi.org/10.5061/dryad.rm574
    Explore at:
    Dataset updated
    Apr 4, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Kimberly J. Gilbert; Michael C. Whitlock
    Time period covered
    Jun 20, 2020
    Description

    QST, a measure of quantitative genetic differentiation among populations, is an index that can suggest local adaptation if QST for a trait is sufficiently larger than the mean FST of neutral genetic markers. A previous method by Whitlock and Guillaume derived a simulation resampling approach to statistically test for a difference between QST and FST, but that method is limited to balanced data sets with offspring related as half-sibs through shared fathers. We extend this approach to (1) allow for a model more suitable for some plant populations or breeding designs in which offspring are related through mothers (assuming independent fathers for each offspring; half-sibs by dam), and (2) by explicitly allowing for unbalanced data sets. The resulting approach is made available through the R package QstFstComp.

  16. Mean, standard deviation (after discarding of the outliers) and threshold...

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulia Di Lullo; Francesca Ieva; Renato Longhi; Anna Maria Paganoni; Maria Pia Protti (2023). Mean, standard deviation (after discarding of the outliers) and threshold for IFN-γ and IL-5 concentration at day 14 in the un-stimulated wells. [Dataset]. http://doi.org/10.1371/journal.pone.0042340.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Giulia Di Lullo; Francesca Ieva; Renato Longhi; Anna Maria Paganoni; Maria Pia Protti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    an.p., not performed.

  17. n

    Anolis carolinensis character displacement SNP

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jan 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Douglas Crawford (2023). Anolis carolinensis character displacement SNP [Dataset]. http://doi.org/10.5061/dryad.qbzkh18ks
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 27, 2023
    Dataset provided by
    University of Miami
    Authors
    Douglas Crawford
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Here are six files that provide details for all 44,120 identified single nucleotide polymorphisms (SNPs) or the 215 outlier SNPs associated with the evolution of rapid character displacement among replicate islands with (2Spp) and without competition (1Spp) between two Anolis species. On 2Spp islands, A. carolinensis occurs higher in trees and have evolved larger toe pads. Among 1Spp and 2Spp island populations, we identify 44,120 SNPs, with 215-outlier SNPs with improbably large FST values, low nucleotide variation, greater linkage than expected, and these SNPs are enriched for animal walking behavior. Thus, we conclude that these 215-outliers are evolving by natural selection in response to the phenotypic convergent evolution of character displacement. There are two, non-mutually exclusive perspective of these nucleotide variants. One is character displacement is convergent: all 215 outlier SNPs are shared among 3 out of 5 2Spp island and 24% of outlier SNPS are shared among all five out of five 2Spp island. Second, character displacement is genetically redundant because the allele frequencies in one or more 2Spp are similar to 1Spp islands: among one or more 2Spp islands 33% of outlier SNPS are within the range of 1Spp MiAF and 76% of outliers are more similar to 1Spp island than mean MiAF of 2Spp islands. Focusing on convergence SNP is scientifically more robust, yet it distracts from the perspective of multiple genetic solutions that enhances the rate and stability of adaptive change. The six files include: a description of eight islands, details of 94 individuals, and four files on SNPs. The four SNP files include the VCF files for 94 individuals with 44KSNPs and two files (Excel sheet/tab-delimited file) with FST, p-values and outlier status for all 44,120 identified single nucleotide polymorphisms (SNPs) associated with the evolution of rapid character displacement. The sixth file is a detailed file on the 215 outlier SNPs. Complete sequence data is available at Bioproject PRJNA833453, which including samples not included in this study. The 94 individuals used in this study are described in “Supplemental_Sample_description.txt” Methods Anoles and genomic DNA: Tissue or DNA for 160 Anolis carolinensis and 20 A. sagrei samples were provided by the Museum of Comparative Zoology at Harvard University (Table S2). Samples were previously used to examine evolution of character displacement in native A. carolinensis following invasion by A. sagrei onto man-made spoil islands in Mosquito Lagoon Florida (Stuart et al. 2014). One hundred samples were genomic DNAs, and 80 samples were tissues (terminal tail clip, Table S2). Genomic DNA was isolated from 80 of 160 A. carolinensis individuals (MCZ, Table S2) using a custom SPRI magnetic bead protocol (Psifidi et al. 2015). Briefly, after removing ethanol, tissues were placed in 200 ul of GH buffer (25 mM Tris- HCl pH 7.5, 25 mM EDTA, , 2M GuHCl Guanidine hydrochloride, G3272 SIGMA, 5 mM CaCl2, 0.5% v/v Triton X-100, 1% N-Lauroyl-Sarcosine) with 5% per volume of 20 mg/ml proteinase K (10 ul/200 ul GH) and digested at 55º C for at least 2 hours. After proteinase K digestion, 100 ul of 0.1% carboxyl-modified Sera-Mag Magnetic beads (Fisher Scientific) resuspended in 2.5 M NaCl, 20% PEG were added and allowed to bind the DNA. Beads were subsequently magnetized and washed twice with 200 ul 70% EtOH, and then DNA was eluted in 100 ul 0.1x TE (10 mM Tris, 0.1 mM EDTA). All DNA samples were gel electrophoresed to ensure high molecular mass and quantified by spectrophotometry and fluorescence using Biotium AccuBlueTM High Sensitivity dsDNA Quantitative Solution according to manufacturer’s instructions. Genotyping-by-sequencing (GBS) libraries were prepared using a modified protocol after Elshire et al. (Elshire et al. 2011). Briefly, high-molecular-weight genomic DNA was aliquoted and digested using ApeKI restriction enzyme. Digests from each individual sample were uniquely barcoded, pooled, and size selected to yield insert sizes between 300-700 bp (Borgstrom et al. 2011). Pooled libraries were PCR amplified (15 cycles) using custom primers that extend into the genomic DNA insert by 3 bases (CTG). Adding 3 extra base pairs systematically reduces the number of sequenced GBS tags, ensuring sufficient sequencing depth. The final library had a mean size of 424 bp ranging from 188 to 700 bp . Anolis SNPs: Pooled libraries were sequenced on one lane on the Illumina HiSeq 4000 in 2x150 bp paired-end configuration, yielding approximately 459 million paired-end reads ( ~138 Gb). The medium Q-Score was 42 with the lower 10% Q-Scores exceeding 32 for all 150 bp. The initial library contained 180 individuals with 8,561,493 polymorphic sites. Twenty individuals were Anolis sagrei, and two individuals (Yan 1610 & Yin 1411) clustered with A. sagrei and were not used to define A. carolinesis’ SNPs. Anolis carolinesis reads were aligned to the Anolis carolinensis genome (NCBI RefSeq accession number:/GCF_000090745.1_AnoCar2.0). Single nucleotide polymorphisms (SNPs) for A. carolinensis were called using the GBeaSy analysis pipeline (Wickland et al. 2017) with the following filter settings: minimum read length of 100 bp after barcode and adapter trimming, minimum phred-scaled variant quality of 30 and minimum read depth of 5. SNPs were further filtered by requiring SNPs to occur in > 50% of individuals, and 66 individuals were removed because they had less than 70% of called SNPs. These filtering steps resulted in 51,155 SNPs among 94 individuals. Final filtering among 94 individuals required all sites to be polymorphic (with fewer individuals, some sites were no longer polymorphic) with a maximum of 2 alleles (all are bi-allelic), minimal allele frequency 0.05, and He that does not exceed HWE (FDR <0.01). SNPs with large He were removed (2,280 SNPs). These SNPs with large significant heterozygosity may result from aligning paralogues (different loci), and thus may not represent polymorphisms. No SNPs were removed with low He (due to possible demography or other exceptions to HWE). After filtering, 94 individual yielded 44,120 SNPs. Thus, the final filtered SNP data set was 44K SNPs from 94 indiviuals. Statistical Analyses: Eight A. carolinensis populations were analyzed: three populations from islands with native species only (1Spp islands) and 5 populations from islands where A. carolinesis co-exist with A. sagrei (2Spp islands, Table 1, Table S1). Most analyses pooled the three 1Spp islands and contrasted these with the pooled five 2Spp islands. Two approaches were used to define SNPs with unusually large allele frequency differences between 1Spp and 2Spp islands: 1) comparison of FST values to random permutations and 2) a modified FDIST approach to identify outlier SNPs with large and statistically unlikely FST values. Random Permutations: FST values were calculated in VCFTools (version 4.2, (Danecek et al. 2011)) where the p-value per SNP were defined by comparing FST values to 1,000 random permutations using a custom script (below). Basically, individuals and all their SNPs were randomly assigned to one of eight islands or to 1Spp versus 2Spp groups. The sample sizes (55 for 2Spp and 39 for 1Spp islands) were maintained. FST values were re-calculated for each 1,000 randomizations using VCFTools. Modified FDIST: To identify outlier SNPs with statistically large FST values, a modified FDIST (Beaumont and Nichols 1996) was implemented in Arlequin (Excoffier et al. 2005). This modified approach applies 50,000 coalescent simulations using hierarchical population structure, in which demes are arranged into k groups of d demes and in which migration rates between demes are different within and between groups. Unlike the finite island models, which have led to large frequencies of false positive because populations share different histories (Lotterhos and Whitlock 2014), the hierarchical island model avoids these false positives by avoiding the assumption of similar ancestry (Excoffier et al. 2009). References Beaumont, M. A. and R. A. Nichols. 1996. Evaluating loci for use in the genetic analysis of population structure. P Roy Soc B-Biol Sci 263:1619-1626. Borgstrom, E., S. Lundin, and J. Lundeberg. 2011. Large scale library generation for high throughput sequencing. PLoS One 6:e19119. Bradbury, P. J., Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss, and E. S. Buckler. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633-2635. Cingolani, P., A. Platts, L. Wang le, M. Coon, T. Nguyen, L. Wang, S. J. Land, X. Lu, and D. M. Ruden. 2012. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6:80-92. Danecek, P., A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin, and G. Genomes Project Analysis. 2011. The variant call format and VCFtools. Bioinformatics 27:2156-2158. Earl, D. A. and B. M. vonHoldt. 2011. Structure Harvester: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genet Resour 4:359-361. Elshire, R. J., J. C. Glaubitz, Q. Sun, J. A. Poland, K. Kawamoto, E. S. Buckler, and S. E. Mitchell. 2011. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6:e19379. Evanno, G., S. Regnaut, and J. Goudet. 2005. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14:2611-2620. Excoffier, L., T. Hofer, and M. Foll. 2009. Detecting loci under selection in a hierarchically structured population. Heredity 103:285-298. Excoffier, L., G. Laval, and S. Schneider. 2005. Arlequin (version 3.0): An integrated software package for population genetics data analysis.

  18. d

    Data from: Sliding window constrained fault-tolerant filtering of compressor...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaolin Hu; Xianxi Chen; Guo Xi Sun (2025). Sliding window constrained fault-tolerant filtering of compressor vibration data [Dataset]. http://doi.org/10.5061/dryad.pc866t20z
    Explore at:
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Shaolin Hu; Xianxi Chen; Guo Xi Sun
    Description

    This paper presents a sliding window constrained fault-tolerant filtering method for sampling data in petrochemical instrumentation. The method requires the design of an appropriate sliding window width based on the time series, as well as the expansion of both ends of the series. By utilizing a sliding window constraint function, the method produces a smoothed estimate for the current moment within the window. As the window advances, a series of smoothed estimates of the original sampled data is generated. Subsequently, the original series is subtracted from this smoothed estimate to create a new series that represents the differences between the two. This difference series is then subjected to an additional smoothing estimation process, and the resulting smoothed estimates are employed to compensate for the smoothed estimates of original sampled series. The experimental results indicate that, compared with sliding mean filtering, sliding median filtering, and Savitzky-Golay filtering,..., , , # Sliding window constrained fault-tolerant filtering of compressor vibration data

    https://doi.org/10.5061/dryad.pc866t20z

    Description of the data and file structure

    Data type

    Files containing ‘fdata1case1’ in the file represents the case "1" of the location of the outlier in the measured data "1", and so on;

    Files containing ‘fwavedata’ in the file name are wave signals with outliers;

    Files containing ‘fwave2data’ in the file name are polynomial signals with outliers;

    Files containing ‘normaldata’ in the file name are normal measured data;

    Files containing ‘normalwavedata’ in the file name are normal wave signals;

    Files containing ‘normalwave2data’ in the file name are normal polynomial signals;

    Files containing ‘ftffiltered’ in the file name indicate that the data have been processed by sliding-window constrained error-tolerant filtering;

    Files containing ‘sgfiltered’ in the file name indicate data after Savitzky-Golay filtering...

  19. f

    Gene ontology enrichment analysis based on outlier windows for high mean FST...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Dec 20, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saelao, Perot; Stevens, Kristian A.; Langley, Charles H.; Begun, David J.; Cardeno, Charis M.; Pool, John E.; Corbett-Detig, Russell B.; Emerson, J. J.; Sugino, Ryuichi P.; Duchen, Pablo; Crepeau, Marc W. (2012). Gene ontology enrichment analysis based on outlier windows for high mean FST for African population comparisons. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001160013
    Explore at:
    Dataset updated
    Dec 20, 2012
    Authors
    Saelao, Perot; Stevens, Kristian A.; Langley, Charles H.; Begun, David J.; Cardeno, Charis M.; Pool, John E.; Corbett-Detig, Russell B.; Emerson, J. J.; Sugino, Ryuichi P.; Duchen, Pablo; Crepeau, Marc W.
    Description

    Listed are GO categories with P<0.05 and outlier genes >1. Full results are given in Table S16.

  20. d

    Morphological data quantifying sexual dimorphism of Anolis carolinensis in...

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Aug 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thor Veen; Yoel Stuart; Ambika Kamath; William Sherwin (2021). Morphological data quantifying sexual dimorphism of Anolis carolinensis in presence and absence of congener [Dataset]. http://doi.org/10.5061/dryad.d51c5b03v
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 31, 2021
    Dataset provided by
    Dryad
    Authors
    Thor Veen; Yoel Stuart; Ambika Kamath; William Sherwin
    Time period covered
    Aug 17, 2021
    Description

    README file included.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Harvard Dataverse (2011). Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods [Dataset]. http://doi.org/10.7910/DVN/JJLJKZ

Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods

Related Article
Explore at:
text/plain; charset=us-ascii(5482), text/plain; charset=us-ascii(3590), pdf(198705)Available download formats
Dataset updated
Aug 10, 2011
Dataset provided by
Harvard Dataverse
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.

Search
Clear search
Close search
Google apps
Main menu