86 datasets found
  1. Data from: Valid Inference Corrected for Outlier Removal

    • tandf.figshare.com
    pdf
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v4
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Shuxiao Chen; Jacob Bien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.

  2. h

    kubernetes-reformatted-remove-outliers

    • huggingface.co
    Updated Aug 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ishaan Sehgal (2024). kubernetes-reformatted-remove-outliers [Dataset]. https://huggingface.co/datasets/ishaansehgal99/kubernetes-reformatted-remove-outliers
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2024
    Authors
    Ishaan Sehgal
    Description

    ishaansehgal99/kubernetes-reformatted-remove-outliers dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. f

    Data from: Methodology to filter out outliers in high spatial density data...

    • scielo.figshare.com
    jpeg
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    SciELO journals
    Authors
    Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.

  4. Outliers Practice model

    • kaggle.com
    zip
    Updated Jun 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riaz Ansari (2022). Outliers Practice model [Dataset]. https://www.kaggle.com/datasets/muhammadriazansari/outliers-practice-model/code
    Explore at:
    zip(179619 bytes)Available download formats
    Dataset updated
    Jun 24, 2022
    Authors
    Riaz Ansari
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Riaz Ansari

    Released under CC0: Public Domain

    Contents

  5. Ex. Normal Distribution & ZScore - Outlier Removal

    • kaggle.com
    zip
    Updated Nov 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Panagiotis Prassas (2023). Ex. Normal Distribution & ZScore - Outlier Removal [Dataset]. https://www.kaggle.com/datasets/panagiotisprassas/ex-normal-distribution-and-zscore-outlier-removal
    Explore at:
    zip(147098 bytes)Available download formats
    Dataset updated
    Nov 11, 2023
    Authors
    Panagiotis Prassas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Panagiotis Prassas

    Released under Apache 2.0

    Contents

  6. d

    Data for Filtering Organized 3D Point Clouds for Bin Picking Applications

    • datasets.ai
    • catalog.data.gov
    0, 34, 47
    Updated Apr 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2024). Data for Filtering Organized 3D Point Clouds for Bin Picking Applications [Dataset]. https://datasets.ai/datasets/data-for-filtering-organized-3d-point-clouds-for-bin-picking-applications
    Explore at:
    0, 34, 47Available download formats
    Dataset updated
    Apr 10, 2024
    Dataset authored and provided by
    National Institute of Standards and Technology
    Description

    Contains scans of a bin filled with different parts ( screws, nuts, rods, spheres, sprockets). For each part type, RGB image and organized 3D point cloud obtained with structured light sensor are provided. In addition, unorganized 3D point cloud representing an empty bin and a small Matlab script to read the files is also provided. 3D data contain a lot of outliers and the data were used to demonstrate a new filtering technique.

  7. Outlier Detection and Removal

    • kaggle.com
    zip
    Updated Jul 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prashant S Parhad (2022). Outlier Detection and Removal [Dataset]. https://www.kaggle.com/datasets/prashantsparhad/outlier-detection-and-removal
    Explore at:
    zip(3741 bytes)Available download formats
    Dataset updated
    Jul 28, 2022
    Authors
    Prashant S Parhad
    Description

    Dataset

    This dataset was created by Prashant S Parhad

    Contents

  8. f

    Timings and statistical data of point model by our method.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaojuan Ning; Fan Li; Ge Tian; Yinghui Wang (2023). Timings and statistical data of point model by our method. [Dataset]. http://doi.org/10.1371/journal.pone.0201280.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Xiaojuan Ning; Fan Li; Ge Tian; Yinghui Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Timings and statistical data of point model by our method.

  9. Outlier classification using autoencoders: application for fluctuation...

    • osti.gov
    • dataverse.harvard.edu
    Updated Jun 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. http://doi.org/10.7910/DVN/SKEHRJ
    Explore at:
    Dataset updated
    Jun 2, 2021
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
    Description

    Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.

  10. Sample(s) removed as outliers in each iteration of MFMW-outlier for all the...

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuk Yee Leung; Chun Qi Chang; Yeung Sam Hung (2023). Sample(s) removed as outliers in each iteration of MFMW-outlier for all the six microarray datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0046700.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yuk Yee Leung; Chun Qi Chang; Yeung Sam Hung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample(s) removed as outliers in each iteration of MFMW-outlier for all the six microarray datasets.

  11. f

    Numbers of detected cgSNP differences between German isolates before removal...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weber, Michael; Semmler, Torsten; Marcordes, Sandra; Brangsch, Hanka; Calvelage, Sten; Linde, Jörg; Höper, Dirk; Barth, Stefanie A.; Busch, Anne; Wolf, Silver A. (2025). Numbers of detected cgSNP differences between German isolates before removal of outlier SAMEA5164947 (original), after outlier removal (outlier removed) and after filtering for recombination sites (recombination-adjusted). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002085632
    Explore at:
    Dataset updated
    Apr 1, 2025
    Authors
    Weber, Michael; Semmler, Torsten; Marcordes, Sandra; Brangsch, Hanka; Calvelage, Sten; Linde, Jörg; Höper, Dirk; Barth, Stefanie A.; Busch, Anne; Wolf, Silver A.
    Description

    Numbers of detected cgSNP differences between German isolates before removal of outlier SAMEA5164947 (original), after outlier removal (outlier removed) and after filtering for recombination sites (recombination-adjusted).

  12. d

    Pressure and processed water levels from the Time Series Station Spiekeroog,...

    • search.dataone.org
    • doi.pangaea.de
    Updated Jan 6, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Holinde, Lars; Badewien, Thomas H; Freund, Jan A; Stanev, Emil V; Zielinski, Oliver (2018). Pressure and processed water levels from the Time Series Station Spiekeroog, 2005-2011 [Dataset]. http://doi.org/10.1594/PANGAEA.843740
    Explore at:
    Dataset updated
    Jan 6, 2018
    Dataset provided by
    PANGAEA Data Publisher for Earth and Environmental Science
    Authors
    Holinde, Lars; Badewien, Thomas H; Freund, Jan A; Stanev, Emil V; Zielinski, Oliver
    Time period covered
    Jan 1, 2005 - Dec 31, 2011
    Area covered
    Description

    The quality of water level time series data strongly varies with periods of high and low quality sensor data. In this paper we are presenting the processing steps which were used to generate high quality water level data from water pressure measured at the Time Series Station (TSS) Spiekeroog. The TSS is positioned in a tidal inlet between the islands of Spiekeroog and Langeoog in the East Frisian Wadden Sea (southern North Sea). The processing steps will cover sensor drift, outlier identification, interpolation of data gaps and quality control. A central step is the removal of outliers. For this process an absolute threshold of 0.25m/10min was selected which still keeps the water level increase and decrease during extreme events as shown during the quality control process. A second important feature of data processing is the interpolation of gappy data which is accomplished with a high certainty of generating trustworthy data. Applying these methods a 10 years dataset (December 2002-December 2012) of water level information at the TSS was processed resulting in a seven year time series (2005-2011).

  13. d

    Supporting data for \"A Standard Operating Procedure for Outlier Removal in...

    • search.dataone.org
    • dataverse.no
    • +1more
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Holsbø, Einar (2024). Supporting data for \"A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets\" [Dataset]. http://doi.org/10.18710/FGVLKS
    Explore at:
    Dataset updated
    Jul 29, 2024
    Dataset provided by
    DataverseNO
    Authors
    Holsbø, Einar
    Description

    This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details

  14. Heart Diseases Dataset

    • kaggle.com
    zip
    Updated Feb 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sally Ahmed (2023). Heart Diseases Dataset [Dataset]. https://www.kaggle.com/datasets/sallyahmed/heart-diseases-dataset/discussion
    Explore at:
    zip(3277670 bytes)Available download formats
    Dataset updated
    Feb 6, 2023
    Authors
    Sally Ahmed
    Description

    Dataset

    This dataset was created by Sally Ahmed

    Contents

  15. Performance analysis of our algorithm on 3D models.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaojuan Ning; Fan Li; Ge Tian; Yinghui Wang (2023). Performance analysis of our algorithm on 3D models. [Dataset]. http://doi.org/10.1371/journal.pone.0201280.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Xiaojuan Ning; Fan Li; Ge Tian; Yinghui Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance analysis of our algorithm on 3D models.

  16. d

    Data from: Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Monthly OpenET Image Collections (v2.0) Summarized by 12-Digit Hydrologic Unit Codes, 2008-2023 [Dataset]. https://catalog.data.gov/dataset/monthly-openet-image-collections-v2-0-summarized-by-12-digit-hydrologic-unit-codes-2008-20
    Explore at:
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    This dataset provides monthly summaries of evapotranspiration (ET) data from OpenET v2.0 image collections for the period 2008-2023 for all National Watershed Boundary Dataset subwatersheds (12-digit hydrologic unit codes [HUC12s]) in the US that overlap the spatial extent of OpenET datasets. For each HUC12, this dataset contains spatial aggregation statistics (minimum, mean, median, and maximum) for each of the ET variables from each of the publicly available image collections from OpenET for the six available models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop) and the Ensemble image collection, which is a pixel-wise ensemble of all 6 individual models after filtering and removal of outliers according to the median absolute deviation approach (Melton and others, 2022). Data are available in this data release in two different formats: comma-separated values (CSV) and parquet, a high-performance format that is optimized for storage and processing of columnar data. CSV files containing data for each 4-digit HUC are grouped by 2-digit HUCs for easier access of regional data, and the single parquet file provides convenient access to the entire dataset. For each of the ET models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop), variables in the model-specific CSV data files include: -huc12: The 12-digit hydrologic unit code -ET: Actual evapotranspiration (in millimeters) over the HUC12 area in the month calculated as the sum of daily ET interpolated between Landsat overpasses -statistic: Max, mean, median, or min. Statistic used in the spatial aggregation within each HUC12. For example, maximum ET is the maximum monthly pixel ET value occurring within the HUC12 boundary after summing daily ET in the month -year: 4-digit year -month: 2-digit month -count: Number of Landsat overpasses included in the ET calculation in the month -et_coverage_pct: Integer percentage of the HUC12 with ET data, which can be used to determine how representative the ET statistic is of the entire HUC12 -count_coverage_pct: Integer percentage of the HUC12 with count data, which can be different than the et_coverage_pct value because the “count” band in the source image collection extends beyond the “et” band in the eastern portion of the image collection extent For the Ensemble data, these additional variables are included in the CSV files: -et_mad: Ensemble ET value, computed as the mean of the ensemble after filtering outliers using the median absolute deviation (MAD) -et_mad_count: The number of models used to compute the ensemble ET value after filtering for outliers using the MAD -et_mad_max: The maximum value in the ensemble range, after filtering for outliers using the MAD -et_mad_min: The minimum value in the ensemble range, after filtering for outliers using the MAD -et_sam: A simple arithmetic mean (across the 6 models) of actual ET average without outlier removal Below are the locations of each OpenET image collection used in this summary: DisALEXI: https://developers.google.com/earth-engine/datasets/catalog/OpenET_DISALEXI_CONUS_GRIDMET_MONTHLY_v2_0 eeMETRIC: https://developers.google.com/earth-engine/datasets/catalog/OpenET_EEMETRIC_CONUS_GRIDMET_MONTHLY_v2_0 geeSEBAL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_GEESEBAL_CONUS_GRIDMET_MONTHLY_v2_0 PT-JPL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_PTJPL_CONUS_GRIDMET_MONTHLY_v2_0 SIMS: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SIMS_CONUS_GRIDMET_MONTHLY_v2_0 SSEBop: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SSEBOP_CONUS_GRIDMET_MONTHLY_v2_0 Ensemble: https://developers.google.com/earth-engine/datasets/catalog/OpenET_ENSEMBLE_CONUS_GRIDMET_MONTHLY_v2_0

  17. f

    A new fast filtering algorithm for a 3D point cloud based on RGB-D...

    • figshare.com
    tiff
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chaochuan Jia; Ting Yang; Chuanjiang Wang; Binghui Fan; Fugui He (2023). A new fast filtering algorithm for a 3D point cloud based on RGB-D information [Dataset]. http://doi.org/10.1371/journal.pone.0220253
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Chaochuan Jia; Ting Yang; Chuanjiang Wang; Binghui Fan; Fugui He
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A point cloud that is obtained by an RGB-D camera will inevitably be affected by outliers that do not belong to the surface of the object, which is due to the different viewing angles, light intensities, and reflective characteristics of the object surface and the limitations of the sensors. An effective and fast outlier removal method based on RGB-D information is proposed in this paper. This method aligns the color image to the depth image, and the color mapping image is converted to an HSV image. Then, the optimal segmentation threshold of the V image that is calculated by using the Otsu algorithm is applied to segment the color mapping image into a binary image, which is used to extract the valid point cloud from the original point cloud with outliers. The robustness of the proposed method to the noise types, light intensity and contrast is evaluated by using several experiments; additionally, the method is compared with other filtering methods and applied to independently developed foot scanning equipment. The experimental results show that the proposed method can remove all type of outliers quickly and effectively.

  18. d

    SKY Vaults Stable Coin Backing Per Most Active Vault (Outliers removed)

    • dune.com
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dhr (2025). SKY Vaults Stable Coin Backing Per Most Active Vault (Outliers removed) [Dataset]. https://dune.com/discover/content/relevant?q=author:dhr&resource-type=queries
    Explore at:
    Dataset updated
    Oct 1, 2025
    Authors
    dhr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Blockchain data query: SKY Vaults Stable Coin Backing Per Most Active Vault (Outliers removed)

  19. f

    Data from: A Multi-Objective Genetic Algorithm for Outlier Removal

    • acs.figshare.com
    xlsx
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oren E. Nahum; Abraham Yosipof; Hanoch Senderowitz (2023). A Multi-Objective Genetic Algorithm for Outlier Removal [Dataset]. http://doi.org/10.1021/acs.jcim.5b00515.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    ACS Publications
    Authors
    Oren E. Nahum; Abraham Yosipof; Hanoch Senderowitz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Quantitative structure activity relationship (QSAR) or quantitative structure property relationship (QSPR) models are developed to correlate activities for sets of compounds with their structure-derived descriptors by means of mathematical models. The presence of outliers, namely, compounds that differ in some respect from the rest of the data set, compromise the ability of statistical methods to derive QSAR models with good prediction statistics. Hence, outliers should be removed from data sets prior to model derivation. Here we present a new multi-objective genetic algorithm for the identification and removal of outliers based on the k nearest neighbors (kNN) method. The algorithm was used to remove outliers from three different data sets of pharmaceutical interest (logBBB, factor 7 inhibitors, and dihydrofolate reductase inhibitors), and its performances were compared with those of five other methods for outlier removal. The results suggest that the new algorithm provides filtered data sets that (1) better maintain the internal diversity of the parent data sets and (2) give rise to QSAR models with much better prediction statistics. Equally good filtered data sets in terms of these metrics were obtained when another objective function was added to the algorithm (termed “preservation”), forcing it to remove certain compounds with low probability only. This option is highly useful when specific compounds should be preferably kept in the final data set either because they have favorable activities or because they represent interesting molecular scaffolds. We expect this new algorithm to be useful in future QSAR applications.

  20. r

    Data from: Male responses to sperm competition risk when rivals vary in...

    • researchdata.edu.au
    • datadryad.org
    Updated 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leigh W. Simmons; Joseph L. Tomkins; Samuel J. Lymbery; School of Biological Sciences (2019). Data from: Male responses to sperm competition risk when rivals vary in their number and familiarity [Dataset]. http://doi.org/10.5061/DRYAD.M097580
    Explore at:
    Dataset updated
    2019
    Dataset provided by
    The University of Western Australia
    DRYAD
    Authors
    Leigh W. Simmons; Joseph L. Tomkins; Samuel J. Lymbery; School of Biological Sciences
    Description

    Males of many species adjust their reproductive investment to the number of rivals present simultaneously. However, few studies have investigated whether males sum previous encounters with rivals, and the total level of competition has never been explicitly separated from social familiarity. Social familiarity can be an important component of kin recognition and has been suggested as a cue that males use to avoid harming females when competing with relatives. Previous work has succeeded in independently manipulating social familiarity and relatedness among rivals, but experimental manipulations of familiarity are confounded with manipulations of the total number of rivals that males encounter. Using the seed beetle Callosobruchus maculatus we manipulated three factors: familiarity among rival males, the number of rivals encountered simultaneously, and the total number of rivals encountered over a 48-hour period. Males produced smaller ejaculates when exposed to more rivals in total, regardless of the maximum number of rivals they encountered simultaneously. Males did not respond to familiarity. Our results demonstrate that males of this species can sum the number of rivals encountered over separate days, and therefore the confounding of familiarity with the total level of competition in previous studies should not be ignored.,Lymbery et al 2018 Full datasetContains all the data used in the statistical analyses for the associated manuscript. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Full Dataset.xlsxLymbery et al 2018 Reduced dataset 1Contains data used in the attached manuscript following the removal of three outliers for the purposes of data distribution, as described in the associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After 1st Round of Outlier Removal.xlsxLymbery et al 2018 Reduced dataset 2Contains the data used in the statistical analyses for the associated manuscript, after the removal of all outliers stated in the manuscript and associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After Final Outlier Removal.xlsxLymbery et al 2018 R ScriptContains all the R code used for statistical analysis in this manuscript, with annotations to aid interpretation.,

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v4
Organization logo

Data from: Valid Inference Corrected for Outlier Removal

Related Article
Explore at:
pdfAvailable download formats
Dataset updated
Jun 4, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Shuxiao Chen; Jacob Bien
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.

Search
Clear search
Close search
Google apps
Main menu