34 datasets found
  1. f

    Data from: Methodology to filter out outliers in high spatial density data...

    • scielo.figshare.com
    jpeg
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    SciELO journals
    Authors
    Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.

  2. d

    Data from: Mining Distance-Based Outliers in Near Linear Time

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

  3. Multi-Domain Outlier Detection Dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Mar 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hannah Kerner; Hannah Kerner; Umaa Rebbapragada; Umaa Rebbapragada; Kiri Wagstaff; Kiri Wagstaff; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha (2022). Multi-Domain Outlier Detection Dataset [Dataset]. http://doi.org/10.5281/zenodo.6400786
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 31, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hannah Kerner; Hannah Kerner; Umaa Rebbapragada; Umaa Rebbapragada; Kiri Wagstaff; Kiri Wagstaff; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multi-Domain Outlier Detection Dataset contains datasets for conducting outlier detection experiments for four different application domains:

    1. Astrophysics - detecting anomalous observations in the Dark Energy Survey (DES) catalog (data type: feature vectors)
    2. Planetary science - selecting novel geologic targets for follow-up observation onboard the Mars Science Laboratory (MSL) rover (data type: grayscale images)
    3. Earth science: detecting anomalous samples in satellite time series corresponding to ground-truth observations of maize crops (data type: time series/feature vectors)
    4. Fashion-MNIST/MNIST: benchmark task to detect anomalous MNIST images among Fashion-MNIST images (data type: grayscale images)

    Each dataset contains a "fit" dataset (used for fitting or training outlier detection models), a "score" dataset (used for scoring samples used to evaluate model performance, analogous to test set), and a label dataset (indicates whether samples in the score dataset are considered outliers or not in the domain of each dataset).

    To read more about the datasets and how they are used for outlier detection, or to cite this dataset in your own work, please see the following citation:

    Kerner, H. R., Rebbapragada, U., Wagstaff, K. L., Lu, S., Dubayah, B., Huff, E., Lee, J., Raman, V., and Kulshrestha, S. (2022). Domain-agnostic Outlier Ranking Algorithms (DORA)-A Configurable Pipeline for Facilitating Outlier Detection in Scientific Datasets. Under review for Frontiers in Astronomy and Space Sciences.

  4. f

    Identifying outliers in asset pricing data with a new weighted forward...

    • datasetcatalog.nlm.nih.gov
    Updated Feb 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aronne, Alexandre; Bressan, Aureliano Angel; Grossi, Luigi (2020). Identifying outliers in asset pricing data with a new weighted forward search estimator [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000459853
    Explore at:
    Dataset updated
    Feb 5, 2020
    Authors
    Aronne, Alexandre; Bressan, Aureliano Angel; Grossi, Luigi
    Description

    ABSTRACT The purpose of this work is to present the Weighted Forward Search (FSW) method for the detection of outliers in asset pricing data. This new estimator, which is based on an algorithm that downweights the most anomalous observations of the dataset, is tested using both simulated and empirical asset pricing data. The impact of outliers on the estimation of asset pricing models is assessed under different scenarios, and the results are evaluated with associated statistical tests based on this new approach. Our proposal generates an alternative procedure for robust estimation of portfolio betas, allowing for the comparison between concurrent asset pricing models. The algorithm, which is both efficient and robust to outliers, is used to provide robust estimates of the models’ parameters in a comparison with traditional econometric estimation methods usually used in the literature. In particular, the precision of the alphas is highly increased when the Forward Search (FS) method is used. We use Monte Carlo simulations, and also the well-known dataset of equity factor returns provided by Prof. Kenneth French, consisting of the 25 Fama-French portfolios on the United States of America equity market using single and three-factor models, on monthly and annual basis. Our results indicate that the marginal rejection of the Fama-French three-factor model is influenced by the presence of outliers in the portfolios, when using monthly returns. In annual data, the use of robust methods increases the rejection level of null alphas in the Capital Asset Pricing Model (CAPM) and the Fama-French three-factor model, with more efficient estimates in the absence of outliers and consistent alphas when outliers are present.

  5. H

    Replication data for: Linear Models with Outliers: Choosing between...

    • dataverse.harvard.edu
    • dataverse-staging.rdmc.unc.edu
    • +1more
    pdf +1
    Updated Aug 10, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2011). Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods [Dataset]. http://doi.org/10.7910/DVN/JJLJKZ
    Explore at:
    text/plain; charset=us-ascii(5482), text/plain; charset=us-ascii(3590), pdf(198705)Available download formats
    Dataset updated
    Aug 10, 2011
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.

  6. Building and updating software datasets: an empirical assessment

    • zenodo.org
    zip
    Updated Aug 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Andrés Carruthers; Juan Andrés Carruthers (2024). Building and updating software datasets: an empirical assessment [Dataset]. http://doi.org/10.5281/zenodo.11395573
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juan Andrés Carruthers; Juan Andrés Carruthers
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is the repository for the scripts and data of the study "Building and updating software datasets: an empirical assessment".

    Data collected

    The data generated for the study it can be downloaded as a zip file. Each folder inside the file corresponds to one of the datasets of projects employed in the study (qualitas, currentSample and qualitasUpdated). Every dataset comprised three files "class.csv", "method.csv" and "sample.csv", with class metrics, method metrics and repository metadata of the projects respectively. Here is a description of the datasets:

    • qualitas: includes code metrics and repository metrics from the projects in the release 20130901r of the Qualitas Corpus.
    • currentSample: includes code metrics and repository metrics from a recent sample collected with our sampling procedure.
    • qualitasUpdated: includes code metrics and repository metrics from an updated version of the Qualitas Corpus applying our maintenance procedure.

    Plot graphics

    To plot the results and graphics in the article there is a Jupyter Notebook "Experiment.ipynb". It is initially configured to use the data in "datasets" folder.

    Replication Kit

    For replication purposes, the datasets containing recent projects from Github can be re-generated. To do so, the virtual environment must have installed the dependencies in "requirements.txt" file, add Github's tokens in "./token" file, re-define or leave as is the paths declared in the constants (variables written in caps) in the main method, and finally run "main.py" script. The source code scanner Sourcemeter for Windows is already installed in the project. If a new release becomes available or if the tool needs to be run on a different OS, it can be replaced in "./Sourcemeter/tool" directory.

    The script comprise 5 steps:

    1. Project retrieval from Github: at first the sampling frame with projects complying with a specific quality criteria are retrieved from Github's API.
    2. Create samples: with the sampling frame retrieved, the current samples are selected (currentSample and qualitasUpdated). In the case of qualitasUpdated, it is important to have first the "sample.csv" file inside the qualitas folder of the dataset originally created for the study. This file contains the metadata of the projects in Qualitas Corpus.
    3. Project download and analysis: when all the samples are selected from the sampling frame (currentSample and qualitasUpdated), the repositories are downloaded and scanned with SourceMeter. In the cases in which the analysis is not possible, the projects are replaced with another one with similar size.
    4. Outlier detection: once the datasets are collected, it is necessary to manually look for possible outliers in the code metrics under study. In the notebook "Experiment.ipynb" there are specific sections dedicated for it ("Outlier detection (Section 4.2.2)").
    5. Outlier replacement: when the outliers are detected, in the same notebook there is also a section for outlier replacement ("Replace Outliers") where the outliers' url have to be listed to find the appropriate replacement.
    • If it is required, the metrics from the Qualitas Corpus can also be re-generated. First, it is necessary to download the release 20130901r from its official webpage. Second, decompress the .tar files downloaded. Third, make sure that the compressed files with source code from the projects (.java files) are placed in the "compressed" folder, in some cases it is necessary to read the "QC_README" file in the project's folder. Finally, run the original main script "Generate metrics for the Qualitas Corpus (QC) dataset" part of the code.
  7. e

    Outliers and similarity in APOGEE - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Nov 2, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Outliers and similarity in APOGEE - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b624b506-541b-5a09-b615-14b8e202c468
    Explore at:
    Dataset updated
    Nov 2, 2017
    Description

    In this work we apply and expand on a recently introduced outlier detection algorithm that is based on an unsupervised random forest. We use the algorithm to calculate a similarity measure for stellar spectra from the Apache Point Observatory Galactic Evolution Experiment (APOGEE). We show that the similarity measure traces non-trivial physical properties and contains information about complex structures in the data. We use it for visualization and clustering of the dataset, and discuss its ability to find groups of highly similar objects, including spectroscopic twins. Using the similarity matrix to search the dataset for objects allows us to find objects that are impossible to find using their best fitting model parameters. This includes extreme objects for which the models fail, and rare objects that are outside the scope of the model. We use the similarity measure to detect outliers in the dataset, and find a number of previously unknown Be-type stars, spectroscopic binaries, carbon rich stars, young stars, and a few that we cannot interpret. Our work further demonstrates the potential for scientific discovery when combining machine learning methods with modern survey data. Cone search capability for table J/MNRAS/476/2117/apogeenn (Nearest neighbors APOGEE IDs)

  8. Oxygen Exposure for Benthic Megafauna near San

    • kaggle.com
    Updated Feb 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Oxygen Exposure for Benthic Megafauna near San [Dataset]. https://www.kaggle.com/datasets/thedevastator/oxygen-exposure-for-benthic-megafauna-near-san-d/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Oxygen Exposure for Benthic Megafauna near San Diego

    Spatially Varying Environmental Risk

    By [source]

    About this dataset

    This dataset captures an in-depth look into the environmental conditions of the underwater world off Southern California's coast. It provides invaluable information related to spatial risk variation, such as oxygen exposure levels, depths and habitat criteria of 53 species of benthic and epibenthic megafauna recorded during the three-year study. This data will provide insight into aquatic life dynamics and potentially generate improved management strategies for protecting these vital species. Moreover, due to the importance that waters play within our planet's fragile ecosystem, a proper understanding of their affairs could lead to greater marine sustainability in the long-term. Ultimately, this dataset may help answer our questions about how exactly ocean life is responding to intense human activity and its effects on today's seaside communities

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Download and install the dataset: The dataset contains two .csv files, each containing data from the three-year study on oxygen exposure for benthic and epibenthic megafauna off the coast of San Diego in Southern California. Download these two files to your computer and save them for further analysis.

    • Familiarize yourself with the datasets: Each file includes very detailed information about a particular variable related to the study (for example, SpeciesMetadata contains species-level information on 53 species of benthic and epibenthic megafauna). Read through each data sheet carefully in order to gain a better understanding of what's included in each column.

    • Clean up any outliers or missing values: Once you understand which columns are important for your analysis, you can begin cleaning up any outliers or missing values that may be present in your dataset. This is an important step as it will help ensure that further analysis is performed accurately.

    • Choose an appropriate visualization method: Depending on what type of results you want to show from your analysis, choose an appropriate visualization method (e.g., bar plot, scatterplot). Also consider if adding labeling such as color with respect to categories would improve legibility of figures you produce from this dataset during exploratory data analyses stages.

      5) Choose a statistical test suitable for this type of project: Once allyour visuals have been produced its time to interpret results using statistics tests depending on how many categorical variables are presentin the data set (i.e t-test or ANOVA). As well understand key outputs like p_values so experiment could effectively conclude if thereare significant differences between treatmentswhen comparing distributions among samples/populations being studied here.. Be sureto adjust mean size/sample size when performing statistic testsuitably accordingto determining adequate power when selecting applicable tests etc.

    Research Ideas

    • Comparing the effects of different environmental factors (depth, temperature, salinity etc.) on depth-specific distributions of oxygen and benthic megafauna.
    • Identifying and mapping vulnerable areas for benthic species based on environmental factors and oxygen exposure patterns.
    • Developing models to predict underlying spatial risk variables for endangered species to inform conservation efforts in the study area

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: ROVObservationData.csv

    File: SpeciesMetadata.csv

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  9. f

    The 12 outliers identified in the Tonga dataset.

    • datasetcatalog.nlm.nih.gov
    Updated Nov 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayfield, Anderson B.; Dempsey, Alexandra C.; Chen, Chii-Shiarng (2017). The 12 outliers identified in the Tonga dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001760878
    Explore at:
    Dataset updated
    Nov 1, 2017
    Authors
    Mayfield, Anderson B.; Dempsey, Alexandra C.; Chen, Chii-Shiarng
    Description

    Gene expression data have been presented as non-normalized (2-Ct*109) in all but the last six rows; this allows for the back-calculation of the raw threshold cycle (Ct) values so that interested individuals can readily estimate the typical range of expression of each gene. Values representing aberrant levels for a particular parameter (z-score>2.5) have been highlighted in bold. When there was a statistically significant difference (student’s t-test, p<0.05) between the outlier and non-outlier averages for a parameter (instead using normalized gene expression data), the lower of the two values has been underlined. All samples hosted Symbiodinium of clade C only unless noted otherwise. The mean Mahalanobis distance did not differ between Pocillopora damicornis and P. acuta (student’s t-test, p>0.05). SA = surface area. GCP = genome copy proportion. Ma Dis = Mahalanobis distance. “.” = missing data.

  10. B

    Data from: QST FST comparisons with unbalanced half-sib designs

    • borealisdata.ca
    Updated May 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kimberly J. Gilbert; Michael C. Whitlock (2021). Data from: QST FST comparisons with unbalanced half-sib designs [Dataset]. http://doi.org/10.5683/SP2/9PBQES
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 20, 2021
    Dataset provided by
    Borealis
    Authors
    Kimberly J. Gilbert; Michael C. Whitlock
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    AbstractQST, a measure of quantitative genetic differentiation among populations, is an index that can suggest local adaptation if QST for a trait is sufficiently larger than the mean FST of neutral genetic markers. A previous method by Whitlock and Guillaume derived a simulation resampling approach to statistically test for a difference between QST and FST, but that method is limited to balanced data sets with offspring related as half-sibs through shared fathers. We extend this approach to (1) allow for a model more suitable for some plant populations or breeding designs in which offspring are related through mothers (assuming independent fathers for each offspring; half-sibs by dam), and (2) by explicitly allowing for unbalanced data sets. The resulting approach is made available through the R package QstFstComp. Usage notesSourceCode_DamModelSource code used when doing type I error testing of balanced or unbalanced half-sib dam modelDamModel_WorkingCopy.RSireModel_WorkingCopySource code used when doing type I error testing of unbalanced half-sib sire modelTypeI_ErrorTest_DamBalancedR code to run the error testing of the balanced half-sib dam model over 1000 replicate datasets.TypeI_ErrorTest_DamUnbalancedR code to run the error testing of the unbalanced half-sib dam model over 1000 replicate datasets.TypeI_ErrorTest_SireUnbalancedR code to run the error testing of the unbalanced half-sib sire model over 1000 replicate datasets.NemoReplicatesZipped file containing the 1000 simulated replicate datasets from Nemo used for type I error testing.

  11. Student Admission Records

    • kaggle.com
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeeshan Ahmad (2024). Student Admission Records [Dataset]. https://www.kaggle.com/datasets/zeeshier/student-admission-records
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2024
    Dataset provided by
    Kaggle
    Authors
    Zeeshan Ahmad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is crafted for beginners to practice data cleaning and preprocessing techniques in machine learning. It contains 157 rows of student admission records, including duplicate rows, missing values, and some data inconsistencies (e.g., outliers, unrealistic values). It’s ideal for practicing common data preparation steps before applying machine learning algorithms.

    The dataset simulates a university admission record system, where each student’s admission profile includes test scores, high school percentages, and admission status. The data contains realistic flaws often encountered in raw data, offering hands-on experience in data wrangling.

    The dataset contains the following columns:

    Name: Student's first name (Pakistani names). Age: Age of the student (some outliers and missing values). Gender: Gender (Male/Female). Admission Test Score: Score obtained in the admission test (includes outliers and missing values). High School Percentage: Student's high school final score percentage (includes outliers and missing values). City: City of residence in Pakistan. Admission Status: Whether the student was accepted or rejected.

  12. n

    TreeShrink: fast and accurate detection of outlier long branches in...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Jun 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siavash Mirarab; Uyen Mai (2023). TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees [Dataset]. http://doi.org/10.6076/D1HC71
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 26, 2023
    Dataset provided by
    University of California, San Diego
    Authors
    Siavash Mirarab; Uyen Mai
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Phylogenetic trees include errors for a variety of reasons. We argue that one way to detect errors is to build a phylogeny with all the data and then detect taxa that artificially inflate the tree diameter. We formulate an optimization problem that seeks to find k leaves that can be removed to reduce the tree diameter maximally. We present a polynomial time solution to this “k-shrink” problem. Given this solution, we then use non-parametric statistics to find an outlier set of taxa that have an unexpectedly high impact on the tree diameter. We test our method, TreeShrink, on five biological datasets, and show that it is more conservative than rogue taxon removal using RogueNaRok. When the amount of filtering is controlled, TreeShrink outperforms RogueNaRok in three out of the five datasets, and they tie in another dataset. Methods All the raw data are obtained from other publications as shown below. We further analyzed the data and provide the results of the analyses here. The methods used to analyze the data are described in the paper.

    Dataset

    Species

    Genes

    Download

    Plants

    104

    852

    DOI 10.1186/2047-217X-3-17

    Mammals

    37

    424

    DOI 10.13012/C5BG2KWG

    Insects

    144

    1478

    http://esayyari.github.io/InsectsData

    Cannon

    78

    213

    DOI 10.5061/dryad.493b7

    Rouse

    26

    393

    DOI 10.5061/dryad.79dq1

    Frogs

    164

    95

    DOI 10.5061/dryad.12546.2

  13. White Wine Quality

    • kaggle.com
    Updated Sep 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piyush Agnihotri (2020). White Wine Quality [Dataset]. https://www.kaggle.com/datasets/piyushagni5/white-wine-quality/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 28, 2020
    Dataset provided by
    Kaggle
    Authors
    Piyush Agnihotri
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, refer to [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

    These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

    Content

    For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

    Acknowledgements

    This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality, to get both the dataset i.e. red and white vinho verde wine samples, from the north of Portugal, please visit the above link.

    Please include this citation if you plan to use this database:

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

    Inspiration

    We kagglers can apply several machine-learning algorithms to determine which physiochemical properties make a wine 'good'!

    Relevant papers

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

  14. C

    Données sur les vagues du comté de Richmond

    • catalogue.cioosatlantic.ca
    erddap, html
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centre for Marine Applied Research (CMAR) (2025). Données sur les vagues du comté de Richmond [Dataset]. https://catalogue.cioosatlantic.ca/dataset/ca-cioos_46820ade-42f1-3506-b185-29148018bc59
    Explore at:
    erddap, htmlAvailable download formats
    Dataset updated
    Aug 29, 2025
    Dataset provided by
    CMAR
    Authors
    Centre for Marine Applied Research (CMAR)
    Time period covered
    Jul 8, 2020 - Present
    Area covered
    Variables measured
    Sea State
    Description

    Le Center for Marine Applied Research (CMAR) fournit des données à haute résolution sur les variables océaniques de la côte de la Nouvelle-Écosse grâce à leur programme de surveillance côtière.Le programme a été lancé par le Département de la pêche et de l'aquaculture de la Nouvelle-Écosse (NSDFA) en 2015. En 2019, CMAR a assumé la responsabilité du programme et élargi sa portée et son mandat.Grâce à la branche des vagues du programme, CMAR traite et publie des données d'onde collectées par NSDFA.Les données sont mesurées par des profilers acoustiques à courant doppler acoustiques (ADCP) qui sont montés sur le fond marin pendant 1 à 3 mois.Un ADCP utilise le son pour mesurer les variables d'onde, y compris la hauteur, la période et la direction.NSDFA utilise plusieurs modèles d'instruments ADCP, notamment Sentinel V20, Sentinel V50, Sentinel V100 et Workhorse Sentinel 600KHz.Les données sont traitées à l'aide du logiciel Velocity et WavesMon4, puis compilé et formaté pour publication avec le package Waves R de CMAR.Un test automatisé de «plage brute» a été appliqué aux données pour identifier les valeurs aberrantes statistiques.Chaque point de données a reçu une valeur de drapeau de «passer», «suspect / d'intérêt» ou «échouer».Les observations signalées en tant que «pass» ont réussi le test (c'est-à-dire, n'étaient pas considérées comme des valeurs aberrantes) et peuvent être incluses dans les analyses.Les observations signalées comme «échec» ont été considérées comme des valeurs aberrantes et devraient être exclues de la plupart des analyses.Les observations signalées comme «suspect / d'intérêt» étaient soit une mauvaise qualité, soit mettez en évidence un événement inhabituel, et devraient être soigneusement examinés avant d'être inclus dans les analyses.Les drapeaux doivent être utilisés uniquement comme guide et les utilisateurs de données sont responsables de l'évaluation de la qualité des données avant l'utilisation dans toute analyse.L'ensemble de données ''Nova Scotia Current and Wave Data: Deployment Information'' sur ce portail montre tous les emplacements avec les données des vagues disponibles collectées via le programme de surveillance côtière de CMAR (https://data.novascotia.ca/fishing-and-aquaculture/nova-scotia-current-and-wave-deployment-for/uban-q9i2/about_data).Des rapports de résumé pour chaque comté sont disponibles sur le site Web de CMAR (https://cmar.ca/coastal-monitoring-program/).La collecte et la récupération des données sont en cours.Les ensembles de données et les rapports peuvent être révisés en attendant la collecte et les analyses de données en cours.Si vous avez accédé aux données du programme de surveillance côtière, CMAR apprécierait vos commentaires anonymes: https://docs.google.com/forms/d/e/1faipqlse3td6umrsvvknql13vvmjipckci2ctonjsgn7_g-4c-tktuw/Viewform.Veuillez reconnaître le Center for Marine Applied Research dans tout matériel publié qui utilise ces données.Contactez info@cmar.ca pour plus d'informations.

  15. C

    Données sur les vagues du comté de Yarmouth

    • catalogue.cioosatlantic.ca
    erddap, html
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centre for Marine Applied Research (CMAR) (2025). Données sur les vagues du comté de Yarmouth [Dataset]. https://catalogue.cioosatlantic.ca/dataset/ca-cioos_1fb1fbdb-c876-3f3c-8f14-bc0a7113a680
    Explore at:
    erddap, htmlAvailable download formats
    Dataset updated
    Aug 29, 2025
    Dataset provided by
    CMAR
    Authors
    Centre for Marine Applied Research (CMAR)
    Time period covered
    Oct 5, 2018 - Present
    Area covered
    Variables measured
    Sea State
    Description

    Le Center for Marine Applied Research (CMAR) fournit des données à haute résolution sur les variables océaniques de la côte de la Nouvelle-Écosse grâce à leur programme de surveillance côtière.Le programme a été lancé par le Département de la pêche et de l'aquaculture de la Nouvelle-Écosse (NSDFA) en 2015. En 2019, CMAR a assumé la responsabilité du programme et élargi sa portée et son mandat.Grâce à la branche des vagues du programme, CMAR traite et publie des données d'onde collectées par NSDFA.Les données sont mesurées par des profilers acoustiques à courant doppler acoustiques (ADCP) qui sont montés sur le fond marin pendant 1 à 3 mois.Un ADCP utilise le son pour mesurer les variables d'onde, y compris la hauteur, la période et la direction.NSDFA utilise plusieurs modèles d'instruments ADCP, notamment Sentinel V20, Sentinel V50, Sentinel V100 et Workhorse Sentinel 600KHz.Les données sont traitées à l'aide du logiciel Velocity et WavesMon4, puis compilé et formaté pour publication avec le package Waves R de CMAR.Un test automatisé de «plage brute» a été appliqué aux données pour identifier les valeurs aberrantes statistiques.Chaque point de données a reçu une valeur de drapeau de «passer», «suspect / d'intérêt» ou «échouer».Les observations signalées en tant que «pass» ont réussi le test (c'est-à-dire, n'étaient pas considérées comme des valeurs aberrantes) et peuvent être incluses dans les analyses.Les observations signalées comme «échec» ont été considérées comme des valeurs aberrantes et devraient être exclues de la plupart des analyses.Les observations signalées comme «suspect / d'intérêt» étaient soit une mauvaise qualité, soit mettez en évidence un événement inhabituel, et devraient être soigneusement examinés avant d'être inclus dans les analyses.Les drapeaux doivent être utilisés uniquement comme guide et les utilisateurs de données sont responsables de l'évaluation de la qualité des données avant l'utilisation dans toute analyse.L'ensemble de données ''Nova Scotia Current and Wave Data: Deployment Information'' sur ce portail montre tous les emplacements avec les données des vagues disponibles collectées via le programme de surveillance côtière de CMAR (https://data.novascotia.ca/fishing-and-aquaculture/nova-scotia-current-and-wave-deployment-for/uban-q9i2/about_data).Des rapports de résumé pour chaque comté sont disponibles sur le site Web de CMAR (https://cmar.ca/coastal-monitoring-program/).La collecte et la récupération des données sont en cours.Les ensembles de données et les rapports peuvent être révisés en attendant la collecte et les analyses de données en cours.Si vous avez accédé aux données du programme de surveillance côtière, CMAR apprécierait vos commentaires anonymes: https://docs.google.com/forms/d/e/1faipqlse3td6umrsvvknql13vvmjipckci2ctonjsgn7_g-4c-tktuw/Viewform.Veuillez reconnaître le Center for Marine Applied Research dans tout matériel publié qui utilise ces données.Contactez info@cmar.ca pour plus d'informations.

  16. R

    Cdd Dataset

    • universe.roboflow.com
    zip
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2023
    Dataset authored and provided by
    hakuna matata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cumcumber Diease Detection Bounding Boxes
    Description

    Project Documentation: Cucumber Disease Detection

    1. Title and Introduction Title: Cucumber Disease Detection

    Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

    1. Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

    Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

    Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

    1. Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

    Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

    Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

    1. Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

    2. Methodology Machine Learning Algorithms:

    Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

    The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

    1. Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

    2. Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

    3. Model Evaluation Evaluation Metrics:

    Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

    The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

    1. Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

    2. Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

    3. References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

    4. Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

    Rafiur Rahman Rafit EWU 2018-3-60-111

  17. C

    Données sur les vagues du comté de Guysborough

    • catalogue.cioosatlantic.ca
    erddap, html
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centre for Marine Applied Research (CMAR) (2025). Données sur les vagues du comté de Guysborough [Dataset]. https://catalogue.cioosatlantic.ca/dataset/ca-cioos_73730e8d-252a-3f85-8be4-4c525f69fb95
    Explore at:
    html, erddapAvailable download formats
    Dataset updated
    Aug 29, 2025
    Dataset provided by
    CMAR
    Authors
    Centre for Marine Applied Research (CMAR)
    Time period covered
    Jul 9, 2014 - Present
    Area covered
    Variables measured
    Sea State
    Description

    Le Center for Marine Applied Research (CMAR) fournit des données à haute résolution sur les variables océaniques de la côte de la Nouvelle-Écosse grâce à leur programme de surveillance côtière.Le programme a été lancé par le Département de la pêche et de l'aquaculture de la Nouvelle-Écosse (NSDFA) en 2015. En 2019, CMAR a assumé la responsabilité du programme et élargi sa portée et son mandat.Grâce à la branche des vagues du programme, CMAR traite et publie des données d'onde collectées par NSDFA.Les données sont mesurées par des profilers acoustiques à courant doppler acoustiques (ADCP) qui sont montés sur le fond marin pendant 1 à 3 mois.Un ADCP utilise le son pour mesurer les variables d'onde, y compris la hauteur, la période et la direction.NSDFA utilise plusieurs modèles d'instruments ADCP, notamment Sentinel V20, Sentinel V50, Sentinel V100 et Workhorse Sentinel 600KHz.Les données sont traitées à l'aide du logiciel Velocity et WavesMon4, puis compilé et formaté pour publication avec le package Waves R de CMAR.Un test automatisé de «plage brute» a été appliqué aux données pour identifier les valeurs aberrantes statistiques.Chaque point de données a reçu une valeur de drapeau de «passer», «suspect / d'intérêt» ou «échouer».Les observations signalées en tant que «pass» ont réussi le test (c'est-à-dire, n'étaient pas considérées comme des valeurs aberrantes) et peuvent être incluses dans les analyses.Les observations signalées comme «échec» ont été considérées comme des valeurs aberrantes et devraient être exclues de la plupart des analyses.Les observations signalées comme «suspect / d'intérêt» étaient soit une mauvaise qualité, soit mettez en évidence un événement inhabituel, et devraient être soigneusement examinés avant d'être inclus dans les analyses.Les drapeaux doivent être utilisés uniquement comme guide et les utilisateurs de données sont responsables de l'évaluation de la qualité des données avant l'utilisation dans toute analyse.L'ensemble de données ''Nova Scotia Current and Wave Data: Deployment Information'' sur ce portail montre tous les emplacements avec les données des vagues disponibles collectées via le programme de surveillance côtière de CMAR (https://data.novascotia.ca/fishing-and-aquaculture/nova-scotia-current-and-wave-deployment-for/uban-q9i2/about_data).Des rapports de résumé pour chaque comté sont disponibles sur le site Web de CMAR (https://cmar.ca/coastal-monitoring-program/).La collecte et la récupération des données sont en cours.Les ensembles de données et les rapports peuvent être révisés en attendant la collecte et les analyses de données en cours.Si vous avez accédé aux données du programme de surveillance côtière, CMAR apprécierait vos commentaires anonymes: https://docs.google.com/forms/d/e/1faipqlse3td6umrsvvknql13vvmjipckci2ctonjsgn7_g-4c-tktuw/Viewform.Veuillez reconnaître le Center for Marine Applied Research dans tout matériel publié qui utilise ces données.Contactez info@cmar.ca pour plus d'informations.

  18. Titanic: A Voyage into the Past

    • kaggle.com
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Asher Mehfooz
    Description

    **Dataset Overview ** The Titanic dataset is a widely used benchmark dataset for machine learning and data science tasks. It contains information about passengers who boarded the RMS Titanic in 1912, including their age, sex, social class, and whether they survived the sinking of the ship. The dataset is divided into two main parts:

    Train.csv: This file contains information about 891 passengers who were used to train machine learning models. It includes the following features:

    PassengerId: A unique identifier for each passenger Survived: Whether the passenger survived (1) or not (0) Pclass: The passenger's social class (1 = Upper, 2 = Middle, 3 = Lower) Name: The passenger's name Sex: The passenger's sex (Male or Female) Age: The passenger's age Sibsp: The number of siblings or spouses aboard the ship Parch: The number of parents or children aboard the ship Ticket: The passenger's ticket number Fare: The passenger's fare Cabin: The passenger's cabin number Embarked: The port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton) Test.csv: This file contains information about 418 passengers who were not used to train machine learning models. It includes the same features as train.csv, but does not include the Survived label. The goal of machine learning models is to predict whether or not each passenger in the test.csv file survived.

    **Data Preparation ** Before using the Titanic dataset for machine learning tasks, it is important to perform some data preparation steps. These steps may include:

    Handling missing values: Some of the features in the dataset have missing values. These values can be imputed or removed, depending on the specific task. Encoding categorical variables: Some of the features in the dataset are categorical variables, such as Pclass, Sex, and Embarked. These variables need to be encoded numerically before they can be used by machine learning algorithms. Scaling numerical variables: Some of the features in the dataset are numerical variables, such as Age and Fare. These variables may need to be scaled to ensure that they are on the same scale. Data Visualization

    Data visualization can be a useful tool for exploring the Titanic dataset and gaining insights into the data. Some common data visualization techniques that can be used with the Titanic dataset include:

    Histograms: Histograms can be used to visualize the distribution of numerical variables, such as Age and Fare. Scatter plots: Scatter plots can be used to visualize the relationship between two numerical variables. Box plots: Box plots can be used to visualize the distribution of a numerical variable across different categories, such as Pclass and Sex. Machine Learning Tasks

    The Titanic dataset can be used for a variety of machine learning tasks, including:

    Classification: The most common task is to use the train.csv file to train a machine learning model to predict whether or not each passenger in the test.csv file survived. Regression: The dataset can also be used to train a machine learning model to predict the fare of a passenger based on their other features. Anomaly detection: The dataset can also be used to identify anomalies, such as passengers who are outliers in terms of their age, social class, or other features.

  19. f

    Data from: HacDivSel: Two new methods (haplotype-based and outlier-based)...

    • datasetcatalog.nlm.nih.gov
    Updated Apr 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carvajal-Rodríguez, Antonio (2017). HacDivSel: Two new methods (haplotype-based and outlier-based) for the detection of divergent selection in pairs of populations [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001791737
    Explore at:
    Dataset updated
    Apr 19, 2017
    Authors
    Carvajal-Rodríguez, Antonio
    Description

    The detection of genomic regions involved in local adaptation is an important topic in current population genetics. There are several detection strategies available depending on the kind of genetic and demographic information at hand. A common drawback is the high risk of false positives. In this study we introduce two complementary methods for the detection of divergent selection from populations connected by migration. Both methods have been developed with the aim of being robust to false positives. The first method combines haplotype information with inter-population differentiation (FST). Evidence of divergent selection is concluded only when both the haplotype pattern and the FST value support it. The second method is developed for independently segregating markers i.e. there is no haplotype information. In this case, the power to detect selection is attained by developing a new outlier test based on detecting a bimodal distribution. The test computes the FST outliers and then assumes that those of interest would have a different mode. We demonstrate the utility of the two methods through simulations and the analysis of real data. The simulation results showed power ranging from 60–95% in several of the scenarios whilst the false positive rate was controlled below the nominal level. The analysis of real samples consisted of phased data from the HapMap project and unphased data from intertidal marine snail ecotypes. The results illustrate that the proposed methods could be useful for detecting locally adapted polymorphisms. The software HacDivSel implements the methods explained in this manuscript.

  20. Data from: Wine Quality

    • kaggle.com
    Updated Jul 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raj Parmar (2018). Wine Quality [Dataset]. https://www.kaggle.com/datasets/rajyellow46/wine-quality/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Raj Parmar
    Description

    Data Set Information:

    The dataset was downloaded from the UCI Machine Learning Repository.

    The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. The reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

    These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

    Two datasets were combined and few values were randomly removed.

    Attribute Information:

    For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

    Acknowledgements:

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1

Data from: Methodology to filter out outliers in high spatial density data to improve maps reliability

Related Article
Explore at:
jpegAvailable download formats
Dataset updated
Jun 4, 2023
Dataset provided by
SciELO journals
Authors
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.

Search
Clear search
Close search
Google apps
Main menu