34 datasets found

f
Data from: Methodology to filter out outliers in high spatial density data...
scielo.figshare.com
jpeg
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14305658.v1
Dataset updated
Jun 4, 2023
Dataset provided by
SciELO journals
Authors
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.
d
Data from: Mining Distance-Based Outliers in Near Linear Time
catalog.data.gov
datasets.ai
+1more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
Multi-Domain Outlier Detection Dataset
zenodo.org
data.niaid.nih.gov
zip
Updated Mar 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hannah Kerner; Hannah Kerner; Umaa Rebbapragada; Umaa Rebbapragada; Kiri Wagstaff; Kiri Wagstaff; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha (2022). Multi-Domain Outlier Detection Dataset [Dataset]. http://doi.org/10.5281/zenodo.6400786
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6400786
Dataset updated
Mar 31, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hannah Kerner; Hannah Kerner; Umaa Rebbapragada; Umaa Rebbapragada; Kiri Wagstaff; Kiri Wagstaff; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multi-Domain Outlier Detection Dataset contains datasets for conducting outlier detection experiments for four different application domains:

Astrophysics - detecting anomalous observations in the Dark Energy Survey (DES) catalog (data type: feature vectors)

Planetary science - selecting novel geologic targets for follow-up observation onboard the Mars Science Laboratory (MSL) rover (data type: grayscale images)

Earth science: detecting anomalous samples in satellite time series corresponding to ground-truth observations of maize crops (data type: time series/feature vectors)

Fashion-MNIST/MNIST: benchmark task to detect anomalous MNIST images among Fashion-MNIST images (data type: grayscale images)

Each dataset contains a "fit" dataset (used for fitting or training outlier detection models), a "score" dataset (used for scoring samples used to evaluate model performance, analogous to test set), and a label dataset (indicates whether samples in the score dataset are considered outliers or not in the domain of each dataset).

To read more about the datasets and how they are used for outlier detection, or to cite this dataset in your own work, please see the following citation:

Kerner, H. R., Rebbapragada, U., Wagstaff, K. L., Lu, S., Dubayah, B., Huff, E., Lee, J., Raman, V., and Kulshrestha, S. (2022). Domain-agnostic Outlier Ranking Algorithms (DORA)-A Configurable Pipeline for Facilitating Outlier Detection in Scientific Datasets. Under review for Frontiers in Astronomy and Space Sciences.
f
Identifying outliers in asset pricing data with a new weighted forward...
datasetcatalog.nlm.nih.gov
Updated Feb 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aronne, Alexandre; Bressan, Aureliano Angel; Grossi, Luigi (2020). Identifying outliers in asset pricing data with a new weighted forward search estimator [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000459853
Explore at:
Dataset updated
Feb 5, 2020
Authors
Aronne, Alexandre; Bressan, Aureliano Angel; Grossi, Luigi
Description
ABSTRACT The purpose of this work is to present the Weighted Forward Search (FSW) method for the detection of outliers in asset pricing data. This new estimator, which is based on an algorithm that downweights the most anomalous observations of the dataset, is tested using both simulated and empirical asset pricing data. The impact of outliers on the estimation of asset pricing models is assessed under different scenarios, and the results are evaluated with associated statistical tests based on this new approach. Our proposal generates an alternative procedure for robust estimation of portfolio betas, allowing for the comparison between concurrent asset pricing models. The algorithm, which is both efficient and robust to outliers, is used to provide robust estimates of the models’ parameters in a comparison with traditional econometric estimation methods usually used in the literature. In particular, the precision of the alphas is highly increased when the Forward Search (FS) method is used. We use Monte Carlo simulations, and also the well-known dataset of equity factor returns provided by Prof. Kenneth French, consisting of the 25 Fama-French portfolios on the United States of America equity market using single and three-factor models, on monthly and annual basis. Our results indicate that the marginal rejection of the Fama-French three-factor model is influenced by the presence of outliers in the portfolios, when using monthly returns. In annual data, the use of robust methods increases the rejection level of null alphas in the Capital Asset Pricing Model (CAPM) and the Fama-French three-factor model, with more efficient estimates in the absence of outliers and consistent alphas when outliers are present.
H
Replication data for: Linear Models with Outliers: Choosing between...
dataverse.harvard.edu
dataverse-staging.rdmc.unc.edu
+1more
pdf +1
Updated Aug 10, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2011). Replication data for: Linear Models with Outliers: Choosing between Conditional-Mean and Conditional-Median Methods [Dataset]. http://doi.org/10.7910/DVN/JJLJKZ
Explore at:
text/plain; charset=us-ascii(5482), text/plain; charset=us-ascii(3590), pdf(198705)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/JJLJKZ
Dataset updated
Aug 10, 2011
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.
Building and updating software datasets: an empirical assessment
zenodo.org
zip
Updated Aug 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Andrés Carruthers; Juan Andrés Carruthers (2024). Building and updating software datasets: an empirical assessment [Dataset]. http://doi.org/10.5281/zenodo.11395573
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11395573
Dataset updated
Aug 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan Andrés Carruthers; Juan Andrés Carruthers
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is the repository for the scripts and data of the study "Building and updating software datasets: an empirical assessment".

Data collected

The data generated for the study it can be downloaded as a zip file. Each folder inside the file corresponds to one of the datasets of projects employed in the study (qualitas, currentSample and qualitasUpdated). Every dataset comprised three files "class.csv", "method.csv" and "sample.csv", with class metrics, method metrics and repository metadata of the projects respectively. Here is a description of the datasets:

qualitas: includes code metrics and repository metrics from the projects in the release 20130901r of the Qualitas Corpus.

currentSample: includes code metrics and repository metrics from a recent sample collected with our sampling procedure.

qualitasUpdated: includes code metrics and repository metrics from an updated version of the Qualitas Corpus applying our maintenance procedure.

Plot graphics

To plot the results and graphics in the article there is a Jupyter Notebook "Experiment.ipynb". It is initially configured to use the data in "datasets" folder.

Replication Kit

For replication purposes, the datasets containing recent projects from Github can be re-generated. To do so, the virtual environment must have installed the dependencies in "requirements.txt" file, add Github's tokens in "./token" file, re-define or leave as is the paths declared in the constants (variables written in caps) in the main method, and finally run "main.py" script. The source code scanner Sourcemeter for Windows is already installed in the project. If a new release becomes available or if the tool needs to be run on a different OS, it can be replaced in "./Sourcemeter/tool" directory.

The script comprise 5 steps:

Project retrieval from Github: at first the sampling frame with projects complying with a specific quality criteria are retrieved from Github's API.

Create samples: with the sampling frame retrieved, the current samples are selected (currentSample and qualitasUpdated). In the case of qualitasUpdated, it is important to have first the "sample.csv" file inside the qualitas folder of the dataset originally created for the study. This file contains the metadata of the projects in Qualitas Corpus.

Project download and analysis: when all the samples are selected from the sampling frame (currentSample and qualitasUpdated), the repositories are downloaded and scanned with SourceMeter. In the cases in which the analysis is not possible, the projects are replaced with another one with similar size.

Outlier detection: once the datasets are collected, it is necessary to manually look for possible outliers in the code metrics under study. In the notebook "Experiment.ipynb" there are specific sections dedicated for it ("Outlier detection (Section 4.2.2)").

Outlier replacement: when the outliers are detected, in the same notebook there is also a section for outlier replacement ("Replace Outliers") where the outliers' url have to be listed to find the appropriate replacement.

If it is required, the metrics from the Qualitas Corpus can also be re-generated. First, it is necessary to download the release 20130901r from its official webpage. Second, decompress the .tar files downloaded. Third, make sure that the compressed files with source code from the projects (.java files) are placed in the "compressed" folder, in some cases it is necessary to read the "QC_README" file in the project's folder. Finally, run the original main script "Generate metrics for the Qualitas Corpus (QC) dataset" part of the code.
e
Outliers and similarity in APOGEE - Dataset - B2FIND
b2find.eudat.eu
Updated Nov 2, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Outliers and similarity in APOGEE - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b624b506-541b-5a09-b615-14b8e202c468
Explore at:
Dataset updated
Nov 2, 2017
Description
In this work we apply and expand on a recently introduced outlier detection algorithm that is based on an unsupervised random forest. We use the algorithm to calculate a similarity measure for stellar spectra from the Apache Point Observatory Galactic Evolution Experiment (APOGEE). We show that the similarity measure traces non-trivial physical properties and contains information about complex structures in the data. We use it for visualization and clustering of the dataset, and discuss its ability to find groups of highly similar objects, including spectroscopic twins. Using the similarity matrix to search the dataset for objects allows us to find objects that are impossible to find using their best fitting model parameters. This includes extreme objects for which the models fail, and rare objects that are outside the scope of the model. We use the similarity measure to detect outliers in the dataset, and find a number of previously unknown Be-type stars, spectroscopic binaries, carbon rich stars, young stars, and a few that we cannot interpret. Our work further demonstrates the potential for scientific discovery when combining machine learning methods with modern survey data. Cone search capability for table J/MNRAS/476/2117/apogeenn (Nearest neighbors APOGEE IDs)
Oxygen Exposure for Benthic Megafauna near San
kaggle.com
Updated Feb 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Oxygen Exposure for Benthic Megafauna near San [Dataset]. https://www.kaggle.com/datasets/thedevastator/oxygen-exposure-for-benthic-megafauna-near-san-d/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Oxygen Exposure for Benthic Megafauna near San Diego

Spatially Varying Environmental Risk

By [source]

About this dataset

This dataset captures an in-depth look into the environmental conditions of the underwater world off Southern California's coast. It provides invaluable information related to spatial risk variation, such as oxygen exposure levels, depths and habitat criteria of 53 species of benthic and epibenthic megafauna recorded during the three-year study. This data will provide insight into aquatic life dynamics and potentially generate improved management strategies for protecting these vital species. Moreover, due to the importance that waters play within our planet's fragile ecosystem, a proper understanding of their affairs could lead to greater marine sustainability in the long-term. Ultimately, this dataset may help answer our questions about how exactly ocean life is responding to intense human activity and its effects on today's seaside communities

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Download and install the dataset: The dataset contains two .csv files, each containing data from the three-year study on oxygen exposure for benthic and epibenthic megafauna off the coast of San Diego in Southern California. Download these two files to your computer and save them for further analysis.

Familiarize yourself with the datasets: Each file includes very detailed information about a particular variable related to the study (for example, SpeciesMetadata contains species-level information on 53 species of benthic and epibenthic megafauna). Read through each data sheet carefully in order to gain a better understanding of what's included in each column.

Clean up any outliers or missing values: Once you understand which columns are important for your analysis, you can begin cleaning up any outliers or missing values that may be present in your dataset. This is an important step as it will help ensure that further analysis is performed accurately.

Choose an appropriate visualization method: Depending on what type of results you want to show from your analysis, choose an appropriate visualization method (e.g., bar plot, scatterplot). Also consider if adding labeling such as color with respect to categories would improve legibility of figures you produce from this dataset during exploratory data analyses stages.

5) Choose a statistical test suitable for this type of project: Once allyour visuals have been produced its time to interpret results using statistics tests depending on how many categorical variables are presentin the data set (i.e t-test or ANOVA). As well understand key outputs like p_values so experiment could effectively conclude if thereare significant differences between treatmentswhen comparing distributions among samples/populations being studied here.. Be sureto adjust mean size/sample size when performing statistic testsuitably accordingto determining adequate power when selecting applicable tests etc.

Research Ideas

Comparing the effects of different environmental factors (depth, temperature, salinity etc.) on depth-specific distributions of oxygen and benthic megafauna.

Identifying and mapping vulnerable areas for benthic species based on environmental factors and oxygen exposure patterns.

Developing models to predict underlying spatial risk variables for endangered species to inform conservation efforts in the study area

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: ROVObservationData.csv

File: SpeciesMetadata.csv

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
f
The 12 outliers identified in the Tonga dataset.
datasetcatalog.nlm.nih.gov
Updated Nov 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayfield, Anderson B.; Dempsey, Alexandra C.; Chen, Chii-Shiarng (2017). The 12 outliers identified in the Tonga dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001760878
Explore at:
Dataset updated
Nov 1, 2017
Authors
Mayfield, Anderson B.; Dempsey, Alexandra C.; Chen, Chii-Shiarng
Description
Gene expression data have been presented as non-normalized (2-Ct*109) in all but the last six rows; this allows for the back-calculation of the raw threshold cycle (Ct) values so that interested individuals can readily estimate the typical range of expression of each gene. Values representing aberrant levels for a particular parameter (z-score>2.5) have been highlighted in bold. When there was a statistically significant difference (student’s t-test, p<0.05) between the outlier and non-outlier averages for a parameter (instead using normalized gene expression data), the lower of the two values has been underlined. All samples hosted Symbiodinium of clade C only unless noted otherwise. The mean Mahalanobis distance did not differ between Pocillopora damicornis and P. acuta (student’s t-test, p>0.05). SA = surface area. GCP = genome copy proportion. Ma Dis = Mahalanobis distance. “.” = missing data.
B
Data from: QST FST comparisons with unbalanced half-sib designs
borealisdata.ca
Updated May 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kimberly J. Gilbert; Michael C. Whitlock (2021). Data from: QST FST comparisons with unbalanced half-sib designs [Dataset]. http://doi.org/10.5683/SP2/9PBQES
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/9PBQES
Dataset updated
May 20, 2021
Dataset provided by
Borealis
Authors
Kimberly J. Gilbert; Michael C. Whitlock
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
AbstractQST, a measure of quantitative genetic differentiation among populations, is an index that can suggest local adaptation if QST for a trait is sufficiently larger than the mean FST of neutral genetic markers. A previous method by Whitlock and Guillaume derived a simulation resampling approach to statistically test for a difference between QST and FST, but that method is limited to balanced data sets with offspring related as half-sibs through shared fathers. We extend this approach to (1) allow for a model more suitable for some plant populations or breeding designs in which offspring are related through mothers (assuming independent fathers for each offspring; half-sibs by dam), and (2) by explicitly allowing for unbalanced data sets. The resulting approach is made available through the R package QstFstComp. Usage notesSourceCode_DamModelSource code used when doing type I error testing of balanced or unbalanced half-sib dam modelDamModel_WorkingCopy.RSireModel_WorkingCopySource code used when doing type I error testing of unbalanced half-sib sire modelTypeI_ErrorTest_DamBalancedR code to run the error testing of the balanced half-sib dam model over 1000 replicate datasets.TypeI_ErrorTest_DamUnbalancedR code to run the error testing of the unbalanced half-sib dam model over 1000 replicate datasets.TypeI_ErrorTest_SireUnbalancedR code to run the error testing of the unbalanced half-sib sire model over 1000 replicate datasets.NemoReplicatesZipped file containing the 1000 simulated replicate datasets from Nemo used for type I error testing.
Student Admission Records
kaggle.com
Updated Nov 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeeshan Ahmad (2024). Student Admission Records [Dataset]. https://www.kaggle.com/datasets/zeeshier/student-admission-records
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2024
Dataset provided by
Kaggle
Authors
Zeeshan Ahmad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is crafted for beginners to practice data cleaning and preprocessing techniques in machine learning. It contains 157 rows of student admission records, including duplicate rows, missing values, and some data inconsistencies (e.g., outliers, unrealistic values). It’s ideal for practicing common data preparation steps before applying machine learning algorithms.

The dataset simulates a university admission record system, where each student’s admission profile includes test scores, high school percentages, and admission status. The data contains realistic flaws often encountered in raw data, offering hands-on experience in data wrangling.

The dataset contains the following columns:

Name: Student's first name (Pakistani names). Age: Age of the student (some outliers and missing values). Gender: Gender (Male/Female). Admission Test Score: Score obtained in the admission test (includes outliers and missing values). High School Percentage: Student's high school final score percentage (includes outliers and missing values). City: City of residence in Pakistan. Admission Status: Whether the student was accepted or rejected.
n
TreeShrink: fast and accurate detection of outlier long branches in...
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Jun 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siavash Mirarab; Uyen Mai (2023). TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees [Dataset]. http://doi.org/10.6076/D1HC71
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6076/D1HC71
Dataset updated
Jun 26, 2023
Dataset provided by
University of California, San Diego
Authors
Siavash Mirarab; Uyen Mai
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Phylogenetic trees include errors for a variety of reasons. We argue that one way to detect errors is to build a phylogeny with all the data and then detect taxa that artificially inflate the tree diameter. We formulate an optimization problem that seeks to find k leaves that can be removed to reduce the tree diameter maximally. We present a polynomial time solution to this “k-shrink” problem. Given this solution, we then use non-parametric statistics to find an outlier set of taxa that have an unexpectedly high impact on the tree diameter. We test our method, TreeShrink, on five biological datasets, and show that it is more conservative than rogue taxon removal using RogueNaRok. When the amount of filtering is controlled, TreeShrink outperforms RogueNaRok in three out of the five datasets, and they tie in another dataset. Methods All the raw data are obtained from other publications as shown below. We further analyzed the data and provide the results of the analyses here. The methods used to analyze the data are described in the paper.

Dataset

Species

Genes

Download

Plants

104

852

DOI 10.1186/2047-217X-3-17

Mammals

37

424

DOI 10.13012/C5BG2KWG

Insects

144

1478

http://esayyari.github.io/InsectsData

Cannon

78

213

DOI 10.5061/dryad.493b7

Rouse

26

393

DOI 10.5061/dryad.79dq1

Frogs

164

95

DOI 10.5061/dryad.12546.2
White Wine Quality
kaggle.com
Updated Sep 28, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piyush Agnihotri (2020). White Wine Quality [Dataset]. https://www.kaggle.com/datasets/piyushagni5/white-wine-quality/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 28, 2020
Dataset provided by
Kaggle
Authors
Piyush Agnihotri
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, refer to [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

Content

For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Acknowledgements

This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality, to get both the dataset i.e. red and white vinho verde wine samples, from the north of Portugal, please visit the above link.

Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Inspiration

We kagglers can apply several machine-learning algorithms to determine which physiochemical properties make a wine 'good'!

Relevant papers

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
C
Données sur les vagues du comté de Richmond
catalogue.cioosatlantic.ca
erddap, html
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centre for Marine Applied Research (CMAR) (2025). Données sur les vagues du comté de Richmond [Dataset]. https://catalogue.cioosatlantic.ca/dataset/ca-cioos_46820ade-42f1-3506-b185-29148018bc59
Explore at:
erddap, htmlAvailable download formats
Dataset updated
Aug 29, 2025
Dataset provided by
CMAR
Authors
Centre for Marine Applied Research (CMAR)
Time period covered
Jul 8, 2020 - Present
Area covered

Variables measured
Sea State
Description
Le Center for Marine Applied Research (CMAR) fournit des données à haute résolution sur les variables océaniques de la côte de la Nouvelle-Écosse grâce à leur programme de surveillance côtière.Le programme a été lancé par le Département de la pêche et de l'aquaculture de la Nouvelle-Écosse (NSDFA) en 2015. En 2019, CMAR a assumé la responsabilité du programme et élargi sa portée et son mandat.Grâce à la branche des vagues du programme, CMAR traite et publie des données d'onde collectées par NSDFA.Les données sont mesurées par des profilers acoustiques à courant doppler acoustiques (ADCP) qui sont montés sur le fond marin pendant 1 à 3 mois.Un ADCP utilise le son pour mesurer les variables d'onde, y compris la hauteur, la période et la direction.NSDFA utilise plusieurs modèles d'instruments ADCP, notamment Sentinel V20, Sentinel V50, Sentinel V100 et Workhorse Sentinel 600KHz.Les données sont traitées à l'aide du logiciel Velocity et WavesMon4, puis compilé et formaté pour publication avec le package Waves R de CMAR.Un test automatisé de «plage brute» a été appliqué aux données pour identifier les valeurs aberrantes statistiques.Chaque point de données a reçu une valeur de drapeau de «passer», «suspect / d'intérêt» ou «échouer».Les observations signalées en tant que «pass» ont réussi le test (c'est-à-dire, n'étaient pas considérées comme des valeurs aberrantes) et peuvent être incluses dans les analyses.Les observations signalées comme «échec» ont été considérées comme des valeurs aberrantes et devraient être exclues de la plupart des analyses.Les observations signalées comme «suspect / d'intérêt» étaient soit une mauvaise qualité, soit mettez en évidence un événement inhabituel, et devraient être soigneusement examinés avant d'être inclus dans les analyses.Les drapeaux doivent être utilisés uniquement comme guide et les utilisateurs de données sont responsables de l'évaluation de la qualité des données avant l'utilisation dans toute analyse.L'ensemble de données ''Nova Scotia Current and Wave Data: Deployment Information'' sur ce portail montre tous les emplacements avec les données des vagues disponibles collectées via le programme de surveillance côtière de CMAR (https://data.novascotia.ca/fishing-and-aquaculture/nova-scotia-current-and-wave-deployment-for/uban-q9i2/about_data).Des rapports de résumé pour chaque comté sont disponibles sur le site Web de CMAR (https://cmar.ca/coastal-monitoring-program/).La collecte et la récupération des données sont en cours.Les ensembles de données et les rapports peuvent être révisés en attendant la collecte et les analyses de données en cours.Si vous avez accédé aux données du programme de surveillance côtière, CMAR apprécierait vos commentaires anonymes: https://docs.google.com/forms/d/e/1faipqlse3td6umrsvvknql13vvmjipckci2ctonjsgn7_g-4c-tktuw/Viewform.Veuillez reconnaître le Center for Marine Applied Research dans tout matériel publié qui utilise ces données.Contactez info@cmar.ca pour plus d'informations.
C
Données sur les vagues du comté de Yarmouth
catalogue.cioosatlantic.ca
erddap, html
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centre for Marine Applied Research (CMAR) (2025). Données sur les vagues du comté de Yarmouth [Dataset]. https://catalogue.cioosatlantic.ca/dataset/ca-cioos_1fb1fbdb-c876-3f3c-8f14-bc0a7113a680
Explore at:
erddap, htmlAvailable download formats
Dataset updated
Aug 29, 2025
Dataset provided by
CMAR
Authors
Centre for Marine Applied Research (CMAR)
Time period covered
Oct 5, 2018 - Present
Area covered

Variables measured
Sea State
Description
Le Center for Marine Applied Research (CMAR) fournit des données à haute résolution sur les variables océaniques de la côte de la Nouvelle-Écosse grâce à leur programme de surveillance côtière.Le programme a été lancé par le Département de la pêche et de l'aquaculture de la Nouvelle-Écosse (NSDFA) en 2015. En 2019, CMAR a assumé la responsabilité du programme et élargi sa portée et son mandat.Grâce à la branche des vagues du programme, CMAR traite et publie des données d'onde collectées par NSDFA.Les données sont mesurées par des profilers acoustiques à courant doppler acoustiques (ADCP) qui sont montés sur le fond marin pendant 1 à 3 mois.Un ADCP utilise le son pour mesurer les variables d'onde, y compris la hauteur, la période et la direction.NSDFA utilise plusieurs modèles d'instruments ADCP, notamment Sentinel V20, Sentinel V50, Sentinel V100 et Workhorse Sentinel 600KHz.Les données sont traitées à l'aide du logiciel Velocity et WavesMon4, puis compilé et formaté pour publication avec le package Waves R de CMAR.Un test automatisé de «plage brute» a été appliqué aux données pour identifier les valeurs aberrantes statistiques.Chaque point de données a reçu une valeur de drapeau de «passer», «suspect / d'intérêt» ou «échouer».Les observations signalées en tant que «pass» ont réussi le test (c'est-à-dire, n'étaient pas considérées comme des valeurs aberrantes) et peuvent être incluses dans les analyses.Les observations signalées comme «échec» ont été considérées comme des valeurs aberrantes et devraient être exclues de la plupart des analyses.Les observations signalées comme «suspect / d'intérêt» étaient soit une mauvaise qualité, soit mettez en évidence un événement inhabituel, et devraient être soigneusement examinés avant d'être inclus dans les analyses.Les drapeaux doivent être utilisés uniquement comme guide et les utilisateurs de données sont responsables de l'évaluation de la qualité des données avant l'utilisation dans toute analyse.L'ensemble de données ''Nova Scotia Current and Wave Data: Deployment Information'' sur ce portail montre tous les emplacements avec les données des vagues disponibles collectées via le programme de surveillance côtière de CMAR (https://data.novascotia.ca/fishing-and-aquaculture/nova-scotia-current-and-wave-deployment-for/uban-q9i2/about_data).Des rapports de résumé pour chaque comté sont disponibles sur le site Web de CMAR (https://cmar.ca/coastal-monitoring-program/).La collecte et la récupération des données sont en cours.Les ensembles de données et les rapports peuvent être révisés en attendant la collecte et les analyses de données en cours.Si vous avez accédé aux données du programme de surveillance côtière, CMAR apprécierait vos commentaires anonymes: https://docs.google.com/forms/d/e/1faipqlse3td6umrsvvknql13vvmjipckci2ctonjsgn7_g-4c-tktuw/Viewform.Veuillez reconnaître le Center for Marine Applied Research dans tout matériel publié qui utilise ces données.Contactez info@cmar.ca pour plus d'informations.
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
C
Données sur les vagues du comté de Guysborough
catalogue.cioosatlantic.ca
erddap, html
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centre for Marine Applied Research (CMAR) (2025). Données sur les vagues du comté de Guysborough [Dataset]. https://catalogue.cioosatlantic.ca/dataset/ca-cioos_73730e8d-252a-3f85-8be4-4c525f69fb95
Explore at:
html, erddapAvailable download formats
Dataset updated
Aug 29, 2025
Dataset provided by
CMAR
Authors
Centre for Marine Applied Research (CMAR)
Time period covered
Jul 9, 2014 - Present
Area covered

Variables measured
Sea State
Description
Le Center for Marine Applied Research (CMAR) fournit des données à haute résolution sur les variables océaniques de la côte de la Nouvelle-Écosse grâce à leur programme de surveillance côtière.Le programme a été lancé par le Département de la pêche et de l'aquaculture de la Nouvelle-Écosse (NSDFA) en 2015. En 2019, CMAR a assumé la responsabilité du programme et élargi sa portée et son mandat.Grâce à la branche des vagues du programme, CMAR traite et publie des données d'onde collectées par NSDFA.Les données sont mesurées par des profilers acoustiques à courant doppler acoustiques (ADCP) qui sont montés sur le fond marin pendant 1 à 3 mois.Un ADCP utilise le son pour mesurer les variables d'onde, y compris la hauteur, la période et la direction.NSDFA utilise plusieurs modèles d'instruments ADCP, notamment Sentinel V20, Sentinel V50, Sentinel V100 et Workhorse Sentinel 600KHz.Les données sont traitées à l'aide du logiciel Velocity et WavesMon4, puis compilé et formaté pour publication avec le package Waves R de CMAR.Un test automatisé de «plage brute» a été appliqué aux données pour identifier les valeurs aberrantes statistiques.Chaque point de données a reçu une valeur de drapeau de «passer», «suspect / d'intérêt» ou «échouer».Les observations signalées en tant que «pass» ont réussi le test (c'est-à-dire, n'étaient pas considérées comme des valeurs aberrantes) et peuvent être incluses dans les analyses.Les observations signalées comme «échec» ont été considérées comme des valeurs aberrantes et devraient être exclues de la plupart des analyses.Les observations signalées comme «suspect / d'intérêt» étaient soit une mauvaise qualité, soit mettez en évidence un événement inhabituel, et devraient être soigneusement examinés avant d'être inclus dans les analyses.Les drapeaux doivent être utilisés uniquement comme guide et les utilisateurs de données sont responsables de l'évaluation de la qualité des données avant l'utilisation dans toute analyse.L'ensemble de données ''Nova Scotia Current and Wave Data: Deployment Information'' sur ce portail montre tous les emplacements avec les données des vagues disponibles collectées via le programme de surveillance côtière de CMAR (https://data.novascotia.ca/fishing-and-aquaculture/nova-scotia-current-and-wave-deployment-for/uban-q9i2/about_data).Des rapports de résumé pour chaque comté sont disponibles sur le site Web de CMAR (https://cmar.ca/coastal-monitoring-program/).La collecte et la récupération des données sont en cours.Les ensembles de données et les rapports peuvent être révisés en attendant la collecte et les analyses de données en cours.Si vous avez accédé aux données du programme de surveillance côtière, CMAR apprécierait vos commentaires anonymes: https://docs.google.com/forms/d/e/1faipqlse3td6umrsvvknql13vvmjipckci2ctonjsgn7_g-4c-tktuw/Viewform.Veuillez reconnaître le Center for Marine Applied Research dans tout matériel publié qui utilise ces données.Contactez info@cmar.ca pour plus d'informations.
Titanic: A Voyage into the Past
kaggle.com
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Asher Mehfooz
Description
**Dataset Overview ** The Titanic dataset is a widely used benchmark dataset for machine learning and data science tasks. It contains information about passengers who boarded the RMS Titanic in 1912, including their age, sex, social class, and whether they survived the sinking of the ship. The dataset is divided into two main parts:

Train.csv: This file contains information about 891 passengers who were used to train machine learning models. It includes the following features:

PassengerId: A unique identifier for each passenger Survived: Whether the passenger survived (1) or not (0) Pclass: The passenger's social class (1 = Upper, 2 = Middle, 3 = Lower) Name: The passenger's name Sex: The passenger's sex (Male or Female) Age: The passenger's age Sibsp: The number of siblings or spouses aboard the ship Parch: The number of parents or children aboard the ship Ticket: The passenger's ticket number Fare: The passenger's fare Cabin: The passenger's cabin number Embarked: The port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton) Test.csv: This file contains information about 418 passengers who were not used to train machine learning models. It includes the same features as train.csv, but does not include the Survived label. The goal of machine learning models is to predict whether or not each passenger in the test.csv file survived.

**Data Preparation ** Before using the Titanic dataset for machine learning tasks, it is important to perform some data preparation steps. These steps may include:

Handling missing values: Some of the features in the dataset have missing values. These values can be imputed or removed, depending on the specific task. Encoding categorical variables: Some of the features in the dataset are categorical variables, such as Pclass, Sex, and Embarked. These variables need to be encoded numerically before they can be used by machine learning algorithms. Scaling numerical variables: Some of the features in the dataset are numerical variables, such as Age and Fare. These variables may need to be scaled to ensure that they are on the same scale. Data Visualization

Data visualization can be a useful tool for exploring the Titanic dataset and gaining insights into the data. Some common data visualization techniques that can be used with the Titanic dataset include:

Histograms: Histograms can be used to visualize the distribution of numerical variables, such as Age and Fare. Scatter plots: Scatter plots can be used to visualize the relationship between two numerical variables. Box plots: Box plots can be used to visualize the distribution of a numerical variable across different categories, such as Pclass and Sex. Machine Learning Tasks

The Titanic dataset can be used for a variety of machine learning tasks, including:

Classification: The most common task is to use the train.csv file to train a machine learning model to predict whether or not each passenger in the test.csv file survived. Regression: The dataset can also be used to train a machine learning model to predict the fare of a passenger based on their other features. Anomaly detection: The dataset can also be used to identify anomalies, such as passengers who are outliers in terms of their age, social class, or other features.
f
Data from: HacDivSel: Two new methods (haplotype-based and outlier-based)...
datasetcatalog.nlm.nih.gov
Updated Apr 19, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carvajal-Rodríguez, Antonio (2017). HacDivSel: Two new methods (haplotype-based and outlier-based) for the detection of divergent selection in pairs of populations [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001791737
Explore at:
Dataset updated
Apr 19, 2017
Authors
Carvajal-Rodríguez, Antonio
Description
The detection of genomic regions involved in local adaptation is an important topic in current population genetics. There are several detection strategies available depending on the kind of genetic and demographic information at hand. A common drawback is the high risk of false positives. In this study we introduce two complementary methods for the detection of divergent selection from populations connected by migration. Both methods have been developed with the aim of being robust to false positives. The first method combines haplotype information with inter-population differentiation (FST). Evidence of divergent selection is concluded only when both the haplotype pattern and the FST value support it. The second method is developed for independently segregating markers i.e. there is no haplotype information. In this case, the power to detect selection is attained by developing a new outlier test based on detecting a bimodal distribution. The test computes the FST outliers and then assumes that those of interest would have a different mode. We demonstrate the utility of the two methods through simulations and the analysis of real data. The simulation results showed power ranging from 60–95% in several of the scenarios whilst the false positive rate was controlled below the nominal level. The analysis of real samples consisted of phased data from the HapMap project and unphased data from intertidal marine snail ecotypes. The results illustrate that the proposed methods could be useful for detecting locally adapted polymorphisms. The software HacDivSel implements the methods explained in this manuscript.
Data from: Wine Quality
kaggle.com
Updated Jul 9, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raj Parmar (2018). Wine Quality [Dataset]. https://www.kaggle.com/datasets/rajyellow46/wine-quality/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 9, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Raj Parmar
Description
Data Set Information:

The dataset was downloaded from the UCI Machine Learning Repository.

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. The reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

Two datasets were combined and few values were randomly removed.

Attribute Information:

For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Acknowledgements:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Facebook

Twitter

Click to copy link

Link copied

Cite

Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1

Data from: Methodology to filter out outliers in high spatial density data to improve maps reliability

Explore at:

jpegAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.14305658.v1

Dataset updated

Jun 4, 2023

Dataset provided by

SciELO journals

Authors

Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.

Clear search

Close search

Google apps

Main menu

Data from: Methodology to filter out outliers in high spatial density data...

Data from: Mining Distance-Based Outliers in Near Linear Time

Multi-Domain Outlier Detection Dataset

Identifying outliers in asset pricing data with a new weighted forward...

Replication data for: Linear Models with Outliers: Choosing between...

Building and updating software datasets: an empirical assessment

Data collected

Plot graphics

Replication Kit

Outliers and similarity in APOGEE - Dataset - B2FIND

Oxygen Exposure for Benthic Megafauna near San

Oxygen Exposure for Benthic Megafauna near San Diego

Spatially Varying Environmental Risk

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

The 12 outliers identified in the Tonga dataset.

Data from: QST FST comparisons with unbalanced half-sib designs

Student Admission Records

TreeShrink: fast and accurate detection of outlier long branches in...

White Wine Quality

Context

Content

Acknowledgements

Inspiration

Relevant papers

Données sur les vagues du comté de Richmond

Données sur les vagues du comté de Yarmouth

Cdd Dataset

Données sur les vagues du comté de Guysborough

Titanic: A Voyage into the Past

Data from: HacDivSel: Two new methods (haplotype-based and outlier-based)...

Data from: Wine Quality

Data from: Methodology to filter out outliers in high spatial density data to improve maps reliability