Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multi-Domain Outlier Detection Dataset contains datasets for conducting outlier detection experiments for four different application domains:
Each dataset contains a "fit" dataset (used for fitting or training outlier detection models), a "score" dataset (used for scoring samples used to evaluate model performance, analogous to test set), and a label dataset (indicates whether samples in the score dataset are considered outliers or not in the domain of each dataset).
To read more about the datasets and how they are used for outlier detection, or to cite this dataset in your own work, please see the following citation:
Kerner, H. R., Rebbapragada, U., Wagstaff, K. L., Lu, S., Dubayah, B., Huff, E., Lee, J., Raman, V., and Kulshrestha, S. (2022). Domain-agnostic Outlier Ranking Algorithms (DORA)-A Configurable Pipeline for Facilitating Outlier Detection in Scientific Datasets. Under review for Frontiers in Astronomy and Space Sciences.
ABSTRACT The purpose of this work is to present the Weighted Forward Search (FSW) method for the detection of outliers in asset pricing data. This new estimator, which is based on an algorithm that downweights the most anomalous observations of the dataset, is tested using both simulated and empirical asset pricing data. The impact of outliers on the estimation of asset pricing models is assessed under different scenarios, and the results are evaluated with associated statistical tests based on this new approach. Our proposal generates an alternative procedure for robust estimation of portfolio betas, allowing for the comparison between concurrent asset pricing models. The algorithm, which is both efficient and robust to outliers, is used to provide robust estimates of the models’ parameters in a comparison with traditional econometric estimation methods usually used in the literature. In particular, the precision of the alphas is highly increased when the Forward Search (FS) method is used. We use Monte Carlo simulations, and also the well-known dataset of equity factor returns provided by Prof. Kenneth French, consisting of the 25 Fama-French portfolios on the United States of America equity market using single and three-factor models, on monthly and annual basis. Our results indicate that the marginal rejection of the Fama-French three-factor model is influenced by the presence of outliers in the portfolios, when using monthly returns. In annual data, the use of robust methods increases the rejection level of null alphas in the Capital Asset Pricing Model (CAPM) and the Fama-French three-factor model, with more efficient estimates in the absence of outliers and consistent alphas when outliers are present.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is the repository for the scripts and data of the study "Building and updating software datasets: an empirical assessment".
The data generated for the study it can be downloaded as a zip file. Each folder inside the file corresponds to one of the datasets of projects employed in the study (qualitas, currentSample and qualitasUpdated). Every dataset comprised three files "class.csv", "method.csv" and "sample.csv", with class metrics, method metrics and repository metadata of the projects respectively. Here is a description of the datasets:
To plot the results and graphics in the article there is a Jupyter Notebook "Experiment.ipynb". It is initially configured to use the data in "datasets" folder.
For replication purposes, the datasets containing recent projects from Github can be re-generated. To do so, the virtual environment must have installed the dependencies in "requirements.txt" file, add Github's tokens in "./token" file, re-define or leave as is the paths declared in the constants (variables written in caps) in the main method, and finally run "main.py" script. The source code scanner Sourcemeter for Windows is already installed in the project. If a new release becomes available or if the tool needs to be run on a different OS, it can be replaced in "./Sourcemeter/tool" directory.
The script comprise 5 steps:
In this work we apply and expand on a recently introduced outlier detection algorithm that is based on an unsupervised random forest. We use the algorithm to calculate a similarity measure for stellar spectra from the Apache Point Observatory Galactic Evolution Experiment (APOGEE). We show that the similarity measure traces non-trivial physical properties and contains information about complex structures in the data. We use it for visualization and clustering of the dataset, and discuss its ability to find groups of highly similar objects, including spectroscopic twins. Using the similarity matrix to search the dataset for objects allows us to find objects that are impossible to find using their best fitting model parameters. This includes extreme objects for which the models fail, and rare objects that are outside the scope of the model. We use the similarity measure to detect outliers in the dataset, and find a number of previously unknown Be-type stars, spectroscopic binaries, carbon rich stars, young stars, and a few that we cannot interpret. Our work further demonstrates the potential for scientific discovery when combining machine learning methods with modern survey data. Cone search capability for table J/MNRAS/476/2117/apogeenn (Nearest neighbors APOGEE IDs)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset captures an in-depth look into the environmental conditions of the underwater world off Southern California's coast. It provides invaluable information related to spatial risk variation, such as oxygen exposure levels, depths and habitat criteria of 53 species of benthic and epibenthic megafauna recorded during the three-year study. This data will provide insight into aquatic life dynamics and potentially generate improved management strategies for protecting these vital species. Moreover, due to the importance that waters play within our planet's fragile ecosystem, a proper understanding of their affairs could lead to greater marine sustainability in the long-term. Ultimately, this dataset may help answer our questions about how exactly ocean life is responding to intense human activity and its effects on today's seaside communities
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Download and install the dataset: The dataset contains two .csv files, each containing data from the three-year study on oxygen exposure for benthic and epibenthic megafauna off the coast of San Diego in Southern California. Download these two files to your computer and save them for further analysis.
Familiarize yourself with the datasets: Each file includes very detailed information about a particular variable related to the study (for example, SpeciesMetadata contains species-level information on 53 species of benthic and epibenthic megafauna). Read through each data sheet carefully in order to gain a better understanding of what's included in each column.
Clean up any outliers or missing values: Once you understand which columns are important for your analysis, you can begin cleaning up any outliers or missing values that may be present in your dataset. This is an important step as it will help ensure that further analysis is performed accurately.
Choose an appropriate visualization method: Depending on what type of results you want to show from your analysis, choose an appropriate visualization method (e.g., bar plot, scatterplot). Also consider if adding labeling such as color with respect to categories would improve legibility of figures you produce from this dataset during exploratory data analyses stages.
5) Choose a statistical test suitable for this type of project: Once allyour visuals have been produced its time to interpret results using statistics tests depending on how many categorical variables are presentin the data set (i.e t-test or ANOVA). As well understand key outputs like p_values so experiment could effectively conclude if thereare significant differences between treatmentswhen comparing distributions among samples/populations being studied here.. Be sureto adjust mean size/sample size when performing statistic testsuitably accordingto determining adequate power when selecting applicable tests etc.
- Comparing the effects of different environmental factors (depth, temperature, salinity etc.) on depth-specific distributions of oxygen and benthic megafauna.
- Identifying and mapping vulnerable areas for benthic species based on environmental factors and oxygen exposure patterns.
- Developing models to predict underlying spatial risk variables for endangered species to inform conservation efforts in the study area
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: ROVObservationData.csv
File: SpeciesMetadata.csv
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Gene expression data have been presented as non-normalized (2-Ct*109) in all but the last six rows; this allows for the back-calculation of the raw threshold cycle (Ct) values so that interested individuals can readily estimate the typical range of expression of each gene. Values representing aberrant levels for a particular parameter (z-score>2.5) have been highlighted in bold. When there was a statistically significant difference (student’s t-test, p<0.05) between the outlier and non-outlier averages for a parameter (instead using normalized gene expression data), the lower of the two values has been underlined. All samples hosted Symbiodinium of clade C only unless noted otherwise. The mean Mahalanobis distance did not differ between Pocillopora damicornis and P. acuta (student’s t-test, p>0.05). SA = surface area. GCP = genome copy proportion. Ma Dis = Mahalanobis distance. “.” = missing data.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
AbstractQST, a measure of quantitative genetic differentiation among populations, is an index that can suggest local adaptation if QST for a trait is sufficiently larger than the mean FST of neutral genetic markers. A previous method by Whitlock and Guillaume derived a simulation resampling approach to statistically test for a difference between QST and FST, but that method is limited to balanced data sets with offspring related as half-sibs through shared fathers. We extend this approach to (1) allow for a model more suitable for some plant populations or breeding designs in which offspring are related through mothers (assuming independent fathers for each offspring; half-sibs by dam), and (2) by explicitly allowing for unbalanced data sets. The resulting approach is made available through the R package QstFstComp. Usage notesSourceCode_DamModelSource code used when doing type I error testing of balanced or unbalanced half-sib dam modelDamModel_WorkingCopy.RSireModel_WorkingCopySource code used when doing type I error testing of unbalanced half-sib sire modelTypeI_ErrorTest_DamBalancedR code to run the error testing of the balanced half-sib dam model over 1000 replicate datasets.TypeI_ErrorTest_DamUnbalancedR code to run the error testing of the unbalanced half-sib dam model over 1000 replicate datasets.TypeI_ErrorTest_SireUnbalancedR code to run the error testing of the unbalanced half-sib sire model over 1000 replicate datasets.NemoReplicatesZipped file containing the 1000 simulated replicate datasets from Nemo used for type I error testing.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is crafted for beginners to practice data cleaning and preprocessing techniques in machine learning. It contains 157 rows of student admission records, including duplicate rows, missing values, and some data inconsistencies (e.g., outliers, unrealistic values). It’s ideal for practicing common data preparation steps before applying machine learning algorithms.
The dataset simulates a university admission record system, where each student’s admission profile includes test scores, high school percentages, and admission status. The data contains realistic flaws often encountered in raw data, offering hands-on experience in data wrangling.
The dataset contains the following columns:
Name: Student's first name (Pakistani names). Age: Age of the student (some outliers and missing values). Gender: Gender (Male/Female). Admission Test Score: Score obtained in the admission test (includes outliers and missing values). High School Percentage: Student's high school final score percentage (includes outliers and missing values). City: City of residence in Pakistan. Admission Status: Whether the student was accepted or rejected.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Phylogenetic trees include errors for a variety of reasons. We argue that one way to detect errors is to build a phylogeny with all the data and then detect taxa that artificially inflate the tree diameter. We formulate an optimization problem that seeks to find k leaves that can be removed to reduce the tree diameter maximally. We present a polynomial time solution to this “k-shrink” problem. Given this solution, we then use non-parametric statistics to find an outlier set of taxa that have an unexpectedly high impact on the tree diameter. We test our method, TreeShrink, on five biological datasets, and show that it is more conservative than rogue taxon removal using RogueNaRok. When the amount of filtering is controlled, TreeShrink outperforms RogueNaRok in three out of the five datasets, and they tie in another dataset. Methods All the raw data are obtained from other publications as shown below. We further analyzed the data and provide the results of the analyses here. The methods used to analyze the data are described in the paper.
Dataset
Species
Genes
Download
Plants
104
852
DOI 10.1186/2047-217X-3-17
Mammals
37
424
DOI 10.13012/C5BG2KWG
Insects
144
1478
http://esayyari.github.io/InsectsData
Cannon
78
213
DOI 10.5061/dryad.493b7
Rouse
26
393
DOI 10.5061/dryad.79dq1
Frogs
164
95
DOI 10.5061/dryad.12546.2
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, refer to [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)
This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality, to get both the dataset i.e. red and white vinho verde wine samples, from the north of Portugal, please visit the above link.
Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
We kagglers can apply several machine-learning algorithms to determine which physiochemical properties make a wine 'good'!
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Le Center for Marine Applied Research (CMAR) fournit des données à haute résolution sur les variables océaniques de la côte de la Nouvelle-Écosse grâce à leur programme de surveillance côtière.Le programme a été lancé par le Département de la pêche et de l'aquaculture de la Nouvelle-Écosse (NSDFA) en 2015. En 2019, CMAR a assumé la responsabilité du programme et élargi sa portée et son mandat.Grâce à la branche des vagues du programme, CMAR traite et publie des données d'onde collectées par NSDFA.Les données sont mesurées par des profilers acoustiques à courant doppler acoustiques (ADCP) qui sont montés sur le fond marin pendant 1 à 3 mois.Un ADCP utilise le son pour mesurer les variables d'onde, y compris la hauteur, la période et la direction.NSDFA utilise plusieurs modèles d'instruments ADCP, notamment Sentinel V20, Sentinel V50, Sentinel V100 et Workhorse Sentinel 600KHz.Les données sont traitées à l'aide du logiciel Velocity et WavesMon4, puis compilé et formaté pour publication avec le package Waves R de CMAR.Un test automatisé de «plage brute» a été appliqué aux données pour identifier les valeurs aberrantes statistiques.Chaque point de données a reçu une valeur de drapeau de «passer», «suspect / d'intérêt» ou «échouer».Les observations signalées en tant que «pass» ont réussi le test (c'est-à-dire, n'étaient pas considérées comme des valeurs aberrantes) et peuvent être incluses dans les analyses.Les observations signalées comme «échec» ont été considérées comme des valeurs aberrantes et devraient être exclues de la plupart des analyses.Les observations signalées comme «suspect / d'intérêt» étaient soit une mauvaise qualité, soit mettez en évidence un événement inhabituel, et devraient être soigneusement examinés avant d'être inclus dans les analyses.Les drapeaux doivent être utilisés uniquement comme guide et les utilisateurs de données sont responsables de l'évaluation de la qualité des données avant l'utilisation dans toute analyse.L'ensemble de données ''Nova Scotia Current and Wave Data: Deployment Information'' sur ce portail montre tous les emplacements avec les données des vagues disponibles collectées via le programme de surveillance côtière de CMAR (https://data.novascotia.ca/fishing-and-aquaculture/nova-scotia-current-and-wave-deployment-for/uban-q9i2/about_data).Des rapports de résumé pour chaque comté sont disponibles sur le site Web de CMAR (https://cmar.ca/coastal-monitoring-program/).La collecte et la récupération des données sont en cours.Les ensembles de données et les rapports peuvent être révisés en attendant la collecte et les analyses de données en cours.Si vous avez accédé aux données du programme de surveillance côtière, CMAR apprécierait vos commentaires anonymes: https://docs.google.com/forms/d/e/1faipqlse3td6umrsvvknql13vvmjipckci2ctonjsgn7_g-4c-tktuw/Viewform.Veuillez reconnaître le Center for Marine Applied Research dans tout matériel publié qui utilise ces données.Contactez info@cmar.ca pour plus d'informations.
Le Center for Marine Applied Research (CMAR) fournit des données à haute résolution sur les variables océaniques de la côte de la Nouvelle-Écosse grâce à leur programme de surveillance côtière.Le programme a été lancé par le Département de la pêche et de l'aquaculture de la Nouvelle-Écosse (NSDFA) en 2015. En 2019, CMAR a assumé la responsabilité du programme et élargi sa portée et son mandat.Grâce à la branche des vagues du programme, CMAR traite et publie des données d'onde collectées par NSDFA.Les données sont mesurées par des profilers acoustiques à courant doppler acoustiques (ADCP) qui sont montés sur le fond marin pendant 1 à 3 mois.Un ADCP utilise le son pour mesurer les variables d'onde, y compris la hauteur, la période et la direction.NSDFA utilise plusieurs modèles d'instruments ADCP, notamment Sentinel V20, Sentinel V50, Sentinel V100 et Workhorse Sentinel 600KHz.Les données sont traitées à l'aide du logiciel Velocity et WavesMon4, puis compilé et formaté pour publication avec le package Waves R de CMAR.Un test automatisé de «plage brute» a été appliqué aux données pour identifier les valeurs aberrantes statistiques.Chaque point de données a reçu une valeur de drapeau de «passer», «suspect / d'intérêt» ou «échouer».Les observations signalées en tant que «pass» ont réussi le test (c'est-à-dire, n'étaient pas considérées comme des valeurs aberrantes) et peuvent être incluses dans les analyses.Les observations signalées comme «échec» ont été considérées comme des valeurs aberrantes et devraient être exclues de la plupart des analyses.Les observations signalées comme «suspect / d'intérêt» étaient soit une mauvaise qualité, soit mettez en évidence un événement inhabituel, et devraient être soigneusement examinés avant d'être inclus dans les analyses.Les drapeaux doivent être utilisés uniquement comme guide et les utilisateurs de données sont responsables de l'évaluation de la qualité des données avant l'utilisation dans toute analyse.L'ensemble de données ''Nova Scotia Current and Wave Data: Deployment Information'' sur ce portail montre tous les emplacements avec les données des vagues disponibles collectées via le programme de surveillance côtière de CMAR (https://data.novascotia.ca/fishing-and-aquaculture/nova-scotia-current-and-wave-deployment-for/uban-q9i2/about_data).Des rapports de résumé pour chaque comté sont disponibles sur le site Web de CMAR (https://cmar.ca/coastal-monitoring-program/).La collecte et la récupération des données sont en cours.Les ensembles de données et les rapports peuvent être révisés en attendant la collecte et les analyses de données en cours.Si vous avez accédé aux données du programme de surveillance côtière, CMAR apprécierait vos commentaires anonymes: https://docs.google.com/forms/d/e/1faipqlse3td6umrsvvknql13vvmjipckci2ctonjsgn7_g-4c-tktuw/Viewform.Veuillez reconnaître le Center for Marine Applied Research dans tout matériel publié qui utilise ces données.Contactez info@cmar.ca pour plus d'informations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Documentation: Cucumber Disease Detection
Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.
Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.
Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.
Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.
Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.
Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.
Methodology Machine Learning Algorithms:
Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:
The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.
Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.
Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.
Model Evaluation Evaluation Metrics:
Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:
The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.
Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.
Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.
References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1
Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g
Rafiur Rahman Rafit EWU 2018-3-60-111
Le Center for Marine Applied Research (CMAR) fournit des données à haute résolution sur les variables océaniques de la côte de la Nouvelle-Écosse grâce à leur programme de surveillance côtière.Le programme a été lancé par le Département de la pêche et de l'aquaculture de la Nouvelle-Écosse (NSDFA) en 2015. En 2019, CMAR a assumé la responsabilité du programme et élargi sa portée et son mandat.Grâce à la branche des vagues du programme, CMAR traite et publie des données d'onde collectées par NSDFA.Les données sont mesurées par des profilers acoustiques à courant doppler acoustiques (ADCP) qui sont montés sur le fond marin pendant 1 à 3 mois.Un ADCP utilise le son pour mesurer les variables d'onde, y compris la hauteur, la période et la direction.NSDFA utilise plusieurs modèles d'instruments ADCP, notamment Sentinel V20, Sentinel V50, Sentinel V100 et Workhorse Sentinel 600KHz.Les données sont traitées à l'aide du logiciel Velocity et WavesMon4, puis compilé et formaté pour publication avec le package Waves R de CMAR.Un test automatisé de «plage brute» a été appliqué aux données pour identifier les valeurs aberrantes statistiques.Chaque point de données a reçu une valeur de drapeau de «passer», «suspect / d'intérêt» ou «échouer».Les observations signalées en tant que «pass» ont réussi le test (c'est-à-dire, n'étaient pas considérées comme des valeurs aberrantes) et peuvent être incluses dans les analyses.Les observations signalées comme «échec» ont été considérées comme des valeurs aberrantes et devraient être exclues de la plupart des analyses.Les observations signalées comme «suspect / d'intérêt» étaient soit une mauvaise qualité, soit mettez en évidence un événement inhabituel, et devraient être soigneusement examinés avant d'être inclus dans les analyses.Les drapeaux doivent être utilisés uniquement comme guide et les utilisateurs de données sont responsables de l'évaluation de la qualité des données avant l'utilisation dans toute analyse.L'ensemble de données ''Nova Scotia Current and Wave Data: Deployment Information'' sur ce portail montre tous les emplacements avec les données des vagues disponibles collectées via le programme de surveillance côtière de CMAR (https://data.novascotia.ca/fishing-and-aquaculture/nova-scotia-current-and-wave-deployment-for/uban-q9i2/about_data).Des rapports de résumé pour chaque comté sont disponibles sur le site Web de CMAR (https://cmar.ca/coastal-monitoring-program/).La collecte et la récupération des données sont en cours.Les ensembles de données et les rapports peuvent être révisés en attendant la collecte et les analyses de données en cours.Si vous avez accédé aux données du programme de surveillance côtière, CMAR apprécierait vos commentaires anonymes: https://docs.google.com/forms/d/e/1faipqlse3td6umrsvvknql13vvmjipckci2ctonjsgn7_g-4c-tktuw/Viewform.Veuillez reconnaître le Center for Marine Applied Research dans tout matériel publié qui utilise ces données.Contactez info@cmar.ca pour plus d'informations.
**Dataset Overview ** The Titanic dataset is a widely used benchmark dataset for machine learning and data science tasks. It contains information about passengers who boarded the RMS Titanic in 1912, including their age, sex, social class, and whether they survived the sinking of the ship. The dataset is divided into two main parts:
Train.csv: This file contains information about 891 passengers who were used to train machine learning models. It includes the following features:
PassengerId: A unique identifier for each passenger Survived: Whether the passenger survived (1) or not (0) Pclass: The passenger's social class (1 = Upper, 2 = Middle, 3 = Lower) Name: The passenger's name Sex: The passenger's sex (Male or Female) Age: The passenger's age Sibsp: The number of siblings or spouses aboard the ship Parch: The number of parents or children aboard the ship Ticket: The passenger's ticket number Fare: The passenger's fare Cabin: The passenger's cabin number Embarked: The port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton) Test.csv: This file contains information about 418 passengers who were not used to train machine learning models. It includes the same features as train.csv, but does not include the Survived label. The goal of machine learning models is to predict whether or not each passenger in the test.csv file survived.
**Data Preparation ** Before using the Titanic dataset for machine learning tasks, it is important to perform some data preparation steps. These steps may include:
Handling missing values: Some of the features in the dataset have missing values. These values can be imputed or removed, depending on the specific task. Encoding categorical variables: Some of the features in the dataset are categorical variables, such as Pclass, Sex, and Embarked. These variables need to be encoded numerically before they can be used by machine learning algorithms. Scaling numerical variables: Some of the features in the dataset are numerical variables, such as Age and Fare. These variables may need to be scaled to ensure that they are on the same scale. Data Visualization
Data visualization can be a useful tool for exploring the Titanic dataset and gaining insights into the data. Some common data visualization techniques that can be used with the Titanic dataset include:
Histograms: Histograms can be used to visualize the distribution of numerical variables, such as Age and Fare. Scatter plots: Scatter plots can be used to visualize the relationship between two numerical variables. Box plots: Box plots can be used to visualize the distribution of a numerical variable across different categories, such as Pclass and Sex. Machine Learning Tasks
The Titanic dataset can be used for a variety of machine learning tasks, including:
Classification: The most common task is to use the train.csv file to train a machine learning model to predict whether or not each passenger in the test.csv file survived. Regression: The dataset can also be used to train a machine learning model to predict the fare of a passenger based on their other features. Anomaly detection: The dataset can also be used to identify anomalies, such as passengers who are outliers in terms of their age, social class, or other features.
The detection of genomic regions involved in local adaptation is an important topic in current population genetics. There are several detection strategies available depending on the kind of genetic and demographic information at hand. A common drawback is the high risk of false positives. In this study we introduce two complementary methods for the detection of divergent selection from populations connected by migration. Both methods have been developed with the aim of being robust to false positives. The first method combines haplotype information with inter-population differentiation (FST). Evidence of divergent selection is concluded only when both the haplotype pattern and the FST value support it. The second method is developed for independently segregating markers i.e. there is no haplotype information. In this case, the power to detect selection is attained by developing a new outlier test based on detecting a bimodal distribution. The test computes the FST outliers and then assumes that those of interest would have a different mode. We demonstrate the utility of the two methods through simulations and the analysis of real data. The simulation results showed power ranging from 60–95% in several of the scenarios whilst the false positive rate was controlled below the nominal level. The analysis of real samples consisted of phased data from the HapMap project and unphased data from intertidal marine snail ecotypes. The results illustrate that the proposed methods could be useful for detecting locally adapted polymorphisms. The software HacDivSel implements the methods explained in this manuscript.
Data Set Information:
The dataset was downloaded from the UCI Machine Learning Repository.
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. The reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Two datasets were combined and few values were randomly removed.
Attribute Information:
For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Acknowledgements:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.