Facebook
TwitterUnderstanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
Facebook
Twitter** Inputs related to Analysis for additional reference:** 1. Why do we need customer Segmentation? As every customer is unique and can be targeted in different ways. The Customer segmentation plays an important role in this case. The segmentation helps to understand profiles of customers and can be helpful in defining cross sell/upsell/activation/acquisition strategies. 2. What is RFM Segmentation? RFM Segmentation is an acronym of recency, frequency and monetary based segmentation. Recency is about when the last order of a customer. It means the number of days since a customer made the last purchase. If it’s a case for a website or an app, this could be interpreted as the last visit day or the last login time. Frequency is about the number of purchases in a given period. It could be 3 months, 6 months or 1 year. So we can understand this value as for how often or how many customers used the product of a company. The bigger the value is, the more engaged the customers are. Alternatively We can define, average duration between two transactions Monetary is the total amount of money a customer spent in that given period. Therefore big spenders will be differentiated with other customers such as MVP or VIP. 3. What is LTV and How to define it? In the current world, almost every retailer promotes its subscription and this is further used to understand the customer lifetime. Retailer can manage these customers in better manner if they know which customer is high life time value. Customer lifetime value (LTV) can also be defined as the monetary value of a customer relationship, based on the present value of the projected future cash flows from the customer relationship. Customer lifetime value is an important concept in that it encourages firms to shift their focus from quarterly profits to the long-term health of their customer relationships. Customer lifetime value is an important metric because it represents an upper limit on spending to acquire new customers. For this reason it is an important element in calculating payback of advertising spent in marketing mix modelling. 4. Why do need to predict Customer Lifetime Value? The LTV is an important building block in campaign design and marketing mix management. Although targeting models can help to identify the right customers to be targeted, LTV analysis can help to quantify the expected outcome of targeting in terms of revenues and profits. The LTV is also important because other major metrics and decision thresholds can be derived from it. For example, the LTV is naturally an upper limit on the spending to acquire a customer, and the sum of the LTVs for all of the customers of a brand, known as the customer equity, is a major metric forbusiness valuations. Similarly to many other problems of marketing analytics and algorithmic marketing, LTV modelling can be approached from descriptive, predictive, and prescriptive perspectives. 5. How Next Purchase Day helps to Retailers? Our objective is to analyse when our customer will purchase products in the future so for such customers we can build strategy and can come up with strategies and marketing campaigns accordingly. a. Group-1: Customers who will purchase in more than 60 days b. Group-2: Customers who will purchase in 30-60 days c. Group-3: Customers who will purchase in 0-30 days 6. What is Cohort Analysis? How it will be helpful? A cohort is a group of users who share a common characteristic that is identified in this report by an Analytics dimension. For example, all users with the same Acquisition Date belong to the same cohort. The Cohort Analysis report lets you isolate and analyze cohort behaviour. Cohort analysis in e-commerce means to monitor your customers’ behaviour based on common traits they share – the first product they bought, when they became customers, etc. - - to find patterns and tailor marketing activities for the group.
Transaction data has been provided for the period of 1st Jan 2019 to 31st Dec 2019. The below data sets have been provided. Online_Sales.csv: This file contains actual orders data (point of Sales data) at transaction level with below variables. CustomerID: Customer unique ID Transaction_ID: Transaction Unique ID Transaction_Date: Date of Transaction Product_SKU: SKU ID – Unique Id for product Product_Description: Product Description Product_Cateogry: Product Category Quantity: Number of items ordered Avg_Price: Price per one quantity Delivery_Charges: Charges for delivery Coupon_Status: Any discount coupon applied Customers_Data.csv: This file contains customer’s demographics. CustomerID: Customer Unique ID Gender: Gender of customer Location: Location of Customer Tenure_Months: Tenure in Months Discount_Coupon.csv: Discount coupons have been given for different categories in different months Month: Discount coupon applied in that month Product_Category: Product categor...
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File List all_files.zip (md5: 05899eadd594e72909c652856b0255fa) benthos.csv (md5: 3967c416398c06a5a1c37ee4213ed601) barents.csv (md5: 4784e8a75f12c8d54c4e75972e5e45ff) macro.csv (md5: ec932c53204636c4d9bc0eda978b4161) seashore.csv (md5: d51a0a57aed6c42ed903280c8ac3a1d0) texel.csv (md5: f06123b443e088d941dae5c8709be649) CAbiplot.R (md5: 92afeea2ed19440632b3dac5c8a1b069)
Description
all_files.zip is an archive containing all data and code described below (6 files).
benthos.csv is a comma-separated text file containing the 'benthos' dataset used in the main body of the report. Column definitions are:
1. station id: Sx for 11 polluted stations, Rx for 2 reference stations
2–93. abundances of 92 benthic species
barents.csv is a comma-separated text file containing the 'barents' dataset. Column definitions are:
1. benthic species id (abbreviated) - 446 species
2–11. abundances at 10 sampling sites
macro.csv is a comma-separated text file containing the 'macro' dataset. Column definitions are:
1. sample id - 40 samples
2–198. abundances of macro-invertebrates from two Dutch streams
seashore.csv is a comma-separated text file containing the 'seashore' dataset. Column definitions are:
1. sample id - 126 samples
2–69. abundances of vegetation of rising seashore in Stockholm archipelago - 68 species
texel.csv is a comma-separated text file containing the 'texel' dataset. Column definitions are:
1. sample id - 285 samples
2–222. vegetation on coastal sand dune area on Texel island - 221 species
Note: to align with the original publication, samples with zero abundances have to be eliminated, as well as species that have a single abundance of 1, to give a table with 209 samples and 209 species. The data set 'fish' is not available for release at the moment. CAbiplot.R is an R script file to produce a contribution biplot in correspondence analysis. This script assumes that the CA package is downloaded from CRAN. The data set should be in the working directory of R when running this code.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Documentation: Cucumber Disease Detection
Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.
Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.
Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.
Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.
Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.
Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.
Methodology Machine Learning Algorithms:
Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:
The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.
Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.
Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.
Model Evaluation Evaluation Metrics:
Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:
The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.
Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.
Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.
References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1
Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g
Rafiur Rahman Rafit EWU 2018-3-60-111
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data DescriptionWater Quality Parameters: Ammonia, BOD, DO, Orthophosphate, pH, Temperature, Nitrogen, Nitrate.Countries/Regions: United States, Canada, Ireland, England, China.Years Covered: 1940-2023.Data Records: 2.82 million.Definition of ColumnsCountry: Name of the water-body region.Area: Name of the area in the region.Waterbody Type: Type of the water-body source.Date: Date of the sample collection (dd-mm-yyyy).Ammonia (mg/l): Ammonia concentration.Biochemical Oxygen Demand (BOD) (mg/l): Oxygen demand measurement.Dissolved Oxygen (DO) (mg/l): Concentration of dissolved oxygen.Orthophosphate (mg/l): Orthophosphate concentration.pH (pH units): pH level of water.Temperature (°C): Temperature in Celsius.Nitrogen (mg/l): Total nitrogen concentration.Nitrate (mg/l): Nitrate concentration.CCME_Values: Calculated water quality index values using the CCME WQI model.CCME_WQI: Water Quality Index classification based on CCME_Values.Data Directory Description:Category 1: DatasetCombined Data: This folder contains two CSV files: Combined_dataset.csv and Summary.xlsx. The Combined_dataset.csv file includes all eight water quality parameter readings across five countries, with additional data for initial preprocessing steps like missing value handling, outlier detection, and other operations. It also contains the CCME Water Quality Index calculation for empirical analysis and ML-based research. The Summary.xlsx provides a brief description of the datasets, including data distributions (e.g., maximum, minimum, mean, standard deviation).Combined_dataset.csvSummary.xlsxCountry-wise Data: This folder contains separate country-based datasets in CSV files. Each file includes the eight water quality parameters for regional analysis. The Summary_country.xlsx file presents country-wise dataset descriptions with data distributions (e.g., maximum, minimum, mean, standard deviation).England_dataset.csvCanada_dataset.csvUSA_dataset.csvIreland_dataset.csvChina_dataset.csvSummary_country.xlsxCategory 2: CodeData processing and harmonization code (e.g., Language Conversion, Date Conversion, Parameter Naming and Unit Conversion, Missing Value Handling, WQI Measurement and Classification).Data_Processing_Harmonnization.ipynbThe code used for Technical Validation (e.g., assessing the Data Distribution, Outlier Detection, Water Quality Trend Analysis, and Vrifying the Application of the Dataset for the ML Models).Technical_Validation.ipynbCategory 3: Data Collection SourcesThis category includes links to the selected dataset sources, which were used to create the dataset and are provided for further reconstruction or data formation. It contains links to various data collection sources.DataCollectionSources.xlsxOriginal Paper Title: A Comprehensive Dataset of Surface Water Quality Spanning 1940-2023 for Empirical and ML Adopted ResearchAbstractAssessment and monitoring of surface water quality are essential for food security, public health, and ecosystem protection. Although water quality monitoring is a known phenomenon, little effort has been made to offer a comprehensive and harmonized dataset for surface water at the global scale. This study presents a comprehensive surface water quality dataset that preserves spatio-temporal variability, integrity, consistency, and depth of the data to facilitate empirical and data-driven evaluation, prediction, and forecasting. The dataset is assembled from a range of sources, including regional and global water quality databases, water management organizations, and individual research projects from five prominent countries in the world, e.g., the USA, Canada, Ireland, England, and China. The resulting dataset consists of 2.82 million measurements of eight water quality parameters that span 1940 - 2023. This dataset can support meta-analysis of water quality models and can facilitate Machine Learning (ML) based data and model-driven investigation of the spatial and temporal drivers and patterns of surface water quality at a cross-regional to global scale.Note: Cite this repository and the original paper when using this dataset.
Facebook
TwitterThis datasets is a summary of the CTD profiles measured with the RV Belgica. It provides general meta-information such as the campaign code, the date of measurement and the geographical information. An important information is the profile quality flag that describes the validity of the data. A quality flag = 2 means the data is generally good although some outliers can still be present. A quality flag = 4 means the data should not be trusted. 1 meter binned data can be download on the SeaDataNet CDI portal (enter the cruise_id in the search bar) ONLY for the good quality profiles. Full acquisition frequency datasets are available on request to BMDC.
Facebook
Twitter✅ Adapted from the original dataset to enable the use of a variety of machine learning techniques
✅ Recommended Data Analysis Process - Verify data(raw) distribution - PRE-PROCESSIONG: missing value - PRE-PROCESSIONG: outlier - Data restructuring - Feature Scaling(Optional) - ML - Derive insights
✅ Available machine learning techniques - Correlation Analysis - Time series Analysis - Regression Analysis - Clustering(PCA) - Recommendation Analysis - Causal inference analysis
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains the daily means of flow rate and various chemical/physical properties of water in surface runoff and subsurface lateral flow, plus soil moisture, soil temperature and precipitation, for the fifteen catchments on the North Wyke Farm Platform (NWFP), for the period Jan 2012 – Dec 2023. The data set was calculated from 15-minute time step values that are available from the NWFP data portal [https://nwfp.rothamsted.ac.uk/]. Prior to calculation of the daily means, each of the variables was first screened for potentially poor quality according to the flag assigned to each time step value during the quality control (QC) process. Where data did not meet the QC criteria flag of ‘Good’, Acceptable’ or ‘Outlier’, values were replaced with ‘NA’ to represent missing data. In addition, since the number of within day missing values impacts the reliability of the statistic, means were set to ‘NA’ where the threshold limit for the acceptable number of daily missing values for each variable was exceeded.
Other data sets that complement this one are ‘NWFP flow rate and chemical/physical parameters 2012-2023', which is the daily means derived from non-adjusted data (https://doi.org/10.23637/mqrcjnur) and ‘NWFP flow rate and chemical/physical parameters 2012-2023 - QC adjusted’, which is the daily means of data that have been adjusted according to QC flag (https://doi.org/10.23637/xfudcf01).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering is a widely used technique with a long and rich history in a variety of areas. However, most existing algorithms do not scale well to large datasets, or are missing theoretical guarantees of convergence. This article introduces a provably robust clustering algorithm based on loss minimization that performs well on Gaussian mixture models with outliers. It provides theoretical guarantees that the algorithm obtains high accuracy with high probability under certain assumptions. Moreover, it can also be used as an initialization strategy for k-means clustering. Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm when clustering a large number of clusters, and a k-means algorithm initialized by the algorithm outperforms many of the classic clustering methods in both speed and accuracy, while scaling well to large datasets such as ImageNet. Supplementary materials for this article are available online.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Details about binning data for outlier identification, sample size for fuzzy c-means cluster analysis, and modifying resistance surfaces or corridor termini to better capture focal land facets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The detection of water quality indicators such as Temperature, pH, Turbidity, Conductivity, and TDS involves five national standard methods. Chemically based measurement techniques may generate liquid residue, causing secondary pollution. The water quality monitoring and data analysis system can effectively address the issues that conventional methods require multiple pieces of equipment and repeated measurements. This paper analyzes the distribution characteristics of the historical data from five sensors at a specific time, displays them graphically in real time, and provides an early warning of exceeding the standard; It selects four water samples from different sections of the Li River, based on the national standard method, the average measurement errors of Temperature, PH, TDS, Conductivity and Turbidity are 0.98%, 2.23%, 2.92%, 3.05% and 3.98%.;It further uses the quartile method to analyze the outlier data over 100,000 records and five historical periods are selected. Experiment results show the system is relatively stable in measuring Temperature, PH and TDS, and the proportion of outlier is 0.42%, 0.84% and 1.24%. When Turbidity and Conductivity are measured, the proportion is 3.11% and 2.92%. In the experiment of using 7 methods to fill outlier, K nearest neighbor algorithm is better than others. The analysis of data trends, outliers, means, and extreme values assists in making decisions, such as updating and maintaining equipment, addressing extreme water quality situations, and enhancing regional water quality oversight.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data associated with main analysis and figures presented in manuscript titled: "A newly invasive species may promote dissimilarity of pest populations between organic and conventional farming systems." Dataset titled: "Bloometal_DissimilarityAnalysis_GPSPoints_DataFinal.csv" are associated with the map of site locations shown in Figure 1. Please note that GPS points have been jittered ~5km to protect the identity of farmer collaborators. The seven datasets tilted: "Bloom_DissimilarityAnalysis_GAMandHGAMValuesXXX_DataFinal.csv" are files associated with the GAM and HGAM analysis accompanying Figures 2 and 3, where XXX is either the common name of a species or how these abundance data were summed across species (see manuscript for details). The two datasets titled: "Bloom_DissimilarityAnalysis_ElasticTemporalBetaDiveristyValuesXXX_DataFinal.csv" are files associated with the temporal beta diversity analysis accompanying Figure 4, where XXX is the year associated with these data (see manuscript for details). Column definitions across the files are as follows: site.anonymous - the 27 site names made anonymous to protect the identity of farmer collaborators and used to display the site, year, and status (see definition) in Figure 1; status - the farming practices used at the site (organic, conventional, and outlier) (see manuscript for a description of the outlier site); latitude.jittered - the latitude of the site jittered (see description of jittering above); longitude.jittered - the longitude of the site jittered (see prior comment); site.anonymous.repeatedsample - sites that were sampled in both study years and used as a random effect in the GAM and HGAM analysis; year - the study year; region - the geographic region where the site is located used as a random effect in the HGAM analysis; variable - either the common name of a species as an abbreviation (see manuscript for details) or how data were summed across species (see prior comment); doy - day of year when sample was taken using in HGAM analysis; value - abundance of species or summation of species using different approaches (see manuscript for details); site.anonymous.status - the anonymous site name and status used as a grouping in the temporal beta diversity analysis; date - the date the sample was taken; sm: abundance of Swede midge for site, status, date combination in temporal beta diversity analysis; icw: abundance of imported cabbage worm for site, status, date combination in temporal beta diversity analysis; diamond back moth: abundance of diamond back moth for site, status, date combination in temporal beta diversity analysis; cl: abundance of cabbage looper for site, status, date combination in temporal beta diversity analysis; ot: abundance of cabbage looper for site, status, date combination in temporal beta diversity analysis. Any transformations of these data and associations of figures and data with tables are described in the manuscript in detail.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterUnderstanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.