Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure–activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.
Facebook
TwitterThis dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">
The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.
The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.
This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.
Usage Information:
The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.
License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.
Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.
This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I wanted a highly imbalanced dataset to share with others. It has the perfect one for us.
Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).
For example, In this dataset, There are way more samples of fully paid borrowers versus not fully paid borrowers.
Full LendingClub data available from their site.
For companies like Lending Club correctly predicting whether or not a loan will be default is very important. This dataset contains historical data from 2007 to 2015, you can to build a deep learning model to predict the chance of default for future loans. As you will see this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
here, we provide plankton image data that was sorted with the web applications ecotaxa and morphocluster. the data set was used for image classification tasks as described in schröder et. al (in preparation) and does not include any geospatial or temporal meta-data.plankton was imaged using the underwater vision profiler 5 (picheral et al. 2010) in various regions of the world's oceans between 2012-10-24 and 2017-08-08.this data publication consists of an archive containing "training.csv" (list of 392k training images for classification, validated using ecotaxa), "validation.csv" (list of 196k validation images for classification, validated using ecotaxa), "unlabeld.csv" (list of 1m unlabeled images), "morphocluster.csv" (1.2m objects validated using morphocluster, a subset of "unlabeled.csv" and "validation.csv") and the image files themselves. the csv files each contain the columns "object_id" (a unique id), "image_fn" (the relative filename), and "label" (the assigned name).the training and validation sets were sorted into 65 classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). this data shows a severe class imbalance; the 10% most populated classes contain more than 80% of the objects and the class sizes span four orders of magnitude. the validation set and a set of additional 1m unlabeled images were sorted during the first trial of morphocluster (https://github.com/morphocluster).the images in this data set were sampled during rv meteor cruises m92, m93, m96, m97, m98, m105, m106, m107, m108, m116, m119, m121, m130, m131, m135, m136, m137 and m138, during rv maria s merian cruises msm22, msm23, msm40 and msm49, during the rv polarstern cruise ps88b and during the fluxes1 experiment with rv sarmiento de gamboa.the following people have contributed to the sorting of the image data on ecotaxa:rainer kiko, tristan biard, benjamin blanc, svenja christiansen, justine courboules, charlotte eich, jannik faustmann, christine gawinski, augustin lafond, aakash panchal, marc picheral, akanksha singh and helena haussin schröder et al. (in preparation), the training set serves as a source for knowledge transfer in the training of the feature extractor. the classification using morphocluster was conducted by rainer kiko. used labels are operational and not yet matched to respective ecotaxa classes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is a valuable resource for building and evaluating machine learning models to predict fraudulent transactions in an e-commerce environment. With 6.3 million rows, it provides a rich, real-world scenario for data science tasks.
The data is an excellent case study for several key challenges in machine learning, including:
Handling Imbalanced Data: The dataset is highly imbalanced, as legitimate transactions vastly outnumber fraudulent ones. This necessitates the use of specialized techniques like SMOTE or advanced models like XGBoost that can handle class imbalance effectively.
Feature Engineering: The raw data provides an opportunity to create new, more powerful features, such as transaction velocity or the ratio of account balances, which can improve model performance.
Model Evaluation: Traditional metrics like accuracy are misleading for this type of dataset. The project requires a deeper analysis using metrics such as Precision, Recall, F1-Score, and the Precision-Recall AUC to truly understand the model's effectiveness.
Key Features: The dataset includes a variety of anonymized transaction details:
amount: The value of the transaction.
type: The type of transaction (e.g., TRANSFER, CASH_OUT).
oldbalance & newbalance: The balances of the origin and destination accounts before and after the transaction.
isFraud: The target variable, a binary flag indicating a fraudulent transaction.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Detailed experimental results of the different Prototype Generation strategies for k-Nearest Neighbour classification in multilabel data attending to the particular issues of label-level imbalance and noise:
Corresponds to Section 5.1 in the manuscript.
Noisy scenarios
Study of the noise robustness capabilities of the proposed strategies.
Individual results provided for each corpus.
Statistical tests (Friedman and Bonferroni-Dunn with significance level of p < 0.01) to assess the improvement compared too the base multilabel PG strategies
Corresponds to Section 5.2 in the manuscript.
Results ignoring the Editing stage
Assessment of the relevance of the Editing stage in the general pipeline.
Individual results provided for each corpus.
Corresponds to Section 5.3 in the manuscript.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Overview This dataset contains a diverse collection of 72,000+ high-quality images of fruits and vegetables, carefully curated for machine learning and deep learning applications. It includes 50 unique categories of fruits and vegetables, such as apples, avocados, carrots, mangoes, broccoli, and more. The dataset is perfect for tasks like classification, object detection, image recognition, and educational purposes.
Key Features Total Images: 72,000+
Image Dimensions: 128x128 pixels (uniform size for consistency and ease of processing). there is also other photos with bigger resloution
Classes: 50 categories of fruits and vegetables, including: Apple, Avocado, Banana, Beetroot, Blackberry, Blueberry, Broccoli, Cabbage, Capsicum, Carrot, Cauliflower, Chilli Pepper, Corn, Cucumber, Dates, Dragonfruit, Eggplant, Fig, Garlic, Ginger, Grapes, Guava, Jalapeno, Kiwi, Lemon, Lettuce, Mango, Mushroom, Okra, Olive, Onion, Orange, Paprika, Peanuts, Pear, Peas, Pineapple, Pomegranate, Potato, Pumpkin, Radish, Rambutan, Soy Beans, Spinach, Strawberry, Sweetcorn, Sweet Potato, Tomato, Turnip, Watermelon.
Split: The dataset is divided into training, validation, and test sets, making it ready for machine learning workflows.
Class Imbalance: Not all categories contain the same number of images, making it suitable for testing class imbalance handling techniques in machine learning.
Why Use This Dataset? Realistic Data Distribution: With varying volumes of data across categories, the dataset provides a realistic challenge for building robust models that can generalize well. Preprocessed and Ready-to-Use: All images are resized to 128x128 pixels, saving you preprocessing time. Diverse Applications: Ideal for fruit and vegetable classification, agriculture-related AI models, health-tech solutions, and educational tools. Large Scale: With over 72,000 images, the dataset is suitable for training deep learning models with high accuracy. Applications Image Classification: Build AI models to classify fruits and vegetables. Health-Tech Solutions: Use the dataset to develop apps for identifying fruits/vegetables for dietary planning. Agricultural Technology: Enhance crop identification systems or supply chain management tools. Education: Provide students and researchers with a practical dataset to learn machine learning techniques. Licensing and Usage This dataset is free to use for any purpose, including research, education, and commercial projects.
Acknowledgments This dataset was created with the goal of advancing AI applications in food technology, agriculture, and education. We hope it helps you build impactful machine learning solutions!
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Retrieved from Mendeley Data on 16-Dec-2024: https://data.mendeley.com/datasets/x8ygrw87jw/1
This dataset comprises vital information on potential cerebral stroke patients, including personal data (e.g., age, gender, etc.), and disease history (e.g. hypertension, heart disease, etc.), which was collected from HealthData.gov by Liu, Fan & Wu (2019) during their study titled 'A hybrid machine learning approach to cerebral stroke prediction based on an imbalanced medical dataset'. The data collection prioritized physiological indicators over complex medical monitoring to minimize diagnosis expenses.
This cerebral stroke dataset records information from 43400 potential patients, comprising 12 attributes with various data types.
The target variable, ‘stroke' is categorized into ‘0’ and ‘1’, representing ‘no stroke’ and ‘have stroke’ respectively. It is a categorical variable, making the problem a binary classification task. This dataset includes 783 occurrences of stroke, which account for 1.18% of the total, resulting in a highly imbalanced dataset. This imbalance reflects actual clinical practice, where most of the medical datasets suffer from class imbalance by nature.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the article, we trained and evaluated models on the Image Privacy Dataset (IPD) and the PrivacyAlert dataset. The datasets are originally provided by other sources and have been re-organised and curated for this work.
Our curation organises the datasets in a common structure. We updated the annotations and labelled the splits of the data in the annotation file. This avoids having separated folders of images for each data split (training, validation, testing) and allows a flexible handling of new splits, e.g. created with a stratified K-Fold cross-validation procedure. As for the original datasets (PicAlert and PrivacyAlert), we provide the link to the images in bash scripts to download the images. Another bash script re-organises the images in sub-folders with maximum 1000 images in each folder.
Both datasets refer to images publicly available on Flickr. These images have a large variety of content, including sensitive content, seminude people, vehicle plates, documents, private events. Images were annotated with a binary label denoting if the content was deemed to be public or private. As the images are publicly available, their label is mostly public. These datasets have therefore a high imbalance towards the public class. Note that IPD combines two other existing datasets, PicAlert and part of VISPR, to increase the number of private images already limited in PicAlert. Further details in our corresponding https://doi.org/10.48550/arXiv.2503.12464" target="_blank" rel="noopener">publication.
List of datasets and their original source:
Notes:
Some of the models run their pipeline end-to-end with the images as input, whereas other models require different or additional inputs. These inputs include the pre-computed visual entities (scene types and object types) represented in a graph format, e.g. for a Graph Neural Network. Re-using these pre-computed visual entities allows other researcher to build new models based on these features while avoiding re-computing the same on their own or for each epoch during the training of a model (faster training).
For each image of each dataset, namely PrivacyAlert, PicAlert, and VISPR, we provide the predicted scene probabilities as a .csv file , the detected objects as a .json file in COCO data format, and the node features (visual entities already organised in graph format with their features) as a .json file. For consistency, all the files are already organised in batches following the structure of the images in the datasets folder. For each dataset, we also provide the pre-computed adjacency matrix for the graph data.
Note: IPD is based on PicAlert and VISPR and therefore IPD refers to the scene probabilities and object detections of the other two datasets. Both PicAlert and VISPR must be downloaded and prepared to use IPD for training and testing.
Further details on downloading and organising data can be found in our GitHub repository: https://github.com/graphnex/privacy-from-visual-entities (see ARTIFACT-EVALUATION.md#pre-computed-visual-entitities-)
If you have any enquiries, question, or comments, or you would like to file a bug report or a feature request, use the issue tracker of our GitHub repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction: Scientific articles serve as vital sources of biomedical information, but with the yearly growth in publication volume, processing such vast amounts of information has become increasingly challenging. This difficulty is particularly pronounced when it requires the expertise of highly qualified professionals. Our research focused on the domain-specific articles classification to determine whether they contain information about drug-induced liver injury (DILI). DILI is a clinically significant condition and one of the reasons for drug registration failures. The rapid and accurate identification of drugs that may cause such conditions can prevent side effects in millions of patients.Methods: Developing a text classification method can help regulators, such as the FDA, much faster at a massive scale identify facts of potential DILI of concrete drugs. In our study, we compared several text classification methodologies, including transformers, LSTMs, information theory, and statistics-based methods. We devised a simple and interpretable text classification method that is as fast as Naïve Bayes while delivering superior performance for topic-oriented text categorisation. Moreover, we revisited techniques and methodologies to handle the imbalance of the data.Results: Transformers achieve the best results in cases if the distribution of classes and semantics of test data matches the training set. But in cases of imbalanced data, simple statistical-information theory-based models can surpass complex transformers, bringing more interpretable results that are so important for the biomedical domain. As our results show, neural networks can achieve better results if they are pre-trained on domain-specific data, and the loss function was designed to reflect the class distribution.Discussion: Overall, transformers are powerful architecture, however, in certain cases, such as topic classification, its usage can be redundant and simple statistical approaches can achieve compatible results while being much faster and explainable. However, we see potential in combining results from both worlds. Development of new neural network architectures, loss functions and training procedures that bring stability to unbalanced data is a promising topic of development.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset presents a transaction data simulator of legitimate and fraudulent transactions.
A simulation is necessarily an approximation of reality. Compared to the complexity of the dynamics underlying real-world payment card transaction data, the data simulator that we present below follows a simple design.
This simple design is a choice. First, having simple rules to generate transactions and fraudulent behaviors will help in interpreting the kind of patterns that different fraud detection techniques can identify. Second, while simple in its design, the data simulator will generate datasets that are challenging to deal with.
The simulated datasets will highlight most of the issues that practitioners of fraud detection face using real-world data. In particular, they will include class imbalance (less than 1% of fraudulent transactions), a mix of numerical and categorical features (with categorical features involving a very large number of values), non-trivial relationships between features, and time-dependent fraud scenarios.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Sunflower Growth Stage Image Dataset for Phenological Classification was collected from agricultural fields in Bangladesh, focusing on the identification and classification of sunflower growth stages. Images were captured directly in the field using a Redmi Note 11 smartphone, under natural daylight and varying weather conditions to reflect real-world environments. This dataset is meant to aid research in deep learning, computer vision, and plant phenology by providing data for automated classification of growth stages.
A total of 1,255 original images were gathered, each with a high resolution of 12,288 × 16,320 pixels and approximately 25 MB in size. The images are divided into five classes: Stage1 (Young_Bud) with 238 images, Stage2 (Mature_Bud) with 272 images, Stage3 (Early_Bloom) with 218 images, Stage4 (Full_Bloom) with 213 images, and Stage5 (Wilted) with 314 images. To balance the dataset for training, each class was augmented to have 500 images, resulting in a final balanced collection of 2,500 images.
Validation of the dataset was carried out by a Sub-Assistant Agriculture Officer from the Department of Agricultural Extension (DAE), Bangladesh, ensuring its reliability. The data was collected at two main sites: Daffodil International University (Ashulia Campus) and Model Town Nursery, Ashulia, Bangladesh. The camera used for capturing the images was a Redmi Note 11, with 24-bit color depth, an aperture of f/1.8, and images saved in JPEG format.
Example metadata for an image shows it was taken on 2025-05-22 at 17:47 using the MediaTek Camera Application. The image’s dimensions are 12,288 × 16,320 pixels at 72 dpi with 24-bit sRGB color representation. The camera details include Xiaomi as the maker, model 23117RA86G, f-stop f/1.6, exposure time 1/100 sec, ISO 200, focal length 6 mm, and auto white balance. GPS coordinates recorded were Latitude 23.5247046, Longitude 90.1918097, Altitude 34.5 m. The image file example is named IMG_20250522_174724.jpg, is a JPEG of size 26.1 MB.
Attribution Notice This dataset also includes 24 images derived from the publicly available dataset: “Sunflower Plant Health and Growth Stage Image Dataset for Agricultural Machine Learning Applications” Sagor, Saifuddin; Hossan, Md. Faysal ; Ahmed, Faruk; Reyad , Md. Zamirul Islam (2025), “Sunflower Plant Health and Growth Stage Image Dataset for Agricultural Machine Learning Applications”, Mendeley Data, V1, doi: 10.17632/y3ygk98ngr.1
These images were incorporated because the number of collected field images was insufficient for the Stage4 (Full_Bloom) Class. After inclusion, a portion of these images was further augmented to increase the dataset size and maintain class balance. Any modifications or augmentations applied to the derived images are the responsibility of the present authors.
The original dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains synthetic credit card transaction data designed for fraud detection and machine learning research. With over 6.3 million transactions, it provides a realistic simulation of financial transaction patterns including both legitimate and fraudulent activities.
This is a synthetic dataset generated to simulate credit card transaction behavior. The data represents financial transactions over a 30-day period (743 hours) with various transaction types including payments, transfers, cash-outs, debits, and cash-ins.
The dataset is specifically designed for: - Training and testing fraud detection models - Anomaly detection research - Binary classification tasks - Imbalanced learning scenarios - Financial machine learning applications
This dataset exhibits significant class imbalance with only 0.13% fraudulent transactions. This mirrors real-world fraud detection scenarios where fraudulent transactions are rare. Consider using techniques such as: - SMOTE (Synthetic Minority Over-sampling Technique) - Undersampling of majority class - Cost-sensitive learning - Ensemble methods - Anomaly detection algorithms
This dataset is well-suited for: - Logistic Regression - Random Forest - Gradient Boosting (XGBoost, LightGBM, CatBoost) - Neural Networks - Isolation Forest - Autoencoders - Support Vector Machines
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.read_csv('/kaggle/input/credit-card-fraud-dataset/Fraud.csv')
# Display basic information
print(df.info())
print(df.head())
# Check fraud distribution
print(df['isFraud'].value_counts())
# Visualize fraud distribution
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='isFraud')
plt.title('Distribution of Fraud vs Legitimate Transactions')
plt.xlabel('Is Fraud (0=No, 1=Yes)')
plt.ylabel('Count')
plt.show()
# Transaction type distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='type', hue='isFraud')
plt.title('Transaction Types by Fraud Status')
plt.xticks(rotation=45)
plt.show()
This is a static dataset with no planned future updates. It serves as a benchmark for fraud detection research and model development.
This dataset is made available under the MIT License for educational and research purposes in the field of fraud detection and financial machine learning.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The development of toxicity classification models using the ToxCast database has been extensively studied. Machine learning approaches are effective in identifying the bioactivity of untested chemicals. However, ToxCast assays differ in the amount of data and degree of class imbalance (CI). Therefore, the resampling algorithm employed should vary depending on the data distribution to achieve optimal classification performance. In this study, the effects of CI and data scarcity (DS) on the performance of binary classification models were investigated using ToxCast bioassay data. An assay matrix based on CI and DS was prepared for 335 assays with biologically intended target information, and 28 CI assays and 3 DS assays were selected. Thirty models established by combining five molecular fingerprints (i.e., Morgan, MACCS, RDKit, Pattern, and Layered) and six algorithms [i.e., gradient boosting tree, random forest (RF), multi-layered perceptron, k-nearest neighbor, logistic regression, and naive Bayes] were trained using the selected assay data set. Of the 30 trained models, MACCS–RF showed the best performance and thus was selected for analyses of the effects of CI and DS. Results showed that recall and F1 were significantly lower when training with the CI assays than with the DS assays. In addition, hyperparameter tuning of the RF algorithm significantly improved F1 on CI assays. This study provided a basis for developing a toxicity classification model with improved performance by evaluating the effects of data set characteristics. This study also emphasized the importance of using appropriate evaluation metrics and tuning hyperparameters in model development.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Burmese Grape Leaf Disease Dataset comprises 3,103 high-quality images categorized into five distinct classes representing various conditions of grapevine leaves. This dataset is curated to support machine learning, deep learning, and computer vision-based applications for automated plant disease recognition and classification. Each image captures clear visual indicators relevant to the health status of the leaf, aiding in effective feature extraction and model training.
Data Collection Details: Captured Using: 1. Realme 8 (64 MP, f/1.79 aperture) 2. Redmi Note 7 Pro Max (48 MP, f/1.79 aperture)
Data Source Locations: 1. Toponer Lotkon Bagan, Kaligonj-Nagori Road, Nagarvala (Latitude: 23.88658723621705, Longitude: 90.47780500780843) 2. Itakhola Bus Stand, Narsingdi (Latitude: 23.980154076764684, Longitude: 90.7332739352483)
Number of Images: 1. Healthy: 1006 2. Anthracnose (Brown Spot): 447 3. Insect Damage: 990 4. Powdery Mildew: 296 5. Leaf Spot (Yellow): 364
Data Augmentation Techniques: To enhance model generalizability and address data imbalance, the dataset was augmented using the following techniques: 1. Brightness adjustment 2. Contrast enhancement 3. Rotation (random angles) 4. Shear transformation 5. Zoom-in and zoom-out scaling
Augmented Images (15,515 Images): 1. Healthy: 1006*5 = 5,030 2. Anthracnose (Brown Spot): 447*5 = 2,235 3. Insect Damage: 990*5 = 4,950 4. Powdery Mildew: 296*5= 1,480 5. Leaf Spot (Yellow): 364*5= 1,820
Key Applications: 1. Automated Disease Detection: Used to train intelligent systems capable of identifying leaf diseases in real time. 2. Precision Viticulture: Enables AI-based monitoring for better vineyard management and targeted treatment. 3. Computer Vision Research: Provides a benchmark for evaluating classification and segmentation models. 4. Transfer Learning & Mobile Deployment: Suitable for fine-tuning pre-trained CNNs and deploying lightweight models on smartphones and edge devices. 5. Explainable AI in Agriculture: Ideal for interpretability research using saliency maps and XAI tools. 6. Academic and Industrial Benchmarking: Can be used in competitions, thesis projects, or commercial AI prototypes for crop health monitoring.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.