100+ datasets found

Data from: A virtual multi-label approach to imbalanced data classification
tandf.figshare.com
text/x-tex
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
Explore at:
text/x-texAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19390561.v1
Dataset updated
Feb 28, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Elizabeth P. Chou; Shan-Ping Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.
f
Data from: GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data...
acs.figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker (2023). GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning [Dataset]. http://doi.org/10.1021/acs.jcim.1c00160.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c00160.s002
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure–activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.
Imbalanced Cifar-10
kaggle.com
zip
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
Explore at:
zip(807146485 bytes)Available download formats
Dataset updated
Jun 17, 2023
Authors
Akhil Theerthala
Description
This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

Usage Information:

The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.
Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2021.715421.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
n
Data from: Exploring deep learning techniques for wild animal behaviour...
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2ngf1vhwk
Dataset updated
Feb 22, 2024
Dataset provided by
Nagoya University
Osaka University
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
Lending Club Loan Data
kaggle.com
zip
Updated Nov 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sweta Shetye (2020). Lending Club Loan Data [Dataset]. https://www.kaggle.com/swetashetye/lending-club-loan-data-imbalance-dataset
Explore at:
zip(218250 bytes)Available download formats
Dataset updated
Nov 8, 2020
Authors
Sweta Shetye
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I wanted a highly imbalanced dataset to share with others. It has the perfect one for us.

Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).

For example, In this dataset, There are way more samples of fully paid borrowers versus not fully paid borrowers.

Full LendingClub data available from their site.

Content

For companies like Lending Club correctly predicting whether or not a loan will be default is very important. This dataset contains historical data from 2007 to 2015, you can to build a deep learning model to predict the chance of default for future loans. As you will see this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.
The definition of a confusion matrix.
plos.figshare.com
xls
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t002
Dataset updated
Feb 10, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
UVP5 data sorted with EcoTaxa and MorphoCluster
seanoe.org
image/*
Updated 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rainer Kiko; Simon-Martin Schröder (2020). UVP5 data sorted with EcoTaxa and MorphoCluster [Dataset]. http://doi.org/10.17882/73002
Explore at:
image/*Available download formats
Unique identifier
https://doi.org/10.17882/73002
Dataset updated
2020
Dataset provided by
SEANOE
Authors
Rainer Kiko; Simon-Martin Schröder
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
Oct 23, 2012 - Aug 7, 2017
Area covered
Description
here, we provide plankton image data that was sorted with the web applications ecotaxa and morphocluster. the data set was used for image classification tasks as described in schröder et. al (in preparation) and does not include any geospatial or temporal meta-data.plankton was imaged using the underwater vision profiler 5 (picheral et al. 2010) in various regions of the world's oceans between 2012-10-24 and 2017-08-08.this data publication consists of an archive containing "training.csv" (list of 392k training images for classification, validated using ecotaxa), "validation.csv" (list of 196k validation images for classification, validated using ecotaxa), "unlabeld.csv" (list of 1m unlabeled images), "morphocluster.csv" (1.2m objects validated using morphocluster, a subset of "unlabeled.csv" and "validation.csv") and the image files themselves. the csv files each contain the columns "object_id" (a unique id), "image_fn" (the relative filename), and "label" (the assigned name).the training and validation sets were sorted into 65 classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). this data shows a severe class imbalance; the 10% most populated classes contain more than 80% of the objects and the class sizes span four orders of magnitude. the validation set and a set of additional 1m unlabeled images were sorted during the first trial of morphocluster (https://github.com/morphocluster).the images in this data set were sampled during rv meteor cruises m92, m93, m96, m97, m98, m105, m106, m107, m108, m116, m119, m121, m130, m131, m135, m136, m137 and m138, during rv maria s merian cruises msm22, msm23, msm40 and msm49, during the rv polarstern cruise ps88b and during the fluxes1 experiment with rv sarmiento de gamboa.the following people have contributed to the sorting of the image data on ecotaxa:rainer kiko, tristan biard, benjamin blanc, svenja christiansen, justine courboules, charlotte eich, jannik faustmann, christine gawinski, augustin lafond, aakash panchal, marc picheral, akanksha singh and helena haussin schröder et al. (in preparation), the training set serves as a source for knowledge transfer in the training of the feature extractor. the classification using morphocluster was conducted by rainer kiko. used labels are operational and not yet matched to respective ecotaxa classes.
Performance comparison of machine learning models across accuracy, AUC, MCC,...
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t005
Dataset updated
Dec 31, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.
Financial Transaction Fraud Detection
kaggle.com
zip
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhi pratap (2025). Financial Transaction Fraud Detection [Dataset]. https://www.kaggle.com/datasets/abhipratapsingh/fraud-detection
Explore at:
zip(186385507 bytes)Available download formats
Dataset updated
Aug 20, 2025
Authors
Abhi pratap
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset is a valuable resource for building and evaluating machine learning models to predict fraudulent transactions in an e-commerce environment. With 6.3 million rows, it provides a rich, real-world scenario for data science tasks.

The data is an excellent case study for several key challenges in machine learning, including:

Handling Imbalanced Data: The dataset is highly imbalanced, as legitimate transactions vastly outnumber fraudulent ones. This necessitates the use of specialized techniques like SMOTE or advanced models like XGBoost that can handle class imbalance effectively.

Feature Engineering: The raw data provides an opportunity to create new, more powerful features, such as transaction velocity or the ratio of account balances, which can improve model performance.

Model Evaluation: Traditional metrics like accuracy are misleading for this type of dataset. The project requires a deeper analysis using metrics such as Precision, Recall, F1-Score, and the Precision-Recall AUC to truly understand the model's effectiveness.

Key Features: The dataset includes a variety of anonymized transaction details:

amount: The value of the transaction.

type: The type of transaction (e.g., TRANSFER, CASH_OUT).

oldbalance & newbalance: The balances of the origin and destination accounts before and after the transaction.

isFraud: The target variable, a binary flag indicating a fraudulent transaction.
m
Detailed results of "Insights into imbalance-aware Multilabel Prototype...
data.mendeley.com
observatorio-cientifico.ua.es
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jose J. Valero-Mas (2024). Detailed results of "Insights into imbalance-aware Multilabel Prototype Generation mechanisms for k-Nearest Neighbor classification in noisy scenarios" [Dataset]. http://doi.org/10.17632/p6ytjt5rfy.1
Explore at:
Unique identifier
https://doi.org/10.17632/p6ytjt5rfy.1
Dataset updated
Apr 2, 2024
Authors
Jose J. Valero-Mas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Detailed experimental results of the different Prototype Generation strategies for k-Nearest Neighbour classification in multilabel data attending to the particular issues of label-level imbalance and noise:

Noise-free scenarios

Study of the considered strategies for addressing label-level imbalance in PG scenarios without induced noise.

Individual results provided for each corpus.

Statistical tests (Friedman and Bonferroni-Dunn with significance level of p < 0.01) to assess the improvement compared to the base multilabel PG strategies

Corresponds to Section 5.1 in the manuscript.

Noisy scenarios

Study of the noise robustness capabilities of the proposed strategies.

Individual results provided for each corpus.

Statistical tests (Friedman and Bonferroni-Dunn with significance level of p < 0.01) to assess the improvement compared too the base multilabel PG strategies

Corresponds to Section 5.2 in the manuscript.

Results ignoring the Editing stage

Assessment of the relevance of the Editing stage in the general pipeline.

Individual results provided for each corpus.

Corresponds to Section 5.3 in the manuscript.
Fruit and Vegetables
kaggle.com
zip
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
youssef salah zakria (2024). Fruit and Vegetables [Dataset]. https://www.kaggle.com/datasets/youssefsalahzakria/fruit-and-vegetables-classification
Explore at:
zip(5178940148 bytes)Available download formats
Dataset updated
Nov 20, 2024
Authors
youssef salah zakria
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview This dataset contains a diverse collection of 72,000+ high-quality images of fruits and vegetables, carefully curated for machine learning and deep learning applications. It includes 50 unique categories of fruits and vegetables, such as apples, avocados, carrots, mangoes, broccoli, and more. The dataset is perfect for tasks like classification, object detection, image recognition, and educational purposes.

Key Features Total Images: 72,000+

Image Dimensions: 128x128 pixels (uniform size for consistency and ease of processing). there is also other photos with bigger resloution

Classes: 50 categories of fruits and vegetables, including: Apple, Avocado, Banana, Beetroot, Blackberry, Blueberry, Broccoli, Cabbage, Capsicum, Carrot, Cauliflower, Chilli Pepper, Corn, Cucumber, Dates, Dragonfruit, Eggplant, Fig, Garlic, Ginger, Grapes, Guava, Jalapeno, Kiwi, Lemon, Lettuce, Mango, Mushroom, Okra, Olive, Onion, Orange, Paprika, Peanuts, Pear, Peas, Pineapple, Pomegranate, Potato, Pumpkin, Radish, Rambutan, Soy Beans, Spinach, Strawberry, Sweetcorn, Sweet Potato, Tomato, Turnip, Watermelon.

Split: The dataset is divided into training, validation, and test sets, making it ready for machine learning workflows.

Class Imbalance: Not all categories contain the same number of images, making it suitable for testing class imbalance handling techniques in machine learning.

Why Use This Dataset? Realistic Data Distribution: With varying volumes of data across categories, the dataset provides a realistic challenge for building robust models that can generalize well. Preprocessed and Ready-to-Use: All images are resized to 128x128 pixels, saving you preprocessing time. Diverse Applications: Ideal for fruit and vegetable classification, agriculture-related AI models, health-tech solutions, and educational tools. Large Scale: With over 72,000 images, the dataset is suitable for training deep learning models with high accuracy. Applications Image Classification: Build AI models to classify fruits and vegetables. Health-Tech Solutions: Use the dataset to develop apps for identifying fruits/vegetables for dietary planning. Agricultural Technology: Enhance crop identification systems or supply chain management tools. Education: Provide students and researchers with a practical dataset to learn machine learning techniques. Licensing and Usage This dataset is free to use for any purpose, including research, education, and commercial projects.

Acknowledgments This dataset was created with the goal of advancing AI applications in food technology, agriculture, and education. We hope it helps you build impactful machine learning solutions!
Cerebral Stroke Dataset
kaggle.com
zip
Updated Sep 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dailydaisy2 (2025). Cerebral Stroke Dataset [Dataset]. https://www.kaggle.com/datasets/viviansam/cerebral-stroke-dataset
Explore at:
zip(573312 bytes)Available download formats
Dataset updated
Sep 25, 2025
Authors
dailydaisy2
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Retrieved from Mendeley Data on 16-Dec-2024: https://data.mendeley.com/datasets/x8ygrw87jw/1

This dataset comprises vital information on potential cerebral stroke patients, including personal data (e.g., age, gender, etc.), and disease history (e.g. hypertension, heart disease, etc.), which was collected from HealthData.gov by Liu, Fan & Wu (2019) during their study titled 'A hybrid machine learning approach to cerebral stroke prediction based on an imbalanced medical dataset'. The data collection prioritized physiological indicators over complex medical monitoring to minimize diagnosis expenses.

This cerebral stroke dataset records information from 43400 potential patients, comprising 12 attributes with various data types.

id - Unique identifier of each patient

gender - Gender of the patient: male, female, other

age - Age of the patient: ranged from 0.08 to 82

hypertension - If the patient has hypertension: 0, 1 (no, yes, respectively)

heart_disease - If the patient has heart disease: 0, 1 (no, yes, respectively)

ever_married - Marital status of patient: No, Yes

work_type - Occupation type of patient: children, private sector, self-employed, government sector, never worked

Residence_type - Residency type of patient: rural, urban

avg_glucose_level - Average glucose level in blood: ranged from 55 to 279.66

bmi - Body mass index: ranged from 10.1 to 97.6

smoking_status - Smoking status: formerly smoked, never smoked, smokes

stroke - If the patient has stroke: 0, 1 (no, yes, respectively)

The target variable, ‘stroke' is categorized into ‘0’ and ‘1’, representing ‘no stroke’ and ‘have stroke’ respectively. It is a categorical variable, making the problem a binary classification task. This dataset includes 783 occurrences of stroke, which account for 1.18% of the total, resulting in a highly imbalanced dataset. This imbalance reflects actual clinical practice, where most of the medical datasets suffer from class imbalance by nature.
Learning Privacy from Visual Entities - Curated data sets and pre-computed...
zenodo.org
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro (2025). Learning Privacy from Visual Entities - Curated data sets and pre-computed visual entities [Dataset]. http://doi.org/10.5281/zenodo.15348506
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15348506
Dataset updated
May 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the curated image privacy datasets and pre-computed visual entities used in the publication Learning Privacy from Visual Entities by A. Xompero and A. Cavallaro.
[arxiv][code]

Curated image privacy data sets

In the article, we trained and evaluated models on the Image Privacy Dataset (IPD) and the PrivacyAlert dataset. The datasets are originally provided by other sources and have been re-organised and curated for this work.

Our curation organises the datasets in a common structure. We updated the annotations and labelled the splits of the data in the annotation file. This avoids having separated folders of images for each data split (training, validation, testing) and allows a flexible handling of new splits, e.g. created with a stratified K-Fold cross-validation procedure. As for the original datasets (PicAlert and PrivacyAlert), we provide the link to the images in bash scripts to download the images. Another bash script re-organises the images in sub-folders with maximum 1000 images in each folder.

Both datasets refer to images publicly available on Flickr. These images have a large variety of content, including sensitive content, seminude people, vehicle plates, documents, private events. Images were annotated with a binary label denoting if the content was deemed to be public or private. As the images are publicly available, their label is mostly public. These datasets have therefore a high imbalance towards the public class. Note that IPD combines two other existing datasets, PicAlert and part of VISPR, to increase the number of private images already limited in PicAlert. Further details in our corresponding https://doi.org/10.48550/arXiv.2503.12464" target="_blank" rel="noopener">publication.

List of datasets and their original source:

PicAlert [Images occupy 2.4 GB]

VISPR [Images occupy 49.7 GB]

PrivacyAlert [Images occupy 1 GB]

Notes:

For PicAlert and PrivacyAlert, only urls to the original locations in Flickr are available in the Zenodo record

Collector and authors of the PrivacyAlert dataset selected the images from Flickr under Public Domain license

Owners of the photos on Flick could have removed the photos from the social media platform

Running the bash scripts to download the images can incur in the "429 Too Many Requests" status code

Pre-computed visual entitities

Some of the models run their pipeline end-to-end with the images as input, whereas other models require different or additional inputs. These inputs include the pre-computed visual entities (scene types and object types) represented in a graph format, e.g. for a Graph Neural Network. Re-using these pre-computed visual entities allows other researcher to build new models based on these features while avoiding re-computing the same on their own or for each epoch during the training of a model (faster training).

For each image of each dataset, namely PrivacyAlert, PicAlert, and VISPR, we provide the predicted scene probabilities as a .csv file , the detected objects as a .json file in COCO data format, and the node features (visual entities already organised in graph format with their features) as a .json file. For consistency, all the files are already organised in batches following the structure of the images in the datasets folder. For each dataset, we also provide the pre-computed adjacency matrix for the graph data.

Note: IPD is based on PicAlert and VISPR and therefore IPD refers to the scene probabilities and object detections of the other two datasets. Both PicAlert and VISPR must be downloaded and prepared to use IPD for training and testing.

Further details on downloading and organising data can be found in our GitHub repository: https://github.com/graphnex/privacy-from-visual-entities (see ARTIFACT-EVALUATION.md#pre-computed-visual-entitities-)

Enquiries, questions and comments

If you have any enquiries, question, or comments, or you would like to file a bug report or a feature request, use the issue tracker of our GitHub repository.
DataSheet1_Comparative analysis of classification techniques for topic-based...
frontiersin.figshare.com
pdf
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ihor Stepanov; Arsentii Ivasiuk; Oleksandr Yavorskyi; Alina Frolova (2023). DataSheet1_Comparative analysis of classification techniques for topic-based biomedical literature categorisation.PDF [Dataset]. http://doi.org/10.3389/fgene.2023.1238140.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2023.1238140.s001
Dataset updated
Nov 7, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Ihor Stepanov; Arsentii Ivasiuk; Oleksandr Yavorskyi; Alina Frolova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction: Scientific articles serve as vital sources of biomedical information, but with the yearly growth in publication volume, processing such vast amounts of information has become increasingly challenging. This difficulty is particularly pronounced when it requires the expertise of highly qualified professionals. Our research focused on the domain-specific articles classification to determine whether they contain information about drug-induced liver injury (DILI). DILI is a clinically significant condition and one of the reasons for drug registration failures. The rapid and accurate identification of drugs that may cause such conditions can prevent side effects in millions of patients.Methods: Developing a text classification method can help regulators, such as the FDA, much faster at a massive scale identify facts of potential DILI of concrete drugs. In our study, we compared several text classification methodologies, including transformers, LSTMs, information theory, and statistics-based methods. We devised a simple and interpretable text classification method that is as fast as Naïve Bayes while delivering superior performance for topic-oriented text categorisation. Moreover, we revisited techniques and methodologies to handle the imbalance of the data.Results: Transformers achieve the best results in cases if the distribution of classes and semantics of test data matches the training set. But in cases of imbalanced data, simple statistical-information theory-based models can surpass complex transformers, bringing more interpretable results that are so important for the biomedical domain. As our results show, neural networks can achieve better results if they are pre-trained on domain-specific data, and the loss function was designed to reflect the class distribution.Discussion: Overall, transformers are powerful architecture, however, in certain cases, such as topic classification, its usage can be redundant and simple statistical approaches can achieve compatible results while being much faster and explainable. However, we see potential in combining results from both worlds. Development of new neural network architectures, loss functions and training procedures that bring stability to unbalanced data is a promising topic of development.
Credit Card Fraud Detection
kaggle.com
zip
Updated Jul 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Old Monk (2021). Credit Card Fraud Detection [Dataset]. https://www.kaggle.com/saurabhbagchi/credit-card-fraud-detection
Explore at:
zip(29415944 bytes)Available download formats
Dataset updated
Jul 18, 2021
Authors
Old Monk
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset presents a transaction data simulator of legitimate and fraudulent transactions.

A simulation is necessarily an approximation of reality. Compared to the complexity of the dynamics underlying real-world payment card transaction data, the data simulator that we present below follows a simple design.

This simple design is a choice. First, having simple rules to generate transactions and fraudulent behaviors will help in interpreting the kind of patterns that different fraud detection techniques can identify. Second, while simple in its design, the data simulator will generate datasets that are challenging to deal with.

The simulated datasets will highlight most of the issues that practitioners of fraud detection face using real-world data. In particular, they will include class imbalance (less than 1% of fraudulent transactions), a mix of numerical and categorical features (with categorical features involving a very large number of values), non-trivial relationships between features, and time-dependent fraud scenarios.
m
Sunflower Growth Stage Image Dataset for Phenological Classification
data.mendeley.com
Updated Aug 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jahangir Alam Jibon (2025). Sunflower Growth Stage Image Dataset for Phenological Classification [Dataset]. http://doi.org/10.17632/byftmdzg4g.2
Explore at:
Unique identifier
https://doi.org/10.17632/byftmdzg4g.2
Dataset updated
Aug 18, 2025
Authors
Jahangir Alam Jibon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Sunflower Growth Stage Image Dataset for Phenological Classification was collected from agricultural fields in Bangladesh, focusing on the identification and classification of sunflower growth stages. Images were captured directly in the field using a Redmi Note 11 smartphone, under natural daylight and varying weather conditions to reflect real-world environments. This dataset is meant to aid research in deep learning, computer vision, and plant phenology by providing data for automated classification of growth stages.

A total of 1,255 original images were gathered, each with a high resolution of 12,288 × 16,320 pixels and approximately 25 MB in size. The images are divided into five classes: Stage1 (Young_Bud) with 238 images, Stage2 (Mature_Bud) with 272 images, Stage3 (Early_Bloom) with 218 images, Stage4 (Full_Bloom) with 213 images, and Stage5 (Wilted) with 314 images. To balance the dataset for training, each class was augmented to have 500 images, resulting in a final balanced collection of 2,500 images.

Validation of the dataset was carried out by a Sub-Assistant Agriculture Officer from the Department of Agricultural Extension (DAE), Bangladesh, ensuring its reliability. The data was collected at two main sites: Daffodil International University (Ashulia Campus) and Model Town Nursery, Ashulia, Bangladesh. The camera used for capturing the images was a Redmi Note 11, with 24-bit color depth, an aperture of f/1.8, and images saved in JPEG format.

Example metadata for an image shows it was taken on 2025-05-22 at 17:47 using the MediaTek Camera Application. The image’s dimensions are 12,288 × 16,320 pixels at 72 dpi with 24-bit sRGB color representation. The camera details include Xiaomi as the maker, model 23117RA86G, f-stop f/1.6, exposure time 1/100 sec, ISO 200, focal length 6 mm, and auto white balance. GPS coordinates recorded were Latitude 23.5247046, Longitude 90.1918097, Altitude 34.5 m. The image file example is named IMG_20250522_174724.jpg, is a JPEG of size 26.1 MB.

Attribution Notice This dataset also includes 24 images derived from the publicly available dataset: “Sunflower Plant Health and Growth Stage Image Dataset for Agricultural Machine Learning Applications” Sagor, Saifuddin; Hossan, Md. Faysal ; Ahmed, Faruk; Reyad , Md. Zamirul Islam (2025), “Sunflower Plant Health and Growth Stage Image Dataset for Agricultural Machine Learning Applications”, Mendeley Data, V1, doi: 10.17632/y3ygk98ngr.1

These images were incorporated because the number of collected field images was insufficient for the Stage4 (Full_Bloom) Class. After inclusion, a portion of these images was further augmented to increase the dataset size and maintain class balance. Any modifications or augmentations applied to the derived images are the responsibility of the present authors.

The original dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
Credit Card Fraud Dataset
kaggle.com
zip
Updated Jun 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Moraes (2024). Credit Card Fraud Dataset [Dataset]. https://www.kaggle.com/datasets/dylanmoraes/credit-card-fraud-dataset/discussion
Explore at:
zip(186385507 bytes)Available download formats
Dataset updated
Jun 22, 2024
Authors
Dylan Moraes
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This dataset contains synthetic credit card transaction data designed for fraud detection and machine learning research. With over 6.3 million transactions, it provides a realistic simulation of financial transaction patterns including both legitimate and fraudulent activities.

Source

This is a synthetic dataset generated to simulate credit card transaction behavior. The data represents financial transactions over a 30-day period (743 hours) with various transaction types including payments, transfers, cash-outs, debits, and cash-ins.

Purpose

The dataset is specifically designed for: - Training and testing fraud detection models - Anomaly detection research - Binary classification tasks - Imbalanced learning scenarios - Financial machine learning applications

Column Descriptions

step: Maps a unit of time in the real world. 1 step represents 1 hour of time. Range: 1 to 743

type: Type of transaction (PAYMENT, TRANSFER, CASH_OUT, DEBIT, CASH_IN)

amount: Amount of the transaction in local currency

nameOrig: Customer ID who initiated the transaction

oldbalanceOrg: Initial balance before the transaction (origin account)

newbalanceOrig: New balance after the transaction (origin account)

nameDest: Recipient ID of the transaction

oldbalanceDest: Initial recipient balance before the transaction

newbalanceDest: New recipient balance after the transaction

isFraud: Binary flag indicating fraud (1 = fraud, 0 = legitimate)

isFlaggedFraud: Flag for illegal attempts to transfer more than 200,000 in a single transaction

Dataset Statistics

Total Transactions: 6,362,620

Fraudulent Transactions: 8,213 (~0.13%)

Legitimate Transactions: 6,354,407 (~99.87%)

Time Period: 30 days (743 hours)

File Size: 493.53 MB

Class Imbalance Note

This dataset exhibits significant class imbalance with only 0.13% fraudulent transactions. This mirrors real-world fraud detection scenarios where fraudulent transactions are rare. Consider using techniques such as: - SMOTE (Synthetic Minority Over-sampling Technique) - Undersampling of majority class - Cost-sensitive learning - Ensemble methods - Anomaly detection algorithms

Model Suitability

This dataset is well-suited for: - Logistic Regression - Random Forest - Gradient Boosting (XGBoost, LightGBM, CatBoost) - Neural Networks - Isolation Forest - Autoencoders - Support Vector Machines

Quick Start Example

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Load the dataset df = pd.read_csv('/kaggle/input/credit-card-fraud-dataset/Fraud.csv') # Display basic information print(df.info()) print(df.head()) # Check fraud distribution print(df['isFraud'].value_counts()) # Visualize fraud distribution plt.figure(figsize=(8, 5)) sns.countplot(data=df, x='isFraud') plt.title('Distribution of Fraud vs Legitimate Transactions') plt.xlabel('Is Fraud (0=No, 1=Yes)') plt.ylabel('Count') plt.show() # Transaction type distribution plt.figure(figsize=(10, 6)) sns.countplot(data=df, x='type', hue='isFraud') plt.title('Transaction Types by Fraud Status') plt.xticks(rotation=45) plt.show()

Usage Tips

Handle Class Imbalance: Use appropriate sampling techniques or algorithms designed for imbalanced data

Feature Engineering: Consider creating features like transaction velocity, time-based patterns, and balance differences

Evaluation Metrics: Use precision, recall, F1-score, and AUC-ROC rather than accuracy due to class imbalance

Cross-validation: Use stratified k-fold to maintain class distribution across folds

Transaction Patterns: Analyze transaction types - TRANSFER and CASH_OUT are more associated with fraud

Update Frequency

This is a static dataset with no planned future updates. It serves as a benchmark for fraud detection research and model development.

Acknowledgments

This dataset is made available under the MIT License for educational and research purposes in the field of fraud detection and financial machine learning.
f
Data from: Effects of Class Imbalance and Data Scarcity on the Performance...
acs.figshare.com
xlsx
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Changhun Kim; Jaeseong Jeong; Jinhee Choi (2023). Effects of Class Imbalance and Data Scarcity on the Performance of Binary Classification Machine Learning Models Developed Based on ToxCast/Tox21 Assay Data [Dataset]. http://doi.org/10.1021/acs.chemrestox.2c00189.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.chemrestox.2c00189.s001
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Changhun Kim; Jaeseong Jeong; Jinhee Choi
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The development of toxicity classification models using the ToxCast database has been extensively studied. Machine learning approaches are effective in identifying the bioactivity of untested chemicals. However, ToxCast assays differ in the amount of data and degree of class imbalance (CI). Therefore, the resampling algorithm employed should vary depending on the data distribution to achieve optimal classification performance. In this study, the effects of CI and data scarcity (DS) on the performance of binary classification models were investigated using ToxCast bioassay data. An assay matrix based on CI and DS was prepared for 335 assays with biologically intended target information, and 28 CI assays and 3 DS assays were selected. Thirty models established by combining five molecular fingerprints (i.e., Morgan, MACCS, RDKit, Pattern, and Layered) and six algorithms [i.e., gradient boosting tree, random forest (RF), multi-layered perceptron, k-nearest neighbor, logistic regression, and naive Bayes] were trained using the selected assay data set. Of the 30 trained models, MACCS–RF showed the best performance and thus was selected for analyses of the effects of CI and DS. Results showed that recall and F1 were significantly lower when training with the CI assays than with the DS assays. In addition, hyperparameter tuning of the RF algorithm significantly improved F1 on CI assays. This study provided a basis for developing a toxicity classification model with improved performance by evaluating the effects of data set characteristics. This study also emphasized the importance of using appropriate evaluation metrics and tuning hyperparameters in model development.
m
Burmese Grape Leaf Disease Dataset for Computer Vision-Based Plant Health...
data.mendeley.com
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salman Af Rahman (2025). Burmese Grape Leaf Disease Dataset for Computer Vision-Based Plant Health Diagnosis [Dataset]. http://doi.org/10.17632/k6gy38xv89.1
Explore at:
Unique identifier
https://doi.org/10.17632/k6gy38xv89.1
Dataset updated
Apr 9, 2025
Authors
Salman Af Rahman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Burmese Grape Leaf Disease Dataset comprises 3,103 high-quality images categorized into five distinct classes representing various conditions of grapevine leaves. This dataset is curated to support machine learning, deep learning, and computer vision-based applications for automated plant disease recognition and classification. Each image captures clear visual indicators relevant to the health status of the leaf, aiding in effective feature extraction and model training.

Data Collection Details: Captured Using: 1. Realme 8 (64 MP, f/1.79 aperture) 2. Redmi Note 7 Pro Max (48 MP, f/1.79 aperture)

Data Source Locations: 1. Toponer Lotkon Bagan, Kaligonj-Nagori Road, Nagarvala (Latitude: 23.88658723621705, Longitude: 90.47780500780843) 2. Itakhola Bus Stand, Narsingdi (Latitude: 23.980154076764684, Longitude: 90.7332739352483)

Number of Images: 1. Healthy: 1006 2. Anthracnose (Brown Spot): 447 3. Insect Damage: 990 4. Powdery Mildew: 296 5. Leaf Spot (Yellow): 364

Data Augmentation Techniques: To enhance model generalizability and address data imbalance, the dataset was augmented using the following techniques: 1. Brightness adjustment 2. Contrast enhancement 3. Rotation (random angles) 4. Shear transformation 5. Zoom-in and zoom-out scaling

Augmented Images (15,515 Images): 1. Healthy: 1006*5 = 5,030 2. Anthracnose (Brown Spot): 447*5 = 2,235 3. Insect Damage: 990*5 = 4,950 4. Powdery Mildew: 296*5= 1,480 5. Leaf Spot (Yellow): 364*5= 1,820

Key Applications: 1. Automated Disease Detection: Used to train intelligent systems capable of identifying leaf diseases in real time. 2. Precision Viticulture: Enables AI-based monitoring for better vineyard management and targeted treatment. 3. Computer Vision Research: Provides a benchmark for evaluating classification and segmentation models. 4. Transfer Learning & Mobile Deployment: Suitable for fine-tuning pre-trained CNNs and deploying lightweight models on smartphones and edge devices. 5. Explainable AI in Agriculture: Ideal for interpretability research using saliency maps and XAI tools. 6. Academic and Industrial Benchmarking: Can be used in competitions, thesis projects, or commercial AI prototypes for crop health monitoring.

Facebook

Twitter

Click to copy link

Link copied

Cite

Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1

Data from: A virtual multi-label approach to imbalanced data classification

Explore at:

text/x-texAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.19390561.v1

Dataset updated

Feb 28, 2024

Dataset provided by

Taylor & Francishttps://taylorandfrancis.com/

Authors

Elizabeth P. Chou; Shan-Ping Yang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.

Clear search

Close search

Google apps

Main menu

Data from: A virtual multi-label approach to imbalanced data classification

Data from: GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data...

Imbalanced Cifar-10

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

Data from: Exploring deep learning techniques for wild animal behaviour...

Lending Club Loan Data

Context

Content

The definition of a confusion matrix.

UVP5 data sorted with EcoTaxa and MorphoCluster

Performance comparison of machine learning models across accuracy, AUC, MCC,...

Financial Transaction Fraud Detection

Detailed results of "Insights into imbalance-aware Multilabel Prototype...

Fruit and Vegetables

Cerebral Stroke Dataset

Learning Privacy from Visual Entities - Curated data sets and pre-computed...

Curated image privacy data sets

Pre-computed visual entitities

Enquiries, questions and comments

DataSheet1_Comparative analysis of classification techniques for topic-based...

Credit Card Fraud Detection

Sunflower Growth Stage Image Dataset for Phenological Classification

Credit Card Fraud Dataset

Overview

Source

Purpose

Column Descriptions

Dataset Statistics

Class Imbalance Note

Model Suitability

Quick Start Example

Usage Tips

Update Frequency

Acknowledgments

Data from: Effects of Class Imbalance and Data Scarcity on the Performance...

Burmese Grape Leaf Disease Dataset for Computer Vision-Based Plant Health...

Data from: A virtual multi-label approach to imbalanced data classification