100+ datasets found
  1. Data from: A virtual multi-label approach to imbalanced data classification

    • tandf.figshare.com
    text/x-tex
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
    Explore at:
    text/x-texAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Elizabeth P. Chou; Shan-Ping Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.

  2. f

    Data from: GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data...

    • acs.figshare.com
    zip
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker (2023). GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning [Dataset]. http://doi.org/10.1021/acs.jcim.1c00160.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure–activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.

  3. Imbalanced Cifar-10

    • kaggle.com
    zip
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
    Explore at:
    zip(807146485 bytes)Available download formats
    Dataset updated
    Jun 17, 2023
    Authors
    Akhil Theerthala
    Description

    This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

    The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

    The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

    This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

    Usage Information:

    The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

    License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

    Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.

  4. Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

  5. n

    Data from: Exploring deep learning techniques for wild animal behaviour...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Feb 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 22, 2024
    Dataset provided by
    Nagoya University
    Osaka University
    Authors
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

    This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.

  6. Lending Club Loan Data

    • kaggle.com
    zip
    Updated Nov 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sweta Shetye (2020). Lending Club Loan Data [Dataset]. https://www.kaggle.com/swetashetye/lending-club-loan-data-imbalance-dataset
    Explore at:
    zip(218250 bytes)Available download formats
    Dataset updated
    Nov 8, 2020
    Authors
    Sweta Shetye
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I wanted a highly imbalanced dataset to share with others. It has the perfect one for us.

    Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).

    For example, In this dataset, There are way more samples of fully paid borrowers versus not fully paid borrowers.

    Full LendingClub data available from their site.

    Content

    For companies like Lending Club correctly predicting whether or not a loan will be default is very important. This dataset contains historical data from 2007 to 2015, you can to build a deep learning model to predict the chance of default for future loans. As you will see this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

  7. The definition of a confusion matrix.

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

  8. UVP5 data sorted with EcoTaxa and MorphoCluster

    • seanoe.org
    image/*
    Updated 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rainer Kiko; Simon-Martin Schröder (2020). UVP5 data sorted with EcoTaxa and MorphoCluster [Dataset]. http://doi.org/10.17882/73002
    Explore at:
    image/*Available download formats
    Dataset updated
    2020
    Dataset provided by
    SEANOE
    Authors
    Rainer Kiko; Simon-Martin Schröder
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Oct 23, 2012 - Aug 7, 2017
    Area covered
    Description

    here, we provide plankton image data that was sorted with the web applications ecotaxa and morphocluster. the data set was used for image classification tasks as described in schröder et. al (in preparation) and does not include any geospatial or temporal meta-data.plankton was imaged using the underwater vision profiler 5 (picheral et al. 2010) in various regions of the world's oceans between 2012-10-24 and 2017-08-08.this data publication consists of an archive containing "training.csv" (list of 392k training images for classification, validated using ecotaxa), "validation.csv" (list of 196k validation images for classification, validated using ecotaxa), "unlabeld.csv" (list of 1m unlabeled images), "morphocluster.csv" (1.2m objects validated using morphocluster, a subset of "unlabeled.csv" and "validation.csv") and the image files themselves. the csv files each contain the columns "object_id" (a unique id), "image_fn" (the relative filename), and "label" (the assigned name).the training and validation sets were sorted into 65 classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). this data shows a severe class imbalance; the 10% most populated classes contain more than 80% of the objects and the class sizes span four orders of magnitude. the validation set and a set of additional 1m unlabeled images were sorted during the first trial of morphocluster (https://github.com/morphocluster).the images in this data set were sampled during rv meteor cruises m92, m93, m96, m97, m98, m105, m106, m107, m108, m116, m119, m121, m130, m131, m135, m136, m137 and m138, during rv maria s merian cruises msm22, msm23, msm40 and msm49, during the rv polarstern cruise ps88b and during the fluxes1 experiment with rv sarmiento de gamboa.the following people have contributed to the sorting of the image data on ecotaxa:rainer kiko, tristan biard, benjamin blanc, svenja christiansen, justine courboules, charlotte eich, jannik faustmann, christine gawinski, augustin lafond, aakash panchal, marc picheral, akanksha singh and helena haussin schröder et al. (in preparation), the training set serves as a source for knowledge transfer in the training of the feature extractor. the classification using morphocluster was conducted by rainer kiko. used labels are operational and not yet matched to respective ecotaxa classes.

  9. Performance comparison of machine learning models across accuracy, AUC, MCC,...

    • plos.figshare.com
    xls
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Seongil Han; Haemin Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.

  10. Financial Transaction Fraud Detection

    • kaggle.com
    zip
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhi pratap (2025). Financial Transaction Fraud Detection [Dataset]. https://www.kaggle.com/datasets/abhipratapsingh/fraud-detection
    Explore at:
    zip(186385507 bytes)Available download formats
    Dataset updated
    Aug 20, 2025
    Authors
    Abhi pratap
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset is a valuable resource for building and evaluating machine learning models to predict fraudulent transactions in an e-commerce environment. With 6.3 million rows, it provides a rich, real-world scenario for data science tasks.

    The data is an excellent case study for several key challenges in machine learning, including:

    • Handling Imbalanced Data: The dataset is highly imbalanced, as legitimate transactions vastly outnumber fraudulent ones. This necessitates the use of specialized techniques like SMOTE or advanced models like XGBoost that can handle class imbalance effectively.

    • Feature Engineering: The raw data provides an opportunity to create new, more powerful features, such as transaction velocity or the ratio of account balances, which can improve model performance.

    • Model Evaluation: Traditional metrics like accuracy are misleading for this type of dataset. The project requires a deeper analysis using metrics such as Precision, Recall, F1-Score, and the Precision-Recall AUC to truly understand the model's effectiveness.

    Key Features: The dataset includes a variety of anonymized transaction details:

    • amount: The value of the transaction.

    • type: The type of transaction (e.g., TRANSFER, CASH_OUT).

    • oldbalance & newbalance: The balances of the origin and destination accounts before and after the transaction.

    • isFraud: The target variable, a binary flag indicating a fraudulent transaction.

  11. m

    Detailed results of "Insights into imbalance-aware Multilabel Prototype...

    • data.mendeley.com
    • observatorio-cientifico.ua.es
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose J. Valero-Mas (2024). Detailed results of "Insights into imbalance-aware Multilabel Prototype Generation mechanisms for k-Nearest Neighbor classification in noisy scenarios" [Dataset]. http://doi.org/10.17632/p6ytjt5rfy.1
    Explore at:
    Dataset updated
    Apr 2, 2024
    Authors
    Jose J. Valero-Mas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Detailed experimental results of the different Prototype Generation strategies for k-Nearest Neighbour classification in multilabel data attending to the particular issues of label-level imbalance and noise:

    1. Noise-free scenarios
    2. Study of the considered strategies for addressing label-level imbalance in PG scenarios without induced noise.
    3. Individual results provided for each corpus.
    4. Statistical tests (Friedman and Bonferroni-Dunn with significance level of p < 0.01) to assess the improvement compared to the base multilabel PG strategies
    5. Corresponds to Section 5.1 in the manuscript.

    6. Noisy scenarios

    7. Study of the noise robustness capabilities of the proposed strategies.

    8. Individual results provided for each corpus.

    9. Statistical tests (Friedman and Bonferroni-Dunn with significance level of p < 0.01) to assess the improvement compared too the base multilabel PG strategies

    10. Corresponds to Section 5.2 in the manuscript.

    11. Results ignoring the Editing stage

    12. Assessment of the relevance of the Editing stage in the general pipeline.

    13. Individual results provided for each corpus.

    14. Corresponds to Section 5.3 in the manuscript.

  12. Fruit and Vegetables

    • kaggle.com
    zip
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    youssef salah zakria (2024). Fruit and Vegetables [Dataset]. https://www.kaggle.com/datasets/youssefsalahzakria/fruit-and-vegetables-classification
    Explore at:
    zip(5178940148 bytes)Available download formats
    Dataset updated
    Nov 20, 2024
    Authors
    youssef salah zakria
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview This dataset contains a diverse collection of 72,000+ high-quality images of fruits and vegetables, carefully curated for machine learning and deep learning applications. It includes 50 unique categories of fruits and vegetables, such as apples, avocados, carrots, mangoes, broccoli, and more. The dataset is perfect for tasks like classification, object detection, image recognition, and educational purposes.

    Key Features Total Images: 72,000+

    Image Dimensions: 128x128 pixels (uniform size for consistency and ease of processing). there is also other photos with bigger resloution

    Classes: 50 categories of fruits and vegetables, including: Apple, Avocado, Banana, Beetroot, Blackberry, Blueberry, Broccoli, Cabbage, Capsicum, Carrot, Cauliflower, Chilli Pepper, Corn, Cucumber, Dates, Dragonfruit, Eggplant, Fig, Garlic, Ginger, Grapes, Guava, Jalapeno, Kiwi, Lemon, Lettuce, Mango, Mushroom, Okra, Olive, Onion, Orange, Paprika, Peanuts, Pear, Peas, Pineapple, Pomegranate, Potato, Pumpkin, Radish, Rambutan, Soy Beans, Spinach, Strawberry, Sweetcorn, Sweet Potato, Tomato, Turnip, Watermelon.

    Split: The dataset is divided into training, validation, and test sets, making it ready for machine learning workflows.

    Class Imbalance: Not all categories contain the same number of images, making it suitable for testing class imbalance handling techniques in machine learning.

    Why Use This Dataset? Realistic Data Distribution: With varying volumes of data across categories, the dataset provides a realistic challenge for building robust models that can generalize well. Preprocessed and Ready-to-Use: All images are resized to 128x128 pixels, saving you preprocessing time. Diverse Applications: Ideal for fruit and vegetable classification, agriculture-related AI models, health-tech solutions, and educational tools. Large Scale: With over 72,000 images, the dataset is suitable for training deep learning models with high accuracy. Applications Image Classification: Build AI models to classify fruits and vegetables. Health-Tech Solutions: Use the dataset to develop apps for identifying fruits/vegetables for dietary planning. Agricultural Technology: Enhance crop identification systems or supply chain management tools. Education: Provide students and researchers with a practical dataset to learn machine learning techniques. Licensing and Usage This dataset is free to use for any purpose, including research, education, and commercial projects.

    Acknowledgments This dataset was created with the goal of advancing AI applications in food technology, agriculture, and education. We hope it helps you build impactful machine learning solutions!

  13. Cerebral Stroke Dataset

    • kaggle.com
    zip
    Updated Sep 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dailydaisy2 (2025). Cerebral Stroke Dataset [Dataset]. https://www.kaggle.com/datasets/viviansam/cerebral-stroke-dataset
    Explore at:
    zip(573312 bytes)Available download formats
    Dataset updated
    Sep 25, 2025
    Authors
    dailydaisy2
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Retrieved from Mendeley Data on 16-Dec-2024: https://data.mendeley.com/datasets/x8ygrw87jw/1

    This dataset comprises vital information on potential cerebral stroke patients, including personal data (e.g., age, gender, etc.), and disease history (e.g. hypertension, heart disease, etc.), which was collected from HealthData.gov by Liu, Fan & Wu (2019) during their study titled 'A hybrid machine learning approach to cerebral stroke prediction based on an imbalanced medical dataset'. The data collection prioritized physiological indicators over complex medical monitoring to minimize diagnosis expenses.

    This cerebral stroke dataset records information from 43400 potential patients, comprising 12 attributes with various data types.

    1. id - Unique identifier of each patient
    2. gender - Gender of the patient: male, female, other
    3. age - Age of the patient: ranged from 0.08 to 82
    4. hypertension - If the patient has hypertension: 0, 1 (no, yes, respectively)
    5. heart_disease - If the patient has heart disease: 0, 1 (no, yes, respectively)
    6. ever_married - Marital status of patient: No, Yes
    7. work_type - Occupation type of patient: children, private sector, self-employed, government sector, never worked
    8. Residence_type - Residency type of patient: rural, urban
    9. avg_glucose_level - Average glucose level in blood: ranged from 55 to 279.66
    10. bmi - Body mass index: ranged from 10.1 to 97.6
    11. smoking_status - Smoking status: formerly smoked, never smoked, smokes
    12. stroke - If the patient has stroke: 0, 1 (no, yes, respectively)

    The target variable, ‘stroke' is categorized into ‘0’ and ‘1’, representing ‘no stroke’ and ‘have stroke’ respectively. It is a categorical variable, making the problem a binary classification task. This dataset includes 783 occurrences of stroke, which account for 1.18% of the total, resulting in a highly imbalanced dataset. This imbalance reflects actual clinical practice, where most of the medical datasets suffer from class imbalance by nature.

  14. Learning Privacy from Visual Entities - Curated data sets and pre-computed...

    • zenodo.org
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro (2025). Learning Privacy from Visual Entities - Curated data sets and pre-computed visual entities [Dataset]. http://doi.org/10.5281/zenodo.15348506
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This repository contains the curated image privacy datasets and pre-computed visual entities used in the publication Learning Privacy from Visual Entities by A. Xompero and A. Cavallaro.
    [
    arxiv][code]

    Curated image privacy data sets

    In the article, we trained and evaluated models on the Image Privacy Dataset (IPD) and the PrivacyAlert dataset. The datasets are originally provided by other sources and have been re-organised and curated for this work.

    Our curation organises the datasets in a common structure. We updated the annotations and labelled the splits of the data in the annotation file. This avoids having separated folders of images for each data split (training, validation, testing) and allows a flexible handling of new splits, e.g. created with a stratified K-Fold cross-validation procedure. As for the original datasets (PicAlert and PrivacyAlert), we provide the link to the images in bash scripts to download the images. Another bash script re-organises the images in sub-folders with maximum 1000 images in each folder.

    Both datasets refer to images publicly available on Flickr. These images have a large variety of content, including sensitive content, seminude people, vehicle plates, documents, private events. Images were annotated with a binary label denoting if the content was deemed to be public or private. As the images are publicly available, their label is mostly public. These datasets have therefore a high imbalance towards the public class. Note that IPD combines two other existing datasets, PicAlert and part of VISPR, to increase the number of private images already limited in PicAlert. Further details in our corresponding https://doi.org/10.48550/arXiv.2503.12464" target="_blank" rel="noopener">publication.

    List of datasets and their original source:

    Notes:

    • For PicAlert and PrivacyAlert, only urls to the original locations in Flickr are available in the Zenodo record
    • Collector and authors of the PrivacyAlert dataset selected the images from Flickr under Public Domain license
    • Owners of the photos on Flick could have removed the photos from the social media platform
    • Running the bash scripts to download the images can incur in the "429 Too Many Requests" status code

    Pre-computed visual entitities

    Some of the models run their pipeline end-to-end with the images as input, whereas other models require different or additional inputs. These inputs include the pre-computed visual entities (scene types and object types) represented in a graph format, e.g. for a Graph Neural Network. Re-using these pre-computed visual entities allows other researcher to build new models based on these features while avoiding re-computing the same on their own or for each epoch during the training of a model (faster training).

    For each image of each dataset, namely PrivacyAlert, PicAlert, and VISPR, we provide the predicted scene probabilities as a .csv file , the detected objects as a .json file in COCO data format, and the node features (visual entities already organised in graph format with their features) as a .json file. For consistency, all the files are already organised in batches following the structure of the images in the datasets folder. For each dataset, we also provide the pre-computed adjacency matrix for the graph data.

    Note: IPD is based on PicAlert and VISPR and therefore IPD refers to the scene probabilities and object detections of the other two datasets. Both PicAlert and VISPR must be downloaded and prepared to use IPD for training and testing.

    Further details on downloading and organising data can be found in our GitHub repository: https://github.com/graphnex/privacy-from-visual-entities (see ARTIFACT-EVALUATION.md#pre-computed-visual-entitities-)

    Enquiries, questions and comments

    If you have any enquiries, question, or comments, or you would like to file a bug report or a feature request, use the issue tracker of our GitHub repository.

  15. DataSheet1_Comparative analysis of classification techniques for topic-based...

    • frontiersin.figshare.com
    pdf
    Updated Nov 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ihor Stepanov; Arsentii Ivasiuk; Oleksandr Yavorskyi; Alina Frolova (2023). DataSheet1_Comparative analysis of classification techniques for topic-based biomedical literature categorisation.PDF [Dataset]. http://doi.org/10.3389/fgene.2023.1238140.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 7, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Ihor Stepanov; Arsentii Ivasiuk; Oleksandr Yavorskyi; Alina Frolova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction: Scientific articles serve as vital sources of biomedical information, but with the yearly growth in publication volume, processing such vast amounts of information has become increasingly challenging. This difficulty is particularly pronounced when it requires the expertise of highly qualified professionals. Our research focused on the domain-specific articles classification to determine whether they contain information about drug-induced liver injury (DILI). DILI is a clinically significant condition and one of the reasons for drug registration failures. The rapid and accurate identification of drugs that may cause such conditions can prevent side effects in millions of patients.Methods: Developing a text classification method can help regulators, such as the FDA, much faster at a massive scale identify facts of potential DILI of concrete drugs. In our study, we compared several text classification methodologies, including transformers, LSTMs, information theory, and statistics-based methods. We devised a simple and interpretable text classification method that is as fast as Naïve Bayes while delivering superior performance for topic-oriented text categorisation. Moreover, we revisited techniques and methodologies to handle the imbalance of the data.Results: Transformers achieve the best results in cases if the distribution of classes and semantics of test data matches the training set. But in cases of imbalanced data, simple statistical-information theory-based models can surpass complex transformers, bringing more interpretable results that are so important for the biomedical domain. As our results show, neural networks can achieve better results if they are pre-trained on domain-specific data, and the loss function was designed to reflect the class distribution.Discussion: Overall, transformers are powerful architecture, however, in certain cases, such as topic classification, its usage can be redundant and simple statistical approaches can achieve compatible results while being much faster and explainable. However, we see potential in combining results from both worlds. Development of new neural network architectures, loss functions and training procedures that bring stability to unbalanced data is a promising topic of development.

  16. Credit Card Fraud Detection

    • kaggle.com
    zip
    Updated Jul 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Old Monk (2021). Credit Card Fraud Detection [Dataset]. https://www.kaggle.com/saurabhbagchi/credit-card-fraud-detection
    Explore at:
    zip(29415944 bytes)Available download formats
    Dataset updated
    Jul 18, 2021
    Authors
    Old Monk
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset presents a transaction data simulator of legitimate and fraudulent transactions.

    A simulation is necessarily an approximation of reality. Compared to the complexity of the dynamics underlying real-world payment card transaction data, the data simulator that we present below follows a simple design.

    This simple design is a choice. First, having simple rules to generate transactions and fraudulent behaviors will help in interpreting the kind of patterns that different fraud detection techniques can identify. Second, while simple in its design, the data simulator will generate datasets that are challenging to deal with.

    The simulated datasets will highlight most of the issues that practitioners of fraud detection face using real-world data. In particular, they will include class imbalance (less than 1% of fraudulent transactions), a mix of numerical and categorical features (with categorical features involving a very large number of values), non-trivial relationships between features, and time-dependent fraud scenarios.

  17. m

    Sunflower Growth Stage Image Dataset for Phenological Classification

    • data.mendeley.com
    Updated Aug 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jahangir Alam Jibon (2025). Sunflower Growth Stage Image Dataset for Phenological Classification [Dataset]. http://doi.org/10.17632/byftmdzg4g.2
    Explore at:
    Dataset updated
    Aug 18, 2025
    Authors
    Jahangir Alam Jibon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Sunflower Growth Stage Image Dataset for Phenological Classification was collected from agricultural fields in Bangladesh, focusing on the identification and classification of sunflower growth stages. Images were captured directly in the field using a Redmi Note 11 smartphone, under natural daylight and varying weather conditions to reflect real-world environments. This dataset is meant to aid research in deep learning, computer vision, and plant phenology by providing data for automated classification of growth stages.

    A total of 1,255 original images were gathered, each with a high resolution of 12,288 × 16,320 pixels and approximately 25 MB in size. The images are divided into five classes: Stage1 (Young_Bud) with 238 images, Stage2 (Mature_Bud) with 272 images, Stage3 (Early_Bloom) with 218 images, Stage4 (Full_Bloom) with 213 images, and Stage5 (Wilted) with 314 images. To balance the dataset for training, each class was augmented to have 500 images, resulting in a final balanced collection of 2,500 images.

    Validation of the dataset was carried out by a Sub-Assistant Agriculture Officer from the Department of Agricultural Extension (DAE), Bangladesh, ensuring its reliability. The data was collected at two main sites: Daffodil International University (Ashulia Campus) and Model Town Nursery, Ashulia, Bangladesh. The camera used for capturing the images was a Redmi Note 11, with 24-bit color depth, an aperture of f/1.8, and images saved in JPEG format.

    Example metadata for an image shows it was taken on 2025-05-22 at 17:47 using the MediaTek Camera Application. The image’s dimensions are 12,288 × 16,320 pixels at 72 dpi with 24-bit sRGB color representation. The camera details include Xiaomi as the maker, model 23117RA86G, f-stop f/1.6, exposure time 1/100 sec, ISO 200, focal length 6 mm, and auto white balance. GPS coordinates recorded were Latitude 23.5247046, Longitude 90.1918097, Altitude 34.5 m. The image file example is named IMG_20250522_174724.jpg, is a JPEG of size 26.1 MB.

    Attribution Notice This dataset also includes 24 images derived from the publicly available dataset: “Sunflower Plant Health and Growth Stage Image Dataset for Agricultural Machine Learning Applications” Sagor, Saifuddin; Hossan, Md. Faysal ; Ahmed, Faruk; Reyad , Md. Zamirul Islam (2025), “Sunflower Plant Health and Growth Stage Image Dataset for Agricultural Machine Learning Applications”, Mendeley Data, V1, doi: 10.17632/y3ygk98ngr.1

    These images were incorporated because the number of collected field images was insufficient for the Stage4 (Full_Bloom) Class. After inclusion, a portion of these images was further augmented to increase the dataset size and maintain class balance. Any modifications or augmentations applied to the derived images are the responsibility of the present authors.

    The original dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

  18. Credit Card Fraud Dataset

    • kaggle.com
    zip
    Updated Jun 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Moraes (2024). Credit Card Fraud Dataset [Dataset]. https://www.kaggle.com/datasets/dylanmoraes/credit-card-fraud-dataset/discussion
    Explore at:
    zip(186385507 bytes)Available download formats
    Dataset updated
    Jun 22, 2024
    Authors
    Dylan Moraes
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This dataset contains synthetic credit card transaction data designed for fraud detection and machine learning research. With over 6.3 million transactions, it provides a realistic simulation of financial transaction patterns including both legitimate and fraudulent activities.

    Source

    This is a synthetic dataset generated to simulate credit card transaction behavior. The data represents financial transactions over a 30-day period (743 hours) with various transaction types including payments, transfers, cash-outs, debits, and cash-ins.

    Purpose

    The dataset is specifically designed for: - Training and testing fraud detection models - Anomaly detection research - Binary classification tasks - Imbalanced learning scenarios - Financial machine learning applications

    Column Descriptions

    • step: Maps a unit of time in the real world. 1 step represents 1 hour of time. Range: 1 to 743
    • type: Type of transaction (PAYMENT, TRANSFER, CASH_OUT, DEBIT, CASH_IN)
    • amount: Amount of the transaction in local currency
    • nameOrig: Customer ID who initiated the transaction
    • oldbalanceOrg: Initial balance before the transaction (origin account)
    • newbalanceOrig: New balance after the transaction (origin account)
    • nameDest: Recipient ID of the transaction
    • oldbalanceDest: Initial recipient balance before the transaction
    • newbalanceDest: New recipient balance after the transaction
    • isFraud: Binary flag indicating fraud (1 = fraud, 0 = legitimate)
    • isFlaggedFraud: Flag for illegal attempts to transfer more than 200,000 in a single transaction

    Dataset Statistics

    • Total Transactions: 6,362,620
    • Fraudulent Transactions: 8,213 (~0.13%)
    • Legitimate Transactions: 6,354,407 (~99.87%)
    • Time Period: 30 days (743 hours)
    • File Size: 493.53 MB

    Class Imbalance Note

    This dataset exhibits significant class imbalance with only 0.13% fraudulent transactions. This mirrors real-world fraud detection scenarios where fraudulent transactions are rare. Consider using techniques such as: - SMOTE (Synthetic Minority Over-sampling Technique) - Undersampling of majority class - Cost-sensitive learning - Ensemble methods - Anomaly detection algorithms

    Model Suitability

    This dataset is well-suited for: - Logistic Regression - Random Forest - Gradient Boosting (XGBoost, LightGBM, CatBoost) - Neural Networks - Isolation Forest - Autoencoders - Support Vector Machines

    Quick Start Example

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Load the dataset
    df = pd.read_csv('/kaggle/input/credit-card-fraud-dataset/Fraud.csv')
    
    # Display basic information
    print(df.info())
    print(df.head())
    
    # Check fraud distribution
    print(df['isFraud'].value_counts())
    
    # Visualize fraud distribution
    plt.figure(figsize=(8, 5))
    sns.countplot(data=df, x='isFraud')
    plt.title('Distribution of Fraud vs Legitimate Transactions')
    plt.xlabel('Is Fraud (0=No, 1=Yes)')
    plt.ylabel('Count')
    plt.show()
    
    # Transaction type distribution
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x='type', hue='isFraud')
    plt.title('Transaction Types by Fraud Status')
    plt.xticks(rotation=45)
    plt.show()
    

    Usage Tips

    1. Handle Class Imbalance: Use appropriate sampling techniques or algorithms designed for imbalanced data
    2. Feature Engineering: Consider creating features like transaction velocity, time-based patterns, and balance differences
    3. Evaluation Metrics: Use precision, recall, F1-score, and AUC-ROC rather than accuracy due to class imbalance
    4. Cross-validation: Use stratified k-fold to maintain class distribution across folds
    5. Transaction Patterns: Analyze transaction types - TRANSFER and CASH_OUT are more associated with fraud

    Update Frequency

    This is a static dataset with no planned future updates. It serves as a benchmark for fraud detection research and model development.

    Acknowledgments

    This dataset is made available under the MIT License for educational and research purposes in the field of fraud detection and financial machine learning.

  19. f

    Data from: Effects of Class Imbalance and Data Scarcity on the Performance...

    • acs.figshare.com
    xlsx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Changhun Kim; Jaeseong Jeong; Jinhee Choi (2023). Effects of Class Imbalance and Data Scarcity on the Performance of Binary Classification Machine Learning Models Developed Based on ToxCast/Tox21 Assay Data [Dataset]. http://doi.org/10.1021/acs.chemrestox.2c00189.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Changhun Kim; Jaeseong Jeong; Jinhee Choi
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The development of toxicity classification models using the ToxCast database has been extensively studied. Machine learning approaches are effective in identifying the bioactivity of untested chemicals. However, ToxCast assays differ in the amount of data and degree of class imbalance (CI). Therefore, the resampling algorithm employed should vary depending on the data distribution to achieve optimal classification performance. In this study, the effects of CI and data scarcity (DS) on the performance of binary classification models were investigated using ToxCast bioassay data. An assay matrix based on CI and DS was prepared for 335 assays with biologically intended target information, and 28 CI assays and 3 DS assays were selected. Thirty models established by combining five molecular fingerprints (i.e., Morgan, MACCS, RDKit, Pattern, and Layered) and six algorithms [i.e., gradient boosting tree, random forest (RF), multi-layered perceptron, k-nearest neighbor, logistic regression, and naive Bayes] were trained using the selected assay data set. Of the 30 trained models, MACCS–RF showed the best performance and thus was selected for analyses of the effects of CI and DS. Results showed that recall and F1 were significantly lower when training with the CI assays than with the DS assays. In addition, hyperparameter tuning of the RF algorithm significantly improved F1 on CI assays. This study provided a basis for developing a toxicity classification model with improved performance by evaluating the effects of data set characteristics. This study also emphasized the importance of using appropriate evaluation metrics and tuning hyperparameters in model development.

  20. m

    Burmese Grape Leaf Disease Dataset for Computer Vision-Based Plant Health...

    • data.mendeley.com
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salman Af Rahman (2025). Burmese Grape Leaf Disease Dataset for Computer Vision-Based Plant Health Diagnosis [Dataset]. http://doi.org/10.17632/k6gy38xv89.1
    Explore at:
    Dataset updated
    Apr 9, 2025
    Authors
    Salman Af Rahman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Burmese Grape Leaf Disease Dataset comprises 3,103 high-quality images categorized into five distinct classes representing various conditions of grapevine leaves. This dataset is curated to support machine learning, deep learning, and computer vision-based applications for automated plant disease recognition and classification. Each image captures clear visual indicators relevant to the health status of the leaf, aiding in effective feature extraction and model training.

    Data Collection Details: Captured Using: 1. Realme 8 (64 MP, f/1.79 aperture) 2. Redmi Note 7 Pro Max (48 MP, f/1.79 aperture)

    Data Source Locations: 1. Toponer Lotkon Bagan, Kaligonj-Nagori Road, Nagarvala (Latitude: 23.88658723621705, Longitude: 90.47780500780843) 2. Itakhola Bus Stand, Narsingdi (Latitude: 23.980154076764684, Longitude: 90.7332739352483)

    Number of Images: 1. Healthy: 1006 2. Anthracnose (Brown Spot): 447 3. Insect Damage: 990 4. Powdery Mildew: 296 5. Leaf Spot (Yellow): 364

    Data Augmentation Techniques: To enhance model generalizability and address data imbalance, the dataset was augmented using the following techniques: 1. Brightness adjustment 2. Contrast enhancement 3. Rotation (random angles) 4. Shear transformation 5. Zoom-in and zoom-out scaling

    Augmented Images (15,515 Images): 1. Healthy: 1006*5 = 5,030 2. Anthracnose (Brown Spot): 447*5 = 2,235 3. Insect Damage: 990*5 = 4,950 4. Powdery Mildew: 296*5= 1,480 5. Leaf Spot (Yellow): 364*5= 1,820

    Key Applications: 1. Automated Disease Detection: Used to train intelligent systems capable of identifying leaf diseases in real time. 2. Precision Viticulture: Enables AI-based monitoring for better vineyard management and targeted treatment. 3. Computer Vision Research: Provides a benchmark for evaluating classification and segmentation models. 4. Transfer Learning & Mobile Deployment: Suitable for fine-tuning pre-trained CNNs and deploying lightweight models on smartphones and edge devices. 5. Explainable AI in Agriculture: Ideal for interpretability research using saliency maps and XAI tools. 6. Academic and Industrial Benchmarking: Can be used in competitions, thesis projects, or commercial AI prototypes for crop health monitoring.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
Organization logo

Data from: A virtual multi-label approach to imbalanced data classification

Related Article
Explore at:
text/x-texAvailable download formats
Dataset updated
Feb 28, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Elizabeth P. Chou; Shan-Ping Yang
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.

Search
Clear search
Close search
Google apps
Main menu