2 datasets found

f
Data from: Addressing Imbalanced Classification Problems in Drug Discovery...
acs.figshare.com
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00023.s001
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
Mushroom Classification Enhanced
kaggle.com
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mo_der Steven (2024). Mushroom Classification Enhanced [Dataset]. https://www.kaggle.com/datasets/sakurapuare/mushroom-classification-enhanced/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mo_der Steven
Description
NOTE

The data for this competition is from the RAICOM Mission Application Competition and Mo in China, originating from https://www.kaggle.com/datasets/uciml/mushroom-classification/

The copyright of datasets belongs to the organizers of "RAICOM Mission Application Competition"

Baseline

The result of Official Baseline is:

Accuracy: 0.7464409388226241 Precision: 0.7591353576942872 Recall: 0.6344086021505376 F1: 0.6911902530459232 Confusion matrix: [[2405 468] [ 850 1475]]

Background

Mushrooms are a beloved delicacy among people, but beneath their glamorous appearance, they may harbor deadly dangers. China is one of the countries with the largest variety of mushrooms in the world. At the same time, mushroom poisoning is one of the most serious food safety issues in China. According to relevant reports, in 2021, China conducted research on 327 mushroom poisoning incidents, involving 923 patients and 20 deaths, with a total mortality rate of 2.17%. For non professionals, it is impossible to distinguish between poisonous mushrooms and edible mushrooms based on their appearance, shape, color, etc. There is no simple standard that can distinguish between poisonous mushrooms and edible mushrooms. To determine whether mushrooms are edible, it is necessary to collect mushrooms with different characteristic attributes and analyze whether they are toxic. In this competition, 22 characteristic attributes of mushrooms were analyzed to obtain a mushroom usability model, which can better predict whether mushrooms are edible.

Metrics

In the context of this mushroom usability model competition, several performance metrics can be utilized to evaluate the predictive accuracy of the model. Among them, the F1 score stands out due to its ability to provide a balance between precision and recall, which are crucial for this classification problem where distinguishing between poisonous and edible mushrooms can have severe real-world implications.

F1 Score The F1 score is the harmonic mean of precision and recall, and it is particularly useful in binary classification scenarios with imbalanced class distribution:

Precision (also known as positive predictive value) indicates the proportion of true positive observations among all observations classified as positive. It measures the accuracy of the positive predictions. \( \text{Precision} = \frac{TP}{TP + FP} \)

Recall (also known as sensitivity or true positive rate) measures the proportion of true positive observations out of all actual positives. It assesses the ability to capture all the true positive instances. \( \text{Recall} = \frac{TP}{TP + FN} \)

The F1 score is calculated as follows:

\[ \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

Why F1 Score? Balance Between Precision and Recall: In the context where mushroom classification error can have critical health impacts, favoring either precision or recall solely might be dangerous. F1 score provides a more comprehensive evaluation by balancing these errors.

Handling Imbalanced Classes: Mushroom datasets often have an imbalance between the number of edible and poisonous instances. The F1 score is less influenced by the skewed class distributions compared to accuracy.

Critical Application: Misclassifying a poisonous mushroom as edible can lead to severe health risks. Hence, ensuring both high precision (minimizing false positives) and high recall (capturing all true positives) is crucial. The F1 score encapsulates the tradeoff between these aspects well.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001

Data from: Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.5c00023.s001

Dataset updated

Apr 15, 2025

Dataset provided by

ACS Publications

Authors

Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.

Clear search

Close search

Google apps

Main menu

Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

Mushroom Classification Enhanced

NOTE

Baseline

Background

Metrics

Data from: Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML