25 datasets found

Classification results for: Hellinger Distance Trees for Imbalanced Streams
figshare.com
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Lyon (2023). Classification results for: Hellinger Distance Trees for Imbalanced Streams [Dataset]. http://doi.org/10.6084/m9.figshare.1534549.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1534549.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Robert Lyon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data sets supporting the results reported in the paper: Hellinger Distance Trees for Imbalanced Streams, R. J. Lyon, J.M. Brooke, J.D. Knowles, B.W Stappers, 22nd International Conference on Pattern Recognition (ICPR), p.1969 - 1974, 2014. DOI: 10.1109/ICPR.2014.344 Contained in this distribution are results of stream classifier perfromance on four different data sets. Also included are the test results from our attempt at reproducing the outcome of the paper, Learning Decision Trees for Un-balanced Data, D. A. Cieslak and N. V. Chawla, in Machine Learning and Knowledge Discovery in Databases (W. Daelemans, B. Goethals, and K. Morik, eds.), vol. 5211 of LNCS, pp. 241-256, 2008. The data sets used for these experiments include, MAGIC Gamma Telescope Data Set : https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+TelescopeMiniBooNE particle identification Data Set : https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identificationSkin Segmentation Data Set : https://archive.ics.uci.edu/ml/datasets/Skin+SegmentationLetter Recognition Data Set : https://archive.ics.uci.edu/ml/datasets/Letter+RecognitionPen-Based Recognition of Handwritten Digits Data Set : https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+DigitsStatlog (Landsat Satellite) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)Statlog (Image Segmentation) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Image+Segmentation) A further data set used is not publicly available at present. However we are in the process of releasing it for public use. Please get in touch if you'd like to use it.

A readme file accompanies the data describing it in more detail.
f
Data from: Addressing Imbalanced Classification Problems in Drug Discovery...
acs.figshare.com
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00023.s001
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
Learning from Imbalanced Insurance Data
kaggle.com
Updated Nov 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Möbius (2020). Learning from Imbalanced Insurance Data [Dataset]. https://www.kaggle.com/arashnic/imbalanced-data-practice/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Möbius
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Insurance companies that sell life, health, and property and casualty insurance are using machine learning (ML) to drive improvements in customer service, fraud detection, and operational efficiency. The data provided by an Insurance company which is not excluded from other companies to getting advantage of ML. This company provides Health Insurance to its customers. We can build a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalized in that year, the insurance provider company will bear the cost of hospitalization etc. for up to Rs. 200,000. Now if you are wondering how can company bear such high hospitalization cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalized that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Content

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.

We have information about: - Demographics (gender, age, region code type), - Vehicles (Vehicle Age, Damage), - Policy (Premium, sourcing channel) etc.

Update: Test data target values has been added. To evaluate your models more precisely you can use: https://www.kaggle.com/arashnic/answer

#
#

Moreover the supplemental goal is to practice learning imbalanced data and verify how the results can help in real operational process. The Response feature (target) is highly imbalanced.

#

0: 319594 1: 62531 Name: Response, dtype: int64

#
Practicing some techniques like resampling is useful to verify impacts on validation results and confusion matrix. #
https://miro.medium.com/max/640/1*KxFmI15rxhvKRVl-febp-Q.png"> figure. Under-sampling: Tomek links # #

Starter Kernel(s)

Quick EDA and LGB ,XGB

Handling Imbalanced: Resampling the right way

Inspiration

Predict whether a customer would be interested in Vehicle Insurance

#
#

MORE DATASETs ...
f
Data from: Machine Learning Model for Screening Thyroid Stimulating Hormone...
figshare.com
xlsx
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenjia Liu; Zhongyu Wang; Jingwen Chen; Weihao Tang; Haobo Wang (2023). Machine Learning Model for Screening Thyroid Stimulating Hormone Receptor Agonists Based on Updated Datasets and Improved Applicability Domain Metrics [Dataset]. http://doi.org/10.1021/acs.chemrestox.3c00074.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.chemrestox.3c00074.s002
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Wenjia Liu; Zhongyu Wang; Jingwen Chen; Weihao Tang; Haobo Wang
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Machine learning (ML) models for screening endocrine-disrupting chemicals (EDCs), such as thyroid stimulating hormone receptor (TSHR) agonists, are essential for sound management of chemicals. Previous models for screening TSHR agonists were built on imbalanced datasets and lacked applicability domain (AD) characterization essential for regulatory application. Herein, an updated TSHR agonist dataset was built, for which the ratio of active to inactive compounds greatly increased to 1:2.6, and chemical spaces of structure–activity landscapes (SALs) were enhanced. Resulting models based on 7 molecular representations and 4 ML algorithms were proven to outperform previous ones. Weighted similarity density (ρs) and weighted inconsistency of activities (IA) were proposed to characterize the SALs, and a state-of-the-art AD characterization methodology ADSAL{ρs, IA} was established. An optimal classifier developed with PubChem fingerprints and the random forest algorithm, coupled with ADSAL{ρs ≥ 0.15, IA ≤ 0.65}, exhibited good performance on the validation set with the area under the receiver operating characteristic curve being 0.984 and balanced accuracy being 0.941 and identified 90 TSHR agonist classes that could not be found previously. The classifier together with the ADSAL{ρs, IA} may serve as efficient tools for screening EDCs, and the AD characterization methodology may be applied to other ML models.
f
Features of the dataset.
figshare.com
xls
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). Features of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t001
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Thyroid disease classification plays a crucial role in early diagnosis and effective treatment of thyroid disorders. Machine learning (ML) techniques have demonstrated remarkable potential in this domain, offering accurate and efficient diagnostic tools. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques process the whole dataset at a time that sometimes causes overfitting and underfitting. However, the complexity of some ML models, often referred to as “black boxes,” raises concerns about their interpretability and clinical applicability. This paper presents a comprehensive study focused on the analysis and interpretability of various ML models for classifying thyroid diseases. In our work, we first applied a new data-balancing mechanism using a clustering technique and then analyzed the performance of different ML algorithms. To address the interpretability challenge, we explored techniques for model explanation and feature importance analysis using eXplainable Artificial Intelligence (XAI) tools globally as well as locally. Finally, the XAI results are validated with the domain experts. Experimental results have shown that our proposed mechanism is efficient in diagnosing thyroid disease and can explain the models effectively. The findings can contribute to bridging the gap between adopting advanced ML techniques and the clinical requirements of transparency and accountability in diagnostic decision-making.
R
Data from: Wine Quality Dataset
beta.dataverse.org
tensorflow.org
+1more
Updated May 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2024). Wine Quality Dataset [Dataset]. https://beta.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/YKJQY8&version=1.0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 30, 2024
Dataset provided by
Root
Authors
M Yasser H
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This datasets is related to red variants of the Portuguese "Vinho Verde" wine.The dataset describes the amount of various chemicals present in wine and their effect on it's quality. The datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).Your task is to predict the quality of wine using the given data. A simple yet challenging project, to anticipate the quality of wine. The complexity arises due to the fact that the dataset has fewer samples, & is highly imbalanced. Can you overcome these obstacles & build a good predictive model to classify them? This data frame contains the following columns: Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10) Acknowledgements: This dataset is also available from Kaggle & UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality. Objective: Understand the Dataset & cleanup (if required). Build classification models to predict the wine quality. Also fine-tune the hyperparameters & compare the evaluation metrics of various classification algorithms. This dataset was originally published on Kaggle at https://www.kaggle.com/datasets/yasserh/wine-quality-dataset
Classification results for: A Study on Classification in Imbalanced and...
figshare.com
application/gzip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Lyon (2023). Classification results for: A Study on Classification in Imbalanced and Partially-Labelled Data Streams [Dataset]. http://doi.org/10.6084/m9.figshare.1534548.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1534548.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Robert Lyon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data sets supporting the results reported in the paper: A Study on Classification in Imbalanced and Partially-Labelled Data Streams,R. J. Lyon, J.M. Brooke, J.D. Knowles, B.W Stappers, Systems, Man, and Cybernetics (SMC), 2013. DOI: 10.1109/SMC.2013.260 Contained in this distribution are results of stream and static classifier perfromance on four different data sets. These include, MAGIC Gamma Telescope Data Set : https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope MiniBooNE particle identification Data Set : https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification Skin Segmentation Data Set : https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation The forth data set is not publicly available at present. However we are in the process of releasing it for public use. Please get in touch if you'd like to use it.
t
Raman Singh, Harish Kumar, R.K. Singla (2025). Dataset: Panjab...
service.tib.eu
Updated Jan 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Raman Singh, Harish Kumar, R.K. Singla (2025). Dataset: Panjab University-Test Data Set (PU-TDS). https://doi.org/10.57702/vm8ggsdy [Dataset]. https://service.tib.eu/ldmservice/dataset/panjab-university-test-data-set--pu-tds-
Explore at:
Dataset updated
Jan 2, 2025
Description
Network traffic dataset is huge, varying and imbalanced because various classes are not equally distributed. Network traffic data is huge, varying and imbalanced because various classes are not equally distributed. Machine learning (ML) algorithms for traffic analysis uses the samples from this data to recommend the actions to be taken by the network administrators as well as training. Due to imbalances in dataset, it is difficult to train machine learning algorithms for traffic analysis and these may give biased or false results leading to serious degradation in performance of these algorithms.
4
Ot & Sien, a dataset to help the development of object detection in...
data.4tu.nl
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haoran Wang; Seyran Khademi, Ot & Sien, a dataset to help the development of object detection in children's book illustrations [Dataset]. http://doi.org/10.4121/d1f3ca5c-f1e4-48f5-9a04-0564572d2b9c.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/d1f3ca5c-f1e4-48f5-9a04-0564572d2b9c.v1
Dataset provided by
4TU.ResearchData
Authors
Haoran Wang; Seyran Khademi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The original dataset is Ot & Sien Dataset (https://lab.kb.nl/dataset/ot-sien-dataset). We corrected mistakes and made it ML-ready.
The purpose of this dataset is to help the development of automatic visual object detection in children's book illustrations. The properties of our dataset are summarized as:
The dataset consists of illustrations rather than standard photos.
1452 images with 8241 objects (5.7 per image) are annotated including the category and bounding boxes.
All images are resized to 416 x 416 with black fitting edges to adapt to the training procedure.
The dataset follows a natural long-tail property, with some object categories being rare.
The dataset has imbalanced categories.
f
Performance of Deep Feature Extraction Models after Applying Dual-GAN.
plos.figshare.com
xls
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanka Roy; Fahim Mohammad Sadique Srijon; Pankaj Bhowmik (2025). Performance of Deep Feature Extraction Models after Applying Dual-GAN. [Dataset]. http://doi.org/10.1371/journal.pone.0310748.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310748.t005
Dataset updated
May 5, 2025
Dataset provided by
PLOS ONE
Authors
Priyanka Roy; Fahim Mohammad Sadique Srijon; Pankaj Bhowmik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of Deep Feature Extraction Models after Applying Dual-GAN.
f
Performance of the classification models before applying dual-GAN.
plos.figshare.com
figshare.com
xls
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanka Roy; Fahim Mohammad Sadique Srijon; Pankaj Bhowmik (2025). Performance of the classification models before applying dual-GAN. [Dataset]. http://doi.org/10.1371/journal.pone.0310748.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310748.t004
Dataset updated
May 5, 2025
Dataset provided by
PLOS ONE
Authors
Priyanka Roy; Fahim Mohammad Sadique Srijon; Pankaj Bhowmik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of the classification models before applying dual-GAN.
f
Comparison with Baseline Studies.
plos.figshare.com
xls
Updated May 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanka Roy; Fahim Mohammad Sadique Srijon; Pankaj Bhowmik (2025). Comparison with Baseline Studies. [Dataset]. http://doi.org/10.1371/journal.pone.0310748.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310748.t008
Dataset updated
May 5, 2025
Dataset provided by
PLOS ONE
Authors
Priyanka Roy; Fahim Mohammad Sadique Srijon; Pankaj Bhowmik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Brain tumors are one of the leading diseases imposing a huge morbidity rate across the world every year. Classifying brain tumors accurately plays a crucial role in clinical diagnosis and improves the overall healthcare process. ML techniques have shown promise in accurately classifying brain tumors based on medical imaging data such as MRI scans. These techniques aid in detecting and planning treatment early, improving patient outcomes. However, medical image datasets are frequently affected by a significant class imbalance, especially when benign tumors outnumber malignant tumors in number. This study presents an explainable ensemble-based pipeline for brain tumor classification that integrates a Dual-GAN mechanism with feature extraction techniques, specifically designed for highly imbalanced data. This Dual-GAN mechanism facilitates the generation of synthetic minority class samples, addressing the class imbalance issue without compromising the original quality of the data. Additionally, the integration of different feature extraction methods facilitates capturing precise and informative features. This study proposes a novel deep ensemble feature extraction (DeepEFE) framework that surpasses other benchmark ML and deep learning models with an accuracy of 98.15%. This study focuses on achieving high classification accuracy while prioritizing stable performance. By incorporating Grad-CAM, it enhances the transparency and interpretability of the overall classification process. This research identifies the most relevant and contributing parts of the input images toward accurate outcomes enhancing the reliability of the proposed pipeline. The significantly improved Precision, Sensitivity and F1-Score demonstrate the effectiveness of the proposed mechanism in handling class imbalance and improving the overall accuracy. Furthermore, the integration of explainability enhances the transparency of the classification process to establish a reliable model for brain tumor classification, encouraging their adoption in clinical practice promoting trust in decision-making processes.
f
Data_Sheet_1_Prediction of Smoking Habits From Class-Imbalanced Saliva...
figshare.com
pdf
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Celia Díez López; Diego Montiel González; Athina Vidaki; Manfred Kayser (2023). Data_Sheet_1_Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning.pdf [Dataset]. http://doi.org/10.3389/fmicb.2022.886201.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fmicb.2022.886201.s001
Dataset updated
Jun 16, 2023
Dataset provided by
Frontiers
Authors
Celia Díez López; Diego Montiel González; Athina Vidaki; Manfred Kayser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits.
f
The performance of ML classifiers utilizing K-means+SMOTE+ENN on the...
figshare.com
xls
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). The performance of ML classifiers utilizing K-means+SMOTE+ENN on the Hungarian heart disease dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t008
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The performance of ML classifiers utilizing K-means+SMOTE+ENN on the Hungarian heart disease dataset.
f
Performance measure of our scheme using K-means+SMOTE+KNN.
figshare.com
xls
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). Performance measure of our scheme using K-means+SMOTE+KNN. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t006
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance measure of our scheme using K-means+SMOTE+KNN.
f
Number of instances before and after data balancing.
plos.figshare.com
xls
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). Number of instances before and after data balancing. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t003
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of instances before and after data balancing.
f
Performance measure after applying SMOTE+ENN.
plos.figshare.com
xls
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). Performance measure after applying SMOTE+ENN. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t011
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Thyroid disease classification plays a crucial role in early diagnosis and effective treatment of thyroid disorders. Machine learning (ML) techniques have demonstrated remarkable potential in this domain, offering accurate and efficient diagnostic tools. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques process the whole dataset at a time that sometimes causes overfitting and underfitting. However, the complexity of some ML models, often referred to as “black boxes,” raises concerns about their interpretability and clinical applicability. This paper presents a comprehensive study focused on the analysis and interpretability of various ML models for classifying thyroid diseases. In our work, we first applied a new data-balancing mechanism using a clustering technique and then analyzed the performance of different ML algorithms. To address the interpretability challenge, we explored techniques for model explanation and feature importance analysis using eXplainable Artificial Intelligence (XAI) tools globally as well as locally. Finally, the XAI results are validated with the domain experts. Experimental results have shown that our proposed mechanism is efficient in diagnosing thyroid disease and can explain the models effectively. The findings can contribute to bridging the gap between adopting advanced ML techniques and the clinical requirements of transparency and accountability in diagnostic decision-making.
f
ML algorithms used in this study.
figshare.com
xls
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). ML algorithms used in this study. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320955.t005
Dataset updated
May 15, 2025
Dataset provided by
PLOS ONE
Authors
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Depression presents a significant challenge to global mental health, often intertwined with factors including oxidative stress. Although the precise relationship with mitochondrial pathways remains elusive, recent advances in machine learning present an avenue for further investigation. This study employed advanced machine learning techniques to classify major depressive disorders based on clinical indicators and mitochondrial oxidative stress markers. Six machine learning algorithms, including Random Forest, were applied and their performance was investigated in balanced and unbalanced data sets with respect to binary and multiclass classification scenarios. Results indicate promising accuracy and precision, particularly with Random Forest on balanced data. RF achieved an average accuracy of 92.7% and an F1 score of 83.95% for binary classification, 90.36% and 90.1%, respectively, for the classification of three classes of severity of depression and 89.76% and 88.26%, respectively, for the classification of five classes. Including only oxidative stress markers resulted in accuracy and an F1 score of 79.52% and 80.56%, respectively. Notably, including mitochondrial peptides alongside clinical factors significantly enhances predictive capability, shedding light on the interplay between depression severity and mitochondrial oxidative stress pathways. These findings underscore the potential for machine learning models to aid clinical assessment, particularly in individuals with comorbid conditions such as hypertension, diabetes mellitus, and cardiovascular disease.
f
Detection of AMR: predictive performance on test dataset.
plos.figshare.com
xls
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davide Ferrari; Pietro Arina; Jonathan Edgeworth; Vasa Curcin; Veronica Guidetti; Federica Mandreoli; Yanzhong Wang (2024). Detection of AMR: predictive performance on test dataset. [Dataset]. http://doi.org/10.1371/journal.pdig.0000641.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000641.t003
Dataset updated
Oct 16, 2024
Dataset provided by
PLOS Digital Health
Authors
Davide Ferrari; Pietro Arina; Jonathan Edgeworth; Vasa Curcin; Veronica Guidetti; Federica Mandreoli; Yanzhong Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Detection of AMR: predictive performance on test dataset.
f
Features ranking according to XAI tools and domain experts.
plos.figshare.com
xls
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). Features ranking according to XAI tools and domain experts. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t015
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t015
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Features ranking according to XAI tools and domain experts.

Facebook

Twitter

Click to copy link

Link copied

Cite

Robert Lyon (2023). Classification results for: Hellinger Distance Trees for Imbalanced Streams [Dataset]. http://doi.org/10.6084/m9.figshare.1534549.v1

Classification results for: Hellinger Distance Trees for Imbalanced Streams

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.1534549.v1

Dataset updated

May 31, 2023

Dataset provided by

Figsharehttp://figshare.com/

Authors

Robert Lyon

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data sets supporting the results reported in the paper: Hellinger Distance Trees for Imbalanced Streams, R. J. Lyon, J.M. Brooke, J.D. Knowles, B.W Stappers, 22nd International Conference on Pattern Recognition (ICPR), p.1969 - 1974, 2014. DOI: 10.1109/ICPR.2014.344 Contained in this distribution are results of stream classifier perfromance on four different data sets. Also included are the test results from our attempt at reproducing the outcome of the paper, Learning Decision Trees for Un-balanced Data, D. A. Cieslak and N. V. Chawla, in Machine Learning and Knowledge Discovery in Databases (W. Daelemans, B. Goethals, and K. Morik, eds.), vol. 5211 of LNCS, pp. 241-256, 2008. The data sets used for these experiments include, MAGIC Gamma Telescope Data Set : https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+TelescopeMiniBooNE particle identification Data Set : https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identificationSkin Segmentation Data Set : https://archive.ics.uci.edu/ml/datasets/Skin+SegmentationLetter Recognition Data Set : https://archive.ics.uci.edu/ml/datasets/Letter+RecognitionPen-Based Recognition of Handwritten Digits Data Set : https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+DigitsStatlog (Landsat Satellite) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)Statlog (Image Segmentation) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Image+Segmentation) A further data set used is not publicly available at present. However we are in the process of releasing it for public use. Please get in touch if you'd like to use it.

A readme file accompanies the data describing it in more detail.

Clear search

Close search

Google apps

Main menu

Classification results for: Hellinger Distance Trees for Imbalanced Streams

Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

Learning from Imbalanced Insurance Data

Context

Content

Starter Kernel(s)

Inspiration

MORE DATASETs ...

Data from: Machine Learning Model for Screening Thyroid Stimulating Hormone...

Features of the dataset.

Data from: Wine Quality Dataset

Classification results for: A Study on Classification in Imbalanced and...

Raman Singh, Harish Kumar, R.K. Singla (2025). Dataset: Panjab...

Ot & Sien, a dataset to help the development of object detection in...

Performance of Deep Feature Extraction Models after Applying Dual-GAN.

Performance of the classification models before applying dual-GAN.

Comparison with Baseline Studies.

Data_Sheet_1_Prediction of Smoking Habits From Class-Imbalanced Saliva...

The performance of ML classifiers utilizing K-means+SMOTE+ENN on the...

Performance measure of our scheme using K-means+SMOTE+KNN.

Number of instances before and after data balancing.

Performance measure after applying SMOTE+ENN.

ML algorithms used in this study.

Detection of AMR: predictive performance on test dataset.

Features ranking according to XAI tools and domain experts.

Classification results for: Hellinger Distance Trees for Imbalanced Streams