7 datasets found

Data from: Decision tree algorithms.
plos.figshare.com
xls
Updated Sep 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahbub E. Sobhani; Anika Tasnim Rodela; Dewan Md. Farid (2025). Decision tree algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0331307.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0331307.t001
Dataset updated
Sep 19, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Mahbub E. Sobhani; Anika Tasnim Rodela; Dewan Md. Farid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Imbalanced intrusion classification is a complex and challenging task as there are few number of instances/intrusions generally considered as minority instances/intrusions in the imbalanced intrusion datasets. Data sampling methods such as over-sampling and under-sampling methods are commonly applied for dealing with imbalanced intrusion data. In over-sampling, synthetic minority instances are generated e.g. SMOTE (Synthetic Minority Over-sampling Technique) and on the contrary, under-sampling methods remove the majority-class instances to create balanced data e.g. random under-sampling. Both over-sampling and under-sampling methods have the disadvantages as over-sampling technique creates overfitting and under-sampling technique ignores a large portion of the data. Ensemble learning in supervised machine learning is also a common technique for handling imbalanced data. Random Forest and Bagging techniques address the overfitting problem, and Boosting (AdaBoost) gives more attention to the minority-class instances in its iterations. In this paper, we have proposed a method for selecting the most informative instances that represent the overall dataset. We have applied both over-sampling and under-sampling techniques to balance the data by employing the majority and minority informative instances. We have used Random Forest, Bagging, and Boosting (AdaBoost) algorithms and have compared their performances. We have used decision tree (C4.5) as the base classifier of Random Forest and AdaBoost classifiers and naïve Bayes classifier as the base classifier of the Bagging model. The proposed method Adaptive TreeHive addresses both the issues of imbalanced ratio and high dimensionality, resulting in reduced computational power and execution time requirements. We have evaluated the proposed Adaptive TreeHive method using five large-scale public benchmark datasets. The experimental results, compared to data balancing methods such as under-sampling and over-sampling, exhibit superior performance of the Adaptive TreeHive with accuracy rates of 99.96%, 85.65%, 99.83%, 99.77%, and 95.54% on the NSL-KDD, UNSW-NB15, CIC-IDS2017, CSE-CIC-IDS2018, and CICDDoS2019 datasets, respectively, establishing the Adaptive TreeHive as a superior performer compared to the traditional ensemble classifiers.
f
Table 1_Impact of a multiple oversampling technique-based assessment...
frontiersin.figshare.com
docx
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guozhu Rao; Yunzhang Rao; Yangjun Xie; Qiang Huang; Jiazheng Wan; Jiyong Zhang (2025). Table 1_Impact of a multiple oversampling technique-based assessment framework on shallow rockburst prediction models.docx [Dataset]. http://doi.org/10.3389/feart.2024.1514591.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feart.2024.1514591.s001
Dataset updated
Jan 20, 2025
Dataset provided by
Frontiers
Authors
Guozhu Rao; Yunzhang Rao; Yangjun Xie; Qiang Huang; Jiazheng Wan; Jiyong Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The occurrence of class-imbalanced datasets is a frequent observation in natural science research, emphasizing the paramount importance of effectively harnessing them to construct highly accurate models for rockburst prediction. Initially, genuine rockburst incidents within a burial depth of 500 m were sourced from literature, revealing a small dataset imbalance issue. Utilizing various mainstream oversampling techniques, the dataset was expanded to generate six new datasets, subsequently subjected to 12 classifiers across 84 classification processes. The model incorporating the highest-scoring model from the original dataset and the top two models from the expanded dataset, yielded a high-performance model. Findings indicate that the KMeansSMOTE oversampling technique exhibits the most substantial enhancement across the combined 12 classifiers, whereas individual classifiers favor ET+SVMSMOTE and RF+SMOTENC. Following multiple rounds of hyper parameter adjustment via random cross-validation, the ET+SVMSMOTE combination attained the highest accuracy rate of 93.75%, surpassing mainstream models for rockburst prediction. Moreover, the SVMSMOTE technique, augmenting samples with fewer categories, demonstrated notable benefits in mitigating overfitting, enhancing generalization, and improving Recall and F1 score within RF classifiers. Validated for its high generalization performance, accuracy, and reliability. This process also provides an efficient framework for model development.
Data from: A virtual multi-label approach to imbalanced data classification
tandf.figshare.com
text/x-tex
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
Explore at:
text/x-texAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19390561.v1
Dataset updated
Feb 28, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Elizabeth P. Chou; Shan-Ping Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.
f
Table 2_Impact of a multiple oversampling technique-based assessment...
frontiersin.figshare.com
docx
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guozhu Rao; Yunzhang Rao; Yangjun Xie; Qiang Huang; Jiazheng Wan; Jiyong Zhang (2025). Table 2_Impact of a multiple oversampling technique-based assessment framework on shallow rockburst prediction models.docx [Dataset]. http://doi.org/10.3389/feart.2024.1514591.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feart.2024.1514591.s002
Dataset updated
Jan 20, 2025
Dataset provided by
Frontiers
Authors
Guozhu Rao; Yunzhang Rao; Yangjun Xie; Qiang Huang; Jiazheng Wan; Jiyong Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The occurrence of class-imbalanced datasets is a frequent observation in natural science research, emphasizing the paramount importance of effectively harnessing them to construct highly accurate models for rockburst prediction. Initially, genuine rockburst incidents within a burial depth of 500 m were sourced from literature, revealing a small dataset imbalance issue. Utilizing various mainstream oversampling techniques, the dataset was expanded to generate six new datasets, subsequently subjected to 12 classifiers across 84 classification processes. The model incorporating the highest-scoring model from the original dataset and the top two models from the expanded dataset, yielded a high-performance model. Findings indicate that the KMeansSMOTE oversampling technique exhibits the most substantial enhancement across the combined 12 classifiers, whereas individual classifiers favor ET+SVMSMOTE and RF+SMOTENC. Following multiple rounds of hyper parameter adjustment via random cross-validation, the ET+SVMSMOTE combination attained the highest accuracy rate of 93.75%, surpassing mainstream models for rockburst prediction. Moreover, the SVMSMOTE technique, augmenting samples with fewer categories, demonstrated notable benefits in mitigating overfitting, enhancing generalization, and improving Recall and F1 score within RF classifiers. Validated for its high generalization performance, accuracy, and reliability. This process also provides an efficient framework for model development.
n
Data from: Image-based automated species identification: Can virtual data...
data.niaid.nih.gov
datadryad.org
zip
Updated Jul 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage (2021). Image-based automated species identification: Can virtual data augmentation overcome problems of insufficient sampling? [Dataset]. http://doi.org/10.5061/dryad.f1vhhmgx9
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.f1vhhmgx9
Dataset updated
Jul 12, 2021
Dataset provided by
University of Bonn
Zoological Research Museum Alexander Koenig
Authors
Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Automated species identification and delimitation is challenging, particularly in rare and thus often scarcely sampled species, which do not allow sufficient discrimination of infraspecific versus interspecific variation. Typical problems arising from either low or exaggerated interspecific morphological differentiation are best met by automated methods of machine learning that learn efficient and effective species identification from training samples. However, limited infraspecific sampling remains a key challenge also in machine learning. In this study, we assessed whether a data augmentation approach may help to overcome the problem of scarce training data in automated visual species identification. The stepwise augmentation of data comprised image rotation as well as visual and virtual augmentation. The visual data augmentation applies classic approaches of data augmentation and generation of artificial images using a Generative Adversarial Networks (GAN) approach. Descriptive feature vectors are derived from bottleneck features of a VGG-16 convolutional neural network (CNN) that are then stepwise reduced in dimensionality using Global Average Pooling and PCA to prevent overfitting. Finally, data augmentation employs synthetic additional sampling in feature space by an oversampling algorithm in vector space (SMOTE). Applied on four different image datasets, which include scarab beetle genitalia (Pleophylla, Schizonycha) as well as wing patterns of bees (Osmia) and cattleheart butterflies (Parides), our augmentation approach outperformed a deep learning baseline approach by means of resulting identification accuracy with non-augmented data as well as a traditional 2D morphometric approach (Procrustes analysis of scarab beetle genitalia).
Hyper-parameters used in different classifiers.
plos.figshare.com
xls
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheikh Shah Mohammad Motiur Rahman; Zhihao Chen; Alain Lalande; Thomas Decourselle; Alexandre Cochet; Thibaut Pommier; Yves Cottin; Michel Salomon; Raphaël Couturier (2023). Hyper-parameters used in different classifiers. [Dataset]. http://doi.org/10.1371/journal.pone.0285165.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0285165.t002
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sheikh Shah Mohammad Motiur Rahman; Zhihao Chen; Alain Lalande; Thomas Decourselle; Alexandre Cochet; Thibaut Pommier; Yves Cottin; Michel Salomon; Raphaël Couturier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundIn acute cardiovascular disease management, the delay between the admission in a hospital emergency department and the assessment of the disease from a Delayed Enhancement cardiac MRI (DE-MRI) scan is one of the barriers for an immediate management of patients with suspected myocardial infarction or myocarditis.ObjectivesThis work targets patients who arrive at the hospital with chest pain and are suspected of having a myocardial infarction or a myocarditis. The main objective is to classify these patients based solely on clinical data in order to provide an early accurate diagnosis.MethodsMachine learning (ML) and ensemble approaches have been used to construct a framework to automatically classify the patients according to their clinical conditions. 10-fold cross-validation is used during the model’s training to avoid overfitting. Approaches such as Stratified, Over-sampling, Under-sampling, NearMiss, and SMOTE were tested in order to address the imbalance of the data (i.e. proportion of cases per pathology). The ground truth is provided by a DE-MRI exam (normal exam, myocarditis or myocardial infarction).ResultsThe stacked generalization technique with Over-sampling seems to be the best one providing more than 97% of accuracy corresponding to 11 wrong classifications among 537 cases. Generally speaking, ensemble classifiers such as Stacking provided the best prediction. The five most important features are troponin, age, tobacco, sex and FEVG calculated from echocardiography.ConclusionOur study provides a reliable approach to classify the patients in emergency department between myocarditis, myocardial infarction or other patient condition from only clinical information, considering DE-MRI as ground-truth. Among the different machine learning and ensemble techniques tested, the stacked generalization technique is the best one providing an accuracy of 97.4%. This automatic classification could provide a quick answer before imaging exam such as cardiovascular MRI depending on the patient’s condition.
f
Results on WebVision using pre-trained ResNet-50.
plos.figshare.com
xls
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qian Zhang; Yi Zhu; Ming Yang; Ge Jin; Yingwen Zhu; Yanjun Lu; Yu Zou; Qiu Chen (2024). Results on WebVision using pre-trained ResNet-50. [Dataset]. http://doi.org/10.1371/journal.pone.0309841.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309841.t007
Dataset updated
Dec 5, 2024
Dataset provided by
PLOS ONE
Authors
Qian Zhang; Yi Zhu; Ming Yang; Ge Jin; Yingwen Zhu; Yanjun Lu; Yu Zou; Qiu Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Deep neural networks have powerful memory capabilities, yet they frequently suffer from overfitting to noisy labels, leading to a decline in classification and generalization performance. To address this issue, sample selection methods that filter out potentially clean labels have been proposed. However, there is a significant gap in size between the filtered, possibly clean subset and the unlabeled subset, which becomes particularly pronounced at high-noise rates. Consequently, this results in underutilizing label-free samples in sample selection methods, leaving room for performance improvement. This study introduces an enhanced sample selection framework with an oversampling strategy (SOS) to overcome this limitation. This framework leverages the valuable information contained in label-free instances to enhance model performance by combining an SOS with state-of-the-art sample selection methods. We validate the effectiveness of SOS through extensive experiments conducted on both synthetic noisy datasets and real-world datasets such as CIFAR, WebVision, and Clothing1M. The source code for SOS will be made available at https://github.com/LanXiaoPang613/SOS.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mahbub E. Sobhani; Anika Tasnim Rodela; Dewan Md. Farid (2025). Decision tree algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0331307.t001

Data from: Decision tree algorithms.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0331307.t001

Dataset updated

Sep 19, 2025

Dataset provided by

PLOShttp://plos.org/

Authors

Mahbub E. Sobhani; Anika Tasnim Rodela; Dewan Md. Farid

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Imbalanced intrusion classification is a complex and challenging task as there are few number of instances/intrusions generally considered as minority instances/intrusions in the imbalanced intrusion datasets. Data sampling methods such as over-sampling and under-sampling methods are commonly applied for dealing with imbalanced intrusion data. In over-sampling, synthetic minority instances are generated e.g. SMOTE (Synthetic Minority Over-sampling Technique) and on the contrary, under-sampling methods remove the majority-class instances to create balanced data e.g. random under-sampling. Both over-sampling and under-sampling methods have the disadvantages as over-sampling technique creates overfitting and under-sampling technique ignores a large portion of the data. Ensemble learning in supervised machine learning is also a common technique for handling imbalanced data. Random Forest and Bagging techniques address the overfitting problem, and Boosting (AdaBoost) gives more attention to the minority-class instances in its iterations. In this paper, we have proposed a method for selecting the most informative instances that represent the overall dataset. We have applied both over-sampling and under-sampling techniques to balance the data by employing the majority and minority informative instances. We have used Random Forest, Bagging, and Boosting (AdaBoost) algorithms and have compared their performances. We have used decision tree (C4.5) as the base classifier of Random Forest and AdaBoost classifiers and naïve Bayes classifier as the base classifier of the Bagging model. The proposed method Adaptive TreeHive addresses both the issues of imbalanced ratio and high dimensionality, resulting in reduced computational power and execution time requirements. We have evaluated the proposed Adaptive TreeHive method using five large-scale public benchmark datasets. The experimental results, compared to data balancing methods such as under-sampling and over-sampling, exhibit superior performance of the Adaptive TreeHive with accuracy rates of 99.96%, 85.65%, 99.83%, 99.77%, and 95.54% on the NSL-KDD, UNSW-NB15, CIC-IDS2017, CSE-CIC-IDS2018, and CICDDoS2019 datasets, respectively, establishing the Adaptive TreeHive as a superior performer compared to the traditional ensemble classifiers.

Clear search

Close search

Google apps

Main menu

Data from: Decision tree algorithms.

Table 1_Impact of a multiple oversampling technique-based assessment...

Data from: A virtual multi-label approach to imbalanced data classification

Table 2_Impact of a multiple oversampling technique-based assessment...

Data from: Image-based automated species identification: Can virtual data...

Hyper-parameters used in different classifiers.

Results on WebVision using pre-trained ResNet-50.

Data from: Decision tree algorithms.