7 datasets found
  1. Data from: Decision tree algorithms.

    • plos.figshare.com
    xls
    Updated Sep 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahbub E. Sobhani; Anika Tasnim Rodela; Dewan Md. Farid (2025). Decision tree algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0331307.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mahbub E. Sobhani; Anika Tasnim Rodela; Dewan Md. Farid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Imbalanced intrusion classification is a complex and challenging task as there are few number of instances/intrusions generally considered as minority instances/intrusions in the imbalanced intrusion datasets. Data sampling methods such as over-sampling and under-sampling methods are commonly applied for dealing with imbalanced intrusion data. In over-sampling, synthetic minority instances are generated e.g. SMOTE (Synthetic Minority Over-sampling Technique) and on the contrary, under-sampling methods remove the majority-class instances to create balanced data e.g. random under-sampling. Both over-sampling and under-sampling methods have the disadvantages as over-sampling technique creates overfitting and under-sampling technique ignores a large portion of the data. Ensemble learning in supervised machine learning is also a common technique for handling imbalanced data. Random Forest and Bagging techniques address the overfitting problem, and Boosting (AdaBoost) gives more attention to the minority-class instances in its iterations. In this paper, we have proposed a method for selecting the most informative instances that represent the overall dataset. We have applied both over-sampling and under-sampling techniques to balance the data by employing the majority and minority informative instances. We have used Random Forest, Bagging, and Boosting (AdaBoost) algorithms and have compared their performances. We have used decision tree (C4.5) as the base classifier of Random Forest and AdaBoost classifiers and naïve Bayes classifier as the base classifier of the Bagging model. The proposed method Adaptive TreeHive addresses both the issues of imbalanced ratio and high dimensionality, resulting in reduced computational power and execution time requirements. We have evaluated the proposed Adaptive TreeHive method using five large-scale public benchmark datasets. The experimental results, compared to data balancing methods such as under-sampling and over-sampling, exhibit superior performance of the Adaptive TreeHive with accuracy rates of 99.96%, 85.65%, 99.83%, 99.77%, and 95.54% on the NSL-KDD, UNSW-NB15, CIC-IDS2017, CSE-CIC-IDS2018, and CICDDoS2019 datasets, respectively, establishing the Adaptive TreeHive as a superior performer compared to the traditional ensemble classifiers.

  2. f

    Table 1_Impact of a multiple oversampling technique-based assessment...

    • frontiersin.figshare.com
    docx
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guozhu Rao; Yunzhang Rao; Yangjun Xie; Qiang Huang; Jiazheng Wan; Jiyong Zhang (2025). Table 1_Impact of a multiple oversampling technique-based assessment framework on shallow rockburst prediction models.docx [Dataset]. http://doi.org/10.3389/feart.2024.1514591.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    Frontiers
    Authors
    Guozhu Rao; Yunzhang Rao; Yangjun Xie; Qiang Huang; Jiazheng Wan; Jiyong Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The occurrence of class-imbalanced datasets is a frequent observation in natural science research, emphasizing the paramount importance of effectively harnessing them to construct highly accurate models for rockburst prediction. Initially, genuine rockburst incidents within a burial depth of 500 m were sourced from literature, revealing a small dataset imbalance issue. Utilizing various mainstream oversampling techniques, the dataset was expanded to generate six new datasets, subsequently subjected to 12 classifiers across 84 classification processes. The model incorporating the highest-scoring model from the original dataset and the top two models from the expanded dataset, yielded a high-performance model. Findings indicate that the KMeansSMOTE oversampling technique exhibits the most substantial enhancement across the combined 12 classifiers, whereas individual classifiers favor ET+SVMSMOTE and RF+SMOTENC. Following multiple rounds of hyper parameter adjustment via random cross-validation, the ET+SVMSMOTE combination attained the highest accuracy rate of 93.75%, surpassing mainstream models for rockburst prediction. Moreover, the SVMSMOTE technique, augmenting samples with fewer categories, demonstrated notable benefits in mitigating overfitting, enhancing generalization, and improving Recall and F1 score within RF classifiers. Validated for its high generalization performance, accuracy, and reliability. This process also provides an efficient framework for model development.

  3. Data from: A virtual multi-label approach to imbalanced data classification

    • tandf.figshare.com
    text/x-tex
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
    Explore at:
    text/x-texAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Elizabeth P. Chou; Shan-Ping Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.

  4. f

    Table 2_Impact of a multiple oversampling technique-based assessment...

    • frontiersin.figshare.com
    docx
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guozhu Rao; Yunzhang Rao; Yangjun Xie; Qiang Huang; Jiazheng Wan; Jiyong Zhang (2025). Table 2_Impact of a multiple oversampling technique-based assessment framework on shallow rockburst prediction models.docx [Dataset]. http://doi.org/10.3389/feart.2024.1514591.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    Frontiers
    Authors
    Guozhu Rao; Yunzhang Rao; Yangjun Xie; Qiang Huang; Jiazheng Wan; Jiyong Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The occurrence of class-imbalanced datasets is a frequent observation in natural science research, emphasizing the paramount importance of effectively harnessing them to construct highly accurate models for rockburst prediction. Initially, genuine rockburst incidents within a burial depth of 500 m were sourced from literature, revealing a small dataset imbalance issue. Utilizing various mainstream oversampling techniques, the dataset was expanded to generate six new datasets, subsequently subjected to 12 classifiers across 84 classification processes. The model incorporating the highest-scoring model from the original dataset and the top two models from the expanded dataset, yielded a high-performance model. Findings indicate that the KMeansSMOTE oversampling technique exhibits the most substantial enhancement across the combined 12 classifiers, whereas individual classifiers favor ET+SVMSMOTE and RF+SMOTENC. Following multiple rounds of hyper parameter adjustment via random cross-validation, the ET+SVMSMOTE combination attained the highest accuracy rate of 93.75%, surpassing mainstream models for rockburst prediction. Moreover, the SVMSMOTE technique, augmenting samples with fewer categories, demonstrated notable benefits in mitigating overfitting, enhancing generalization, and improving Recall and F1 score within RF classifiers. Validated for its high generalization performance, accuracy, and reliability. This process also provides an efficient framework for model development.

  5. n

    Data from: Image-based automated species identification: Can virtual data...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jul 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage (2021). Image-based automated species identification: Can virtual data augmentation overcome problems of insufficient sampling? [Dataset]. http://doi.org/10.5061/dryad.f1vhhmgx9
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 12, 2021
    Dataset provided by
    University of Bonn
    Zoological Research Museum Alexander Koenig
    Authors
    Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Automated species identification and delimitation is challenging, particularly in rare and thus often scarcely sampled species, which do not allow sufficient discrimination of infraspecific versus interspecific variation. Typical problems arising from either low or exaggerated interspecific morphological differentiation are best met by automated methods of machine learning that learn efficient and effective species identification from training samples. However, limited infraspecific sampling remains a key challenge also in machine learning. In this study, we assessed whether a data augmentation approach may help to overcome the problem of scarce training data in automated visual species identification. The stepwise augmentation of data comprised image rotation as well as visual and virtual augmentation. The visual data augmentation applies classic approaches of data augmentation and generation of artificial images using a Generative Adversarial Networks (GAN) approach. Descriptive feature vectors are derived from bottleneck features of a VGG-16 convolutional neural network (CNN) that are then stepwise reduced in dimensionality using Global Average Pooling and PCA to prevent overfitting. Finally, data augmentation employs synthetic additional sampling in feature space by an oversampling algorithm in vector space (SMOTE). Applied on four different image datasets, which include scarab beetle genitalia (Pleophylla, Schizonycha) as well as wing patterns of bees (Osmia) and cattleheart butterflies (Parides), our augmentation approach outperformed a deep learning baseline approach by means of resulting identification accuracy with non-augmented data as well as a traditional 2D morphometric approach (Procrustes analysis of scarab beetle genitalia).

  6. Hyper-parameters used in different classifiers.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheikh Shah Mohammad Motiur Rahman; Zhihao Chen; Alain Lalande; Thomas Decourselle; Alexandre Cochet; Thibaut Pommier; Yves Cottin; Michel Salomon; Raphaël Couturier (2023). Hyper-parameters used in different classifiers. [Dataset]. http://doi.org/10.1371/journal.pone.0285165.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sheikh Shah Mohammad Motiur Rahman; Zhihao Chen; Alain Lalande; Thomas Decourselle; Alexandre Cochet; Thibaut Pommier; Yves Cottin; Michel Salomon; Raphaël Couturier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundIn acute cardiovascular disease management, the delay between the admission in a hospital emergency department and the assessment of the disease from a Delayed Enhancement cardiac MRI (DE-MRI) scan is one of the barriers for an immediate management of patients with suspected myocardial infarction or myocarditis.ObjectivesThis work targets patients who arrive at the hospital with chest pain and are suspected of having a myocardial infarction or a myocarditis. The main objective is to classify these patients based solely on clinical data in order to provide an early accurate diagnosis.MethodsMachine learning (ML) and ensemble approaches have been used to construct a framework to automatically classify the patients according to their clinical conditions. 10-fold cross-validation is used during the model’s training to avoid overfitting. Approaches such as Stratified, Over-sampling, Under-sampling, NearMiss, and SMOTE were tested in order to address the imbalance of the data (i.e. proportion of cases per pathology). The ground truth is provided by a DE-MRI exam (normal exam, myocarditis or myocardial infarction).ResultsThe stacked generalization technique with Over-sampling seems to be the best one providing more than 97% of accuracy corresponding to 11 wrong classifications among 537 cases. Generally speaking, ensemble classifiers such as Stacking provided the best prediction. The five most important features are troponin, age, tobacco, sex and FEVG calculated from echocardiography.ConclusionOur study provides a reliable approach to classify the patients in emergency department between myocarditis, myocardial infarction or other patient condition from only clinical information, considering DE-MRI as ground-truth. Among the different machine learning and ensemble techniques tested, the stacked generalization technique is the best one providing an accuracy of 97.4%. This automatic classification could provide a quick answer before imaging exam such as cardiovascular MRI depending on the patient’s condition.

  7. f

    Results on WebVision using pre-trained ResNet-50.

    • plos.figshare.com
    xls
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qian Zhang; Yi Zhu; Ming Yang; Ge Jin; Yingwen Zhu; Yanjun Lu; Yu Zou; Qiu Chen (2024). Results on WebVision using pre-trained ResNet-50. [Dataset]. http://doi.org/10.1371/journal.pone.0309841.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Qian Zhang; Yi Zhu; Ming Yang; Ge Jin; Yingwen Zhu; Yanjun Lu; Yu Zou; Qiu Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Deep neural networks have powerful memory capabilities, yet they frequently suffer from overfitting to noisy labels, leading to a decline in classification and generalization performance. To address this issue, sample selection methods that filter out potentially clean labels have been proposed. However, there is a significant gap in size between the filtered, possibly clean subset and the unlabeled subset, which becomes particularly pronounced at high-noise rates. Consequently, this results in underutilizing label-free samples in sample selection methods, leaving room for performance improvement. This study introduces an enhanced sample selection framework with an oversampling strategy (SOS) to overcome this limitation. This framework leverages the valuable information contained in label-free instances to enhance model performance by combining an SOS with state-of-the-art sample selection methods. We validate the effectiveness of SOS through extensive experiments conducted on both synthetic noisy datasets and real-world datasets such as CIFAR, WebVision, and Clothing1M. The source code for SOS will be made available at https://github.com/LanXiaoPang613/SOS.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mahbub E. Sobhani; Anika Tasnim Rodela; Dewan Md. Farid (2025). Decision tree algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0331307.t001
Organization logo

Data from: Decision tree algorithms.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Sep 19, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Mahbub E. Sobhani; Anika Tasnim Rodela; Dewan Md. Farid
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Imbalanced intrusion classification is a complex and challenging task as there are few number of instances/intrusions generally considered as minority instances/intrusions in the imbalanced intrusion datasets. Data sampling methods such as over-sampling and under-sampling methods are commonly applied for dealing with imbalanced intrusion data. In over-sampling, synthetic minority instances are generated e.g. SMOTE (Synthetic Minority Over-sampling Technique) and on the contrary, under-sampling methods remove the majority-class instances to create balanced data e.g. random under-sampling. Both over-sampling and under-sampling methods have the disadvantages as over-sampling technique creates overfitting and under-sampling technique ignores a large portion of the data. Ensemble learning in supervised machine learning is also a common technique for handling imbalanced data. Random Forest and Bagging techniques address the overfitting problem, and Boosting (AdaBoost) gives more attention to the minority-class instances in its iterations. In this paper, we have proposed a method for selecting the most informative instances that represent the overall dataset. We have applied both over-sampling and under-sampling techniques to balance the data by employing the majority and minority informative instances. We have used Random Forest, Bagging, and Boosting (AdaBoost) algorithms and have compared their performances. We have used decision tree (C4.5) as the base classifier of Random Forest and AdaBoost classifiers and naïve Bayes classifier as the base classifier of the Bagging model. The proposed method Adaptive TreeHive addresses both the issues of imbalanced ratio and high dimensionality, resulting in reduced computational power and execution time requirements. We have evaluated the proposed Adaptive TreeHive method using five large-scale public benchmark datasets. The experimental results, compared to data balancing methods such as under-sampling and over-sampling, exhibit superior performance of the Adaptive TreeHive with accuracy rates of 99.96%, 85.65%, 99.83%, 99.77%, and 95.54% on the NSL-KDD, UNSW-NB15, CIC-IDS2017, CSE-CIC-IDS2018, and CICDDoS2019 datasets, respectively, establishing the Adaptive TreeHive as a superior performer compared to the traditional ensemble classifiers.

Search
Clear search
Close search
Google apps
Main menu