11 datasets found
  1. f

    Data from: Less is More: An Empirical Study of Undersampling Techniques for...

    • figshare.com
    zip
    Updated May 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gichan Lee (2024). Less is More: An Empirical Study of Undersampling Techniques for Technical Debt Prediction [Dataset]. http://doi.org/10.6084/m9.figshare.22708036.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2024
    Dataset provided by
    figshare
    Authors
    Gichan Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Technical Debt (TD) prediction is crucial to preventing software quality degradation and maintenance cost increase. Recent Machine Learning (ML) approaches have shown promising results in TD prediction, but the imbalanced TD datasets can have a negative impact on ML model performance. Although previous TD studies have investigated various oversampling techniques that generates minority class instances to mitigate the imbalance, potentials of undersampling techniques have not yet been thoroughly explored due to the concerns about information loss. To address this gap, we investigate the impact of undersampling on ML model performance for TD prediction by utilizing 17,797 classes from 25 Java open-source projects. We compare the performance of ML models with different undersampling techniques and evaluate the impact of combining them with widely used oversampling techniques in TD studies. Our findings reveal that (i) undersampling can significantly improve ML model performance compared to oversampling and no resampling; (ii) the combined application of undersampling and oversampling techniques leads to a synergy of further performance improvement compared to applying each technique exclusively. Based on these results, we recommend practitioners to explore various undersampling techniques and their combinations with oversampling techniques for more effective TD prediction.This package is for the replication of 'Less is More: an Empirical Study of Undersampling Techniques for Technical Debt Prediction'File list:X.csv, Y.csv: - These are the datasets for the study, used in the ipynb file below.under_over_sampling_scripts.ipynb: - These scripts can obtain all the experimental results from the study. - They can be run through Jupyter Notebook or Google Colab. - The required packages are listed at the top in the file, so installation via pip or conda is necessary before running.Results_for_all_tables.csv: This is a csv file that summarizes all the results obtained from the study.

  2. f

    Performance measure after applying NearMiss.

    • figshare.com
    • plos.figshare.com
    xls
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumya Akter; Hossen A. Mustafa (2024). Performance measure after applying NearMiss. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Sumya Akter; Hossen A. Mustafa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Thyroid disease classification plays a crucial role in early diagnosis and effective treatment of thyroid disorders. Machine learning (ML) techniques have demonstrated remarkable potential in this domain, offering accurate and efficient diagnostic tools. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques process the whole dataset at a time that sometimes causes overfitting and underfitting. However, the complexity of some ML models, often referred to as “black boxes,” raises concerns about their interpretability and clinical applicability. This paper presents a comprehensive study focused on the analysis and interpretability of various ML models for classifying thyroid diseases. In our work, we first applied a new data-balancing mechanism using a clustering technique and then analyzed the performance of different ML algorithms. To address the interpretability challenge, we explored techniques for model explanation and feature importance analysis using eXplainable Artificial Intelligence (XAI) tools globally as well as locally. Finally, the XAI results are validated with the domain experts. Experimental results have shown that our proposed mechanism is efficient in diagnosing thyroid disease and can explain the models effectively. The findings can contribute to bridging the gap between adopting advanced ML techniques and the clinical requirements of transparency and accountability in diagnostic decision-making.

  3. f

    The comparison of different ML algorithms on EN dataset in group AB.s.

    • figshare.com
    • plos.figshare.com
    bin
    Updated Aug 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hakimeh Khojasteh; Jamshid Pirgazi; Ali Ghanbari Sorkhi (2023). The comparison of different ML algorithms on EN dataset in group AB.s. [Dataset]. http://doi.org/10.1371/journal.pone.0288173.t012
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Hakimeh Khojasteh; Jamshid Pirgazi; Ali Ghanbari Sorkhi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The comparison of different ML algorithms on EN dataset in group AB.s.

  4. f

    Performance measure of our scheme using K-means+SMOTE+ENN.

    • figshare.com
    xls
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumya Akter; Hossen A. Mustafa (2024). Performance measure of our scheme using K-means+SMOTE+ENN. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Sumya Akter; Hossen A. Mustafa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance measure of our scheme using K-means+SMOTE+ENN.

  5. f

    Rank and frequency of the domain expert’s opinion.

    • figshare.com
    • plos.figshare.com
    xls
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumya Akter; Hossen A. Mustafa (2024). Rank and frequency of the domain expert’s opinion. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t013
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Sumya Akter; Hossen A. Mustafa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Rank and frequency of the domain expert’s opinion.

  6. f

    Confusion matrix.

    • plos.figshare.com
    xls
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumya Akter; Hossen A. Mustafa (2024). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Sumya Akter; Hossen A. Mustafa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Thyroid disease classification plays a crucial role in early diagnosis and effective treatment of thyroid disorders. Machine learning (ML) techniques have demonstrated remarkable potential in this domain, offering accurate and efficient diagnostic tools. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques process the whole dataset at a time that sometimes causes overfitting and underfitting. However, the complexity of some ML models, often referred to as “black boxes,” raises concerns about their interpretability and clinical applicability. This paper presents a comprehensive study focused on the analysis and interpretability of various ML models for classifying thyroid diseases. In our work, we first applied a new data-balancing mechanism using a clustering technique and then analyzed the performance of different ML algorithms. To address the interpretability challenge, we explored techniques for model explanation and feature importance analysis using eXplainable Artificial Intelligence (XAI) tools globally as well as locally. Finally, the XAI results are validated with the domain experts. Experimental results have shown that our proposed mechanism is efficient in diagnosing thyroid disease and can explain the models effectively. The findings can contribute to bridging the gap between adopting advanced ML techniques and the clinical requirements of transparency and accountability in diagnostic decision-making.

  7. f

    Categories with its number of samples.

    • plos.figshare.com
    xls
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himanshi Babbar; Shalli Rani; Maha Driss (2024). Categories with its number of samples. [Dataset]. http://doi.org/10.1371/journal.pone.0314695.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Himanshi Babbar; Shalli Rani; Maha Driss
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Vehicular Networks (VN) utilizing Software Defined Networking (SDN) have garnered significant attention recently, paralleling the advancements in wireless networks. VN are deployed to optimize traffic flow, enhance the driving experience, and ensure road safety. However, VN are vulnerable to Distributed Denial of Service (DDoS) attacks, posing severe threats in the contemporary Internet landscape. With the surge in Internet traffic, this study proposes novel methodologies for effectively detecting DDoS attacks within Software-Defined Vehicular Networks (SDVN), wherein attackers commandeer compromised nodes to monopolize network resources, disrupting communication among vehicles and between vehicles and infrastructure. The proposed methodology aims to: (i) analyze statistical flow and compute entropy, and (ii) implement Machine Learning (ML) algorithms within SDN Intrusion Detection Systems for Internet of Things (IoT) environments. Additionally, the approach distinguishes between reconnaissance, Denial of Service (DoS), and DDoS traffic by addressing the challenges of imbalanced and overfitting dataset traces. One of the significant challenges in this integration is managing the computational load and ensuring real-time performance. The ML models, especially complex ones like Random Forest, require substantial processing power, which necessitates efficient data handling and possibly leveraging edge computing resources to reduce latency. Ensuring scalability and maintaining high detection accuracy as network traffic grows and evolves is another critical challenge. By leveraging a minimal subset of features from a given dataset, a comparative study is conducted to determine the optimal sample size for maximizing model accuracy. Further, the study evaluates the impact of various dataset attributes on performance thresholds. The K-nearest Neighbor, Random Forest, and Logistic Regression supervised ML classifiers are assessed using the BoT-IoT dataset. The results indicate that the Random Forest classifier achieves superior performance metrics, with Precision, F1-score, Accuracy, and Recall rates of 92%, 92%, 91%, and 90%, respectively, over five iterations.

  8. f

    DataSheet1_TextNetTopics Pro, a topic model-based text classification for...

    • frontiersin.figshare.com
    xlsx
    Updated Oct 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Voskergian; Burcu Bakir-Gungor; Malik Yousef (2023). DataSheet1_TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.xlsx [Dataset]. http://doi.org/10.3389/fgene.2023.1243874.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    Frontiers
    Authors
    Daniel Voskergian; Burcu Bakir-Gungor; Malik Yousef
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers.

  9. f

    Hyperparameters tuning of the classifiers using gridsearchCV.

    • plos.figshare.com
    xls
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumya Akter; Hossen A. Mustafa (2024). Hyperparameters tuning of the classifiers using gridsearchCV. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Sumya Akter; Hossen A. Mustafa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hyperparameters tuning of the classifiers using gridsearchCV.

  10. f

    The summarized responses from the survey in three categories.

    • plos.figshare.com
    xls
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumya Akter; Hossen A. Mustafa (2024). The summarized responses from the survey in three categories. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t014
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Sumya Akter; Hossen A. Mustafa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The summarized responses from the survey in three categories.

  11. Performance comparison with traditional ML models.

    • plos.figshare.com
    bin
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Mahadi Hasan; Saba Binte Murtaz; Muhammad Usama Islam; Muhammad Jafar Sadeq; Jasim Uddin (2023). Performance comparison with traditional ML models. [Dataset]. http://doi.org/10.1371/journal.pone.0274538.t002
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Md. Mahadi Hasan; Saba Binte Murtaz; Muhammad Usama Islam; Muhammad Jafar Sadeq; Jasim Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance comparison with traditional ML models.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gichan Lee (2024). Less is More: An Empirical Study of Undersampling Techniques for Technical Debt Prediction [Dataset]. http://doi.org/10.6084/m9.figshare.22708036.v1

Data from: Less is More: An Empirical Study of Undersampling Techniques for Technical Debt Prediction

Related Article
Explore at:
zipAvailable download formats
Dataset updated
May 20, 2024
Dataset provided by
figshare
Authors
Gichan Lee
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Technical Debt (TD) prediction is crucial to preventing software quality degradation and maintenance cost increase. Recent Machine Learning (ML) approaches have shown promising results in TD prediction, but the imbalanced TD datasets can have a negative impact on ML model performance. Although previous TD studies have investigated various oversampling techniques that generates minority class instances to mitigate the imbalance, potentials of undersampling techniques have not yet been thoroughly explored due to the concerns about information loss. To address this gap, we investigate the impact of undersampling on ML model performance for TD prediction by utilizing 17,797 classes from 25 Java open-source projects. We compare the performance of ML models with different undersampling techniques and evaluate the impact of combining them with widely used oversampling techniques in TD studies. Our findings reveal that (i) undersampling can significantly improve ML model performance compared to oversampling and no resampling; (ii) the combined application of undersampling and oversampling techniques leads to a synergy of further performance improvement compared to applying each technique exclusively. Based on these results, we recommend practitioners to explore various undersampling techniques and their combinations with oversampling techniques for more effective TD prediction.This package is for the replication of 'Less is More: an Empirical Study of Undersampling Techniques for Technical Debt Prediction'File list:X.csv, Y.csv: - These are the datasets for the study, used in the ipynb file below.under_over_sampling_scripts.ipynb: - These scripts can obtain all the experimental results from the study. - They can be run through Jupyter Notebook or Google Colab. - The required packages are listed at the top in the file, so installation via pip or conda is necessary before running.Results_for_all_tables.csv: This is a csv file that summarizes all the results obtained from the study.

Search
Clear search
Close search
Google apps
Main menu