11 datasets found

f
Data from: Less is More: An Empirical Study of Undersampling Techniques for...
figshare.com
zip
Updated May 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gichan Lee (2024). Less is More: An Empirical Study of Undersampling Techniques for Technical Debt Prediction [Dataset]. http://doi.org/10.6084/m9.figshare.22708036.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22708036.v1
Dataset updated
May 20, 2024
Dataset provided by
figshare
Authors
Gichan Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Technical Debt (TD) prediction is crucial to preventing software quality degradation and maintenance cost increase. Recent Machine Learning (ML) approaches have shown promising results in TD prediction, but the imbalanced TD datasets can have a negative impact on ML model performance. Although previous TD studies have investigated various oversampling techniques that generates minority class instances to mitigate the imbalance, potentials of undersampling techniques have not yet been thoroughly explored due to the concerns about information loss. To address this gap, we investigate the impact of undersampling on ML model performance for TD prediction by utilizing 17,797 classes from 25 Java open-source projects. We compare the performance of ML models with different undersampling techniques and evaluate the impact of combining them with widely used oversampling techniques in TD studies. Our findings reveal that (i) undersampling can significantly improve ML model performance compared to oversampling and no resampling; (ii) the combined application of undersampling and oversampling techniques leads to a synergy of further performance improvement compared to applying each technique exclusively. Based on these results, we recommend practitioners to explore various undersampling techniques and their combinations with oversampling techniques for more effective TD prediction.This package is for the replication of 'Less is More: an Empirical Study of Undersampling Techniques for Technical Debt Prediction'File list:X.csv, Y.csv: - These are the datasets for the study, used in the ipynb file below.under_over_sampling_scripts.ipynb: - These scripts can obtain all the experimental results from the study. - They can be run through Jupyter Notebook or Google Colab. - The required packages are listed at the top in the file, so installation via pip or conda is necessary before running.Results_for_all_tables.csv: This is a csv file that summarizes all the results obtained from the study.
f
Performance measure after applying NearMiss.
figshare.com
plos.figshare.com
xls
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). Performance measure after applying NearMiss. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t010
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Thyroid disease classification plays a crucial role in early diagnosis and effective treatment of thyroid disorders. Machine learning (ML) techniques have demonstrated remarkable potential in this domain, offering accurate and efficient diagnostic tools. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques process the whole dataset at a time that sometimes causes overfitting and underfitting. However, the complexity of some ML models, often referred to as “black boxes,” raises concerns about their interpretability and clinical applicability. This paper presents a comprehensive study focused on the analysis and interpretability of various ML models for classifying thyroid diseases. In our work, we first applied a new data-balancing mechanism using a clustering technique and then analyzed the performance of different ML algorithms. To address the interpretability challenge, we explored techniques for model explanation and feature importance analysis using eXplainable Artificial Intelligence (XAI) tools globally as well as locally. Finally, the XAI results are validated with the domain experts. Experimental results have shown that our proposed mechanism is efficient in diagnosing thyroid disease and can explain the models effectively. The findings can contribute to bridging the gap between adopting advanced ML techniques and the clinical requirements of transparency and accountability in diagnostic decision-making.
f
The comparison of different ML algorithms on EN dataset in group AB.s.
figshare.com
plos.figshare.com
bin
Updated Aug 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hakimeh Khojasteh; Jamshid Pirgazi; Ali Ghanbari Sorkhi (2023). The comparison of different ML algorithms on EN dataset in group AB.s. [Dataset]. http://doi.org/10.1371/journal.pone.0288173.t012
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288173.t012
Dataset updated
Aug 3, 2023
Dataset provided by
PLOS ONE
Authors
Hakimeh Khojasteh; Jamshid Pirgazi; Ali Ghanbari Sorkhi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The comparison of different ML algorithms on EN dataset in group AB.s.
f
Performance measure of our scheme using K-means+SMOTE+ENN.
figshare.com
xls
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). Performance measure of our scheme using K-means+SMOTE+ENN. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t005
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance measure of our scheme using K-means+SMOTE+ENN.
f
Rank and frequency of the domain expert’s opinion.
figshare.com
plos.figshare.com
xls
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). Rank and frequency of the domain expert’s opinion. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t013
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t013
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Rank and frequency of the domain expert’s opinion.
f
Confusion matrix.
plos.figshare.com
xls
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t002
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Thyroid disease classification plays a crucial role in early diagnosis and effective treatment of thyroid disorders. Machine learning (ML) techniques have demonstrated remarkable potential in this domain, offering accurate and efficient diagnostic tools. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques process the whole dataset at a time that sometimes causes overfitting and underfitting. However, the complexity of some ML models, often referred to as “black boxes,” raises concerns about their interpretability and clinical applicability. This paper presents a comprehensive study focused on the analysis and interpretability of various ML models for classifying thyroid diseases. In our work, we first applied a new data-balancing mechanism using a clustering technique and then analyzed the performance of different ML algorithms. To address the interpretability challenge, we explored techniques for model explanation and feature importance analysis using eXplainable Artificial Intelligence (XAI) tools globally as well as locally. Finally, the XAI results are validated with the domain experts. Experimental results have shown that our proposed mechanism is efficient in diagnosing thyroid disease and can explain the models effectively. The findings can contribute to bridging the gap between adopting advanced ML techniques and the clinical requirements of transparency and accountability in diagnostic decision-making.
f
Categories with its number of samples.
plos.figshare.com
xls
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himanshi Babbar; Shalli Rani; Maha Driss (2024). Categories with its number of samples. [Dataset]. http://doi.org/10.1371/journal.pone.0314695.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0314695.t002
Dataset updated
Dec 18, 2024
Dataset provided by
PLOS ONE
Authors
Himanshi Babbar; Shalli Rani; Maha Driss
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Vehicular Networks (VN) utilizing Software Defined Networking (SDN) have garnered significant attention recently, paralleling the advancements in wireless networks. VN are deployed to optimize traffic flow, enhance the driving experience, and ensure road safety. However, VN are vulnerable to Distributed Denial of Service (DDoS) attacks, posing severe threats in the contemporary Internet landscape. With the surge in Internet traffic, this study proposes novel methodologies for effectively detecting DDoS attacks within Software-Defined Vehicular Networks (SDVN), wherein attackers commandeer compromised nodes to monopolize network resources, disrupting communication among vehicles and between vehicles and infrastructure. The proposed methodology aims to: (i) analyze statistical flow and compute entropy, and (ii) implement Machine Learning (ML) algorithms within SDN Intrusion Detection Systems for Internet of Things (IoT) environments. Additionally, the approach distinguishes between reconnaissance, Denial of Service (DoS), and DDoS traffic by addressing the challenges of imbalanced and overfitting dataset traces. One of the significant challenges in this integration is managing the computational load and ensuring real-time performance. The ML models, especially complex ones like Random Forest, require substantial processing power, which necessitates efficient data handling and possibly leveraging edge computing resources to reduce latency. Ensuring scalability and maintaining high detection accuracy as network traffic grows and evolves is another critical challenge. By leveraging a minimal subset of features from a given dataset, a comparative study is conducted to determine the optimal sample size for maximizing model accuracy. Further, the study evaluates the impact of various dataset attributes on performance thresholds. The K-nearest Neighbor, Random Forest, and Logistic Regression supervised ML classifiers are assessed using the BoT-IoT dataset. The results indicate that the Random Forest classifier achieves superior performance metrics, with Precision, F1-score, Accuracy, and Recall rates of 92%, 92%, 91%, and 90%, respectively, over five iterations.
f
DataSheet1_TextNetTopics Pro, a topic model-based text classification for...
frontiersin.figshare.com
xlsx
Updated Oct 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Voskergian; Burcu Bakir-Gungor; Malik Yousef (2023). DataSheet1_TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.xlsx [Dataset]. http://doi.org/10.3389/fgene.2023.1243874.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2023.1243874.s001
Dataset updated
Oct 5, 2023
Dataset provided by
Frontiers
Authors
Daniel Voskergian; Burcu Bakir-Gungor; Malik Yousef
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers.
f
Hyperparameters tuning of the classifiers using gridsearchCV.
plos.figshare.com
xls
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). Hyperparameters tuning of the classifiers using gridsearchCV. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t004
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hyperparameters tuning of the classifiers using gridsearchCV.
f
The summarized responses from the survey in three categories.
plos.figshare.com
xls
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumya Akter; Hossen A. Mustafa (2024). The summarized responses from the survey in three categories. [Dataset]. http://doi.org/10.1371/journal.pone.0300670.t014
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300670.t014
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Sumya Akter; Hossen A. Mustafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The summarized responses from the survey in three categories.
Performance comparison with traditional ML models.
plos.figshare.com
bin
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Mahadi Hasan; Saba Binte Murtaz; Muhammad Usama Islam; Muhammad Jafar Sadeq; Jasim Uddin (2023). Performance comparison with traditional ML models. [Dataset]. http://doi.org/10.1371/journal.pone.0274538.t002
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0274538.t002
Dataset updated
Jun 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Md. Mahadi Hasan; Saba Binte Murtaz; Muhammad Usama Islam; Muhammad Jafar Sadeq; Jasim Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison with traditional ML models.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Gichan Lee (2024). Less is More: An Empirical Study of Undersampling Techniques for Technical Debt Prediction [Dataset]. http://doi.org/10.6084/m9.figshare.22708036.v1

Data from: Less is More: An Empirical Study of Undersampling Techniques for Technical Debt Prediction

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.22708036.v1

Dataset updated

May 20, 2024

Dataset provided by

figshare

Authors

Gichan Lee

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Technical Debt (TD) prediction is crucial to preventing software quality degradation and maintenance cost increase. Recent Machine Learning (ML) approaches have shown promising results in TD prediction, but the imbalanced TD datasets can have a negative impact on ML model performance. Although previous TD studies have investigated various oversampling techniques that generates minority class instances to mitigate the imbalance, potentials of undersampling techniques have not yet been thoroughly explored due to the concerns about information loss. To address this gap, we investigate the impact of undersampling on ML model performance for TD prediction by utilizing 17,797 classes from 25 Java open-source projects. We compare the performance of ML models with different undersampling techniques and evaluate the impact of combining them with widely used oversampling techniques in TD studies. Our findings reveal that (i) undersampling can significantly improve ML model performance compared to oversampling and no resampling; (ii) the combined application of undersampling and oversampling techniques leads to a synergy of further performance improvement compared to applying each technique exclusively. Based on these results, we recommend practitioners to explore various undersampling techniques and their combinations with oversampling techniques for more effective TD prediction.This package is for the replication of 'Less is More: an Empirical Study of Undersampling Techniques for Technical Debt Prediction'File list:X.csv, Y.csv: - These are the datasets for the study, used in the ipynb file below.under_over_sampling_scripts.ipynb: - These scripts can obtain all the experimental results from the study. - They can be run through Jupyter Notebook or Google Colab. - The required packages are listed at the top in the file, so installation via pip or conda is necessary before running.Results_for_all_tables.csv: This is a csv file that summarizes all the results obtained from the study.

Clear search

Close search

Google apps

Main menu

Data from: Less is More: An Empirical Study of Undersampling Techniques for...

Performance measure after applying NearMiss.

The comparison of different ML algorithms on EN dataset in group AB.s.

Performance measure of our scheme using K-means+SMOTE+ENN.

Rank and frequency of the domain expert’s opinion.

Confusion matrix.

Categories with its number of samples.

DataSheet1_TextNetTopics Pro, a topic model-based text classification for...

Hyperparameters tuning of the classifiers using gridsearchCV.

The summarized responses from the survey in three categories.

Performance comparison with traditional ML models.

Data from: Less is More: An Empirical Study of Undersampling Techniques for Technical Debt Prediction