14 datasets found
  1. Comparison of ML model performances for imbalanced data (imbalance ratio =...

    • plos.figshare.com
    xls
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grzegorz Dudek; Sebastian Sakowski; Olga Brzezińska; Joanna Sarnik; Tomasz Budlewski; Grzegorz Dragan; Marta Poplawska; Tomasz Poplawski; Michał Bijak; Joanna Makowska (2024). Comparison of ML model performances for imbalanced data (imbalance ratio = 2). [Dataset]. http://doi.org/10.1371/journal.pone.0300717.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Grzegorz Dudek; Sebastian Sakowski; Olga Brzezińska; Joanna Sarnik; Tomasz Budlewski; Grzegorz Dragan; Marta Poplawska; Tomasz Poplawski; Michał Bijak; Joanna Makowska
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of ML model performances for imbalanced data (imbalance ratio = 2).

  2. Names of each attack and category.

    • plos.figshare.com
    xls
    Updated Oct 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Names of each attack and category. [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 17, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.

  3. Data from: Classification Trees for Imbalanced Data: Surface-to-Volume...

    • tandf.figshare.com
    zip
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yichen Zhu; Cheng Li; David B. Dunson (2024). Classification Trees for Imbalanced Data: Surface-to-Volume Regularization [Dataset]. http://doi.org/10.6084/m9.figshare.17033038.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Yichen Zhu; Cheng Li; David B. Dunson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification algorithms face difficulties when one or more classes have limited training data. We are particularly interested in classification trees, due to their interpretability and flexibility. When data are limited in one or more of the classes, the estimated decision boundaries are often irregularly shaped due to the limited sample size, leading to poor generalization error. We propose a novel approach that penalizes the Surface-to-Volume Ratio (SVR) of the decision set, obtaining a new class of SVR-Tree algorithms. We develop a simple and computationally efficient implementation while proving estimation consistency for SVR-Tree and rate of convergence for an idealized empirical risk minimizer of SVR-Tree. SVR-Tree is compared with multiple algorithms that are designed to deal with imbalance through real data applications. Supplementary materials for this article are available online.

  4. DataSheet_1_Construction and validation of a progression prediction model...

    • frontiersin.figshare.com
    txt
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jitao Hu; Yuanyuan Sheng; Jinlong Ma; Yujie Tang; Dong Liu; Jianqing Zhang; Xudong Wei; Yang Yang; Yueping Liu; Yongqiang Zhang; Guiying Wang (2024). DataSheet_1_Construction and validation of a progression prediction model for locally advanced rectal cancer patients received neoadjuvant chemoradiotherapy followed by total mesorectal excision based on machine learning.csv [Dataset]. http://doi.org/10.3389/fonc.2023.1231508.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 24, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Jitao Hu; Yuanyuan Sheng; Jinlong Ma; Yujie Tang; Dong Liu; Jianqing Zhang; Xudong Wei; Yang Yang; Yueping Liu; Yongqiang Zhang; Guiying Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundWe attempted to develop a progression prediction model for local advanced rectal cancer(LARC) patients who received preoperative neoadjuvant chemoradiotherapy(NCRT) and operative treatment to identify high-risk patients in advance.MethodsData from 272 LARC patients who received NCRT and total mesorectal excision(TME) from 2011 to 2018 at the Fourth Hospital of Hebei Medical University were collected. Data from 161 patients with rectal cancer (each sample with one target variable (progression) and 145 characteristic variables) were included. One Hot Encoding was applied to numerically represent some characteristics. The K-Nearest Neighbor (KNN) filling method was used to determine the missing values, and SmoteTomek comprehensive sampling was used to solve the data imbalance. Eventually, data from 135 patients with 45 characteristic clinical variables were obtained. Random forest, decision tree, support vector machine (SVM), and XGBoost were used to predict whether patients with rectal cancer will exhibit progression. LASSO regression was used to further filter the variables and narrow down the list of variables using a Venn diagram. Eventually, the prediction model was constructed by multivariate logistic regression, and the performance of the model was confirmed in the validation set.ResultsEventually, data from 135 patients including 45 clinical characteristic variables were included in the study. Data were randomly divided in an 8:2 ratio into a data set and a validation set, respectively. Area Under Curve (AUC) values of 0.72 for the decision tree, 0.97 for the random forest, 0.89 for SVM, and 0.94 for XGBoost were obtained from the data set. Similar results were obtained from the validation set. Twenty-three variables were obtained from LASSO regression, and eight variables were obtained by considering the intersection of the variables obtained using the previous four machine learning methods. Furthermore, a multivariate logistic regression model was constructed using the data set; the ROC indicated its good performance. The ROC curve also verified the good predictive performance in the validation set.ConclusionsWe constructed a logistic regression model with good predictive performance, which allowed us to accurately predict whether patients who received NCRT and TME will exhibit disease progression.

  5. Evaluation of SMOTE-ENN+SFMI+PCA (in %).

    • plos.figshare.com
    xls
    Updated Oct 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Evaluation of SMOTE-ENN+SFMI+PCA (in %). [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 17, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.

  6. Detailed overview of feature information.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Sep 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar (2024). Detailed overview of feature information. [Dataset]. http://doi.org/10.1371/journal.pone.0309383.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 4, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundMechanical ventilation (MV) is vital for critically ill ICU patients but carries significant mortality risks. This study aims to develop a predictive model to estimate hospital mortality among MV patients, utilizing comprehensive health data to assist ICU physicians with early-stage alerts.MethodsWe developed a Machine Learning (ML) framework to predict hospital mortality in ICU patients receiving MV. Using the MIMIC-III database, we identified 25,202 eligible patients through ICD-9 codes. We employed backward elimination and the Lasso method, selecting 32 features based on clinical insights and literature. Data preprocessing included eliminating columns with over 90% missing data and using mean imputation for the remaining missing values. To address class imbalance, we used the Synthetic Minority Over-sampling Technique (SMOTE). We evaluated several ML models, including CatBoost, XGBoost, Decision Tree, Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Logistic Regression, using a 70/30 train-test split. The CatBoost model was chosen for its superior performance in terms of accuracy, precision, recall, F1-score, AUROC metrics, and calibration plots.ResultsThe study involved a cohort of 25,202 patients on MV. The CatBoost model attained an AUROC of 0.862, an increase from an initial AUROC of 0.821, which was the best reported in the literature. It also demonstrated an accuracy of 0.789, an F1-score of 0.747, and better calibration, outperforming other models. These improvements are due to systematic feature selection and the robust gradient boosting architecture of CatBoost.ConclusionThe preprocessing methodology significantly reduced the number of relevant features, simplifying computational processes, and identified critical features previously overlooked. Integrating these features and tuning the parameters, our model demonstrated strong generalization to unseen data. This highlights the potential of ML as a crucial tool in ICUs, enhancing resource allocation and providing more personalized interventions for MV patients.

  7. Evaluation index after Decision Tree optimization parameters.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Aug 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinyi Wei; Boyu Shi (2025). Evaluation index after Decision Tree optimization parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0327569.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 7, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Xinyi Wei; Boyu Shi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Evaluation index after Decision Tree optimization parameters.

  8. Spike train classification metric values (for imbalance-robust metrics) for...

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Lazarevich; Ilya Prokin; Boris Gutkin; Victor Kazantsev (2023). Spike train classification metric values (for imbalance-robust metrics) for the retinal neuron activity dataset on a range of models. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010792.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ivan Lazarevich; Ilya Prokin; Boris Gutkin; Victor Kazantsev
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The “simple baseline” model tag corresponds to spike trains encoded with 6 basic distribution statistics, the “raw” tag implies that the model has been directly trained on ISI time-series data without feature extraction. The “tsfresh” tag corresponds to encoding with the full set of time-series features. “ISIe” stands for interspike-interval encoding of the spike train, “SCe” stands for spike-count encoding. “ISIe + SPe” means that feature vectors corresponding to both types of encoding are concatenated. InceptionTimePlus, FCNPlus, ResNetPlus and XceptionTimePlus and refer to implementations in the PyTorch-based tsai package.

  9. Evaluation of BFE with SMOTE-ENN (in %).

    • plos.figshare.com
    xls
    Updated Oct 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Evaluation of BFE with SMOTE-ENN (in %). [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 17, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.

  10. FFT feature extraction.

    • plos.figshare.com
    xls
    Updated Nov 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zawar Ahmed Khan; Muhammad Amir Raza; Muhammad I. Masud; Touqeer Ahmed Jumani; Muhammad Faheem; Mohammed Aman (2025). FFT feature extraction. [Dataset]. http://doi.org/10.1371/journal.pone.0335367.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 6, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Zawar Ahmed Khan; Muhammad Amir Raza; Muhammad I. Masud; Touqeer Ahmed Jumani; Muhammad Faheem; Mohammed Aman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study explores the design of an effective fault classification algorithm for 3 phase induction motor, an integral unit in many industrial systems. It is found that traditional fault detection methods and deep learning approaches are both effective; however, current techniques can either be computationally exhaustive, or suffer from low accuracy, thus making them inapplicable in many real-world settings. To address these limitations, this study evaluates different machine learning algorithms for accurate and efficient fault detection using a dataset of triaxial vibrational data converted into current variables. A dataset of triaxial vibrational current data at 0.7 mm bearing and rotor faults at various loads (100W, 200W, and 300W) were considered. For the data preprocessing, we handled with the missing values by interpolation and handle data imbalance fault types with Synthetic Minority Over-sampling Technique (SMOTE). Through Fast Fourier Transform (FFT) techniques, the frequency domain information were extracted, which is key for current signals, adding to the feature set. In addition, dimensionality reduction with Principal Component Analysis (PCA) and feature selection was done with SelectKBest. Then, the different machine learning models such as Random Forest (RF), Decision Tree (DT), k-nearest neighbors (KNN), and eXtreme Gradient Boosting (XGBoost) was trained to optimize the hyperparameters and make them perform to its best possible. The results shows the accuracy and performance of all models, DT and RF show good performance, with 99.95% accuracy, while KNN performs well, but at a higher computational cost in testing. Generally known for its capability to handle all the complex dataset, XGBoost wasn’t able to perform in this scenario as it got an accuracy of 87.13%, indicating potentially more optimization is required for the model. This work serves as the groundwork for future work with a multiplicity of fault types, motor specifications, and the incorporation of additional feature-engineering techniques to develop a more robust and intelligent framework for fault detection.

  11. Detailed overview of cohort characteristics for train and test cohort.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Sep 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar (2024). Detailed overview of cohort characteristics for train and test cohort. [Dataset]. http://doi.org/10.1371/journal.pone.0309383.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 4, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Values are presented as means with the standard deviations in parentheses.

  12. f

    Parameter settings of different models.

    • figshare.com
    xls
    Updated Oct 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Parameter settings of different models. [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 17, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.

  13. Data_Sheet_1_Advancing NSCLC pathological subtype prediction with...

    • frontiersin.figshare.com
    txt
    Updated May 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bingling Kuang; Jingxuan Zhang; Mingqi Zhang; Haoming Xia; Guangliang Qiang; Jiangyu Zhang (2024). Data_Sheet_1_Advancing NSCLC pathological subtype prediction with interpretable machine learning: a comprehensive radiomics-based approach.CSV [Dataset]. http://doi.org/10.3389/fmed.2024.1413990.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 22, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Bingling Kuang; Jingxuan Zhang; Mingqi Zhang; Haoming Xia; Guangliang Qiang; Jiangyu Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveThis research aims to develop and assess the performance of interpretable machine learning models for diagnosing three histological subtypes of non-small cell lung cancer (NSCLC) utilizing CT imaging data.MethodsA retrospective cohort of 317 patients diagnosed with NSCLC was included in the study. These individuals were randomly segregated into two groups: a training set comprising 222 patients and a validation set with 95 patients, adhering to a 7:3 ratio. A comprehensive extraction yielded 1,834 radiomic features. For feature selection, statistical methodologies such as the Mann–Whitney U test, Spearman’s rank correlation, and one-way logistic regression were employed. To address data imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was utilized. The study designed three distinct models to predict adenocarcinoma (ADC), squamous cell carcinoma (SCC), and large cell carcinoma (LCC). Six different classifiers, namely Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, eXtreme Gradient Boosting (XGB), and LightGBM, were deployed for model training. Model performance was gauged through accuracy metrics and the area under the receiver operating characteristic (ROC) curves (AUC). To interpret the diagnostic process, the Shapley Additive Explanations (SHAP) approach was applied.ResultsFor the ADC, SCC, and LCC groups, 9, 12, and 8 key radiomic features were selected, respectively. In terms of model performance, the XGB model demonstrated superior performance in predicting SCC and LCC, with AUC values of 0.789 and 0.848, respectively. For ADC prediction, the Random Forest model excelled, showcasing an AUC of 0.748.ConclusionThe constructed machine learning models, leveraging CT imaging, exhibited robust predictive capabilities for SCC, LCC, and ADC subtypes of NSCLC. These interpretable models serve as substantial support for clinical decision-making processes.

  14. Testing time (Sec) of different models with feature selection.

    • plos.figshare.com
    xls
    Updated Oct 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Testing time (Sec) of different models with feature selection. [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t012
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 17, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Testing time (Sec) of different models with feature selection.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Grzegorz Dudek; Sebastian Sakowski; Olga Brzezińska; Joanna Sarnik; Tomasz Budlewski; Grzegorz Dragan; Marta Poplawska; Tomasz Poplawski; Michał Bijak; Joanna Makowska (2024). Comparison of ML model performances for imbalanced data (imbalance ratio = 2). [Dataset]. http://doi.org/10.1371/journal.pone.0300717.t006
Organization logo

Comparison of ML model performances for imbalanced data (imbalance ratio = 2).

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Mar 22, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Grzegorz Dudek; Sebastian Sakowski; Olga Brzezińska; Joanna Sarnik; Tomasz Budlewski; Grzegorz Dragan; Marta Poplawska; Tomasz Poplawski; Michał Bijak; Joanna Makowska
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Comparison of ML model performances for imbalanced data (imbalance ratio = 2).

Search
Clear search
Close search
Google apps
Main menu