14 datasets found

Comparison of ML model performances for imbalanced data (imbalance ratio =...
plos.figshare.com
xls
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grzegorz Dudek; Sebastian Sakowski; Olga Brzezińska; Joanna Sarnik; Tomasz Budlewski; Grzegorz Dragan; Marta Poplawska; Tomasz Poplawski; Michał Bijak; Joanna Makowska (2024). Comparison of ML model performances for imbalanced data (imbalance ratio = 2). [Dataset]. http://doi.org/10.1371/journal.pone.0300717.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300717.t006
Dataset updated
Mar 22, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Grzegorz Dudek; Sebastian Sakowski; Olga Brzezińska; Joanna Sarnik; Tomasz Budlewski; Grzegorz Dragan; Marta Poplawska; Tomasz Poplawski; Michał Bijak; Joanna Makowska
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of ML model performances for imbalanced data (imbalance ratio = 2).
Names of each attack and category.
plos.figshare.com
xls
Updated Oct 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Names of each attack and category. [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309682.t001
Dataset updated
Oct 17, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.
Data from: Classification Trees for Imbalanced Data: Surface-to-Volume...
tandf.figshare.com
zip
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yichen Zhu; Cheng Li; David B. Dunson (2024). Classification Trees for Imbalanced Data: Surface-to-Volume Regularization [Dataset]. http://doi.org/10.6084/m9.figshare.17033038.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17033038.v1
Dataset updated
Feb 14, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Yichen Zhu; Cheng Li; David B. Dunson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification algorithms face difficulties when one or more classes have limited training data. We are particularly interested in classification trees, due to their interpretability and flexibility. When data are limited in one or more of the classes, the estimated decision boundaries are often irregularly shaped due to the limited sample size, leading to poor generalization error. We propose a novel approach that penalizes the Surface-to-Volume Ratio (SVR) of the decision set, obtaining a new class of SVR-Tree algorithms. We develop a simple and computationally efficient implementation while proving estimation consistency for SVR-Tree and rate of convergence for an idealized empirical risk minimizer of SVR-Tree. SVR-Tree is compared with multiple algorithms that are designed to deal with imbalance through real data applications. Supplementary materials for this article are available online.
DataSheet_1_Construction and validation of a progression prediction model...
frontiersin.figshare.com
txt
Updated Jan 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jitao Hu; Yuanyuan Sheng; Jinlong Ma; Yujie Tang; Dong Liu; Jianqing Zhang; Xudong Wei; Yang Yang; Yueping Liu; Yongqiang Zhang; Guiying Wang (2024). DataSheet_1_Construction and validation of a progression prediction model for locally advanced rectal cancer patients received neoadjuvant chemoradiotherapy followed by total mesorectal excision based on machine learning.csv [Dataset]. http://doi.org/10.3389/fonc.2023.1231508.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2023.1231508.s001
Dataset updated
Jan 24, 2024
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Jitao Hu; Yuanyuan Sheng; Jinlong Ma; Yujie Tang; Dong Liu; Jianqing Zhang; Xudong Wei; Yang Yang; Yueping Liu; Yongqiang Zhang; Guiying Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundWe attempted to develop a progression prediction model for local advanced rectal cancer(LARC) patients who received preoperative neoadjuvant chemoradiotherapy(NCRT) and operative treatment to identify high-risk patients in advance.MethodsData from 272 LARC patients who received NCRT and total mesorectal excision(TME) from 2011 to 2018 at the Fourth Hospital of Hebei Medical University were collected. Data from 161 patients with rectal cancer (each sample with one target variable (progression) and 145 characteristic variables) were included. One Hot Encoding was applied to numerically represent some characteristics. The K-Nearest Neighbor (KNN) filling method was used to determine the missing values, and SmoteTomek comprehensive sampling was used to solve the data imbalance. Eventually, data from 135 patients with 45 characteristic clinical variables were obtained. Random forest, decision tree, support vector machine (SVM), and XGBoost were used to predict whether patients with rectal cancer will exhibit progression. LASSO regression was used to further filter the variables and narrow down the list of variables using a Venn diagram. Eventually, the prediction model was constructed by multivariate logistic regression, and the performance of the model was confirmed in the validation set.ResultsEventually, data from 135 patients including 45 clinical characteristic variables were included in the study. Data were randomly divided in an 8:2 ratio into a data set and a validation set, respectively. Area Under Curve (AUC) values of 0.72 for the decision tree, 0.97 for the random forest, 0.89 for SVM, and 0.94 for XGBoost were obtained from the data set. Similar results were obtained from the validation set. Twenty-three variables were obtained from LASSO regression, and eight variables were obtained by considering the intersection of the variables obtained using the previous four machine learning methods. Furthermore, a multivariate logistic regression model was constructed using the data set; the ROC indicated its good performance. The ROC curve also verified the good predictive performance in the validation set.ConclusionsWe constructed a logistic regression model with good predictive performance, which allowed us to accurately predict whether patients who received NCRT and TME will exhibit disease progression.
Evaluation of SMOTE-ENN+SFMI+PCA (in %).
plos.figshare.com
xls
Updated Oct 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Evaluation of SMOTE-ENN+SFMI+PCA (in %). [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309682.t010
Dataset updated
Oct 17, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.
Detailed overview of feature information.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar (2024). Detailed overview of feature information. [Dataset]. http://doi.org/10.1371/journal.pone.0309383.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309383.t001
Dataset updated
Sep 4, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundMechanical ventilation (MV) is vital for critically ill ICU patients but carries significant mortality risks. This study aims to develop a predictive model to estimate hospital mortality among MV patients, utilizing comprehensive health data to assist ICU physicians with early-stage alerts.MethodsWe developed a Machine Learning (ML) framework to predict hospital mortality in ICU patients receiving MV. Using the MIMIC-III database, we identified 25,202 eligible patients through ICD-9 codes. We employed backward elimination and the Lasso method, selecting 32 features based on clinical insights and literature. Data preprocessing included eliminating columns with over 90% missing data and using mean imputation for the remaining missing values. To address class imbalance, we used the Synthetic Minority Over-sampling Technique (SMOTE). We evaluated several ML models, including CatBoost, XGBoost, Decision Tree, Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Logistic Regression, using a 70/30 train-test split. The CatBoost model was chosen for its superior performance in terms of accuracy, precision, recall, F1-score, AUROC metrics, and calibration plots.ResultsThe study involved a cohort of 25,202 patients on MV. The CatBoost model attained an AUROC of 0.862, an increase from an initial AUROC of 0.821, which was the best reported in the literature. It also demonstrated an accuracy of 0.789, an F1-score of 0.747, and better calibration, outperforming other models. These improvements are due to systematic feature selection and the robust gradient boosting architecture of CatBoost.ConclusionThe preprocessing methodology significantly reduced the number of relevant features, simplifying computational processes, and identified critical features previously overlooked. Integrating these features and tuning the parameters, our model demonstrated strong generalization to unseen data. This highlights the potential of ML as a crucial tool in ICUs, enhancing resource allocation and providing more personalized interventions for MV patients.
Evaluation index after Decision Tree optimization parameters.
plos.figshare.com
figshare.com
xls
Updated Aug 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinyi Wei; Boyu Shi (2025). Evaluation index after Decision Tree optimization parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0327569.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327569.t007
Dataset updated
Aug 7, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Xinyi Wei; Boyu Shi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Evaluation index after Decision Tree optimization parameters.
Spike train classification metric values (for imbalance-robust metrics) for...
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Lazarevich; Ilya Prokin; Boris Gutkin; Victor Kazantsev (2023). Spike train classification metric values (for imbalance-robust metrics) for the retinal neuron activity dataset on a range of models. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010792.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1010792.t004
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Ivan Lazarevich; Ilya Prokin; Boris Gutkin; Victor Kazantsev
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The “simple baseline” model tag corresponds to spike trains encoded with 6 basic distribution statistics, the “raw” tag implies that the model has been directly trained on ISI time-series data without feature extraction. The “tsfresh” tag corresponds to encoding with the full set of time-series features. “ISIe” stands for interspike-interval encoding of the spike train, “SCe” stands for spike-count encoding. “ISIe + SPe” means that feature vectors corresponding to both types of encoding are concatenated. InceptionTimePlus, FCNPlus, ResNetPlus and XceptionTimePlus and refer to implementations in the PyTorch-based tsai package.
Evaluation of BFE with SMOTE-ENN (in %).
plos.figshare.com
xls
Updated Oct 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Evaluation of BFE with SMOTE-ENN (in %). [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309682.t009
Dataset updated
Oct 17, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.
FFT feature extraction.
plos.figshare.com
xls
Updated Nov 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zawar Ahmed Khan; Muhammad Amir Raza; Muhammad I. Masud; Touqeer Ahmed Jumani; Muhammad Faheem; Mohammed Aman (2025). FFT feature extraction. [Dataset]. http://doi.org/10.1371/journal.pone.0335367.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0335367.t003
Dataset updated
Nov 6, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Zawar Ahmed Khan; Muhammad Amir Raza; Muhammad I. Masud; Touqeer Ahmed Jumani; Muhammad Faheem; Mohammed Aman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study explores the design of an effective fault classification algorithm for 3 phase induction motor, an integral unit in many industrial systems. It is found that traditional fault detection methods and deep learning approaches are both effective; however, current techniques can either be computationally exhaustive, or suffer from low accuracy, thus making them inapplicable in many real-world settings. To address these limitations, this study evaluates different machine learning algorithms for accurate and efficient fault detection using a dataset of triaxial vibrational data converted into current variables. A dataset of triaxial vibrational current data at 0.7 mm bearing and rotor faults at various loads (100W, 200W, and 300W) were considered. For the data preprocessing, we handled with the missing values by interpolation and handle data imbalance fault types with Synthetic Minority Over-sampling Technique (SMOTE). Through Fast Fourier Transform (FFT) techniques, the frequency domain information were extracted, which is key for current signals, adding to the feature set. In addition, dimensionality reduction with Principal Component Analysis (PCA) and feature selection was done with SelectKBest. Then, the different machine learning models such as Random Forest (RF), Decision Tree (DT), k-nearest neighbors (KNN), and eXtreme Gradient Boosting (XGBoost) was trained to optimize the hyperparameters and make them perform to its best possible. The results shows the accuracy and performance of all models, DT and RF show good performance, with 99.95% accuracy, while KNN performs well, but at a higher computational cost in testing. Generally known for its capability to handle all the complex dataset, XGBoost wasn’t able to perform in this scenario as it got an accuracy of 87.13%, indicating potentially more optimization is required for the model. This work serves as the groundwork for future work with a multiplicity of fault types, motor specifications, and the incorporation of additional feature-engineering techniques to develop a more robust and intelligent framework for fault detection.
Detailed overview of cohort characteristics for train and test cohort.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar (2024). Detailed overview of cohort characteristics for train and test cohort. [Dataset]. http://doi.org/10.1371/journal.pone.0309383.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309383.t002
Dataset updated
Sep 4, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Values are presented as means with the standard deviations in parentheses.
f
Parameter settings of different models.
figshare.com
xls
Updated Oct 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Parameter settings of different models. [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309682.t004
Dataset updated
Oct 17, 2024
Dataset provided by
PLOS ONE
Authors
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Internet of things (IoT) facilitates a variety of heterogeneous devices to be enabled with network connectivity via various network architectures to gather and exchange real-time information. On the other hand, the rise of IoT creates Distributed Denial of Services (DDoS) like security threats. The recent advancement of Software Defined-Internet of Things (SDIoT) architecture can provide better security solutions compared to the conventional networking approaches. Moreover, limited computing resources and heterogeneous network protocols are major challenges in the SDIoT ecosystem. Given these circumstances, it is essential to design a low-cost DDoS attack classifier. The current study aims to employ an improved feature selection (FS) technique which determines the most relevant features that can improve the detection rate and reduce the training time. At first, to overcome the data imbalance problem, Edited Nearest Neighbor-based Synthetic Minority Oversampling (SMOTE-ENN) was exploited. The study proposes SFMI, an FS method that combines Sequential Feature Selection (SFE) and Mutual Information (MI) techniques. The top k common features were extracted from the nominated features based on SFE and MI. Further, Principal component analysis (PCA) is employed to address multicollinearity issues in the dataset. Comprehensive experiments have been conducted on two benchmark datasets such as the KDDCup99, CIC IoT-2023 datasets. For classification purposes, Decision Tree, K-Nearest Neighbor, Gaussian Naive Bayes, Random Forest (RF), and Multilayer Perceptron classifiers were employed. The experimental results quantitatively demonstrate that the proposed SMOTE-ENN+SFMI+PCA with RF classifier achieves 99.97% accuracy and 99.39% precision with 10 features.
Data_Sheet_1_Advancing NSCLC pathological subtype prediction with...
frontiersin.figshare.com
txt
Updated May 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bingling Kuang; Jingxuan Zhang; Mingqi Zhang; Haoming Xia; Guangliang Qiang; Jiangyu Zhang (2024). Data_Sheet_1_Advancing NSCLC pathological subtype prediction with interpretable machine learning: a comprehensive radiomics-based approach.CSV [Dataset]. http://doi.org/10.3389/fmed.2024.1413990.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fmed.2024.1413990.s001
Dataset updated
May 22, 2024
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Bingling Kuang; Jingxuan Zhang; Mingqi Zhang; Haoming Xia; Guangliang Qiang; Jiangyu Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThis research aims to develop and assess the performance of interpretable machine learning models for diagnosing three histological subtypes of non-small cell lung cancer (NSCLC) utilizing CT imaging data.MethodsA retrospective cohort of 317 patients diagnosed with NSCLC was included in the study. These individuals were randomly segregated into two groups: a training set comprising 222 patients and a validation set with 95 patients, adhering to a 7:3 ratio. A comprehensive extraction yielded 1,834 radiomic features. For feature selection, statistical methodologies such as the Mann–Whitney U test, Spearman’s rank correlation, and one-way logistic regression were employed. To address data imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was utilized. The study designed three distinct models to predict adenocarcinoma (ADC), squamous cell carcinoma (SCC), and large cell carcinoma (LCC). Six different classifiers, namely Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, eXtreme Gradient Boosting (XGB), and LightGBM, were deployed for model training. Model performance was gauged through accuracy metrics and the area under the receiver operating characteristic (ROC) curves (AUC). To interpret the diagnostic process, the Shapley Additive Explanations (SHAP) approach was applied.ResultsFor the ADC, SCC, and LCC groups, 9, 12, and 8 key radiomic features were selected, respectively. In terms of model performance, the XGB model demonstrated superior performance in predicting SCC and LCC, with AUC values of 0.789 and 0.848, respectively. For ADC prediction, the Random Forest model excelled, showcasing an AUC of 0.748.ConclusionThe constructed machine learning models, leveraging CT imaging, exhibited robust predictive capabilities for SCC, LCC, and ADC subtypes of NSCLC. These interpretable models serve as substantial support for clinical decision-making processes.
Testing time (Sec) of different models with feature selection.
plos.figshare.com
xls
Updated Oct 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal (2024). Testing time (Sec) of different models with feature selection. [Dataset]. http://doi.org/10.1371/journal.pone.0309682.t012
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309682.t012
Dataset updated
Oct 17, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Arati Behera; Kshira Sagar Sahoo; Tapas Kumara Mishra; Anand Nayyar; Muhammad Bilal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Testing time (Sec) of different models with feature selection.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Grzegorz Dudek; Sebastian Sakowski; Olga Brzezińska; Joanna Sarnik; Tomasz Budlewski; Grzegorz Dragan; Marta Poplawska; Tomasz Poplawski; Michał Bijak; Joanna Makowska (2024). Comparison of ML model performances for imbalanced data (imbalance ratio = 2). [Dataset]. http://doi.org/10.1371/journal.pone.0300717.t006

Comparison of ML model performances for imbalanced data (imbalance ratio = 2).

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0300717.t006

Dataset updated

Mar 22, 2024

Dataset provided by

PLOShttp://plos.org/

Authors

Grzegorz Dudek; Sebastian Sakowski; Olga Brzezińska; Joanna Sarnik; Tomasz Budlewski; Grzegorz Dragan; Marta Poplawska; Tomasz Poplawski; Michał Bijak; Joanna Makowska

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Comparison of ML model performances for imbalanced data (imbalance ratio = 2).

Clear search

Close search

Google apps

Main menu

Comparison of ML model performances for imbalanced data (imbalance ratio =...

Names of each attack and category.

Data from: Classification Trees for Imbalanced Data: Surface-to-Volume...

DataSheet_1_Construction and validation of a progression prediction model...

Evaluation of SMOTE-ENN+SFMI+PCA (in %).

Detailed overview of feature information.

Evaluation index after Decision Tree optimization parameters.

Spike train classification metric values (for imbalance-robust metrics) for...

Evaluation of BFE with SMOTE-ENN (in %).

FFT feature extraction.

Detailed overview of cohort characteristics for train and test cohort.

Parameter settings of different models.

Data_Sheet_1_Advancing NSCLC pathological subtype prediction with...

Testing time (Sec) of different models with feature selection.

Comparison of ML model performances for imbalanced data (imbalance ratio = 2).See More Versions

Comparison of ML model performances for imbalanced data (imbalance ratio = 2).