51 datasets found

TPS Jun 2022 Params
kaggle.com
zip
Updated Jun 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akio Onodera (2022). TPS Jun 2022 Params [Dataset]. https://www.kaggle.com/datasets/akioonodera/tps-jun-2022
Explore at:
zip(7503 bytes)Available download formats
Dataset updated
Jun 22, 2022
Authors
Akio Onodera
Description
Adjust the parameters by extracting only the rows that do not contain any missing values. Best result when using ver4.
RFE results for the ML models.
plos.figshare.com
xls
Updated Apr 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Pourmahmood Aghababa; Jan Andrysek (2024). RFE results for the ML models. [Dataset]. http://doi.org/10.1371/journal.pone.0300447.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300447.t004
Dataset updated
Apr 2, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Mohammad Pourmahmood Aghababa; Jan Andrysek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Quantitative gait analysis is important for understanding the non-typical walking patterns associated with mobility impairments. Conventional linear statistical methods and machine learning (ML) models are commonly used to assess gait performance and related changes in the gait parameters. Nonetheless, explainable machine learning provides an alternative technique for distinguishing the significant and influential gait changes stemming from a given intervention. The goal of this work was to demonstrate the use of explainable ML models in gait analysis for prosthetic rehabilitation in both population- and sample-based interpretability analyses. Models were developed to classify amputee gait with two types of prosthetic knee joints. Sagittal plane gait patterns of 21 individuals with unilateral transfemoral amputations were video-recorded and 19 spatiotemporal and kinematic gait parameters were extracted and included in the models. Four ML models—logistic regression, support vector machine, random forest, and LightGBM—were assessed and tested for accuracy and precision. The Shapley Additive exPlanations (SHAP) framework was applied to examine global and local interpretability. Random Forest yielded the highest classification accuracy (98.3%). The SHAP framework quantified the level of influence of each gait parameter in the models where knee flexion-related parameters were found the most influential factors in yielding the outcomes of the models. The sample-based explainable ML provided additional insights over the population-based analyses, including an understanding of the effect of the knee type on the walking style of a specific sample, and whether or not it agreed with global interpretations. It was concluded that explainable ML models can be powerful tools for the assessment of gait-related clinical interventions, revealing important parameters that may be overlooked using conventional statistical methods.
Pickle files
kaggle.com
zip
Updated Aug 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ishan (2023). Pickle files [Dataset]. https://www.kaggle.com/datasets/ishanpurohit/pickle-files/discussion
Explore at:
zip(3930788 bytes)Available download formats
Dataset updated
Aug 18, 2023
Authors
Ishan
Description
In this dataset, I've compiled and shared the best-fitted model with the parameters optimized with GridSearchCV. These model parameters have been carefully selected and optimized to provide superior predictive performance for the given task.

The dataset includes pickle files containing the best parameter settings for different machine learning algorithms. Here's what you'll find:

CatBoost Classifier Parameters (catboost.pkl): Unleash the power of gradient boosting with categorical features. The pickle file contains a model with tuned hyperparameters for the CatBoost model.

LightGBM Classifier Parameters (lgbm.pkl): Experience the efficiency and accuracy of LightGBM. The pickle file holds the model with optimized hyperparameters for the LightGBM model.

Random Forest Classifier Parameters (rf.pkl): Embrace the classic Random Forest algorithm. The pickle file presents the model with the best hyperparameters for the Random Forest model.

TabNet Classifier Parameters (tab_net.pkl): Dive into the world of TabNet's attention mechanisms. The pickle file showcases the ideal hyperparameters for the TabNet model. You need to directly use this model to predict. But make sure you use the columns I used in my Notebook and used the exact same feature Engineering.

XGBoost Classifier Parameters (xgb.pkl): Harness the power of XGBoost's gradient boosting techniques. The pickle file includes the model with the finest hyperparameter settings for the XGBoost model.

These pickle files provide a snapshot of the hyperparameters that have yielded exceptional results in terms of accuracy and generalization. They are a valuable resource for anyone aiming to enhance their predictive modeling skills or participate in the Spaceship Titanic competition.

Feel free to explore and utilize these best-fitted model parameters in your analysis and modeling endeavors. Let's continue to learn, collaborate, and push the boundaries of data science together.
Optimization range of LightGBM parameters.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi Guo; Xuejun Xiong; Yangcheng Liu; Liang Xu; Qiong Li (2023). Optimization range of LightGBM parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0267132.t012
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0267132.t012
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yi Guo; Xuejun Xiong; Yangcheng Liu; Liang Xu; Qiong Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Optimization range of LightGBM parameters.
Key parameters of LightGBM.
plos.figshare.com
xls
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jizhong Wang; Jianfei Chi; Yeqiang Ding; Haiyan Yao; Qiang Guo (2025). Key parameters of LightGBM. [Dataset]. http://doi.org/10.1371/journal.pone.0314481.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0314481.t002
Dataset updated
Feb 19, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Jizhong Wang; Jianfei Chi; Yeqiang Ding; Haiyan Yao; Qiang Guo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A fault diagnosis method for oil immersed transformers based on principal component analysis and SSA LightGBM is proposed to address the problem of low diagnostic accuracy caused by the complexity of current oil immersed transformer faults. Firstly, data on dissolved gases in oil is collected, and a 17 dimensional fault feature matrix is constructed using the uncoded ratio method. The feature matrix is then standardized to obtain joint features. Secondly, principal component analysis is used for feature fusion to eliminate information redundancy between variables and construct fused features. Finally, a transformer diagnostic model based on SSA-LightGBM was constructed, and the ten fold cross validation method was used to verify the classification ability of the model. The experimental results show that the SSA-LightGBM model proposed in this paper has an average fault diagnosis accuracy of 93.6% after SSA algorithm optimization, which is 3.6% higher than before optimization. At the same time, compared with the GA-LightGBM and GWO-LightGBM fault diagnosis models, SSA-LightGBM has improved the diagnostic accuracy by 8.1% and 5.7% respectively, verifying that this method can effectively improve the fault diagnosis performance of oil immersed transformers and is superior to other similar methods.
All boosters parameters for TPS oct 2021
kaggle.com
zip
Updated Oct 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meghdad (2021). All boosters parameters for TPS oct 2021 [Dataset]. https://www.kaggle.com/akmeghdad/all-booster-parameters-for-tps-oct-2021
Explore at:
zip(6269713890 bytes)Available download formats
Dataset updated
Oct 21, 2021
Authors
Meghdad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Various methods are used to solve TSP Oct 2021, including the use of LightGBM , CatBoost and XGBoost, in which hyperparametersplay an important role. We use hyperparameter optimization framework like Optuna to find hyperparameter. Another way is to use parameters that have already been created and had good results.

In this database, I collected all the parameters of LightGBM , CatBoost and XGBoost introduced in the TPS Oct 2021.

All parameters are checked under one condition. I used the following specifications to find the accuracy of each parameter. This is not the final accuracy because it is measured with only 20% of the data, but it is a criterion for comparing the parameters.

train_20 = dt.fread(sample_train_20.csv', columns=lambda cols:[col.name not in ('id') for col in cols]).to_pandas() y = train_20['target'] X = train_20.drop(columns=['target']) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1 / 5), random_state=59) model = model_from_csv(**params_from_csv) model.fit( X_train, y_train, eval_set=[(X_test, y_test)], eval_metric=["auc"], verbose=False, early_stopping_rounds = 600 ) y_predicted = model.predict_proba(X_test) accuracy = roc_auc_score(y_test, y_predicted[:, 1])

UPVOTE

I will try to create for future competitions as well if it is helpful for you, If you think this dataset are helpful for you, Please do not forget upvote this dataset Thank you in advance
Super parameters of LightGBM.
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi Guo; Xuejun Xiong; Yangcheng Liu; Liang Xu; Qiong Li (2023). Super parameters of LightGBM. [Dataset]. http://doi.org/10.1371/journal.pone.0267132.t016
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0267132.t016
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yi Guo; Xuejun Xiong; Yangcheng Liu; Liang Xu; Qiong Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Super parameters of LightGBM.
Mau_LightGBM_Height_RFG_Level
kaggle.com
zip
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataLeMur (2025). Mau_LightGBM_Height_RFG_Level [Dataset]. https://www.kaggle.com/datasets/saqifdtahmid/mau-lightgbm-height-rfg-level
Explore at:
zip(60589949 bytes)Available download formats
Dataset updated
Jul 16, 2025
Authors
DataLeMur
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
In this project, I focus on enhancing global building data by combining multiple open-source geospatial datasets to predict building attributes, specifically the number of levels (floors). The core datasets used are the Microsoft Open Buildings dataset, which provides detailed building footprints across many regions, and Google’s Temporal Buildings Dataset (V1), which includes estimated building heights over time derived from satellite imagery. While Google's dataset includes height information for many buildings, a significant portion contains missing or unreliable values.

To address this, I first performed data preprocessing and merged the two datasets based on geographic coordinates. For buildings with missing height values, I used LightGBM, a gradient boosting framework, to impute missing heights using features like footprint area, geometry, and surrounding context. I then brought in OpenStreetMap (OSM) data to enrich the dataset with additional contextual information, such as building type, land use, and nearby infrastructure.

Using the combined dataset — now with both original and imputed heights — I trained a Random Forest Regressor to predict the number of building levels. Since floor count is not always directly available, especially in developing regions, this approach offers a way to estimate it from height and footprint data with relatively high accuracy.

This kind of modeling has important real-world applications. Predicting building levels can help support urban planning, disaster response, infrastructure development, and climate risk modeling. For example, knowing the number of floors in buildings allows for better estimation of population density, potential occupancy, or structural vulnerability in earthquake-prone or flood-prone regions. It can also help fill gaps in existing GIS data where traditional surveys are too expensive or time-consuming.

In future work, this framework could be extended globally and refined with additional data sources like LIDAR or census information to further improve the accuracy and coverage of building-level models
Comparison of the results of P-feature and TF-feature in LightGBM.
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaokun Liang; Tao Deng; Anna Huang; Ningxian Liu; Xuchu Jiang (2023). Comparison of the results of P-feature and TF-feature in LightGBM. [Dataset]. http://doi.org/10.1371/journal.pone.0277085.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0277085.t003
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Shaokun Liang; Tao Deng; Anna Huang; Ningxian Liu; Xuchu Jiang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of the results of P-feature and TF-feature in LightGBM.
f
LightGBM hyperparameters with default values, search ranges, and selected...
plos.figshare.com
xls
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shimels Derso Kebede; Agmasie Damtew Walle; Daniel Niguse Mamo; Ermias Bekele Enyew; Jibril Bashir Adem; Meron Asmamaw Alemayehu (2025). LightGBM hyperparameters with default values, search ranges, and selected optimal values. [Dataset]. http://doi.org/10.1371/journal.pgph.0004787.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0004787.t003
Dataset updated
Jun 20, 2025
Dataset provided by
PLOS Global Public Health
Authors
Shimels Derso Kebede; Agmasie Damtew Walle; Daniel Niguse Mamo; Ermias Bekele Enyew; Jibril Bashir Adem; Meron Asmamaw Alemayehu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LightGBM hyperparameters with default values, search ranges, and selected optimal values.
Neural network hyperparameters.
figshare.com
xls
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaokun Liang; Tao Deng; Anna Huang; Ningxian Liu; Xuchu Jiang (2023). Neural network hyperparameters. [Dataset]. http://doi.org/10.1371/journal.pone.0277085.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0277085.t002
Dataset updated
Jun 7, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Shaokun Liang; Tao Deng; Anna Huang; Ningxian Liu; Xuchu Jiang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Neural network hyperparameters.
f
Parameters of base classifiers in ensemble model.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Apr 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhang, Peng; Zhang, Jialiang; Li, Yi (2025). Parameters of base classifiers in ensemble model. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002104649
Explore at:
Dataset updated
Apr 23, 2025
Authors
Zhang, Peng; Zhang, Jialiang; Li, Yi
Description
Timely prediction of memory failures is crucial for the stable operation of data centers. However, existing methods often rely on a single classifier, which can lead to inaccurate or unstable predictions. To address this, we propose a new ensemble model for predicting CE-driven memory failures, where failures occur due to a surge of correctable errors (CEs) in memory, causing server downtime. Our model combines several strong-performing classifiers, such as Random Forest, LightGBM, and XGBoost, and assigns different weights to each based on its performance. By optimizing the decision-making process, the model improves prediction accuracy. We validate the model using in-memory data from Alibaba’s data center, and the results show an accuracy of over 84%, outperforming existing single and dual-classifier models, further confirming its excellent predictive performance.
MAPE of all methods without noise.
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaokun Liang; Tao Deng; Anna Huang; Ningxian Liu; Xuchu Jiang (2023). MAPE of all methods without noise. [Dataset]. http://doi.org/10.1371/journal.pone.0277085.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0277085.t005
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Shaokun Liang; Tao Deng; Anna Huang; Ningxian Liu; Xuchu Jiang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MAPE of all methods without noise.
f
Table 1_Machine learning prediction of anxiety symptoms in social anxiety...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pack, Seung Pil; Hur, Ji-Won; Jung, Dooyoung; Cho, Chul-Hyun; Park, Jin-Hyun; Lee, Hwamin; Lee, Heon-Jeong; Shin, Yu-Bin (2025). Table 1_Machine learning prediction of anxiety symptoms in social anxiety disorder: utilizing multimodal data from virtual reality sessions.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001283930
Explore at:
Dataset updated
Jan 7, 2025
Authors
Pack, Seung Pil; Hur, Ji-Won; Jung, Dooyoung; Cho, Chul-Hyun; Park, Jin-Hyun; Lee, Hwamin; Lee, Heon-Jeong; Shin, Yu-Bin
Description
IntroductionMachine learning (ML) is an effective tool for predicting mental states and is a key technology in digital psychiatry. This study aimed to develop ML algorithms to predict the upper tertile group of various anxiety symptoms based on multimodal data from virtual reality (VR) therapy sessions for social anxiety disorder (SAD) patients and to evaluate their predictive performance across each data type.MethodsThis study included 32 SAD-diagnosed individuals, and finalized a dataset of 132 samples from 25 participants. It utilized multimodal (physiological and acoustic) data from VR sessions to simulate social anxiety scenarios. This study employed extended Geneva minimalistic acoustic parameter set for acoustic feature extraction and extracted statistical attributes from time series-based physiological responses. We developed ML models that predict the upper tertile group for various anxiety symptoms in SAD using Random Forest, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost) models. The best parameters were explored through grid search or random search, and the models were validated using stratified cross-validation and leave-one-out cross-validation.ResultsThe CatBoost, using multimodal features, exhibited high performance, particularly for the Social Phobia Scale with an area under the receiver operating characteristics curve (AUROC) of 0.852. It also showed strong performance in predicting cognitive symptoms, with the highest AUROC of 0.866 for the Post-Event Rumination Scale. For generalized anxiety, the LightGBM’s prediction for the State-Trait Anxiety Inventory-trait led to an AUROC of 0.819. In the same analysis, models using only physiological features had AUROCs of 0.626, 0.744, and 0.671, whereas models using only acoustic features had AUROCs of 0.788, 0.823, and 0.754.ConclusionsThis study showed that a ML algorithm using integrated multimodal data can predict upper tertile anxiety symptoms in patients with SAD with higher performance than acoustic or physiological data obtained during a VR session. The results of this study can be used as evidence for personalized VR sessions and to demonstrate the strength of the clinical use of multimodal data.
Comparison of the results of P-feature and TF-feature in GRU.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaokun Liang; Tao Deng; Anna Huang; Ningxian Liu; Xuchu Jiang (2023). Comparison of the results of P-feature and TF-feature in GRU. [Dataset]. http://doi.org/10.1371/journal.pone.0277085.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0277085.t001
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Shaokun Liang; Tao Deng; Anna Huang; Ningxian Liu; Xuchu Jiang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of the results of P-feature and TF-feature in GRU.
f
Table_1_MRI radiomics combined with machine learning for diagnosing mild...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Oct 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhang, Yaping; Chen, Yi; Ye, Zhinan; Luo, Weili; Chen, Yini; Chen, Ying; Wang, Wenjie; Lin, Andong (2024). Table_1_MRI radiomics combined with machine learning for diagnosing mild cognitive impairment: a focus on the cerebellar gray and white matter.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001275457
Explore at:
Dataset updated
Oct 4, 2024
Authors
Zhang, Yaping; Chen, Yi; Ye, Zhinan; Luo, Weili; Chen, Yini; Chen, Ying; Wang, Wenjie; Lin, Andong
Description
ObjectiveMild Cognitive Impairment (MCI) is a recognized precursor to Alzheimer’s Disease (AD), presenting a significant risk of progression. Early detection and intervention in MCI can potentially slow disease advancement, offering substantial clinical benefits. This study employed radiomics and machine learning methodologies to distinguish between MCI and Normal Cognition (NC) groups.MethodsThe study included 172 MCI patients and 183 healthy controls from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, all of whom had 3D-T1 weighted MRI structural images. The cerebellar gray and white matter were segmented automatically using volBrain software, and radiomic features were extracted and screened through Pyradiomics. The screened features were then input into various machine learning models, including Random Forest (RF), Logistic Regression (LR), eXtreme Gradient Boosting (XGBoost), Support Vector Machines (SVM), K Nearest Neighbors (KNN), Extra Trees, Light Gradient Boosting Machine (LightGBM), and Multilayer Perceptron (MLP). Each model was optimized for penalty parameters through 5-fold cross-validation to construct radiomic models. The DeLong test was used to evaluate the performance of different models.ResultsThe LightGBM model, which utilizes a combination of cerebellar gray and white matter features (comprising eight gray matter and eight white matter features), emerges as the most effective model for radiomics feature analysis. The model demonstrates an Area Under the Curve (AUC) of 0.863 for the training set and 0.776 for the test set.ConclusionRadiomic features based on the cerebellar gray and white matter, combined with machine learning, can objectively diagnose MCI, which provides significant clinical value for assisted diagnosis.
f
Data Sheet 2_Population-based colorectal cancer risk prediction using a...
frontiersin.figshare.com
xlsx
Updated Jul 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guinian Du; Hui Lv; Yishan Liang; Jingyue Zhang; Qiaoling Huang; Guiming Xie; Xian Wu; Hao Zeng; Lijuan Wu; Jianbo Ye; Wentan Xie; Xia Li; Yifan Sun (2025). Data Sheet 2_Population-based colorectal cancer risk prediction using a SHAP-enhanced LightGBM model.xlsx [Dataset]. http://doi.org/10.3389/fonc.2025.1575844.s005
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2025.1575844.s005
Dataset updated
Jul 17, 2025
Dataset provided by
Frontiers
Authors
Guinian Du; Hui Lv; Yishan Liang; Jingyue Zhang; Qiaoling Huang; Guiming Xie; Xian Wu; Hao Zeng; Lijuan Wu; Jianbo Ye; Wentan Xie; Xia Li; Yifan Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundColorectal cancer (CRC) is a highly frequent cancer worldwide, and early detection and risk stratification playing a critical role in reducing both incidence and mortality. we aimed to develop and validate a machine learning (ML) model using clinical data to improve CRC identification and prognostic evaluation.MethodsWe analyzed multicenter datasets comprising 676 CRC patients and 410 controls from Guigang City People’s Hospital (2020-2024) for model training/internal validation, with 463 patients from Laibin City People’s Hospital for external validation. Seven ML algorithms were systematically compared, with Light Gradient Boosting Machine (LightGBM) ultimately selected as the optimal framework. Model performance was rigorously assessed through area under the receiver operating characteristic (AUROC) analysis, calibration curves, Brier scores, and decision curve analysis. SHAP (SHapley Additive exPlanations) methodology was employed for feature interpretation.ResultsThe LightGBM model demonstrated exceptional discrimination with AUROCs of 0.9931 (95% CI: 0.9883-0.998) in the training cohort and 0.9429 (95% CI: 0.9176-0.9682) in external validation. Calibration curves revealed strong prediction-actual outcome concordance (Brier score=0.139). SHAP analysis identified 13 key predictors, with age (mean SHAP value=0.216) and CA19-9 (mean SHAP value=0.198) as dominant contributors. Other significant variables included hematological parameters (WBC, RBC, HGB, PLT), biochemical markers (ALT, TP, ALB, UREA, uric acid), and gender. A clinically implementable web-based risk calculator was successfully developed for real-time probability estimation.ConclusionsOur LightGBM-based model achieves high predictive accuracy while maintaining clinical interpretability, effectively bridging the gap between complex ML systems and practical clinical decision-making. The identified biomarker panel provides biological insights into CRC pathogenesis. This tool shows significant potential for optimizing early diagnosis and personalized risk assessment in CRC management.
f
Comparison of all optimal model results.
figshare.com
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaokun Liang; Tao Deng; Anna Huang; Ningxian Liu; Xuchu Jiang (2023). Comparison of all optimal model results. [Dataset]. http://doi.org/10.1371/journal.pone.0277085.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0277085.t004
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Shaokun Liang; Tao Deng; Anna Huang; Ningxian Liu; Xuchu Jiang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of all optimal model results.
The MAPE of the noise experiment on all methods.
figshare.com
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaokun Liang; Tao Deng; Anna Huang; Ningxian Liu; Xuchu Jiang (2023). The MAPE of the noise experiment on all methods. [Dataset]. http://doi.org/10.1371/journal.pone.0277085.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0277085.t006
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Shaokun Liang; Tao Deng; Anna Huang; Ningxian Liu; Xuchu Jiang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MAPE of the noise experiment on all methods.
Table_1_Development and validation of AI models using LR and LightGBM for...
frontiersin.figshare.com
docx
Updated Jun 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wen-hai Zhang; Yang Tan; Zhen Huang; Qi-xing Tan; Yue-mei Zhang; Bin-jie Chen; Chang-yuan Wei (2024). Table_1_Development and validation of AI models using LR and LightGBM for predicting distant metastasis in breast cancer: a dual-center study.docx [Dataset]. http://doi.org/10.3389/fonc.2024.1409273.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2024.1409273.s002
Dataset updated
Jun 14, 2024
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Wen-hai Zhang; Yang Tan; Zhen Huang; Qi-xing Tan; Yue-mei Zhang; Bin-jie Chen; Chang-yuan Wei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThis study aims to develop an artificial intelligence model utilizing clinical blood markers, ultrasound data, and breast biopsy pathological information to predict the distant metastasis in breast cancer patients.MethodsData from two medical centers were utilized, Clinical blood markers, ultrasound data, and breast biopsy pathological information were separately extracted and selected. Feature dimensionality reduction was performed using Spearman correlation and LASSO regression. Predictive models were constructed using LR and LightGBM machine learning algorithms and validated on internal and external validation sets. Feature correlation analysis was conducted for both models.ResultsThe LR model achieved AUC values of 0.892, 0.816, and 0.817 for the training, internal validation, and external validation cohorts, respectively. The LightGBM model achieved AUC values of 0.971, 0.861, and 0.890 for the same cohorts, respectively. Clinical decision curve analysis showed a superior net benefit of the LightGBM model over the LR model in predicting distant metastasis in breast cancer. Key features identified included creatine kinase isoenzyme (CK-MB) and alpha-hydroxybutyrate dehydrogenase.ConclusionThis study developed an artificial intelligence model using clinical blood markers, ultrasound data, and pathological information to identify distant metastasis in breast cancer patients. The LightGBM model demonstrated superior predictive accuracy and clinical applicability, suggesting it as a promising tool for early diagnosis of distant metastasis in breast cancer.