Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Objective: We explored the risk factors for intravenous immunoglobulin (IVIG) resistance in children with Kawasaki disease (KD) and constructed a prediction model based on machine learning algorithms.Methods: A retrospective study including 1,398 KD patients hospitalized in 7 affiliated hospitals of Chongqing Medical University from January 2015 to August 2020 was conducted. All patients were divided into IVIG-responsive and IVIG-resistant groups, which were randomly divided into training and validation sets. The independent risk factors were determined using logistic regression analysis. Logistic regression nomograms, support vector machine (SVM), XGBoost and LightGBM prediction models were constructed and compared with the previous models.Results: In total, 1,240 out of 1,398 patients were IVIG responders, while 158 were resistant to IVIG. According to the results of logistic regression analysis of the training set, four independent risk factors were identified, including total bilirubin (TBIL) (OR = 1.115, 95% CI 1.067–1.165), procalcitonin (PCT) (OR = 1.511, 95% CI 1.270–1.798), alanine aminotransferase (ALT) (OR = 1.013, 95% CI 1.008–1.018) and platelet count (PLT) (OR = 0.998, 95% CI 0.996–1). Logistic regression nomogram, SVM, XGBoost, and LightGBM prediction models were constructed based on the above independent risk factors. The sensitivity was 0.617, 0.681, 0.638, and 0.702, the specificity was 0.712, 0.841, 0.967, and 0.903, and the area under curve (AUC) was 0.731, 0.814, 0.804, and 0.874, respectively. Among the prediction models, the LightGBM model displayed the best ability for comprehensive prediction, with an AUC of 0.874, which surpassed the previous classic models of Egami (AUC = 0.581), Kobayashi (AUC = 0.524), Sano (AUC = 0.519), Fu (AUC = 0.578), and Formosa (AUC = 0.575).Conclusion: The machine learning LightGBM prediction model for IVIG-resistant KD patients was superior to previous models. Our findings may help to accomplish early identification of the risk of IVIG resistance and improve their outcomes.
🚀 python package style code with package code on datasets - LightGBM and TabNet This is the code of training model and inference. Normally we use ipynb style code in kaggle. I just change the code style to py package and it's better for training with shell command.
I refer the original code below and thanks to @chumajin
[Notebook] Reference Notebook by chumajin
-- config : yaml file of parameter for lightgbm
-- models : saved model
-- train.py
-- predict test.py
-- feature_engineering.py
-- metric.py
-- preprocessing.py
-- seed.py
-- tabnet preprocessing.py
-- config : tabnet hyp.yaml / tabnet config.py
-- models : saved model
-- predict_test.py
-- train.py
I refer the original code below and thanks to @chumajin
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate information concerning crown profile is critical in analyzing biological processes and providing a more accurate estimate of carbon balance, which is conducive to sustainable forest management and planning. The similarities between the types of data addressed with LSTM algorithms and crown profile data make a compelling argument for the integration of deep learning into the crown profile modeling. Thus, the aim was to study the application of deep learning method LSTM and its variant algorithms in the crown profile modeling, using the crown profile database from Pinus yunnanensis secondary forests in Yunnan province, in southwest China. Furthermore, the SHAP (SHapley Additive exPlanations) was used to interpret the predictions of ensemble or deep learning models. The results showed that LSTM’s variant algorithms was competitive with traditional Vanila LSTM, but substantially outperformed ensemble learning model LightGBM. Specifically, the proposed Hybrid LSTM-LightGBM and Integrated LSTM-LightGBM have achieved a best forecasting performance on training set and testing set respectively. Furthermore, the feature importance analysis of LightGBM and Vanila LSTM presented that there were more factors that contribute significantly to Vanila LSTM model compared to LightGBM model. This phenomenon can explain why deep learning outperforms ensemble learning when there are more interrelated features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of a machine learning project focused on predicting rainfall, a critical task for sectors like agriculture, water resource management, and disaster prevention. The project employs machine learning algorithms to forecast rainfall occurrences based on historical weather data, including features like temperature, humidity, and pressure.
The primary goal of the dataset is to train multiple machine learning models to predict rainfall and compare their performances. The insights gained will help identify the most accurate models for real-world predictions of rainfall events.
The dataset is derived from various historical weather observations, including temperature, humidity, wind speed, and pressure, collected by weather stations across Australia. These observations are used as inputs for training machine learning models. The dataset is publicly available on platforms like Kaggle and is often used in competitions and research to advance predictive analytics in meteorology.
The dataset consists of weather data from multiple Australian weather stations, spanning various time periods. Key features include:
Temperature
Humidity
Wind Speed
Pressure
Rainfall (target variable)
These features are tracked for each weather station over different times, with the goal of predicting rainfall.
Python: The primary programming language for data analysis and machine learning.
scikit-learn: For implementing machine learning models.
XGBoost, LightGBM, and CatBoost: Popular libraries for building more advanced ensemble models.
Matplotlib/Seaborn: For data visualization.
These libraries and tools help in data manipulation, modeling, evaluation, and visualization of results.
DBRepo Authorization: Required to access datasets via the DBRepo API for dataset retrieval.
Model Comparison Charts: The project includes output charts comparing the performance of seven popular machine learning models.
Trained Models (.pkl files): Pre-trained models are saved as .pkl files for reuse without retraining.
Documentation and Code: A Jupyter notebook guides through the process of data analysis, model training, and evaluation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Peroxisome proliferator-activated receptor gamma (PPARγ), a critical nuclear receptor, plays a pivotal role in regulating metabolic and inflammatory processes. However, various environmental contaminants can disrupt PPARγ function, leading to adverse health effects. This study introduces a novel approach to predict the inhibitory activity (IC50 values) of 140 chemical compounds across 13 categories, including pesticides, organochlorines, dioxins, detergents, flame retardants, and preservatives, on PPARγ. The predictive model, based on the light-gradient boosting machine (LightGBM) algorithm, was trained on a dataset of 1804 molecules showed r2 values of 0.82 and 0.59, Mean Absolute Error (MAE) of 0.38 and 0.58, and Root Mean Square Error (RMSE) of 0.54 and 0.76 for the training and test sets, respectively. This study provides novel insights into the interactions between emerging contaminants and PPARγ, highlighting the potential hazards and risks these chemicals may pose to public health and the environment. The ability to predict PPARγ inhibition by these hazardous contaminants demonstrates the value of this approach in guiding enhanced environmental toxicology research and risk assessment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prediction of the effect of a single-nucleotide variant (SNV) in an intronic region on aberrant pre-mRNA splicing is challenging except for an SNV affecting the canonical GU/AG splice sites (ss). To predict pathogenicity of SNVs at intronic positions −50 (Int-50) to −3 (Int-3) close to the 3’ ss, we developed light gradient boosting machine (LightGBM)-based IntSplice2 models using pathogenic SNVs in the human gene mutation database (HGMD) and ClinVar and common SNVs in dbSNP with 0.01 ≤ minor allelic frequency (MAF) < 0.50. The LightGBM models were generated using features representing splicing cis-elements. The average recall/sensitivity and specificity of IntSplice2 by fivefold cross-validation (CV) of the training dataset were 0.764 and 0.884, respectively. The recall/sensitivity of IntSplice2 was lower than the average recall/sensitivity of 0.800 of IntSplice that we previously made with support vector machine (SVM) modeling for the same intronic positions. In contrast, the specificity of IntSplice2 was higher than the average specificity of 0.849 of IntSplice. For benchmarking (BM) of IntSplice2 with IntSplice, we made a test dataset that was not used to train IntSplice. After excluding the test dataset from the training dataset, we generated IntSplice2-BM and compared it with IntSplice using the test dataset. IntSplice2-BM was superior to IntSplice in all of the seven statistical measures of accuracy, precision, recall/sensitivity, specificity, F1 score, negative predictive value (NPV), and matthews correlation coefficient (MCC). We made the IntSplice2 web service at https://www.med.nagoya-u.ac.jp/neurogenetics/IntSplice2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundColorectal cancer (CRC) is a highly frequent cancer worldwide, and early detection and risk stratification playing a critical role in reducing both incidence and mortality. we aimed to develop and validate a machine learning (ML) model using clinical data to improve CRC identification and prognostic evaluation.MethodsWe analyzed multicenter datasets comprising 676 CRC patients and 410 controls from Guigang City People’s Hospital (2020-2024) for model training/internal validation, with 463 patients from Laibin City People’s Hospital for external validation. Seven ML algorithms were systematically compared, with Light Gradient Boosting Machine (LightGBM) ultimately selected as the optimal framework. Model performance was rigorously assessed through area under the receiver operating characteristic (AUROC) analysis, calibration curves, Brier scores, and decision curve analysis. SHAP (SHapley Additive exPlanations) methodology was employed for feature interpretation.ResultsThe LightGBM model demonstrated exceptional discrimination with AUROCs of 0.9931 (95% CI: 0.9883-0.998) in the training cohort and 0.9429 (95% CI: 0.9176-0.9682) in external validation. Calibration curves revealed strong prediction-actual outcome concordance (Brier score=0.139). SHAP analysis identified 13 key predictors, with age (mean SHAP value=0.216) and CA19-9 (mean SHAP value=0.198) as dominant contributors. Other significant variables included hematological parameters (WBC, RBC, HGB, PLT), biochemical markers (ALT, TP, ALB, UREA, uric acid), and gender. A clinically implementable web-based risk calculator was successfully developed for real-time probability estimation.ConclusionsOur LightGBM-based model achieves high predictive accuracy while maintaining clinical interpretability, effectively bridging the gap between complex ML systems and practical clinical decision-making. The identified biomarker panel provides biological insights into CRC pathogenesis. This tool shows significant potential for optimizing early diagnosis and personalized risk assessment in CRC management.
Dataset 1. observation data over 14 weather stations Variables: hourly near-surface 2-min average wind speed, wind direction 2. ECMWF-IFS forecast data over 14 weather stations Variables: hourly predictors at surface level and upper level in next 48 hours (shown in Table 1. and Table 2.) Table 1. ECMWF-IFS forecast data at surface level Predictors Abbreviation Unit Temperature at 2 m 2t ℃ Sea surface temperature sst ℃ Dewpoint temperature at 2 m 2d ℃ Convective precipitation in the past hour cp mm Mean sea level pressure msl hPa Zonal component of wind speed at 10 m 10u m s-1 Meridional component of wind speed at 10 m 10v m s-1 Wind speed at 10 m 10ws m s-1 Wind direction at 10 m 10wd ° Zonal component of wind speed at 100 m 100u m s-1 Meridional component of wind speed at 100 m 100v m s-1 Wind speed at 100 m 100ws m s-1 Wind direction at 100 m 100wd ° Table 2. ECMWF-IFS forecast data at upper level Predictors Abbreviation Unit Relative humidity at xxx hPa r_Lxxx % Temperature at xxx hPa t_Lxxx ℃ Vertical velocity of wind at xxx hPa w_Lxxx Pa s-1 Zonal component of wind at xxx hPa u_Lxxx m s-1 Meridional component of wind at xxx hPa v_Lxxx m s-1 Wind speed at xxx hPa ws_Lxxx m s-1 Wind direction at xxx hPa wd_Lxxx ° 3. key variables constructed by feature engineering (1) sort-term statistics, including maximum, minimum, mean and variance of key variables (2t, 10u, 10v and 10ws) from ECMWF-IFS model during the next 48 hours, (2) long-term statistics, including mean and deviation of key variables (2t, 10u, 10v and 10ws) from ECMWF-IFS model during history 3-yr period (January 2020–December 2022), (3) thermodynamic factors, including the low-level wind shear between 10ws and 100ws, vertical wind shear between 200 hPa and 850 hPa, the differences between sst and 2t. Scripts 1. Random Forest model training code 2. LightGBM model training code 3. XGBoost model training code 4. TabNet-MTL model training code
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Contains resources needed to train, test, and analyze performance of gradient boosting models used to predict venous thromboembolism (VTE) from electronic health record (EHR) data.
"Code for analyses" folder: Contains code we used for the analyses in our paper. Prediction.ipynb: Contains code needed to run trained models. Small, Medium, and Large.xlsx: Excel templates to correctly format data for prediction generation. Models.zip: Contains trained models. Note that this is 0.4 GB once unzipped. Analysis.ipynb: Contains code used to train the models.
Dependencies: Python 3.10.9; Pandas 1.5.1; LightGBM 3.3.2.
ObjectiveMild Cognitive Impairment (MCI) is a recognized precursor to Alzheimer’s Disease (AD), presenting a significant risk of progression. Early detection and intervention in MCI can potentially slow disease advancement, offering substantial clinical benefits. This study employed radiomics and machine learning methodologies to distinguish between MCI and Normal Cognition (NC) groups.MethodsThe study included 172 MCI patients and 183 healthy controls from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, all of whom had 3D-T1 weighted MRI structural images. The cerebellar gray and white matter were segmented automatically using volBrain software, and radiomic features were extracted and screened through Pyradiomics. The screened features were then input into various machine learning models, including Random Forest (RF), Logistic Regression (LR), eXtreme Gradient Boosting (XGBoost), Support Vector Machines (SVM), K Nearest Neighbors (KNN), Extra Trees, Light Gradient Boosting Machine (LightGBM), and Multilayer Perceptron (MLP). Each model was optimized for penalty parameters through 5-fold cross-validation to construct radiomic models. The DeLong test was used to evaluate the performance of different models.ResultsThe LightGBM model, which utilizes a combination of cerebellar gray and white matter features (comprising eight gray matter and eight white matter features), emerges as the most effective model for radiomics feature analysis. The model demonstrates an Area Under the Curve (AUC) of 0.863 for the training set and 0.776 for the test set.ConclusionRadiomic features based on the cerebellar gray and white matter, combined with machine learning, can objectively diagnose MCI, which provides significant clinical value for assisted diagnosis.
X-IDS Dataset & Artifacts Repository
This repository contains all the data assets, experiment results, and preprocessing steps used in the development of the X-IDS system — an Explainable Intrusion Detection System using autoencoders, LightGBM classifiers, and fine-tune T5-small text generation.
The repository includes: raw and processed data, tensor-formatted datasets for model training, and hyperparameter search results using Optuna.
Folder Structure… See the full description on the dataset page: https://huggingface.co/datasets/luminolous/xids-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training dataset portioning results using CatBoost.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate real-time icing grid fields are critical for preventing ice-related disasters during winter and protecting property. These fields are essential both for mapping ice distribution and for predicting icing using physical models combined with numerical weather prediction systems. However, developing precise real-time icing grids is challenging due to the uneven distribution of monitoring stations, data confidentiality restrictions, and the limitations of existing interpolation methods. In this study, we propose a new approach for constructing real-time icing grid fields using 1,339 online terminal monitoring datasets provided by the China Southern Power Grid Research Institute Co., Ltd. (CSPGRI) during the winter of 2023. Our method integrates static geographic information, dynamic meteorological factors, and ice_kriging values derived from parameter-optimized Empirical Bayesian Kriging Interpolation (EBKI) to create a spatiotemporally matched, multi-source fused icing thickness grid dataset. We applied five machine learning algorithms—Random Forest, XGBoost, LightGBM, Stacking, and Convolutional Neural Network Transformers (CNNT)—and evaluated their performance using six metrics: R, RMSE, CSI, MAR, FAR, and fbias, on both validation and testing sets. The stacking model performed best, achieving an R value of 0.634 (0.893), RMSE of 3.424 mm (2.834 mm), CSI of 0.514 (0.774), MAR of 0.309 (0.091), FAR of 0.332 (0.161), and fbias of 1.034 (1.084), respectively, when comparing predicted icing values with actual measurements on pylons. Additionally, we employed the SHAP model to provide a physical interpretation of the stacking model, confirming the independence of selected features. Meteorological factors such as relative humidity (RH), 10-meter wind speed (WS10), 2-meter temperature (T2), and precipitation (PRE) demonstrated a range of positive and negative contributions consistent with the observed growth of icing. Thus, our multi-source remote sensing data fusion approach, combined with the stacking model, offers a highly accurate and interpretable solution for generating real-time icing grid fields.
In the repository you can find a variety of data and scripts to approximate the UTCI in southern South America and apply it to forecasts generated by data-driven models:1) UTCI data from ERA5-HEAT and different meteorological variables from ERA5.2) LightGBM models trained to estimate the UTCI from different predictors.3) Two examples sripts to train the LGBM models4) Scripts for metric estimation on the test sample of different LightGBM-based models with different predictors.5) Forecasts of the traditional GFS model, and data-driven models during a heat wave in central Argentina during March 2023.6) Scripts to apply the UTCI approach on the forecasts mentioned in the previous item.
This material is related to the article "Forecasting Heat Stress in southern South America from data-driven model outputs"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants.Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes.Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant.Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveMetagenomic next-generation sequencing (mNGS) can potentially detect various pathogenic microorganisms without bias to improve the diagnostic rate of fever of unknown origin (FUO), but there are no effective methods to predict mNGS-positive results. This study aimed to develop an interpretable machine learning algorithm for the effective prediction of mNGS results in patients with FUO.MethodsA clinical dataset from a large medical institution was used to develop and compare the performance of several predictive models, namely eXtreme Gradient Boosting (XGBoost), Light Gradient-Boosting Machine (LightGBM), and Random Forest, and the Shapley additive explanation (SHAP) method was employed to interpret and analyze the results.ResultsThe mNGS-positive rate among 284 patients with FUO reached 64.1%. Overall, the LightGBM-based model exhibited the best comprehensive predictive performance, with areas under the curve of 0.84 and 0.93 for the training and validation sets, respectively. Using the SHAP method, the five most important factors for predicting mNGS-positive results were albumin, procalcitonin, blood culture, disease type, and sample type.ConclusionThe validated LightGBM-based predictive model could have practical clinical value in enhancing the application of mNGS in the etiological diagnosis of FUO, representing a powerful tool to optimize the timing of mNGS.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The accuracy of digital elevation models (DEMs) in forested areas plays a crucial role in canopy height monitoring and ecological sensitivity analysis. Despite extensive research on DEMs in recent years, significant errors still exist in forested areas due to factors such as canopy occlusion, terrain complexity, and limited penetration, posing challenges for subsequent analyses based on DEMs. Therefore, a CNN-LightGBM hybrid model is proposed in this paper, with four different types of forests (tropical rainforest, coniferous forest, mixed coniferous and broad-leaved forest, and broad-leaved forest) selected as study sites to validate the performance of the hybrid model in correcting COP30DEM in different forest area DEMs. In the hybrid model of this paper, the choice was made to use the Densenet architecture of CNN models with LightGBM as the primary model. This choice is based on LightGBM’s leaf-growth strategy and histogram linking methods, which are effective in reducing the data’s memory footprint and utilising more of the data without sacrificing speed. The study uses elevation values from ICESat-2 as ground truth, covering several parameters including COP30DEM, canopy height, forest coverage, slope, terrain roughness and relief amplitude. To validate the superiority of the CNN-LightGBM hybrid model in DEMs correction compared to other models, a test of LightGBM model, CNN-SVR model, and SVR model is conducted within the same sample space. To prevent issues such as overfitting or underfitting during model training, although common meta-heuristic optimisation algorithms can alleviate these problems to a certain extent, they still have some shortcomings. To overcome these shortcomings, this paper cites an improved SSA search algorithm that incorporates the ingestion strategy of the FA algorithm to increase the diversity of solutions and global search capability, the Firefly Algorithm-based Sparrow Search Optimization Algorithm (FA-SSA algorithm) is introduced. By comparing multiple models and validating the data with an airborne LiDAR reference dataset, the results show that the R2 (R-Square) of the CNN-LightGBM model improves by more than 0.05 compared to the other models, and performs better in the experiments. The FA-SSA-CNN-LightGBM model has the highest accuracy, with an RMSE of 1.09 meters, and a reduction of more than 30% of the RMSE when compared to the LightGBM and other hybrid models. Compared to other forested area DEMs (such as FABDEM and GEDI), its accuracy is improved by more than 50%, and the performance is significantly better than other commonly used DEMs in forested areas, indicating the feasibility of this method in correcting elevation errors in forested area DEMs and its significant importance in advancing global topographic mapping.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Paired t-test for detecting the difference between DMFDEM errors and other types of DEM errors.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Objective:Distant metastasis other than non-regional lymph nodes and lung (i.e., M1b stage) significantly contributes to the poor survival prognosis of patients with germ cell testicular cancer (GCTC). The aim of this study was to develop a machine learning (ML) algorithm model to predict the risk of patients with GCTC developing the M1b stage, which can be used to assist in early intervention of patients.MethodsThe clinical and pathological data of patients with GCTC were obtained from the Surveillance, Epidemiology, and End Results (SEER) database. Combing the patient's characteristic variables, we applied six machine learning (ML) algorithms to develop the predictive models, including logistic regression(LR), eXtreme Gradient Boosting (XGBoost), light Gradient Boosting Machine (lightGBM), random forest (RF), multilayer perceptron (MLP), and k-nearest neighbor (kNN). Model performances were evaluated by 10-fold cross-receiver operating characteristic (ROC) curves, which calculated the area under the curve (AUC) of models for predictive accuracy. A total of 54 patients from our own center (October 2006 to June 2021) were collected as the external validation cohort.ResultsA total of 4,323 patients eligible for inclusion were screened for enrollment from the SEER database, of which 178 (4.12%) developing M1b stage. Multivariate logistic regression showed that lymph node dissection (LND), T stage, N stage, lung metastases, and distant lymph node metastases were the independent predictors of developing M1b stage risk. The models based on both the XGBoost and RF algorithms showed stable and efficient prediction performance in the training and external validation groups.ConclusionS-stage is not an independent factor for predicting the risk of developing the M1b stage of patients with GCTC. The ML models based on both XGBoost and RF algorithms have high predictive effectiveness and may be used to predict the risk of developing the M1b stage of patients with GCTC, which is of promising value in clinical decision-making. Models still need to be tested with a larger sample of real-world data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The accuracy of digital elevation models (DEMs) in forested areas plays a crucial role in canopy height monitoring and ecological sensitivity analysis. Despite extensive research on DEMs in recent years, significant errors still exist in forested areas due to factors such as canopy occlusion, terrain complexity, and limited penetration, posing challenges for subsequent analyses based on DEMs. Therefore, a CNN-LightGBM hybrid model is proposed in this paper, with four different types of forests (tropical rainforest, coniferous forest, mixed coniferous and broad-leaved forest, and broad-leaved forest) selected as study sites to validate the performance of the hybrid model in correcting COP30DEM in different forest area DEMs. In the hybrid model of this paper, the choice was made to use the Densenet architecture of CNN models with LightGBM as the primary model. This choice is based on LightGBM’s leaf-growth strategy and histogram linking methods, which are effective in reducing the data’s memory footprint and utilising more of the data without sacrificing speed. The study uses elevation values from ICESat-2 as ground truth, covering several parameters including COP30DEM, canopy height, forest coverage, slope, terrain roughness and relief amplitude. To validate the superiority of the CNN-LightGBM hybrid model in DEMs correction compared to other models, a test of LightGBM model, CNN-SVR model, and SVR model is conducted within the same sample space. To prevent issues such as overfitting or underfitting during model training, although common meta-heuristic optimisation algorithms can alleviate these problems to a certain extent, they still have some shortcomings. To overcome these shortcomings, this paper cites an improved SSA search algorithm that incorporates the ingestion strategy of the FA algorithm to increase the diversity of solutions and global search capability, the Firefly Algorithm-based Sparrow Search Optimization Algorithm (FA-SSA algorithm) is introduced. By comparing multiple models and validating the data with an airborne LiDAR reference dataset, the results show that the R2 (R-Square) of the CNN-LightGBM model improves by more than 0.05 compared to the other models, and performs better in the experiments. The FA-SSA-CNN-LightGBM model has the highest accuracy, with an RMSE of 1.09 meters, and a reduction of more than 30% of the RMSE when compared to the LightGBM and other hybrid models. Compared to other forested area DEMs (such as FABDEM and GEDI), its accuracy is improved by more than 50%, and the performance is significantly better than other commonly used DEMs in forested areas, indicating the feasibility of this method in correcting elevation errors in forested area DEMs and its significant importance in advancing global topographic mapping.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Objective: We explored the risk factors for intravenous immunoglobulin (IVIG) resistance in children with Kawasaki disease (KD) and constructed a prediction model based on machine learning algorithms.Methods: A retrospective study including 1,398 KD patients hospitalized in 7 affiliated hospitals of Chongqing Medical University from January 2015 to August 2020 was conducted. All patients were divided into IVIG-responsive and IVIG-resistant groups, which were randomly divided into training and validation sets. The independent risk factors were determined using logistic regression analysis. Logistic regression nomograms, support vector machine (SVM), XGBoost and LightGBM prediction models were constructed and compared with the previous models.Results: In total, 1,240 out of 1,398 patients were IVIG responders, while 158 were resistant to IVIG. According to the results of logistic regression analysis of the training set, four independent risk factors were identified, including total bilirubin (TBIL) (OR = 1.115, 95% CI 1.067–1.165), procalcitonin (PCT) (OR = 1.511, 95% CI 1.270–1.798), alanine aminotransferase (ALT) (OR = 1.013, 95% CI 1.008–1.018) and platelet count (PLT) (OR = 0.998, 95% CI 0.996–1). Logistic regression nomogram, SVM, XGBoost, and LightGBM prediction models were constructed based on the above independent risk factors. The sensitivity was 0.617, 0.681, 0.638, and 0.702, the specificity was 0.712, 0.841, 0.967, and 0.903, and the area under curve (AUC) was 0.731, 0.814, 0.804, and 0.874, respectively. Among the prediction models, the LightGBM model displayed the best ability for comprehensive prediction, with an AUC of 0.874, which surpassed the previous classic models of Egami (AUC = 0.581), Kobayashi (AUC = 0.524), Sano (AUC = 0.519), Fu (AUC = 0.578), and Formosa (AUC = 0.575).Conclusion: The machine learning LightGBM prediction model for IVIG-resistant KD patients was superior to previous models. Our findings may help to accomplish early identification of the risk of IVIG resistance and improve their outcomes.