Facebook
TwitterAdjust the parameters by extracting only the rows that do not contain any missing values. Best result when using ver4.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quantitative gait analysis is important for understanding the non-typical walking patterns associated with mobility impairments. Conventional linear statistical methods and machine learning (ML) models are commonly used to assess gait performance and related changes in the gait parameters. Nonetheless, explainable machine learning provides an alternative technique for distinguishing the significant and influential gait changes stemming from a given intervention. The goal of this work was to demonstrate the use of explainable ML models in gait analysis for prosthetic rehabilitation in both population- and sample-based interpretability analyses. Models were developed to classify amputee gait with two types of prosthetic knee joints. Sagittal plane gait patterns of 21 individuals with unilateral transfemoral amputations were video-recorded and 19 spatiotemporal and kinematic gait parameters were extracted and included in the models. Four ML models—logistic regression, support vector machine, random forest, and LightGBM—were assessed and tested for accuracy and precision. The Shapley Additive exPlanations (SHAP) framework was applied to examine global and local interpretability. Random Forest yielded the highest classification accuracy (98.3%). The SHAP framework quantified the level of influence of each gait parameter in the models where knee flexion-related parameters were found the most influential factors in yielding the outcomes of the models. The sample-based explainable ML provided additional insights over the population-based analyses, including an understanding of the effect of the knee type on the walking style of a specific sample, and whether or not it agreed with global interpretations. It was concluded that explainable ML models can be powerful tools for the assessment of gait-related clinical interventions, revealing important parameters that may be overlooked using conventional statistical methods.
Facebook
TwitterIn this dataset, I've compiled and shared the best-fitted model with the parameters optimized with GridSearchCV. These model parameters have been carefully selected and optimized to provide superior predictive performance for the given task.
The dataset includes pickle files containing the best parameter settings for different machine learning algorithms. Here's what you'll find:
CatBoost Classifier Parameters (catboost.pkl): Unleash the power of gradient boosting with categorical features. The pickle file contains a model with tuned hyperparameters for the CatBoost model.
LightGBM Classifier Parameters (lgbm.pkl): Experience the efficiency and accuracy of LightGBM. The pickle file holds the model with optimized hyperparameters for the LightGBM model.
Random Forest Classifier Parameters (rf.pkl): Embrace the classic Random Forest algorithm. The pickle file presents the model with the best hyperparameters for the Random Forest model.
TabNet Classifier Parameters (tab_net.pkl): Dive into the world of TabNet's attention mechanisms. The pickle file showcases the ideal hyperparameters for the TabNet model. You need to directly use this model to predict. But make sure you use the columns I used in my Notebook and used the exact same feature Engineering.
XGBoost Classifier Parameters (xgb.pkl): Harness the power of XGBoost's gradient boosting techniques. The pickle file includes the model with the finest hyperparameter settings for the XGBoost model.
These pickle files provide a snapshot of the hyperparameters that have yielded exceptional results in terms of accuracy and generalization. They are a valuable resource for anyone aiming to enhance their predictive modeling skills or participate in the Spaceship Titanic competition.
Feel free to explore and utilize these best-fitted model parameters in your analysis and modeling endeavors. Let's continue to learn, collaborate, and push the boundaries of data science together.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Optimization range of LightGBM parameters.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A fault diagnosis method for oil immersed transformers based on principal component analysis and SSA LightGBM is proposed to address the problem of low diagnostic accuracy caused by the complexity of current oil immersed transformer faults. Firstly, data on dissolved gases in oil is collected, and a 17 dimensional fault feature matrix is constructed using the uncoded ratio method. The feature matrix is then standardized to obtain joint features. Secondly, principal component analysis is used for feature fusion to eliminate information redundancy between variables and construct fused features. Finally, a transformer diagnostic model based on SSA-LightGBM was constructed, and the ten fold cross validation method was used to verify the classification ability of the model. The experimental results show that the SSA-LightGBM model proposed in this paper has an average fault diagnosis accuracy of 93.6% after SSA algorithm optimization, which is 3.6% higher than before optimization. At the same time, compared with the GA-LightGBM and GWO-LightGBM fault diagnosis models, SSA-LightGBM has improved the diagnostic accuracy by 8.1% and 5.7% respectively, verifying that this method can effectively improve the fault diagnosis performance of oil immersed transformers and is superior to other similar methods.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Various methods are used to solve TSP Oct 2021, including the use of LightGBM , CatBoost and XGBoost, in which hyperparametersplay an important role.
We use hyperparameter optimization framework like Optuna to find hyperparameter. Another way is to use parameters that have already been created and had good results.
In this database, I collected all the parameters of LightGBM , CatBoost and XGBoost introduced in the TPS Oct 2021.
All parameters are checked under one condition. I used the following specifications to find the accuracy of each parameter. This is not the final accuracy because it is measured with only 20% of the data, but it is a criterion for comparing the parameters.
train_20 = dt.fread(sample_train_20.csv', columns=lambda cols:[col.name not in ('id') for col in cols]).to_pandas()
y = train_20['target']
X = train_20.drop(columns=['target'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1 / 5), random_state=59)
model = model_from_csv(**params_from_csv)
model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric=["auc"],
verbose=False,
early_stopping_rounds = 600
)
y_predicted = model.predict_proba(X_test)
accuracy = roc_auc_score(y_test, y_predicted[:, 1])
I will try to create for future competitions as well if it is helpful for you, If you think this dataset are helpful for you, Please do not forget upvote this dataset Thank you in advance
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Super parameters of LightGBM.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In this project, I focus on enhancing global building data by combining multiple open-source geospatial datasets to predict building attributes, specifically the number of levels (floors). The core datasets used are the Microsoft Open Buildings dataset, which provides detailed building footprints across many regions, and Google’s Temporal Buildings Dataset (V1), which includes estimated building heights over time derived from satellite imagery. While Google's dataset includes height information for many buildings, a significant portion contains missing or unreliable values.
To address this, I first performed data preprocessing and merged the two datasets based on geographic coordinates. For buildings with missing height values, I used LightGBM, a gradient boosting framework, to impute missing heights using features like footprint area, geometry, and surrounding context. I then brought in OpenStreetMap (OSM) data to enrich the dataset with additional contextual information, such as building type, land use, and nearby infrastructure.
Using the combined dataset — now with both original and imputed heights — I trained a Random Forest Regressor to predict the number of building levels. Since floor count is not always directly available, especially in developing regions, this approach offers a way to estimate it from height and footprint data with relatively high accuracy.
This kind of modeling has important real-world applications. Predicting building levels can help support urban planning, disaster response, infrastructure development, and climate risk modeling. For example, knowing the number of floors in buildings allows for better estimation of population density, potential occupancy, or structural vulnerability in earthquake-prone or flood-prone regions. It can also help fill gaps in existing GIS data where traditional surveys are too expensive or time-consuming.
In future work, this framework could be extended globally and refined with additional data sources like LIDAR or census information to further improve the accuracy and coverage of building-level models
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of the results of P-feature and TF-feature in LightGBM.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LightGBM hyperparameters with default values, search ranges, and selected optimal values.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Neural network hyperparameters.
Facebook
TwitterTimely prediction of memory failures is crucial for the stable operation of data centers. However, existing methods often rely on a single classifier, which can lead to inaccurate or unstable predictions. To address this, we propose a new ensemble model for predicting CE-driven memory failures, where failures occur due to a surge of correctable errors (CEs) in memory, causing server downtime. Our model combines several strong-performing classifiers, such as Random Forest, LightGBM, and XGBoost, and assigns different weights to each based on its performance. By optimizing the decision-making process, the model improves prediction accuracy. We validate the model using in-memory data from Alibaba’s data center, and the results show an accuracy of over 84%, outperforming existing single and dual-classifier models, further confirming its excellent predictive performance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MAPE of all methods without noise.
Facebook
TwitterIntroductionMachine learning (ML) is an effective tool for predicting mental states and is a key technology in digital psychiatry. This study aimed to develop ML algorithms to predict the upper tertile group of various anxiety symptoms based on multimodal data from virtual reality (VR) therapy sessions for social anxiety disorder (SAD) patients and to evaluate their predictive performance across each data type.MethodsThis study included 32 SAD-diagnosed individuals, and finalized a dataset of 132 samples from 25 participants. It utilized multimodal (physiological and acoustic) data from VR sessions to simulate social anxiety scenarios. This study employed extended Geneva minimalistic acoustic parameter set for acoustic feature extraction and extracted statistical attributes from time series-based physiological responses. We developed ML models that predict the upper tertile group for various anxiety symptoms in SAD using Random Forest, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost) models. The best parameters were explored through grid search or random search, and the models were validated using stratified cross-validation and leave-one-out cross-validation.ResultsThe CatBoost, using multimodal features, exhibited high performance, particularly for the Social Phobia Scale with an area under the receiver operating characteristics curve (AUROC) of 0.852. It also showed strong performance in predicting cognitive symptoms, with the highest AUROC of 0.866 for the Post-Event Rumination Scale. For generalized anxiety, the LightGBM’s prediction for the State-Trait Anxiety Inventory-trait led to an AUROC of 0.819. In the same analysis, models using only physiological features had AUROCs of 0.626, 0.744, and 0.671, whereas models using only acoustic features had AUROCs of 0.788, 0.823, and 0.754.ConclusionsThis study showed that a ML algorithm using integrated multimodal data can predict upper tertile anxiety symptoms in patients with SAD with higher performance than acoustic or physiological data obtained during a VR session. The results of this study can be used as evidence for personalized VR sessions and to demonstrate the strength of the clinical use of multimodal data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of the results of P-feature and TF-feature in GRU.
Facebook
TwitterObjectiveMild Cognitive Impairment (MCI) is a recognized precursor to Alzheimer’s Disease (AD), presenting a significant risk of progression. Early detection and intervention in MCI can potentially slow disease advancement, offering substantial clinical benefits. This study employed radiomics and machine learning methodologies to distinguish between MCI and Normal Cognition (NC) groups.MethodsThe study included 172 MCI patients and 183 healthy controls from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, all of whom had 3D-T1 weighted MRI structural images. The cerebellar gray and white matter were segmented automatically using volBrain software, and radiomic features were extracted and screened through Pyradiomics. The screened features were then input into various machine learning models, including Random Forest (RF), Logistic Regression (LR), eXtreme Gradient Boosting (XGBoost), Support Vector Machines (SVM), K Nearest Neighbors (KNN), Extra Trees, Light Gradient Boosting Machine (LightGBM), and Multilayer Perceptron (MLP). Each model was optimized for penalty parameters through 5-fold cross-validation to construct radiomic models. The DeLong test was used to evaluate the performance of different models.ResultsThe LightGBM model, which utilizes a combination of cerebellar gray and white matter features (comprising eight gray matter and eight white matter features), emerges as the most effective model for radiomics feature analysis. The model demonstrates an Area Under the Curve (AUC) of 0.863 for the training set and 0.776 for the test set.ConclusionRadiomic features based on the cerebellar gray and white matter, combined with machine learning, can objectively diagnose MCI, which provides significant clinical value for assisted diagnosis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundColorectal cancer (CRC) is a highly frequent cancer worldwide, and early detection and risk stratification playing a critical role in reducing both incidence and mortality. we aimed to develop and validate a machine learning (ML) model using clinical data to improve CRC identification and prognostic evaluation.MethodsWe analyzed multicenter datasets comprising 676 CRC patients and 410 controls from Guigang City People’s Hospital (2020-2024) for model training/internal validation, with 463 patients from Laibin City People’s Hospital for external validation. Seven ML algorithms were systematically compared, with Light Gradient Boosting Machine (LightGBM) ultimately selected as the optimal framework. Model performance was rigorously assessed through area under the receiver operating characteristic (AUROC) analysis, calibration curves, Brier scores, and decision curve analysis. SHAP (SHapley Additive exPlanations) methodology was employed for feature interpretation.ResultsThe LightGBM model demonstrated exceptional discrimination with AUROCs of 0.9931 (95% CI: 0.9883-0.998) in the training cohort and 0.9429 (95% CI: 0.9176-0.9682) in external validation. Calibration curves revealed strong prediction-actual outcome concordance (Brier score=0.139). SHAP analysis identified 13 key predictors, with age (mean SHAP value=0.216) and CA19-9 (mean SHAP value=0.198) as dominant contributors. Other significant variables included hematological parameters (WBC, RBC, HGB, PLT), biochemical markers (ALT, TP, ALB, UREA, uric acid), and gender. A clinically implementable web-based risk calculator was successfully developed for real-time probability estimation.ConclusionsOur LightGBM-based model achieves high predictive accuracy while maintaining clinical interpretability, effectively bridging the gap between complex ML systems and practical clinical decision-making. The identified biomarker panel provides biological insights into CRC pathogenesis. This tool shows significant potential for optimizing early diagnosis and personalized risk assessment in CRC management.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of all optimal model results.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MAPE of the noise experiment on all methods.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveThis study aims to develop an artificial intelligence model utilizing clinical blood markers, ultrasound data, and breast biopsy pathological information to predict the distant metastasis in breast cancer patients.MethodsData from two medical centers were utilized, Clinical blood markers, ultrasound data, and breast biopsy pathological information were separately extracted and selected. Feature dimensionality reduction was performed using Spearman correlation and LASSO regression. Predictive models were constructed using LR and LightGBM machine learning algorithms and validated on internal and external validation sets. Feature correlation analysis was conducted for both models.ResultsThe LR model achieved AUC values of 0.892, 0.816, and 0.817 for the training, internal validation, and external validation cohorts, respectively. The LightGBM model achieved AUC values of 0.971, 0.861, and 0.890 for the same cohorts, respectively. Clinical decision curve analysis showed a superior net benefit of the LightGBM model over the LR model in predicting distant metastasis in breast cancer. Key features identified included creatine kinase isoenzyme (CK-MB) and alpha-hydroxybutyrate dehydrogenase.ConclusionThis study developed an artificial intelligence model using clinical blood markers, ultrasound data, and pathological information to identify distant metastasis in breast cancer patients. The LightGBM model demonstrated superior predictive accuracy and clinical applicability, suggesting it as a promising tool for early diagnosis of distant metastasis in breast cancer.
Facebook
TwitterAdjust the parameters by extracting only the rows that do not contain any missing values. Best result when using ver4.