Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used VGG16 for feature extraction. VGG16 is a 16-layer CNN that was trained on millions of images from the ImageNet database. The we used XGboost for regression and LightGBM for classification of rooftop heights.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quantitative gait analysis is important for understanding the non-typical walking patterns associated with mobility impairments. Conventional linear statistical methods and machine learning (ML) models are commonly used to assess gait performance and related changes in the gait parameters. Nonetheless, explainable machine learning provides an alternative technique for distinguishing the significant and influential gait changes stemming from a given intervention. The goal of this work was to demonstrate the use of explainable ML models in gait analysis for prosthetic rehabilitation in both population- and sample-based interpretability analyses. Models were developed to classify amputee gait with two types of prosthetic knee joints. Sagittal plane gait patterns of 21 individuals with unilateral transfemoral amputations were video-recorded and 19 spatiotemporal and kinematic gait parameters were extracted and included in the models. Four ML models—logistic regression, support vector machine, random forest, and LightGBM—were assessed and tested for accuracy and precision. The Shapley Additive exPlanations (SHAP) framework was applied to examine global and local interpretability. Random Forest yielded the highest classification accuracy (98.3%). The SHAP framework quantified the level of influence of each gait parameter in the models where knee flexion-related parameters were found the most influential factors in yielding the outcomes of the models. The sample-based explainable ML provided additional insights over the population-based analyses, including an understanding of the effect of the knee type on the walking style of a specific sample, and whether or not it agreed with global interpretations. It was concluded that explainable ML models can be powerful tools for the assessment of gait-related clinical interventions, revealing important parameters that may be overlooked using conventional statistical methods.
Sentinel QoS Training Dataset
This dataset contains synthetic network traffic features used to train the Sentry LightGBM classifier in the Sentinel-QoS project. Files
training_data.csv — Tabular CSV with per-flow/session features and a target label.
Columns (example)
src_ip, dst_ip, src_port, dst_port protocol — e.g., TCP/UDP bytes, packets, duration app_label — human-readable application class (e.g., Video, Gaming, Browsing) target — numeric label used for model training
Usage… See the full description on the dataset page: https://huggingface.co/datasets/Pulast/sentry_training_data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.
X-IDS Dataset & Artifacts Repository
This repository contains all the data assets, experiment results, and preprocessing steps used in the development of the X-IDS system — an Explainable Intrusion Detection System using autoencoders, LightGBM classifiers, and fine-tune T5-small text generation.
The repository includes: raw and processed data, tensor-formatted datasets for model training, and hyperparameter search results using Optuna.
Folder Structure… See the full description on the dataset page: https://huggingface.co/datasets/luminolous/xids-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundEndoscopic submucosal dissection has become the primary option of treatment for early gastric cancer. However, lymph node metastasis may lead to poor prognosis. We analyzed factors related to lymph node metastasis in EGC patients, and we developed a construction prediction model with machine learning using data from a retrospective series.MethodsTwo independent cohorts’ series were evaluated including 305 patients with EGC from China as cohort I and 35 patients from Spain as cohort II. Five classifiers obtained from machine learning were selected to establish a robust prediction model for lymph node metastasis in EGC.ResultsThe clinical variables such as invasion depth, histologic type, ulceration, tumor location, tumor size, Lauren classification, and age were selected to establish the five prediction models: linear support vector classifier (Linear SVC), logistic regression model, extreme gradient boosting model (XGBoost), light gradient boosting machine model (LightGBM), and Gaussian process classification model. Interestingly, all prediction models of cohort I showed accuracy between 70 and 81%. Furthermore, the prediction models of the cohort II exhibited accuracy between 48 and 82%. The areas under curve (AUC) of the five models between cohort I and cohort II were between 0.736 and 0.830.ConclusionsOur results support that the machine learning method could be used to predict lymph node metastasis in early gastric cancer and perhaps provide another evaluation method to choose the suited treatment for patients.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Timely prediction of memory failures is crucial for the stable operation of data centers. However, existing methods often rely on a single classifier, which can lead to inaccurate or unstable predictions. To address this, we propose a new ensemble model for predicting CE-driven memory failures, where failures occur due to a surge of correctable errors (CEs) in memory, causing server downtime. Our model combines several strong-performing classifiers, such as Random Forest, LightGBM, and XGBoost, and assigns different weights to each based on its performance. By optimizing the decision-making process, the model improves prediction accuracy. We validate the model using in-memory data from Alibaba’s data center, and the results show an accuracy of over 84%, outperforming existing single and dual-classifier models, further confirming its excellent predictive performance.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web applications are important for various online businesses and operations because of their platform stability and low operation cost. The increasing usage of Internet-of-Things (IoT) devices within a network has contributed to the rise of network intrusion issues due to malicious Uniform Resource Locators (URLs). Generally, malicious URLs are initiated to promote scams, attacks, and frauds which can lead to high-risk intrusion. Several methods have been developed to detect malicious URLs in previous works. There has been a good amount of work done to detect malicious URLs using various methods such as random forest, regression, LightGBM, and more as reported in the literature. However, most of the previous works focused on the binary classification of malicious URLs and are tested on limited URL datasets. Nevertheless, the detection of malicious URLs remains a challenging task that remains open to research. Hence, this work proposed a stacking-based ensemble classifier to perform multi-class classification of malicious URLs on larger URL datasets to justify the robustness of the proposed method. This study focuses on obtaining lexical features directly from the URL to identify malicious websites. Then, the proposed stacking-based ensemble classifier is developed by integrating Random Forest, XGBoost, LightGBM, and CatBoost. In addition, hyperparameter tuning was performed using the Randomized Search method to optimize the proposed classifier. The proposed stacking-based ensemble classifier aims to take advantage of the performance of each machine learning model and aggregate the output to improve prediction accuracy. The classification accuracies of the machine learning model when applied individually are 93.6%, 95.2%, 95.7% and 94.8% for random forest, XGBoost, LightGBM, and CatBoost respectively. The proposed stacking-based ensemble classifier has shown significant results in classifying four classes of malicious URLs (phishing, malware, defacement, and benign) with an average accuracy of 96.8% when benchmarked with previous works.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A fault diagnosis method for oil immersed transformers based on principal component analysis and SSA LightGBM is proposed to address the problem of low diagnostic accuracy caused by the complexity of current oil immersed transformer faults. Firstly, data on dissolved gases in oil is collected, and a 17 dimensional fault feature matrix is constructed using the uncoded ratio method. The feature matrix is then standardized to obtain joint features. Secondly, principal component analysis is used for feature fusion to eliminate information redundancy between variables and construct fused features. Finally, a transformer diagnostic model based on SSA-LightGBM was constructed, and the ten fold cross validation method was used to verify the classification ability of the model. The experimental results show that the SSA-LightGBM model proposed in this paper has an average fault diagnosis accuracy of 93.6% after SSA algorithm optimization, which is 3.6% higher than before optimization. At the same time, compared with the GA-LightGBM and GWO-LightGBM fault diagnosis models, SSA-LightGBM has improved the diagnostic accuracy by 8.1% and 5.7% respectively, verifying that this method can effectively improve the fault diagnosis performance of oil immersed transformers and is superior to other similar methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset and static code base associated with the paper: "Superphot+: Real-Time Fitting and Classification of Supernova Light Curves". The contents are as follows:
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, we've launched many Playground competitions that are more approachable than our Featured competitions and thus, more beginner-friendly.
In this way, the TPS competition starts!
The dataset contains all infomation of diverse trainings: train_predictions and test predictions. I tried diverse models: 11 lightgbm, 4 xgboost, 7 catboost, 1 keras, 1 deebtable, 2 logistic regressions, 5 autolightml. This models was obtained with different preprocessing - Considering create categorical features. Low range of values (max=10, 15 values). - Trying a diversity of encoding in numerical, categorical features - Binning some features. - Considering cluster as a feature. - Considering interactions between features - Remove duplicates.
Things to do: I think i should tried an autoencoder.
Thanks to Kaggle community!
See this reference notebooks. This two guys deserves all the claps. - https://www.kaggle.com/davidedwards1/tabmar21-tabular-blend-final-sub - https://www.kaggle.com/hiro5299834/3rd-tps-mar-2021-stacking
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning hyperparameters.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results of detected semantic clones using the proposed technique.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web applications are important for various online businesses and operations because of their platform stability and low operation cost. The increasing usage of Internet-of-Things (IoT) devices within a network has contributed to the rise of network intrusion issues due to malicious Uniform Resource Locators (URLs). Generally, malicious URLs are initiated to promote scams, attacks, and frauds which can lead to high-risk intrusion. Several methods have been developed to detect malicious URLs in previous works. There has been a good amount of work done to detect malicious URLs using various methods such as random forest, regression, LightGBM, and more as reported in the literature. However, most of the previous works focused on the binary classification of malicious URLs and are tested on limited URL datasets. Nevertheless, the detection of malicious URLs remains a challenging task that remains open to research. Hence, this work proposed a stacking-based ensemble classifier to perform multi-class classification of malicious URLs on larger URL datasets to justify the robustness of the proposed method. This study focuses on obtaining lexical features directly from the URL to identify malicious websites. Then, the proposed stacking-based ensemble classifier is developed by integrating Random Forest, XGBoost, LightGBM, and CatBoost. In addition, hyperparameter tuning was performed using the Randomized Search method to optimize the proposed classifier. The proposed stacking-based ensemble classifier aims to take advantage of the performance of each machine learning model and aggregate the output to improve prediction accuracy. The classification accuracies of the machine learning model when applied individually are 93.6%, 95.2%, 95.7% and 94.8% for random forest, XGBoost, LightGBM, and CatBoost respectively. The proposed stacking-based ensemble classifier has shown significant results in classifying four classes of malicious URLs (phishing, malware, defacement, and benign) with an average accuracy of 96.8% when benchmarked with previous works.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MicroRNAs (miRNAs) are closely associated with the occurrences and developments of many complex human diseases. Increasing studies have shown that miRNAs emerge as new therapeutic targets of small molecule (SM) drugs. Since traditional experiment methods are expensive and time consuming, it is particularly crucial to find efficient computational approaches to predict potential small molecule-miRNA (SM-miRNA) associations. Considering that integrating multi-source heterogeneous information related with SM-miRNA association prediction would provide a comprehensive insight into the features of both SMs and miRNAs, we proposed a novel model of Small Molecule-MiRNA Association prediction based on Heterogeneous Network Representation Learning (SMMA-HNRL) for more precisely predicting the potential SM-miRNA associations. In SMMA-HNRL, a novel heterogeneous information network was constructed with SM nodes, miRNA nodes and disease nodes. To access and utilize of the topological information of the heterogeneous information network, feature vectors of SM and miRNA nodes were obtained by two different heterogeneous network representation learning algorithms (HeGAN and HIN2Vec) respectively and merged with connect operation. Finally, LightGBM was chosen as the classifier of SMMA-HNRL for predicting potential SM-miRNA associations. The 10-fold cross validations were conducted to evaluate the prediction performance of SMMA-HNRL, it achieved an area under of ROC curve of 0.9875, which was superior to other three state-of-the-art models. With two independent validation datasets, the test experiment results revealed the robustness of our model. Moreover, three case studies were performed. As a result, 35, 37, and 22 miRNAs among the top 50 predicting miRNAs associated with 5-FU, cisplatin, and imatinib were validated by experimental literature works respectively, which confirmed the effectiveness of SMMA-HNRL. The source code and experimental data of SMMA-HNRL are available at https://github.com/SMMA-HNRL/SMMA-HNRL.
Background and purposeCorpus callosum (CC) infarction is an extremely rare subtype of cerebral ischemic stroke, however, the symptoms of cognitive impairment often fail to attract early attention of patients, which seriously affects the long-term prognosis, such as high mortality, personality changes, mood disorders, psychotic reactions, financial burden and so on. This study seeks to develop and validate models for early predicting the risk of subjective cognitive decline (SCD) after CC infarction by machine learning (ML) algorithms.MethodsThis is a prospective study that enrolled 213 (only 3.7%) CC infarction patients from a nine-year cohort comprising 8,555 patients with acute ischemic stroke. Telephone follow-up surveys were carried out for the patients with definite diagnosis of CC infarction one-year after disease onset, and SCD was identified by Behavioral Risk Factor Surveillance System (BRFSS) questionnaire. Based on the significant features selected by the least absolute shrinkage and selection operator (LASSO), seven ML models including Extreme Gradient Boosting (XGBoost), Logistic Regression (LR), Light Gradient Boosting Machine (LightGBM), Adaptive Boosting (AdaBoost), Gaussian Naïve Bayes (GNB), Complement Naïve Bayes (CNB), and Support vector machine (SVM) were established and their predictive performances were compared by different metrics. Importantly, the SHapley Additive exPlanations (SHAP) was also utilized to examine internal behavior of the highest-performance ML classifier.ResultsThe Logistic Regression (LR)-model performed better than other six ML-models in SCD predictability after the CC infarction, with the area under the receiver characteristic operator curve (AUC) of 77.1% in the validation set. Using LASSO and SHAP analysis, we found that infarction subregions of CC infarction, female, 3-month modified Rankin Scale (mRS) score, age, homocysteine, location of angiostenosis, neutrophil to lymphocyte ratio, pure CC infarction, and number of angiostenosis were the top-nine significant predictors in the order of importance for the output of LR-model. Meanwhile, we identified that infarction subregion of CC, female, 3-month mRS score and pure CC infarction were the factors which independently associated with the cognitive outcome.ConclusionOur study firstly demonstrated that the LR-model with 9 common variables has the best-performance to predict the risk of post-stroke SCD due to CC infarcton. Particularly, the combination of LR-model and SHAP-explainer could aid in achieving personalized risk prediction and be served as a decision-making tool for early intervention since its poor long-term outcome.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code representation techniques used in the literature.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Computational prediction to screen potential vaccine candidates has been proven to be a reliable way to provide guarantees for vaccine discovery in infectious diseases. As an important class of organisms causing infectious diseases, pathogenic eukaryotes (such as parasitic protozoans) have evolved the ability to colonize a wide range of hosts, including humans and animals; meanwhile, protective vaccines are urgently needed. Inspired by the immunological idea that pathogen-derived epitopes are able to mediate the CD8+ T-cell-related host adaptive immune response and with the available positive and negative CD8+ T-cell epitopes (TCEs), we proposed a novel predictor called CD8TCEI-EukPath to detect CD8+ TCEs of eukaryotic pathogens. Our method integrated multiple amino acid sequence-based hybrid features, employed a well-established feature selection technique, and eventually built an efficient machine learning classifier to differentiate CD8+ TCEs from non-CD8+ TCEs. Based on the feature selection results, 520 optimal hybrid features were used for modeling by utilizing the LightGBM algorithm. CD8TCEI-EukPath achieved impressive performance, with an accuracy of 79.255% in ten-fold cross-validation and an accuracy of 78.169% in the independent test. Collectively, CD8TCEI-EukPath will contribute to rapidly screening epitope-based vaccine candidates, particularly from large peptide-coding datasets. To conduct the prediction of CD8+ TCEs conveniently, an online web server is freely accessible (http://lab.malab.cn/∼hrs/CD8TCEI-EukPath/).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LightGBM hyperparameters with default values, search ranges, and selected optimal values.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used VGG16 for feature extraction. VGG16 is a 16-layer CNN that was trained on millions of images from the ImageNet database. The we used XGboost for regression and LightGBM for classification of rooftop heights.