37 datasets found

f
VGG16 + XGBoost (or LightGBM)
catalog.eoxhub.fairicube.eu
bin, data
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). VGG16 + XGBoost (or LightGBM) [Dataset]. https://catalog.eoxhub.fairicube.eu/collections/ML%20collection/items/13DFOCKBVL
Explore at:
data, binAvailable download formats
Dataset updated
Jul 3, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 3, 2025
Area covered
Earth
Description
We used VGG16 for feature extraction. VGG16 is a 16-layer CNN that was trained on millions of images from the ImageNet database. The we used XGboost for regression and LightGBM for classification of rooftop heights.
f
Hyperparameters obtained for the classifiers.
plos.figshare.com
xls
Updated Apr 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Pourmahmood Aghababa; Jan Andrysek (2024). Hyperparameters obtained for the classifiers. [Dataset]. http://doi.org/10.1371/journal.pone.0300447.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300447.t002
Dataset updated
Apr 2, 2024
Dataset provided by
PLOS ONE
Authors
Mohammad Pourmahmood Aghababa; Jan Andrysek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Quantitative gait analysis is important for understanding the non-typical walking patterns associated with mobility impairments. Conventional linear statistical methods and machine learning (ML) models are commonly used to assess gait performance and related changes in the gait parameters. Nonetheless, explainable machine learning provides an alternative technique for distinguishing the significant and influential gait changes stemming from a given intervention. The goal of this work was to demonstrate the use of explainable ML models in gait analysis for prosthetic rehabilitation in both population- and sample-based interpretability analyses. Models were developed to classify amputee gait with two types of prosthetic knee joints. Sagittal plane gait patterns of 21 individuals with unilateral transfemoral amputations were video-recorded and 19 spatiotemporal and kinematic gait parameters were extracted and included in the models. Four ML models—logistic regression, support vector machine, random forest, and LightGBM—were assessed and tested for accuracy and precision. The Shapley Additive exPlanations (SHAP) framework was applied to examine global and local interpretability. Random Forest yielded the highest classification accuracy (98.3%). The SHAP framework quantified the level of influence of each gait parameter in the models where knee flexion-related parameters were found the most influential factors in yielding the outcomes of the models. The sample-based explainable ML provided additional insights over the population-based analyses, including an understanding of the effect of the knee type on the walking style of a specific sample, and whether or not it agreed with global interpretations. It was concluded that explainable ML models can be powerful tools for the assessment of gait-related clinical interventions, revealing important parameters that may be overlooked using conventional statistical methods.
h
sentry_training_data
huggingface.co
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pulast S Tiwari (2025). sentry_training_data [Dataset]. https://huggingface.co/datasets/Pulast/sentry_training_data
Explore at:
Dataset updated
Aug 31, 2025
Authors
Pulast S Tiwari
Description
Sentinel QoS Training Dataset

This dataset contains synthetic network traffic features used to train the Sentry LightGBM classifier in the Sentinel-QoS project. Files

training_data.csv — Tabular CSV with per-flow/session features and a target label.

Columns (example)

src_ip, dst_ip, src_port, dst_port protocol — e.g., TCP/UDP bytes, packets, duration app_label — human-readable application class (e.g., Video, Gaming, Browsing) target — numeric label used for model training

Usage… See the full description on the dataset page: https://huggingface.co/datasets/Pulast/sentry_training_data.
f
Details of dataset information.
plos.figshare.com
xls
Updated May 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan (2024). Details of dataset information. [Dataset]. http://doi.org/10.1371/journal.pone.0302333.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302333.t005
Dataset updated
May 10, 2024
Dataset provided by
PLOS ONE
Authors
Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.
f
Selected AST non-terminal nodes.
plos.figshare.com
xls
Updated May 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan (2024). Selected AST non-terminal nodes. [Dataset]. http://doi.org/10.1371/journal.pone.0302333.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302333.t002
Dataset updated
May 10, 2024
Dataset provided by
PLOS ONE
Authors
Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.
h
xids-dataset
huggingface.co
Updated Aug 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lumy (2025). xids-dataset [Dataset]. https://huggingface.co/datasets/luminolous/xids-dataset
Explore at:
Dataset updated
Aug 27, 2025
Authors
Lumy
Description
X-IDS Dataset & Artifacts Repository

This repository contains all the data assets, experiment results, and preprocessing steps used in the development of the X-IDS system — an Explainable Intrusion Detection System using autoencoders, LightGBM classifiers, and fine-tune T5-small text generation.

The repository includes: raw and processed data, tensor-formatted datasets for model training, and hyperparameter search results using Optuna.

Folder Structure… See the full description on the dataset page: https://huggingface.co/datasets/luminolous/xids-dataset.
f
DataSheet_1_A retrospective analysis based on multiple machine learning...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tao Yang; Javier Martinez-Useros; JingWen Liu; Isaias Alarcón; Chao Li; WeiYao Li; Yuanxun Xiao; Xiang Ji; YanDong Zhao; Lei Wang; Salvador Morales-Conde; Zuli Yang (2023). DataSheet_1_A retrospective analysis based on multiple machine learning models to predict lymph node metastasis in early gastric cancer.zip [Dataset]. http://doi.org/10.3389/fonc.2022.1023110.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2022.1023110.s001
Dataset updated
Jun 9, 2023
Dataset provided by
Frontiers
Authors
Tao Yang; Javier Martinez-Useros; JingWen Liu; Isaias Alarcón; Chao Li; WeiYao Li; Yuanxun Xiao; Xiang Ji; YanDong Zhao; Lei Wang; Salvador Morales-Conde; Zuli Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundEndoscopic submucosal dissection has become the primary option of treatment for early gastric cancer. However, lymph node metastasis may lead to poor prognosis. We analyzed factors related to lymph node metastasis in EGC patients, and we developed a construction prediction model with machine learning using data from a retrospective series.MethodsTwo independent cohorts’ series were evaluated including 305 patients with EGC from China as cohort I and 35 patients from Spain as cohort II. Five classifiers obtained from machine learning were selected to establish a robust prediction model for lymph node metastasis in EGC.ResultsThe clinical variables such as invasion depth, histologic type, ulceration, tumor location, tumor size, Lauren classification, and age were selected to establish the five prediction models: linear support vector classifier (Linear SVC), logistic regression model, extreme gradient boosting model (XGBoost), light gradient boosting machine model (LightGBM), and Gaussian process classification model. Interestingly, all prediction models of cohort I showed accuracy between 70 and 81%. Furthermore, the prediction models of the cohort II exhibited accuracy between 48 and 82%. The areas under curve (AUC) of the five models between cohort I and cohort II were between 0.736 and 0.830.ConclusionsOur results support that the machine learning method could be used to predict lymph node metastasis in early gastric cancer and perhaps provide another evaluation method to choose the suited treatment for patients.
f
Best parameters of base classifiers.
plos.figshare.com
xls
Updated Apr 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peng Zhang; Jialiang Zhang; Yi Li (2025). Best parameters of base classifiers. [Dataset]. http://doi.org/10.1371/journal.pone.0321954.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0321954.t001
Dataset updated
Apr 23, 2025
Dataset provided by
PLOS ONE
Authors
Peng Zhang; Jialiang Zhang; Yi Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Timely prediction of memory failures is crucial for the stable operation of data centers. However, existing methods often rely on a single classifier, which can lead to inaccurate or unstable predictions. To address this, we propose a new ensemble model for predicting CE-driven memory failures, where failures occur due to a surge of correctable errors (CEs) in memory, causing server downtime. Our model combines several strong-performing classifiers, such as Random Forest, LightGBM, and XGBoost, and assigns different weights to each based on its performance. By optimizing the decision-making process, the model improves prediction accuracy. We validate the model using in-memory data from Alibaba’s data center, and the results show an accuracy of over 84%, outperforming existing single and dual-classifier models, further confirming its excellent predictive performance.
f
Hyperparameter of tuned Random Forest classifier.
figshare.com
xls
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait (2024). Hyperparameter of tuned Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0302196.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302196.t003
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web applications are important for various online businesses and operations because of their platform stability and low operation cost. The increasing usage of Internet-of-Things (IoT) devices within a network has contributed to the rise of network intrusion issues due to malicious Uniform Resource Locators (URLs). Generally, malicious URLs are initiated to promote scams, attacks, and frauds which can lead to high-risk intrusion. Several methods have been developed to detect malicious URLs in previous works. There has been a good amount of work done to detect malicious URLs using various methods such as random forest, regression, LightGBM, and more as reported in the literature. However, most of the previous works focused on the binary classification of malicious URLs and are tested on limited URL datasets. Nevertheless, the detection of malicious URLs remains a challenging task that remains open to research. Hence, this work proposed a stacking-based ensemble classifier to perform multi-class classification of malicious URLs on larger URL datasets to justify the robustness of the proposed method. This study focuses on obtaining lexical features directly from the URL to identify malicious websites. Then, the proposed stacking-based ensemble classifier is developed by integrating Random Forest, XGBoost, LightGBM, and CatBoost. In addition, hyperparameter tuning was performed using the Randomized Search method to optimize the proposed classifier. The proposed stacking-based ensemble classifier aims to take advantage of the performance of each machine learning model and aggregate the output to improve prediction accuracy. The classification accuracies of the machine learning model when applied individually are 93.6%, 95.2%, 95.7% and 94.8% for random forest, XGBoost, LightGBM, and CatBoost respectively. The proposed stacking-based ensemble classifier has shown significant results in classifying four classes of malicious URLs (phishing, malware, defacement, and benign) with an average accuracy of 96.8% when benchmarked with previous works.
f
Key parameters of LightGBM.
plos.figshare.com
xls
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jizhong Wang; Jianfei Chi; Yeqiang Ding; Haiyan Yao; Qiang Guo (2025). Key parameters of LightGBM. [Dataset]. http://doi.org/10.1371/journal.pone.0314481.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0314481.t002
Dataset updated
Feb 19, 2025
Dataset provided by
PLOS ONE
Authors
Jizhong Wang; Jianfei Chi; Yeqiang Ding; Haiyan Yao; Qiang Guo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A fault diagnosis method for oil immersed transformers based on principal component analysis and SSA LightGBM is proposed to address the problem of low diagnostic accuracy caused by the complexity of current oil immersed transformer faults. Firstly, data on dissolved gases in oil is collected, and a 17 dimensional fault feature matrix is constructed using the uncoded ratio method. The feature matrix is then standardized to obtain joint features. Secondly, principal component analysis is used for feature fusion to eliminate information redundancy between variables and construct fused features. Finally, a transformer diagnostic model based on SSA-LightGBM was constructed, and the ten fold cross validation method was used to verify the classification ability of the model. The experimental results show that the SSA-LightGBM model proposed in this paper has an average fault diagnosis accuracy of 93.6% after SSA algorithm optimization, which is 3.6% higher than before optimization. At the same time, compared with the GA-LightGBM and GWO-LightGBM fault diagnosis models, SSA-LightGBM has improved the diagnostic accuracy by 8.1% and 5.7% respectively, verifying that this method can effectively improve the fault diagnosis performance of oil immersed transformers and is superior to other similar methods.
Data for "Superphot+: Real-Time Fitting and Classification of Supernova...
zenodo.org
bin, csv, tar
Updated Jun 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaylee de Soto; Kaylee de Soto; Ashley Villar; Ashley Villar; Edo Berger; Edo Berger; Sebastian Gomez; Sebastian Gomez; Griffin Hosseinzadeh; Griffin Hosseinzadeh; Doug Branton; Doug Branton; Sandro Campos; Sandro Campos; Melissa DeLucchi; Melissa DeLucchi; Jeremy Kubica; Jeremy Kubica; Olivia Lynn; Olivia Lynn; Konstantin Malanchev; Konstantin Malanchev; Alex I. Malz; Alex I. Malz (2024). Data for "Superphot+: Real-Time Fitting and Classification of Supernova Light Curves" [Dataset]. http://doi.org/10.5281/zenodo.10798425
Explore at:
bin, csv, tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10798425
Dataset updated
Jun 24, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kaylee de Soto; Kaylee de Soto; Ashley Villar; Ashley Villar; Edo Berger; Edo Berger; Sebastian Gomez; Sebastian Gomez; Griffin Hosseinzadeh; Griffin Hosseinzadeh; Doug Branton; Doug Branton; Sandro Campos; Sandro Campos; Melissa DeLucchi; Melissa DeLucchi; Jeremy Kubica; Jeremy Kubica; Olivia Lynn; Olivia Lynn; Konstantin Malanchev; Konstantin Malanchev; Alex I. Malz; Alex I. Malz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset and static code base associated with the paper: "Superphot+: Real-Time Fitting and Classification of Supernova Light Curves". The contents are as follows:

superphot-plus-v0.0.7.tar: Superphot+ code base downloaded at time of paper submission. Static copy of the Github repo: https://github.com/VTDA-Group/superphot-plus

dataset_spec_pruned.csv: Spectroscopic dataset pruned according to Table 1 of the paper.

dataset_phot_final.csv: Photometric dataset (without spectroscopic labels) pruned according to Section 2 of the paper. Label and probability columns are values from the ALeRCE-SN classifier.

model_0.pt: One of the 10 (redshift-independent) LightGBM models trained for 5-way SN classification.

model_0.yaml: Configuration file associated with model_0.pt.

model_z_0.pt: Same as model_0.pt, but trained using redshift information.

model_z_0.yaml: Configuration file associated with model_z_0.pt.

early_phase_classifier_0.pt: Same as model_0.pt, but trained only using early-phase light curve features. Tailored for realtime classification.

early_phase_classifier_0.yaml: Configuration file for early_phase_classifier_0.pt.

probs_concat.csv: Spectroscopic set's classification results without using redshift information.

probs_z_concat.csv: Spectroscopic set's classification results using redshift information.

probs_photometric.mrt: Superphot+'s probabilities for the photometric set without using redshift information.
[Tps May] 1st stage of modeling
kaggle.com
Updated Jun 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lázaro (2021). [Tps May] 1st stage of modeling [Dataset]. https://www.kaggle.com/lazaro97/tps-may-1st-stage-of-modeling
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lázaro
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, we've launched many Playground competitions that are more approachable than our Featured competitions and thus, more beginner-friendly.

In this way, the TPS competition starts!

Content

The dataset contains all infomation of diverse trainings: train_predictions and test predictions. I tried diverse models: 11 lightgbm, 4 xgboost, 7 catboost, 1 keras, 1 deebtable, 2 logistic regressions, 5 autolightml. This models was obtained with different preprocessing - Considering create categorical features. Low range of values (max=10, 15 values). - Trying a diversity of encoding in numerical, categorical features - Binning some features. - Considering cluster as a feature. - Considering interactions between features - Remove duplicates.

Things to do: I think i should tried an autoencoder.

Acknowledgements

Thanks to Kaggle community!

Inspiration

See this reference notebooks. This two guys deserves all the claps. - https://www.kaggle.com/davidedwards1/tabmar21-tabular-blend-final-sub - https://www.kaggle.com/hiro5299834/3rd-tps-mar-2021-stacking
f
Machine learning hyperparameters.
plos.figshare.com
bin
Updated Jun 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huy Le; Beverly Peng; Janelle Uy; Daniel Carrillo; Yun Zhang; Brian D. Aevermann; Richard H. Scheuermann (2023). Machine learning hyperparameters. [Dataset]. http://doi.org/10.1371/journal.pone.0275070.t001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0275070.t001
Dataset updated
Jun 16, 2023
Dataset provided by
PLOS ONE
Authors
Huy Le; Beverly Peng; Janelle Uy; Daniel Carrillo; Yun Zhang; Brian D. Aevermann; Richard H. Scheuermann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning hyperparameters.
f
Results of detected semantic clones using the proposed technique.
plos.figshare.com
xls
Updated May 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan (2024). Results of detected semantic clones using the proposed technique. [Dataset]. http://doi.org/10.1371/journal.pone.0302333.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302333.t006
Dataset updated
May 10, 2024
Dataset provided by
PLOS ONE
Authors
Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of detected semantic clones using the proposed technique.
f
Hyperparameter of Catboost classifier.
figshare.com
xls
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait (2024). Hyperparameter of Catboost classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0302196.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302196.t005
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web applications are important for various online businesses and operations because of their platform stability and low operation cost. The increasing usage of Internet-of-Things (IoT) devices within a network has contributed to the rise of network intrusion issues due to malicious Uniform Resource Locators (URLs). Generally, malicious URLs are initiated to promote scams, attacks, and frauds which can lead to high-risk intrusion. Several methods have been developed to detect malicious URLs in previous works. There has been a good amount of work done to detect malicious URLs using various methods such as random forest, regression, LightGBM, and more as reported in the literature. However, most of the previous works focused on the binary classification of malicious URLs and are tested on limited URL datasets. Nevertheless, the detection of malicious URLs remains a challenging task that remains open to research. Hence, this work proposed a stacking-based ensemble classifier to perform multi-class classification of malicious URLs on larger URL datasets to justify the robustness of the proposed method. This study focuses on obtaining lexical features directly from the URL to identify malicious websites. Then, the proposed stacking-based ensemble classifier is developed by integrating Random Forest, XGBoost, LightGBM, and CatBoost. In addition, hyperparameter tuning was performed using the Randomized Search method to optimize the proposed classifier. The proposed stacking-based ensemble classifier aims to take advantage of the performance of each machine learning model and aggregate the output to improve prediction accuracy. The classification accuracies of the machine learning model when applied individually are 93.6%, 95.2%, 95.7% and 94.8% for random forest, XGBoost, LightGBM, and CatBoost respectively. The proposed stacking-based ensemble classifier has shown significant results in classifying four classes of malicious URLs (phishing, malware, defacement, and benign) with an average accuracy of 96.8% when benchmarked with previous works.
f
Table7_Prediction of potential small molecule−miRNA associations based on...
figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jianwei Li; Hongxin Lin; Yinfei Wang; Zhiguang Li; Baoqin Wu (2023). Table7_Prediction of potential small molecule−miRNA associations based on heterogeneous network representation learning.XLSX [Dataset]. http://doi.org/10.3389/fgene.2022.1079053.s008
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2022.1079053.s008
Dataset updated
Jun 21, 2023
Dataset provided by
Frontiers
Authors
Jianwei Li; Hongxin Lin; Yinfei Wang; Zhiguang Li; Baoqin Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MicroRNAs (miRNAs) are closely associated with the occurrences and developments of many complex human diseases. Increasing studies have shown that miRNAs emerge as new therapeutic targets of small molecule (SM) drugs. Since traditional experiment methods are expensive and time consuming, it is particularly crucial to find efficient computational approaches to predict potential small molecule-miRNA (SM-miRNA) associations. Considering that integrating multi-source heterogeneous information related with SM-miRNA association prediction would provide a comprehensive insight into the features of both SMs and miRNAs, we proposed a novel model of Small Molecule-MiRNA Association prediction based on Heterogeneous Network Representation Learning (SMMA-HNRL) for more precisely predicting the potential SM-miRNA associations. In SMMA-HNRL, a novel heterogeneous information network was constructed with SM nodes, miRNA nodes and disease nodes. To access and utilize of the topological information of the heterogeneous information network, feature vectors of SM and miRNA nodes were obtained by two different heterogeneous network representation learning algorithms (HeGAN and HIN2Vec) respectively and merged with connect operation. Finally, LightGBM was chosen as the classifier of SMMA-HNRL for predicting potential SM-miRNA associations. The 10-fold cross validations were conducted to evaluate the prediction performance of SMMA-HNRL, it achieved an area under of ROC curve of 0.9875, which was superior to other three state-of-the-art models. With two independent validation datasets, the test experiment results revealed the robustness of our model. Moreover, three case studies were performed. As a result, 35, 37, and 22 miRNAs among the top 50 predicting miRNAs associated with 5-FU, cisplatin, and imatinib were validated by experimental literature works respectively, which confirmed the effectiveness of SMMA-HNRL. The source code and experimental data of SMMA-HNRL are available at https://github.com/SMMA-HNRL/SMMA-HNRL.
f
Data_Sheet_1_Prediction of subjective cognitive decline after corpus...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu, Yanqun; Huang, Yuxin; Xu, Yawen; Song, Chenrui; Yin, Ge; Ding, Qichao; Sun, Rui; Liang, Meng; Du, Bingying; Sun, Xu; Bi, Xiaoying (2023). Data_Sheet_1_Prediction of subjective cognitive decline after corpus callosum infarction by an interpretable machine learning-derived early warning strategy.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001107172
Explore at:
Dataset updated
Jun 21, 2023
Authors
Liu, Yanqun; Huang, Yuxin; Xu, Yawen; Song, Chenrui; Yin, Ge; Ding, Qichao; Sun, Rui; Liang, Meng; Du, Bingying; Sun, Xu; Bi, Xiaoying
Description
Background and purposeCorpus callosum (CC) infarction is an extremely rare subtype of cerebral ischemic stroke, however, the symptoms of cognitive impairment often fail to attract early attention of patients, which seriously affects the long-term prognosis, such as high mortality, personality changes, mood disorders, psychotic reactions, financial burden and so on. This study seeks to develop and validate models for early predicting the risk of subjective cognitive decline (SCD) after CC infarction by machine learning (ML) algorithms.MethodsThis is a prospective study that enrolled 213 (only 3.7%) CC infarction patients from a nine-year cohort comprising 8,555 patients with acute ischemic stroke. Telephone follow-up surveys were carried out for the patients with definite diagnosis of CC infarction one-year after disease onset, and SCD was identified by Behavioral Risk Factor Surveillance System (BRFSS) questionnaire. Based on the significant features selected by the least absolute shrinkage and selection operator (LASSO), seven ML models including Extreme Gradient Boosting (XGBoost), Logistic Regression (LR), Light Gradient Boosting Machine (LightGBM), Adaptive Boosting (AdaBoost), Gaussian Naïve Bayes (GNB), Complement Naïve Bayes (CNB), and Support vector machine (SVM) were established and their predictive performances were compared by different metrics. Importantly, the SHapley Additive exPlanations (SHAP) was also utilized to examine internal behavior of the highest-performance ML classifier.ResultsThe Logistic Regression (LR)-model performed better than other six ML-models in SCD predictability after the CC infarction, with the area under the receiver characteristic operator curve (AUC) of 77.1% in the validation set. Using LASSO and SHAP analysis, we found that infarction subregions of CC infarction, female, 3-month modified Rankin Scale (mRS) score, age, homocysteine, location of angiostenosis, neutrophil to lymphocyte ratio, pure CC infarction, and number of angiostenosis were the top-nine significant predictors in the order of importance for the output of LR-model. Meanwhile, we identified that infarction subregion of CC, female, 3-month mRS score and pure CC infarction were the factors which independently associated with the cognitive outcome.ConclusionOur study firstly demonstrated that the LR-model with 9 common variables has the best-performance to predict the risk of post-stroke SCD due to CC infarcton. Particularly, the combination of LR-model and SHAP-explainer could aid in achieving personalized risk prediction and be served as a decision-making tool for early intervention since its poor long-term outcome.
f
Code representation techniques used in the literature.
plos.figshare.com
xls
Updated May 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan (2024). Code representation techniques used in the literature. [Dataset]. http://doi.org/10.1371/journal.pone.0302333.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302333.t001
Dataset updated
May 10, 2024
Dataset provided by
PLOS ONE
Authors
Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code representation techniques used in the literature.
f
DataSheet1_CD8TCEI-EukPath: A Novel Predictor to Rapidly Identify CD8+...
figshare.com
docx
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui-Si Hu; Jin Wu; Lichao Zhang; Xun Zhou; Ying Zhang (2023). DataSheet1_CD8TCEI-EukPath: A Novel Predictor to Rapidly Identify CD8+ T-Cell Epitopes of Eukaryotic Pathogens Using a Hybrid Feature Selection Approach.docx [Dataset]. http://doi.org/10.3389/fgene.2022.935989.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2022.935989.s001
Dataset updated
Jun 14, 2023
Dataset provided by
Frontiers
Authors
Rui-Si Hu; Jin Wu; Lichao Zhang; Xun Zhou; Ying Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Computational prediction to screen potential vaccine candidates has been proven to be a reliable way to provide guarantees for vaccine discovery in infectious diseases. As an important class of organisms causing infectious diseases, pathogenic eukaryotes (such as parasitic protozoans) have evolved the ability to colonize a wide range of hosts, including humans and animals; meanwhile, protective vaccines are urgently needed. Inspired by the immunological idea that pathogen-derived epitopes are able to mediate the CD8+ T-cell-related host adaptive immune response and with the available positive and negative CD8+ T-cell epitopes (TCEs), we proposed a novel predictor called CD8TCEI-EukPath to detect CD8+ TCEs of eukaryotic pathogens. Our method integrated multiple amino acid sequence-based hybrid features, employed a well-established feature selection technique, and eventually built an efficient machine learning classifier to differentiate CD8+ TCEs from non-CD8+ TCEs. Based on the feature selection results, 520 optimal hybrid features were used for modeling by utilizing the LightGBM algorithm. CD8TCEI-EukPath achieved impressive performance, with an accuracy of 79.255% in ten-fold cross-validation and an accuracy of 78.169% in the independent test. Collectively, CD8TCEI-EukPath will contribute to rapidly screening epitope-based vaccine candidates, particularly from large peptide-coding datasets. To conduct the prediction of CD8+ TCEs conveniently, an online web server is freely accessible (http://lab.malab.cn/∼hrs/CD8TCEI-EukPath/).
f
LightGBM hyperparameters with default values, search ranges, and selected...
plos.figshare.com
xls
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shimels Derso Kebede; Agmasie Damtew Walle; Daniel Niguse Mamo; Ermias Bekele Enyew; Jibril Bashir Adem; Meron Asmamaw Alemayehu (2025). LightGBM hyperparameters with default values, search ranges, and selected optimal values. [Dataset]. http://doi.org/10.1371/journal.pgph.0004787.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0004787.t003
Dataset updated
Jun 20, 2025
Dataset provided by
PLOS Global Public Health
Authors
Shimels Derso Kebede; Agmasie Damtew Walle; Daniel Niguse Mamo; Ermias Bekele Enyew; Jibril Bashir Adem; Meron Asmamaw Alemayehu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LightGBM hyperparameters with default values, search ranges, and selected optimal values.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2025). VGG16 + XGBoost (or LightGBM) [Dataset]. https://catalog.eoxhub.fairicube.eu/collections/ML%20collection/items/13DFOCKBVL

VGG16 + XGBoost (or LightGBM)

Explore at:

data, binAvailable download formats

Dataset updated

Jul 3, 2025

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jul 3, 2025

Area covered

Earth

Description

We used VGG16 for feature extraction. VGG16 is a 16-layer CNN that was trained on millions of images from the ImageNet database. The we used XGboost for regression and LightGBM for classification of rooftop heights.

Clear search

Close search

Google apps

Main menu

VGG16 + XGBoost (or LightGBM)

Hyperparameters obtained for the classifiers.

sentry_training_data

Details of dataset information.

Selected AST non-terminal nodes.

xids-dataset

DataSheet_1_A retrospective analysis based on multiple machine learning...

Best parameters of base classifiers.

Hyperparameter of tuned Random Forest classifier.

Key parameters of LightGBM.

Data for "Superphot+: Real-Time Fitting and Classification of Supernova...

[Tps May] 1st stage of modeling

Context

Content

Acknowledgements

Inspiration

Machine learning hyperparameters.

Results of detected semantic clones using the proposed technique.

Hyperparameter of Catboost classifier.

Table7_Prediction of potential small molecule−miRNA associations based on...

Data_Sheet_1_Prediction of subjective cognitive decline after corpus...

Code representation techniques used in the literature.

DataSheet1_CD8TCEI-EukPath: A Novel Predictor to Rapidly Identify CD8+...

LightGBM hyperparameters with default values, search ranges, and selected...

VGG16 + XGBoost (or LightGBM)