37 datasets found
  1. f

    VGG16 + XGBoost (or LightGBM)

    • catalog.eoxhub.fairicube.eu
    bin, data
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). VGG16 + XGBoost (or LightGBM) [Dataset]. https://catalog.eoxhub.fairicube.eu/collections/ML%20collection/items/13DFOCKBVL
    Explore at:
    data, binAvailable download formats
    Dataset updated
    Jul 3, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 3, 2025
    Area covered
    Earth
    Description

    We used VGG16 for feature extraction. VGG16 is a 16-layer CNN that was trained on millions of images from the ImageNet database. The we used XGboost for regression and LightGBM for classification of rooftop heights.

  2. f

    Hyperparameters obtained for the classifiers.

    • plos.figshare.com
    xls
    Updated Apr 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Pourmahmood Aghababa; Jan Andrysek (2024). Hyperparameters obtained for the classifiers. [Dataset]. http://doi.org/10.1371/journal.pone.0300447.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mohammad Pourmahmood Aghababa; Jan Andrysek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Quantitative gait analysis is important for understanding the non-typical walking patterns associated with mobility impairments. Conventional linear statistical methods and machine learning (ML) models are commonly used to assess gait performance and related changes in the gait parameters. Nonetheless, explainable machine learning provides an alternative technique for distinguishing the significant and influential gait changes stemming from a given intervention. The goal of this work was to demonstrate the use of explainable ML models in gait analysis for prosthetic rehabilitation in both population- and sample-based interpretability analyses. Models were developed to classify amputee gait with two types of prosthetic knee joints. Sagittal plane gait patterns of 21 individuals with unilateral transfemoral amputations were video-recorded and 19 spatiotemporal and kinematic gait parameters were extracted and included in the models. Four ML models—logistic regression, support vector machine, random forest, and LightGBM—were assessed and tested for accuracy and precision. The Shapley Additive exPlanations (SHAP) framework was applied to examine global and local interpretability. Random Forest yielded the highest classification accuracy (98.3%). The SHAP framework quantified the level of influence of each gait parameter in the models where knee flexion-related parameters were found the most influential factors in yielding the outcomes of the models. The sample-based explainable ML provided additional insights over the population-based analyses, including an understanding of the effect of the knee type on the walking style of a specific sample, and whether or not it agreed with global interpretations. It was concluded that explainable ML models can be powerful tools for the assessment of gait-related clinical interventions, revealing important parameters that may be overlooked using conventional statistical methods.

  3. h

    sentry_training_data

    • huggingface.co
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pulast S Tiwari (2025). sentry_training_data [Dataset]. https://huggingface.co/datasets/Pulast/sentry_training_data
    Explore at:
    Dataset updated
    Aug 31, 2025
    Authors
    Pulast S Tiwari
    Description

    Sentinel QoS Training Dataset

    This dataset contains synthetic network traffic features used to train the Sentry LightGBM classifier in the Sentinel-QoS project. Files

    training_data.csv — Tabular CSV with per-flow/session features and a target label.

    Columns (example)

    src_ip, dst_ip, src_port, dst_port protocol — e.g., TCP/UDP bytes, packets, duration app_label — human-readable application class (e.g., Video, Gaming, Browsing) target — numeric label used for model training

    Usage… See the full description on the dataset page: https://huggingface.co/datasets/Pulast/sentry_training_data.

  4. f

    Details of dataset information.

    • plos.figshare.com
    xls
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan (2024). Details of dataset information. [Dataset]. http://doi.org/10.1371/journal.pone.0302333.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 10, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.

  5. f

    Selected AST non-terminal nodes.

    • plos.figshare.com
    xls
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan (2024). Selected AST non-terminal nodes. [Dataset]. http://doi.org/10.1371/journal.pone.0302333.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 10, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.

  6. h

    xids-dataset

    • huggingface.co
    Updated Aug 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lumy (2025). xids-dataset [Dataset]. https://huggingface.co/datasets/luminolous/xids-dataset
    Explore at:
    Dataset updated
    Aug 27, 2025
    Authors
    Lumy
    Description

    X-IDS Dataset & Artifacts Repository

    This repository contains all the data assets, experiment results, and preprocessing steps used in the development of the X-IDS system — an Explainable Intrusion Detection System using autoencoders, LightGBM classifiers, and fine-tune T5-small text generation.

    The repository includes: raw and processed data, tensor-formatted datasets for model training, and hyperparameter search results using Optuna.

      Folder Structure… See the full description on the dataset page: https://huggingface.co/datasets/luminolous/xids-dataset.
    
  7. f

    DataSheet_1_A retrospective analysis based on multiple machine learning...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tao Yang; Javier Martinez-Useros; JingWen Liu; Isaias Alarcón; Chao Li; WeiYao Li; Yuanxun Xiao; Xiang Ji; YanDong Zhao; Lei Wang; Salvador Morales-Conde; Zuli Yang (2023). DataSheet_1_A retrospective analysis based on multiple machine learning models to predict lymph node metastasis in early gastric cancer.zip [Dataset]. http://doi.org/10.3389/fonc.2022.1023110.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    Frontiers
    Authors
    Tao Yang; Javier Martinez-Useros; JingWen Liu; Isaias Alarcón; Chao Li; WeiYao Li; Yuanxun Xiao; Xiang Ji; YanDong Zhao; Lei Wang; Salvador Morales-Conde; Zuli Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundEndoscopic submucosal dissection has become the primary option of treatment for early gastric cancer. However, lymph node metastasis may lead to poor prognosis. We analyzed factors related to lymph node metastasis in EGC patients, and we developed a construction prediction model with machine learning using data from a retrospective series.MethodsTwo independent cohorts’ series were evaluated including 305 patients with EGC from China as cohort I and 35 patients from Spain as cohort II. Five classifiers obtained from machine learning were selected to establish a robust prediction model for lymph node metastasis in EGC.ResultsThe clinical variables such as invasion depth, histologic type, ulceration, tumor location, tumor size, Lauren classification, and age were selected to establish the five prediction models: linear support vector classifier (Linear SVC), logistic regression model, extreme gradient boosting model (XGBoost), light gradient boosting machine model (LightGBM), and Gaussian process classification model. Interestingly, all prediction models of cohort I showed accuracy between 70 and 81%. Furthermore, the prediction models of the cohort II exhibited accuracy between 48 and 82%. The areas under curve (AUC) of the five models between cohort I and cohort II were between 0.736 and 0.830.ConclusionsOur results support that the machine learning method could be used to predict lymph node metastasis in early gastric cancer and perhaps provide another evaluation method to choose the suited treatment for patients.

  8. f

    Best parameters of base classifiers.

    • plos.figshare.com
    xls
    Updated Apr 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peng Zhang; Jialiang Zhang; Yi Li (2025). Best parameters of base classifiers. [Dataset]. http://doi.org/10.1371/journal.pone.0321954.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Peng Zhang; Jialiang Zhang; Yi Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Timely prediction of memory failures is crucial for the stable operation of data centers. However, existing methods often rely on a single classifier, which can lead to inaccurate or unstable predictions. To address this, we propose a new ensemble model for predicting CE-driven memory failures, where failures occur due to a surge of correctable errors (CEs) in memory, causing server downtime. Our model combines several strong-performing classifiers, such as Random Forest, LightGBM, and XGBoost, and assigns different weights to each based on its performance. By optimizing the decision-making process, the model improves prediction accuracy. We validate the model using in-memory data from Alibaba’s data center, and the results show an accuracy of over 84%, outperforming existing single and dual-classifier models, further confirming its excellent predictive performance.

  9. f

    Hyperparameter of tuned Random Forest classifier.

    • figshare.com
    xls
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait (2024). Hyperparameter of tuned Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0302196.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Web applications are important for various online businesses and operations because of their platform stability and low operation cost. The increasing usage of Internet-of-Things (IoT) devices within a network has contributed to the rise of network intrusion issues due to malicious Uniform Resource Locators (URLs). Generally, malicious URLs are initiated to promote scams, attacks, and frauds which can lead to high-risk intrusion. Several methods have been developed to detect malicious URLs in previous works. There has been a good amount of work done to detect malicious URLs using various methods such as random forest, regression, LightGBM, and more as reported in the literature. However, most of the previous works focused on the binary classification of malicious URLs and are tested on limited URL datasets. Nevertheless, the detection of malicious URLs remains a challenging task that remains open to research. Hence, this work proposed a stacking-based ensemble classifier to perform multi-class classification of malicious URLs on larger URL datasets to justify the robustness of the proposed method. This study focuses on obtaining lexical features directly from the URL to identify malicious websites. Then, the proposed stacking-based ensemble classifier is developed by integrating Random Forest, XGBoost, LightGBM, and CatBoost. In addition, hyperparameter tuning was performed using the Randomized Search method to optimize the proposed classifier. The proposed stacking-based ensemble classifier aims to take advantage of the performance of each machine learning model and aggregate the output to improve prediction accuracy. The classification accuracies of the machine learning model when applied individually are 93.6%, 95.2%, 95.7% and 94.8% for random forest, XGBoost, LightGBM, and CatBoost respectively. The proposed stacking-based ensemble classifier has shown significant results in classifying four classes of malicious URLs (phishing, malware, defacement, and benign) with an average accuracy of 96.8% when benchmarked with previous works.

  10. f

    Key parameters of LightGBM.

    • plos.figshare.com
    xls
    Updated Feb 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jizhong Wang; Jianfei Chi; Yeqiang Ding; Haiyan Yao; Qiang Guo (2025). Key parameters of LightGBM. [Dataset]. http://doi.org/10.1371/journal.pone.0314481.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 19, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Jizhong Wang; Jianfei Chi; Yeqiang Ding; Haiyan Yao; Qiang Guo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A fault diagnosis method for oil immersed transformers based on principal component analysis and SSA LightGBM is proposed to address the problem of low diagnostic accuracy caused by the complexity of current oil immersed transformer faults. Firstly, data on dissolved gases in oil is collected, and a 17 dimensional fault feature matrix is constructed using the uncoded ratio method. The feature matrix is then standardized to obtain joint features. Secondly, principal component analysis is used for feature fusion to eliminate information redundancy between variables and construct fused features. Finally, a transformer diagnostic model based on SSA-LightGBM was constructed, and the ten fold cross validation method was used to verify the classification ability of the model. The experimental results show that the SSA-LightGBM model proposed in this paper has an average fault diagnosis accuracy of 93.6% after SSA algorithm optimization, which is 3.6% higher than before optimization. At the same time, compared with the GA-LightGBM and GWO-LightGBM fault diagnosis models, SSA-LightGBM has improved the diagnostic accuracy by 8.1% and 5.7% respectively, verifying that this method can effectively improve the fault diagnosis performance of oil immersed transformers and is superior to other similar methods.

  11. Data for "Superphot+: Real-Time Fitting and Classification of Supernova...

    • zenodo.org
    bin, csv, tar
    Updated Jun 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaylee de Soto; Kaylee de Soto; Ashley Villar; Ashley Villar; Edo Berger; Edo Berger; Sebastian Gomez; Sebastian Gomez; Griffin Hosseinzadeh; Griffin Hosseinzadeh; Doug Branton; Doug Branton; Sandro Campos; Sandro Campos; Melissa DeLucchi; Melissa DeLucchi; Jeremy Kubica; Jeremy Kubica; Olivia Lynn; Olivia Lynn; Konstantin Malanchev; Konstantin Malanchev; Alex I. Malz; Alex I. Malz (2024). Data for "Superphot+: Real-Time Fitting and Classification of Supernova Light Curves" [Dataset]. http://doi.org/10.5281/zenodo.10798425
    Explore at:
    bin, csv, tarAvailable download formats
    Dataset updated
    Jun 24, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kaylee de Soto; Kaylee de Soto; Ashley Villar; Ashley Villar; Edo Berger; Edo Berger; Sebastian Gomez; Sebastian Gomez; Griffin Hosseinzadeh; Griffin Hosseinzadeh; Doug Branton; Doug Branton; Sandro Campos; Sandro Campos; Melissa DeLucchi; Melissa DeLucchi; Jeremy Kubica; Jeremy Kubica; Olivia Lynn; Olivia Lynn; Konstantin Malanchev; Konstantin Malanchev; Alex I. Malz; Alex I. Malz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset and static code base associated with the paper: "Superphot+: Real-Time Fitting and Classification of Supernova Light Curves". The contents are as follows:

    • superphot-plus-v0.0.7.tar: Superphot+ code base downloaded at time of paper submission. Static copy of the Github repo: https://github.com/VTDA-Group/superphot-plus
    • dataset_spec_pruned.csv: Spectroscopic dataset pruned according to Table 1 of the paper.
    • dataset_phot_final.csv: Photometric dataset (without spectroscopic labels) pruned according to Section 2 of the paper. Label and probability columns are values from the ALeRCE-SN classifier.
    • model_0.pt: One of the 10 (redshift-independent) LightGBM models trained for 5-way SN classification.
    • model_0.yaml: Configuration file associated with model_0.pt.
    • model_z_0.pt: Same as model_0.pt, but trained using redshift information.
    • model_z_0.yaml: Configuration file associated with model_z_0.pt.
    • early_phase_classifier_0.pt: Same as model_0.pt, but trained only using early-phase light curve features. Tailored for realtime classification.
    • early_phase_classifier_0.yaml: Configuration file for early_phase_classifier_0.pt.
    • probs_concat.csv: Spectroscopic set's classification results without using redshift information.
    • probs_z_concat.csv: Spectroscopic set's classification results using redshift information.
    • probs_photometric.mrt: Superphot+'s probabilities for the photometric set without using redshift information.
  12. [Tps May] 1st stage of modeling

    • kaggle.com
    Updated Jun 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lázaro (2021). [Tps May] 1st stage of modeling [Dataset]. https://www.kaggle.com/lazaro97/tps-may-1st-stage-of-modeling
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Lázaro
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, we've launched many Playground competitions that are more approachable than our Featured competitions and thus, more beginner-friendly.

    In this way, the TPS competition starts!

    Content

    The dataset contains all infomation of diverse trainings: train_predictions and test predictions. I tried diverse models: 11 lightgbm, 4 xgboost, 7 catboost, 1 keras, 1 deebtable, 2 logistic regressions, 5 autolightml. This models was obtained with different preprocessing - Considering create categorical features. Low range of values (max=10, 15 values). - Trying a diversity of encoding in numerical, categorical features - Binning some features. - Considering cluster as a feature. - Considering interactions between features - Remove duplicates.

    Things to do: I think i should tried an autoencoder.

    Acknowledgements

    Thanks to Kaggle community!

    Inspiration

    See this reference notebooks. This two guys deserves all the claps. - https://www.kaggle.com/davidedwards1/tabmar21-tabular-blend-final-sub - https://www.kaggle.com/hiro5299834/3rd-tps-mar-2021-stacking

  13. f

    Machine learning hyperparameters.

    • plos.figshare.com
    bin
    Updated Jun 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huy Le; Beverly Peng; Janelle Uy; Daniel Carrillo; Yun Zhang; Brian D. Aevermann; Richard H. Scheuermann (2023). Machine learning hyperparameters. [Dataset]. http://doi.org/10.1371/journal.pone.0275070.t001
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Huy Le; Beverly Peng; Janelle Uy; Daniel Carrillo; Yun Zhang; Brian D. Aevermann; Richard H. Scheuermann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning hyperparameters.

  14. f

    Results of detected semantic clones using the proposed technique.

    • plos.figshare.com
    xls
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan (2024). Results of detected semantic clones using the proposed technique. [Dataset]. http://doi.org/10.1371/journal.pone.0302333.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 10, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results of detected semantic clones using the proposed technique.

  15. f

    Hyperparameter of Catboost classifier.

    • figshare.com
    xls
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait (2024). Hyperparameter of Catboost classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0302196.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Suresh Sankaranarayanan; Arvinthan Thevar Sivachandran; Anis Salwa Mohd Khairuddin; Khairunnisa Hasikin; Abdul Rahman Wahab Sait
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Web applications are important for various online businesses and operations because of their platform stability and low operation cost. The increasing usage of Internet-of-Things (IoT) devices within a network has contributed to the rise of network intrusion issues due to malicious Uniform Resource Locators (URLs). Generally, malicious URLs are initiated to promote scams, attacks, and frauds which can lead to high-risk intrusion. Several methods have been developed to detect malicious URLs in previous works. There has been a good amount of work done to detect malicious URLs using various methods such as random forest, regression, LightGBM, and more as reported in the literature. However, most of the previous works focused on the binary classification of malicious URLs and are tested on limited URL datasets. Nevertheless, the detection of malicious URLs remains a challenging task that remains open to research. Hence, this work proposed a stacking-based ensemble classifier to perform multi-class classification of malicious URLs on larger URL datasets to justify the robustness of the proposed method. This study focuses on obtaining lexical features directly from the URL to identify malicious websites. Then, the proposed stacking-based ensemble classifier is developed by integrating Random Forest, XGBoost, LightGBM, and CatBoost. In addition, hyperparameter tuning was performed using the Randomized Search method to optimize the proposed classifier. The proposed stacking-based ensemble classifier aims to take advantage of the performance of each machine learning model and aggregate the output to improve prediction accuracy. The classification accuracies of the machine learning model when applied individually are 93.6%, 95.2%, 95.7% and 94.8% for random forest, XGBoost, LightGBM, and CatBoost respectively. The proposed stacking-based ensemble classifier has shown significant results in classifying four classes of malicious URLs (phishing, malware, defacement, and benign) with an average accuracy of 96.8% when benchmarked with previous works.

  16. f

    Table7_Prediction of potential small molecule−miRNA associations based on...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jianwei Li; Hongxin Lin; Yinfei Wang; Zhiguang Li; Baoqin Wu (2023). Table7_Prediction of potential small molecule−miRNA associations based on heterogeneous network representation learning.XLSX [Dataset]. http://doi.org/10.3389/fgene.2022.1079053.s008
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    Frontiers
    Authors
    Jianwei Li; Hongxin Lin; Yinfei Wang; Zhiguang Li; Baoqin Wu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MicroRNAs (miRNAs) are closely associated with the occurrences and developments of many complex human diseases. Increasing studies have shown that miRNAs emerge as new therapeutic targets of small molecule (SM) drugs. Since traditional experiment methods are expensive and time consuming, it is particularly crucial to find efficient computational approaches to predict potential small molecule-miRNA (SM-miRNA) associations. Considering that integrating multi-source heterogeneous information related with SM-miRNA association prediction would provide a comprehensive insight into the features of both SMs and miRNAs, we proposed a novel model of Small Molecule-MiRNA Association prediction based on Heterogeneous Network Representation Learning (SMMA-HNRL) for more precisely predicting the potential SM-miRNA associations. In SMMA-HNRL, a novel heterogeneous information network was constructed with SM nodes, miRNA nodes and disease nodes. To access and utilize of the topological information of the heterogeneous information network, feature vectors of SM and miRNA nodes were obtained by two different heterogeneous network representation learning algorithms (HeGAN and HIN2Vec) respectively and merged with connect operation. Finally, LightGBM was chosen as the classifier of SMMA-HNRL for predicting potential SM-miRNA associations. The 10-fold cross validations were conducted to evaluate the prediction performance of SMMA-HNRL, it achieved an area under of ROC curve of 0.9875, which was superior to other three state-of-the-art models. With two independent validation datasets, the test experiment results revealed the robustness of our model. Moreover, three case studies were performed. As a result, 35, 37, and 22 miRNAs among the top 50 predicting miRNAs associated with 5-FU, cisplatin, and imatinib were validated by experimental literature works respectively, which confirmed the effectiveness of SMMA-HNRL. The source code and experimental data of SMMA-HNRL are available at https://github.com/SMMA-HNRL/SMMA-HNRL.

  17. f

    Data_Sheet_1_Prediction of subjective cognitive decline after corpus...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu, Yanqun; Huang, Yuxin; Xu, Yawen; Song, Chenrui; Yin, Ge; Ding, Qichao; Sun, Rui; Liang, Meng; Du, Bingying; Sun, Xu; Bi, Xiaoying (2023). Data_Sheet_1_Prediction of subjective cognitive decline after corpus callosum infarction by an interpretable machine learning-derived early warning strategy.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001107172
    Explore at:
    Dataset updated
    Jun 21, 2023
    Authors
    Liu, Yanqun; Huang, Yuxin; Xu, Yawen; Song, Chenrui; Yin, Ge; Ding, Qichao; Sun, Rui; Liang, Meng; Du, Bingying; Sun, Xu; Bi, Xiaoying
    Description

    Background and purposeCorpus callosum (CC) infarction is an extremely rare subtype of cerebral ischemic stroke, however, the symptoms of cognitive impairment often fail to attract early attention of patients, which seriously affects the long-term prognosis, such as high mortality, personality changes, mood disorders, psychotic reactions, financial burden and so on. This study seeks to develop and validate models for early predicting the risk of subjective cognitive decline (SCD) after CC infarction by machine learning (ML) algorithms.MethodsThis is a prospective study that enrolled 213 (only 3.7%) CC infarction patients from a nine-year cohort comprising 8,555 patients with acute ischemic stroke. Telephone follow-up surveys were carried out for the patients with definite diagnosis of CC infarction one-year after disease onset, and SCD was identified by Behavioral Risk Factor Surveillance System (BRFSS) questionnaire. Based on the significant features selected by the least absolute shrinkage and selection operator (LASSO), seven ML models including Extreme Gradient Boosting (XGBoost), Logistic Regression (LR), Light Gradient Boosting Machine (LightGBM), Adaptive Boosting (AdaBoost), Gaussian Naïve Bayes (GNB), Complement Naïve Bayes (CNB), and Support vector machine (SVM) were established and their predictive performances were compared by different metrics. Importantly, the SHapley Additive exPlanations (SHAP) was also utilized to examine internal behavior of the highest-performance ML classifier.ResultsThe Logistic Regression (LR)-model performed better than other six ML-models in SCD predictability after the CC infarction, with the area under the receiver characteristic operator curve (AUC) of 77.1% in the validation set. Using LASSO and SHAP analysis, we found that infarction subregions of CC infarction, female, 3-month modified Rankin Scale (mRS) score, age, homocysteine, location of angiostenosis, neutrophil to lymphocyte ratio, pure CC infarction, and number of angiostenosis were the top-nine significant predictors in the order of importance for the output of LR-model. Meanwhile, we identified that infarction subregion of CC, female, 3-month mRS score and pure CC infarction were the factors which independently associated with the cognitive outcome.ConclusionOur study firstly demonstrated that the LR-model with 9 common variables has the best-performance to predict the risk of post-stroke SCD due to CC infarcton. Particularly, the combination of LR-model and SHAP-explainer could aid in achieving personalized risk prediction and be served as a decision-making tool for early intervention since its poor long-term outcome.

  18. f

    Code representation techniques used in the literature.

    • plos.figshare.com
    xls
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan (2024). Code representation techniques used in the literature. [Dataset]. http://doi.org/10.1371/journal.pone.0302333.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 10, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Fahmi H. Quradaa; Sara Shahzad; Rashad Saeed; Mubarak M. Sufyan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code representation techniques used in the literature.

  19. f

    DataSheet1_CD8TCEI-EukPath: A Novel Predictor to Rapidly Identify CD8+...

    • figshare.com
    docx
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui-Si Hu; Jin Wu; Lichao Zhang; Xun Zhou; Ying Zhang (2023). DataSheet1_CD8TCEI-EukPath: A Novel Predictor to Rapidly Identify CD8+ T-Cell Epitopes of Eukaryotic Pathogens Using a Hybrid Feature Selection Approach.docx [Dataset]. http://doi.org/10.3389/fgene.2022.935989.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    Frontiers
    Authors
    Rui-Si Hu; Jin Wu; Lichao Zhang; Xun Zhou; Ying Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Computational prediction to screen potential vaccine candidates has been proven to be a reliable way to provide guarantees for vaccine discovery in infectious diseases. As an important class of organisms causing infectious diseases, pathogenic eukaryotes (such as parasitic protozoans) have evolved the ability to colonize a wide range of hosts, including humans and animals; meanwhile, protective vaccines are urgently needed. Inspired by the immunological idea that pathogen-derived epitopes are able to mediate the CD8+ T-cell-related host adaptive immune response and with the available positive and negative CD8+ T-cell epitopes (TCEs), we proposed a novel predictor called CD8TCEI-EukPath to detect CD8+ TCEs of eukaryotic pathogens. Our method integrated multiple amino acid sequence-based hybrid features, employed a well-established feature selection technique, and eventually built an efficient machine learning classifier to differentiate CD8+ TCEs from non-CD8+ TCEs. Based on the feature selection results, 520 optimal hybrid features were used for modeling by utilizing the LightGBM algorithm. CD8TCEI-EukPath achieved impressive performance, with an accuracy of 79.255% in ten-fold cross-validation and an accuracy of 78.169% in the independent test. Collectively, CD8TCEI-EukPath will contribute to rapidly screening epitope-based vaccine candidates, particularly from large peptide-coding datasets. To conduct the prediction of CD8+ TCEs conveniently, an online web server is freely accessible (http://lab.malab.cn/∼hrs/CD8TCEI-EukPath/).

  20. f

    LightGBM hyperparameters with default values, search ranges, and selected...

    • plos.figshare.com
    xls
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shimels Derso Kebede; Agmasie Damtew Walle; Daniel Niguse Mamo; Ermias Bekele Enyew; Jibril Bashir Adem; Meron Asmamaw Alemayehu (2025). LightGBM hyperparameters with default values, search ranges, and selected optimal values. [Dataset]. http://doi.org/10.1371/journal.pgph.0004787.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset provided by
    PLOS Global Public Health
    Authors
    Shimels Derso Kebede; Agmasie Damtew Walle; Daniel Niguse Mamo; Ermias Bekele Enyew; Jibril Bashir Adem; Meron Asmamaw Alemayehu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LightGBM hyperparameters with default values, search ranges, and selected optimal values.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). VGG16 + XGBoost (or LightGBM) [Dataset]. https://catalog.eoxhub.fairicube.eu/collections/ML%20collection/items/13DFOCKBVL

VGG16 + XGBoost (or LightGBM)

Explore at:
data, binAvailable download formats
Dataset updated
Jul 3, 2025
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Jul 3, 2025
Area covered
Earth
Description

We used VGG16 for feature extraction. VGG16 is a 16-layer CNN that was trained on millions of images from the ImageNet database. The we used XGboost for regression and LightGBM for classification of rooftop heights.

Search
Clear search
Close search
Google apps
Main menu