32 datasets found
  1. f

    Data_Sheet_1_A Machine Learning Model to Predict Intravenous...

    • frontiersin.figshare.com
    txt
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jie Liu; Jian Zhang; Haodong Huang; Yunting Wang; Zuyue Zhang; Yunfeng Ma; Xiangqian He (2023). Data_Sheet_1_A Machine Learning Model to Predict Intravenous Immunoglobulin-Resistant Kawasaki Disease Patients: A Retrospective Study Based on the Chongqing Population.CSV [Dataset]. http://doi.org/10.3389/fped.2021.756095.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    Frontiers
    Authors
    Jie Liu; Jian Zhang; Haodong Huang; Yunting Wang; Zuyue Zhang; Yunfeng Ma; Xiangqian He
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Chongqing
    Description

    Objective: We explored the risk factors for intravenous immunoglobulin (IVIG) resistance in children with Kawasaki disease (KD) and constructed a prediction model based on machine learning algorithms.Methods: A retrospective study including 1,398 KD patients hospitalized in 7 affiliated hospitals of Chongqing Medical University from January 2015 to August 2020 was conducted. All patients were divided into IVIG-responsive and IVIG-resistant groups, which were randomly divided into training and validation sets. The independent risk factors were determined using logistic regression analysis. Logistic regression nomograms, support vector machine (SVM), XGBoost and LightGBM prediction models were constructed and compared with the previous models.Results: In total, 1,240 out of 1,398 patients were IVIG responders, while 158 were resistant to IVIG. According to the results of logistic regression analysis of the training set, four independent risk factors were identified, including total bilirubin (TBIL) (OR = 1.115, 95% CI 1.067–1.165), procalcitonin (PCT) (OR = 1.511, 95% CI 1.270–1.798), alanine aminotransferase (ALT) (OR = 1.013, 95% CI 1.008–1.018) and platelet count (PLT) (OR = 0.998, 95% CI 0.996–1). Logistic regression nomogram, SVM, XGBoost, and LightGBM prediction models were constructed based on the above independent risk factors. The sensitivity was 0.617, 0.681, 0.638, and 0.702, the specificity was 0.712, 0.841, 0.967, and 0.903, and the area under curve (AUC) was 0.731, 0.814, 0.804, and 0.874, respectively. Among the prediction models, the LightGBM model displayed the best ability for comprehensive prediction, with an AUC of 0.874, which surpassed the previous classic models of Egami (AUC = 0.581), Kobayashi (AUC = 0.524), Sano (AUC = 0.519), Fu (AUC = 0.578), and Formosa (AUC = 0.575).Conclusion: The machine learning LightGBM prediction model for IVIG-resistant KD patients was superior to previous models. Our findings may help to accomplish early identification of the risk of IVIG resistance and improve their outcomes.

  2. Py style code for volatility

    • kaggle.com
    zip
    Updated Aug 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sushi (2021). Py style code for volatility [Dataset]. https://www.kaggle.com/madquer/volatility
    Explore at:
    zip(24231567 bytes)Available download formats
    Dataset updated
    Aug 25, 2021
    Authors
    sushi
    Description

    Context

    🚀 python package style code with package code on datasets - LightGBM and TabNet This is the code of training model and inference. Normally we use ipynb style code in kaggle. I just change the code style to py package and it's better for training with shell command.

    I refer the original code below and thanks to @chumajin

    [Notebook] Reference Notebook by chumajin

    Content

    1. contents in directory of src

    • prepare data(with feature engineering),
    • lightgbm : train and predict
    • tabnet : train and predict
    • volatility_2021.ipynb : the notebook of local version for last submission with shell command.

    2. structure in detail

    • light_gbm

    -- config : yaml file of parameter for lightgbm
    -- models : saved model
    -- train.py
    -- predict test.py

    • prepare

    -- feature_engineering.py
    -- metric.py
    -- preprocessing.py
    -- seed.py
    -- tabnet preprocessing.py

    • tabnet

    -- config : tabnet hyp.yaml / tabnet config.py
    -- models : saved model
    -- predict_test.py
    -- train.py

    • volatility_2021.ipynb

    Acknowledgements

    I refer the original code below and thanks to @chumajin

    [Notebook] Reference Notebook by chumajin

  3. f

    Table_1_Deep learning for crown profile modelling of Pinus yunnanensis...

    • frontiersin.figshare.com
    docx
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuling Chen; Jianming Wang (2023). Table_1_Deep learning for crown profile modelling of Pinus yunnanensis secondary forests in Southwest China.docx [Dataset]. http://doi.org/10.3389/fpls.2023.1093905.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Frontiers
    Authors
    Yuling Chen; Jianming Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Southwestern China
    Description

    Accurate information concerning crown profile is critical in analyzing biological processes and providing a more accurate estimate of carbon balance, which is conducive to sustainable forest management and planning. The similarities between the types of data addressed with LSTM algorithms and crown profile data make a compelling argument for the integration of deep learning into the crown profile modeling. Thus, the aim was to study the application of deep learning method LSTM and its variant algorithms in the crown profile modeling, using the crown profile database from Pinus yunnanensis secondary forests in Yunnan province, in southwest China. Furthermore, the SHAP (SHapley Additive exPlanations) was used to interpret the predictions of ensemble or deep learning models. The results showed that LSTM’s variant algorithms was competitive with traditional Vanila LSTM, but substantially outperformed ensemble learning model LightGBM. Specifically, the proposed Hybrid LSTM-LightGBM and Integrated LSTM-LightGBM have achieved a best forecasting performance on training set and testing set respectively. Furthermore, the feature importance analysis of LightGBM and Vanila LSTM presented that there were more factors that contribute significantly to Vanila LSTM model compared to LightGBM model. This phenomenon can explain why deep learning outperforms ensemble learning when there are more interrelated features.

  4. t

    Rainfall Prediction: Comparison of 7 Popular Models

    • test.researchdata.tuwien.ac.at
    bin, png +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaya Ali Kus; Kaya Ali Kus (2025). Rainfall Prediction: Comparison of 7 Popular Models [Dataset]. http://doi.org/10.70124/p7rh4-0g783
    Explore at:
    png, text/markdown, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Kaya Ali Kus; Kaya Ali Kus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Rainfall Prediction using 7 Popular Models

    Context and Methodology

    Research Domain/Project:

    This dataset is part of a machine learning project focused on predicting rainfall, a critical task for sectors like agriculture, water resource management, and disaster prevention. The project employs machine learning algorithms to forecast rainfall occurrences based on historical weather data, including features like temperature, humidity, and pressure.

    Purpose:

    The primary goal of the dataset is to train multiple machine learning models to predict rainfall and compare their performances. The insights gained will help identify the most accurate models for real-world predictions of rainfall events.

    Creation Process:

    The dataset is derived from various historical weather observations, including temperature, humidity, wind speed, and pressure, collected by weather stations across Australia. These observations are used as inputs for training machine learning models. The dataset is publicly available on platforms like Kaggle and is often used in competitions and research to advance predictive analytics in meteorology.

    Technical Details


    Dataset Structure:

    The dataset consists of weather data from multiple Australian weather stations, spanning various time periods. Key features include:

    Temperature
    Humidity
    Wind Speed
    Pressure
    Rainfall (target variable)
    These features are tracked for each weather station over different times, with the goal of predicting rainfall.

    Software Requirements:

    Python: The primary programming language for data analysis and machine learning.
    scikit-learn: For implementing machine learning models.
    XGBoost, LightGBM, and CatBoost: Popular libraries for building more advanced ensemble models.
    Matplotlib/Seaborn: For data visualization.
    These libraries and tools help in data manipulation, modeling, evaluation, and visualization of results.
    DBRepo Authorization: Required to access datasets via the DBRepo API for dataset retrieval.

    Additional Resources

    Model Comparison Charts: The project includes output charts comparing the performance of seven popular machine learning models.
    Trained Models (.pkl files): Pre-trained models are saved as .pkl files for reuse without retraining.
    Documentation and Code: A Jupyter notebook guides through the process of data analysis, model training, and evaluation.

  5. f

    Data from: Predictive modelling of peroxisome proliferator-activated...

    • tandf.figshare.com
    xlsx
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A. Awomuti; Z. Yu; O. Adesina; O.W. Samuel; A.W. Mumbi; D. Yin (2025). Predictive modelling of peroxisome proliferator-activated receptor gamma (PPARγ) IC50 inhibition by emerging pollutants using light gradient boosting machine [Dataset]. http://doi.org/10.6084/m9.figshare.28652570.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Apr 7, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    A. Awomuti; Z. Yu; O. Adesina; O.W. Samuel; A.W. Mumbi; D. Yin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Peroxisome proliferator-activated receptor gamma (PPARγ), a critical nuclear receptor, plays a pivotal role in regulating metabolic and inflammatory processes. However, various environmental contaminants can disrupt PPARγ function, leading to adverse health effects. This study introduces a novel approach to predict the inhibitory activity (IC50 values) of 140 chemical compounds across 13 categories, including pesticides, organochlorines, dioxins, detergents, flame retardants, and preservatives, on PPARγ. The predictive model, based on the light-gradient boosting machine (LightGBM) algorithm, was trained on a dataset of 1804 molecules showed r2 values of 0.82 and 0.59, Mean Absolute Error (MAE) of 0.38 and 0.58, and Root Mean Square Error (RMSE) of 0.54 and 0.76 for the training and test sets, respectively. This study provides novel insights into the interactions between emerging contaminants and PPARγ, highlighting the potential hazards and risks these chemicals may pose to public health and the environment. The ability to predict PPARγ inhibition by these hazardous contaminants demonstrates the value of this approach in guiding enhanced environmental toxicology research and risk assessment.

  6. f

    Data_Sheet_1_IntSplice2: Prediction of the Splicing Effects of Intronic...

    • frontiersin.figshare.com
    docx
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jun-ichi Takeda; Sae Fukami; Akira Tamura; Akihide Shibata; Kinji Ohno (2023). Data_Sheet_1_IntSplice2: Prediction of the Splicing Effects of Intronic Single-Nucleotide Variants Using LightGBM Modeling.docx [Dataset]. http://doi.org/10.3389/fgene.2021.701076.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    Frontiers
    Authors
    Jun-ichi Takeda; Sae Fukami; Akira Tamura; Akihide Shibata; Kinji Ohno
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Prediction of the effect of a single-nucleotide variant (SNV) in an intronic region on aberrant pre-mRNA splicing is challenging except for an SNV affecting the canonical GU/AG splice sites (ss). To predict pathogenicity of SNVs at intronic positions −50 (Int-50) to −3 (Int-3) close to the 3’ ss, we developed light gradient boosting machine (LightGBM)-based IntSplice2 models using pathogenic SNVs in the human gene mutation database (HGMD) and ClinVar and common SNVs in dbSNP with 0.01 ≤ minor allelic frequency (MAF) < 0.50. The LightGBM models were generated using features representing splicing cis-elements. The average recall/sensitivity and specificity of IntSplice2 by fivefold cross-validation (CV) of the training dataset were 0.764 and 0.884, respectively. The recall/sensitivity of IntSplice2 was lower than the average recall/sensitivity of 0.800 of IntSplice that we previously made with support vector machine (SVM) modeling for the same intronic positions. In contrast, the specificity of IntSplice2 was higher than the average specificity of 0.849 of IntSplice. For benchmarking (BM) of IntSplice2 with IntSplice, we made a test dataset that was not used to train IntSplice. After excluding the test dataset from the training dataset, we generated IntSplice2-BM and compared it with IntSplice using the test dataset. IntSplice2-BM was superior to IntSplice in all of the seven statistical measures of accuracy, precision, recall/sensitivity, specificity, F1 score, negative predictive value (NPV), and matthews correlation coefficient (MCC). We made the IntSplice2 web service at https://www.med.nagoya-u.ac.jp/neurogenetics/IntSplice2.

  7. f

    Data Sheet 2_Population-based colorectal cancer risk prediction using a...

    • frontiersin.figshare.com
    xlsx
    Updated Jul 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guinian Du; Hui Lv; Yishan Liang; Jingyue Zhang; Qiaoling Huang; Guiming Xie; Xian Wu; Hao Zeng; Lijuan Wu; Jianbo Ye; Wentan Xie; Xia Li; Yifan Sun (2025). Data Sheet 2_Population-based colorectal cancer risk prediction using a SHAP-enhanced LightGBM model.xlsx [Dataset]. http://doi.org/10.3389/fonc.2025.1575844.s005
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    Frontiers
    Authors
    Guinian Du; Hui Lv; Yishan Liang; Jingyue Zhang; Qiaoling Huang; Guiming Xie; Xian Wu; Hao Zeng; Lijuan Wu; Jianbo Ye; Wentan Xie; Xia Li; Yifan Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundColorectal cancer (CRC) is a highly frequent cancer worldwide, and early detection and risk stratification playing a critical role in reducing both incidence and mortality. we aimed to develop and validate a machine learning (ML) model using clinical data to improve CRC identification and prognostic evaluation.MethodsWe analyzed multicenter datasets comprising 676 CRC patients and 410 controls from Guigang City People’s Hospital (2020-2024) for model training/internal validation, with 463 patients from Laibin City People’s Hospital for external validation. Seven ML algorithms were systematically compared, with Light Gradient Boosting Machine (LightGBM) ultimately selected as the optimal framework. Model performance was rigorously assessed through area under the receiver operating characteristic (AUROC) analysis, calibration curves, Brier scores, and decision curve analysis. SHAP (SHapley Additive exPlanations) methodology was employed for feature interpretation.ResultsThe LightGBM model demonstrated exceptional discrimination with AUROCs of 0.9931 (95% CI: 0.9883-0.998) in the training cohort and 0.9429 (95% CI: 0.9176-0.9682) in external validation. Calibration curves revealed strong prediction-actual outcome concordance (Brier score=0.139). SHAP analysis identified 13 key predictors, with age (mean SHAP value=0.216) and CA19-9 (mean SHAP value=0.198) as dominant contributors. Other significant variables included hematological parameters (WBC, RBC, HGB, PLT), biochemical markers (ALT, TP, ALB, UREA, uric acid), and gender. A clinically implementable web-based risk calculator was successfully developed for real-time probability estimation.ConclusionsOur LightGBM-based model achieves high predictive accuracy while maintaining clinical interpretability, effectively bridging the gap between complex ML systems and practical clinical decision-making. The identified biomarker panel provides biological insights into CRC pathogenesis. This tool shows significant potential for optimizing early diagnosis and personalized risk assessment in CRC management.

  8. o

    Skillful bias correction of offshore near-surface wind speed and wind...

    • explore.openaire.eu
    • zenodo.org
    Updated Jun 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qiyang Liu; Anboyu Guo; Xinjian Ma; Fengxue Qiao; Yan-an Liu; Yong Huang; Rui Wang (2024). Skillful bias correction of offshore near-surface wind speed and wind direction forecasting based on a multi-task machine learning model [Dataset]. http://doi.org/10.5281/zenodo.11044037
    Explore at:
    Dataset updated
    Jun 25, 2024
    Authors
    Qiyang Liu; Anboyu Guo; Xinjian Ma; Fengxue Qiao; Yan-an Liu; Yong Huang; Rui Wang
    Description

    Dataset 1. observation data over 14 weather stations Variables: hourly near-surface 2-min average wind speed, wind direction 2. ECMWF-IFS forecast data over 14 weather stations Variables: hourly predictors at surface level and upper level in next 48 hours (shown in Table 1. and Table 2.) Table 1. ECMWF-IFS forecast data at surface level Predictors Abbreviation Unit Temperature at 2 m 2t ℃ Sea surface temperature sst ℃ Dewpoint temperature at 2 m 2d ℃ Convective precipitation in the past hour cp mm Mean sea level pressure msl hPa Zonal component of wind speed at 10 m 10u m s-1 Meridional component of wind speed at 10 m 10v m s-1 Wind speed at 10 m 10ws m s-1 Wind direction at 10 m 10wd ° Zonal component of wind speed at 100 m 100u m s-1 Meridional component of wind speed at 100 m 100v m s-1 Wind speed at 100 m 100ws m s-1 Wind direction at 100 m 100wd ° Table 2. ECMWF-IFS forecast data at upper level Predictors Abbreviation Unit Relative humidity at xxx hPa r_Lxxx % Temperature at xxx hPa t_Lxxx ℃ Vertical velocity of wind at xxx hPa w_Lxxx Pa s-1 Zonal component of wind at xxx hPa u_Lxxx m s-1 Meridional component of wind at xxx hPa v_Lxxx m s-1 Wind speed at xxx hPa ws_Lxxx m s-1 Wind direction at xxx hPa wd_Lxxx ° 3. key variables constructed by feature engineering (1) sort-term statistics, including maximum, minimum, mean and variance of key variables (2t, 10u, 10v and 10ws) from ECMWF-IFS model during the next 48 hours, (2) long-term statistics, including mean and deviation of key variables (2t, 10u, 10v and 10ws) from ECMWF-IFS model during history 3-yr period (January 2020–December 2022), (3) thermodynamic factors, including the low-level wind shear between 10ws and 100ws, vertical wind shear between 200 hPa and 850 hPa, the differences between sst and 2t. Scripts 1. Random Forest model training code 2. LightGBM model training code 3. XGBoost model training code 4. TabNet-MTL model training code

  9. m

    Prediction of Venous Thromboembolism in Diverse Populations Using Machine...

    • data.mendeley.com
    Updated Oct 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Chen (2023). Prediction of Venous Thromboembolism in Diverse Populations Using Machine Learning and Electronic Health Records [Dataset]. http://doi.org/10.17632/tkwzysr4y6.6
    Explore at:
    Dataset updated
    Oct 25, 2023
    Authors
    Robert Chen
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Contains resources needed to train, test, and analyze performance of gradient boosting models used to predict venous thromboembolism (VTE) from electronic health record (EHR) data.

    "Code for analyses" folder: Contains code we used for the analyses in our paper. Prediction.ipynb: Contains code needed to run trained models. Small, Medium, and Large.xlsx: Excel templates to correctly format data for prediction generation. Models.zip: Contains trained models. Note that this is 0.4 GB once unzipped. Analysis.ipynb: Contains code used to train the models.

    Dependencies: Python 3.10.9; Pandas 1.5.1; LightGBM 3.3.2.

  10. f

    Table_1_MRI radiomics combined with machine learning for diagnosing mild...

    • datasetcatalog.nlm.nih.gov
    Updated Oct 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang, Yaping; Chen, Yi; Ye, Zhinan; Luo, Weili; Chen, Yini; Chen, Ying; Wang, Wenjie; Lin, Andong (2024). Table_1_MRI radiomics combined with machine learning for diagnosing mild cognitive impairment: a focus on the cerebellar gray and white matter.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001275457
    Explore at:
    Dataset updated
    Oct 4, 2024
    Authors
    Zhang, Yaping; Chen, Yi; Ye, Zhinan; Luo, Weili; Chen, Yini; Chen, Ying; Wang, Wenjie; Lin, Andong
    Description

    ObjectiveMild Cognitive Impairment (MCI) is a recognized precursor to Alzheimer’s Disease (AD), presenting a significant risk of progression. Early detection and intervention in MCI can potentially slow disease advancement, offering substantial clinical benefits. This study employed radiomics and machine learning methodologies to distinguish between MCI and Normal Cognition (NC) groups.MethodsThe study included 172 MCI patients and 183 healthy controls from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, all of whom had 3D-T1 weighted MRI structural images. The cerebellar gray and white matter were segmented automatically using volBrain software, and radiomic features were extracted and screened through Pyradiomics. The screened features were then input into various machine learning models, including Random Forest (RF), Logistic Regression (LR), eXtreme Gradient Boosting (XGBoost), Support Vector Machines (SVM), K Nearest Neighbors (KNN), Extra Trees, Light Gradient Boosting Machine (LightGBM), and Multilayer Perceptron (MLP). Each model was optimized for penalty parameters through 5-fold cross-validation to construct radiomic models. The DeLong test was used to evaluate the performance of different models.ResultsThe LightGBM model, which utilizes a combination of cerebellar gray and white matter features (comprising eight gray matter and eight white matter features), emerges as the most effective model for radiomics feature analysis. The model demonstrates an Area Under the Curve (AUC) of 0.863 for the training set and 0.776 for the test set.ConclusionRadiomic features based on the cerebellar gray and white matter, combined with machine learning, can objectively diagnose MCI, which provides significant clinical value for assisted diagnosis.

  11. h

    xids-dataset

    • huggingface.co
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lumy (2025). xids-dataset [Dataset]. https://huggingface.co/datasets/luminolous/xids-dataset
    Explore at:
    Dataset updated
    Aug 8, 2025
    Authors
    Lumy
    Description

    X-IDS Dataset & Artifacts Repository

    This repository contains all the data assets, experiment results, and preprocessing steps used in the development of the X-IDS system — an Explainable Intrusion Detection System using autoencoders, LightGBM classifiers, and fine-tune T5-small text generation.

    The repository includes: raw and processed data, tensor-formatted datasets for model training, and hyperparameter search results using Optuna.

      Folder Structure… See the full description on the dataset page: https://huggingface.co/datasets/luminolous/xids-dataset.
    
  12. f

    Training dataset portioning results using CatBoost.

    • figshare.com
    xls
    Updated Aug 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavithra Mahesh; Rajkumar Soundrapandiyan (2024). Training dataset portioning results using CatBoost. [Dataset]. http://doi.org/10.1371/journal.pone.0291928.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Pavithra Mahesh; Rajkumar Soundrapandiyan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training dataset portioning results using CatBoost.

  13. Ensemble Learning for Spatial Modeling of Icing Fields from Multi-Source...

    • zenodo.org
    zip
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shaohui zhou; shaohui zhou (2025). Ensemble Learning for Spatial Modeling of Icing Fields from Multi-Source Remote Sensing Data: Partial Data and Training Code [Dataset]. http://doi.org/10.5281/zenodo.15622908
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    shaohui zhou; shaohui zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accurate real-time icing grid fields are critical for preventing ice-related disasters during winter and protecting property. These fields are essential both for mapping ice distribution and for predicting icing using physical models combined with numerical weather prediction systems. However, developing precise real-time icing grids is challenging due to the uneven distribution of monitoring stations, data confidentiality restrictions, and the limitations of existing interpolation methods. In this study, we propose a new approach for constructing real-time icing grid fields using 1,339 online terminal monitoring datasets provided by the China Southern Power Grid Research Institute Co., Ltd. (CSPGRI) during the winter of 2023. Our method integrates static geographic information, dynamic meteorological factors, and ice_kriging values derived from parameter-optimized Empirical Bayesian Kriging Interpolation (EBKI) to create a spatiotemporally matched, multi-source fused icing thickness grid dataset. We applied five machine learning algorithms—Random Forest, XGBoost, LightGBM, Stacking, and Convolutional Neural Network Transformers (CNNT)—and evaluated their performance using six metrics: R, RMSE, CSI, MAR, FAR, and fbias, on both validation and testing sets. The stacking model performed best, achieving an R value of 0.634 (0.893), RMSE of 3.424 mm (2.834 mm), CSI of 0.514 (0.774), MAR of 0.309 (0.091), FAR of 0.332 (0.161), and fbias of 1.034 (1.084), respectively, when comparing predicted icing values with actual measurements on pylons. Additionally, we employed the SHAP model to provide a physical interpretation of the stacking model, confirming the independence of selected features. Meteorological factors such as relative humidity (RH), 10-meter wind speed (WS10), 2-meter temperature (T2), and precipitation (PRE) demonstrated a range of positive and negative contributions consistent with the observed growth of icing. Thus, our multi-source remote sensing data fusion approach, combined with the stacking model, offers a highly accurate and interpretable solution for generating real-time icing grid fields.

  14. u

    Data from: Approximated UTCI

    • produccioncientifica.ucm.es
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Collazo, Soledad; Collazo, Soledad (2025). Approximated UTCI [Dataset]. https://produccioncientifica.ucm.es/documentos/688b603d17bb6239d2d4a144
    Explore at:
    Dataset updated
    2025
    Authors
    Collazo, Soledad; Collazo, Soledad
    Description

    In the repository you can find a variety of data and scripts to approximate the UTCI in southern South America and apply it to forecasts generated by data-driven models:1) UTCI data from ERA5-HEAT and different meteorological variables from ERA5.2) LightGBM models trained to estimate the UTCI from different predictors.3) Two examples sripts to train the LGBM models4) Scripts for metric estimation on the test sample of different LightGBM-based models with different predictors.5) Forecasts of the traditional GFS model, and data-driven models during a heat wave in central Argentina during March 2023.6) Scripts to apply the UTCI approach on the forecasts mentioned in the previous item.

    This material is related to the article "Forecasting Heat Stress in southern South America from data-driven model outputs"

  15. f

    DataSheet1_Comparative analysis of tissue-specific genes in maize based on...

    • frontiersin.figshare.com
    docx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zijie Wang; Yuzhi Zhu; Zhule Liu; Hongfu Li; Xinqiang Tang; Yi Jiang (2023). DataSheet1_Comparative analysis of tissue-specific genes in maize based on machine learning models: CNN performs technically best, LightGBM performs biologically soundest.docx [Dataset]. http://doi.org/10.3389/fgene.2023.1190887.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Zijie Wang; Yuzhi Zhu; Zhule Liu; Hongfu Li; Xinqiang Tang; Yi Jiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants.Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes.Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant.Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.

  16. f

    Data Sheet 1_Enhancing fever of unknown origin diagnosis: machine learning...

    • frontiersin.figshare.com
    docx
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhi Gao; Yongfang Jiang; Mengxuan Chen; Weihang Wang; Qiyao Liu; Jing Ma (2025). Data Sheet 1_Enhancing fever of unknown origin diagnosis: machine learning approaches to predict metagenomic next-generation sequencing positivity.docx [Dataset]. http://doi.org/10.3389/fcimb.2025.1550933.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Frontiers
    Authors
    Zhi Gao; Yongfang Jiang; Mengxuan Chen; Weihang Wang; Qiyao Liu; Jing Ma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveMetagenomic next-generation sequencing (mNGS) can potentially detect various pathogenic microorganisms without bias to improve the diagnostic rate of fever of unknown origin (FUO), but there are no effective methods to predict mNGS-positive results. This study aimed to develop an interpretable machine learning algorithm for the effective prediction of mNGS results in patients with FUO.MethodsA clinical dataset from a large medical institution was used to develop and compare the performance of several predictive models, namely eXtreme Gradient Boosting (XGBoost), Light Gradient-Boosting Machine (LightGBM), and Random Forest, and the Shapley additive explanation (SHAP) method was employed to interpret and analyze the results.ResultsThe mNGS-positive rate among 284 patients with FUO reached 64.1%. Overall, the LightGBM-based model exhibited the best comprehensive predictive performance, with areas under the curve of 0.84 and 0.93 for the training and validation sets, respectively. Using the SHAP method, the five most important factors for predicting mNGS-positive results were albumin, procalcitonin, blood culture, disease type, and sample type.ConclusionThe validated LightGBM-based predictive model could have practical clinical value in enhancing the application of mNGS in the etiological diagnosis of FUO, representing a powerful tool to optimize the timing of mNGS.

  17. f

    DEM error verified by airborne data.

    • plos.figshare.com
    xls
    Updated Oct 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qinghua Li; Dong Wang; Fengying Liu; Jiachen Yu; Zheng Jia (2024). DEM error verified by airborne data. [Dataset]. http://doi.org/10.1371/journal.pone.0309025.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 7, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Qinghua Li; Dong Wang; Fengying Liu; Jiachen Yu; Zheng Jia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The accuracy of digital elevation models (DEMs) in forested areas plays a crucial role in canopy height monitoring and ecological sensitivity analysis. Despite extensive research on DEMs in recent years, significant errors still exist in forested areas due to factors such as canopy occlusion, terrain complexity, and limited penetration, posing challenges for subsequent analyses based on DEMs. Therefore, a CNN-LightGBM hybrid model is proposed in this paper, with four different types of forests (tropical rainforest, coniferous forest, mixed coniferous and broad-leaved forest, and broad-leaved forest) selected as study sites to validate the performance of the hybrid model in correcting COP30DEM in different forest area DEMs. In the hybrid model of this paper, the choice was made to use the Densenet architecture of CNN models with LightGBM as the primary model. This choice is based on LightGBM’s leaf-growth strategy and histogram linking methods, which are effective in reducing the data’s memory footprint and utilising more of the data without sacrificing speed. The study uses elevation values from ICESat-2 as ground truth, covering several parameters including COP30DEM, canopy height, forest coverage, slope, terrain roughness and relief amplitude. To validate the superiority of the CNN-LightGBM hybrid model in DEMs correction compared to other models, a test of LightGBM model, CNN-SVR model, and SVR model is conducted within the same sample space. To prevent issues such as overfitting or underfitting during model training, although common meta-heuristic optimisation algorithms can alleviate these problems to a certain extent, they still have some shortcomings. To overcome these shortcomings, this paper cites an improved SSA search algorithm that incorporates the ingestion strategy of the FA algorithm to increase the diversity of solutions and global search capability, the Firefly Algorithm-based Sparrow Search Optimization Algorithm (FA-SSA algorithm) is introduced. By comparing multiple models and validating the data with an airborne LiDAR reference dataset, the results show that the R2 (R-Square) of the CNN-LightGBM model improves by more than 0.05 compared to the other models, and performs better in the experiments. The FA-SSA-CNN-LightGBM model has the highest accuracy, with an RMSE of 1.09 meters, and a reduction of more than 30% of the RMSE when compared to the LightGBM and other hybrid models. Compared to other forested area DEMs (such as FABDEM and GEDI), its accuracy is improved by more than 50%, and the performance is significantly better than other commonly used DEMs in forested areas, indicating the feasibility of this method in correcting elevation errors in forested area DEMs and its significant importance in advancing global topographic mapping.

  18. f

    Paired t-test for detecting the difference between DMFDEM errors and other...

    • plos.figshare.com
    xls
    Updated Oct 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qinghua Li; Dong Wang; Fengying Liu; Jiachen Yu; Zheng Jia (2024). Paired t-test for detecting the difference between DMFDEM errors and other types of DEM errors. [Dataset]. http://doi.org/10.1371/journal.pone.0309025.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 7, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Qinghua Li; Dong Wang; Fengying Liu; Jiachen Yu; Zheng Jia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Paired t-test for detecting the difference between DMFDEM errors and other types of DEM errors.

  19. f

    Table_1_A Machine Learning Algorithm for Predicting the Risk of Developing...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Ding; Kun Wang; Chi Zhang; Yang Zhang; Kanlirong Wang; Wang Li; Junqi Wang (2023). Table_1_A Machine Learning Algorithm for Predicting the Risk of Developing to M1b Stage of Patients With Germ Cell Testicular Cancer.XLSX [Dataset]. http://doi.org/10.3389/fpubh.2022.916513.s004
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Li Ding; Kun Wang; Chi Zhang; Yang Zhang; Kanlirong Wang; Wang Li; Junqi Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Objective:Distant metastasis other than non-regional lymph nodes and lung (i.e., M1b stage) significantly contributes to the poor survival prognosis of patients with germ cell testicular cancer (GCTC). The aim of this study was to develop a machine learning (ML) algorithm model to predict the risk of patients with GCTC developing the M1b stage, which can be used to assist in early intervention of patients.MethodsThe clinical and pathological data of patients with GCTC were obtained from the Surveillance, Epidemiology, and End Results (SEER) database. Combing the patient's characteristic variables, we applied six machine learning (ML) algorithms to develop the predictive models, including logistic regression(LR), eXtreme Gradient Boosting (XGBoost), light Gradient Boosting Machine (lightGBM), random forest (RF), multilayer perceptron (MLP), and k-nearest neighbor (kNN). Model performances were evaluated by 10-fold cross-receiver operating characteristic (ROC) curves, which calculated the area under the curve (AUC) of models for predictive accuracy. A total of 54 patients from our own center (October 2006 to June 2021) were collected as the external validation cohort.ResultsA total of 4,323 patients eligible for inclusion were screened for enrollment from the SEER database, of which 178 (4.12%) developing M1b stage. Multivariate logistic regression showed that lymph node dissection (LND), T stage, N stage, lung metastases, and distant lymph node metastases were the independent predictors of developing M1b stage risk. The models based on both the XGBoost and RF algorithms showed stable and efficient prediction performance in the training and external validation groups.ConclusionS-stage is not an independent factor for predicting the risk of developing the M1b stage of patients with GCTC. The ML models based on both XGBoost and RF algorithms have high predictive effectiveness and may be used to predict the risk of developing the M1b stage of patients with GCTC, which is of promising value in clinical decision-making. Models still need to be tested with a larger sample of real-world data.

  20. f

    Error of ICESat-2 with respect to airborne data.

    • plos.figshare.com
    xls
    Updated Oct 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qinghua Li; Dong Wang; Fengying Liu; Jiachen Yu; Zheng Jia (2024). Error of ICESat-2 with respect to airborne data. [Dataset]. http://doi.org/10.1371/journal.pone.0309025.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 7, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Qinghua Li; Dong Wang; Fengying Liu; Jiachen Yu; Zheng Jia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The accuracy of digital elevation models (DEMs) in forested areas plays a crucial role in canopy height monitoring and ecological sensitivity analysis. Despite extensive research on DEMs in recent years, significant errors still exist in forested areas due to factors such as canopy occlusion, terrain complexity, and limited penetration, posing challenges for subsequent analyses based on DEMs. Therefore, a CNN-LightGBM hybrid model is proposed in this paper, with four different types of forests (tropical rainforest, coniferous forest, mixed coniferous and broad-leaved forest, and broad-leaved forest) selected as study sites to validate the performance of the hybrid model in correcting COP30DEM in different forest area DEMs. In the hybrid model of this paper, the choice was made to use the Densenet architecture of CNN models with LightGBM as the primary model. This choice is based on LightGBM’s leaf-growth strategy and histogram linking methods, which are effective in reducing the data’s memory footprint and utilising more of the data without sacrificing speed. The study uses elevation values from ICESat-2 as ground truth, covering several parameters including COP30DEM, canopy height, forest coverage, slope, terrain roughness and relief amplitude. To validate the superiority of the CNN-LightGBM hybrid model in DEMs correction compared to other models, a test of LightGBM model, CNN-SVR model, and SVR model is conducted within the same sample space. To prevent issues such as overfitting or underfitting during model training, although common meta-heuristic optimisation algorithms can alleviate these problems to a certain extent, they still have some shortcomings. To overcome these shortcomings, this paper cites an improved SSA search algorithm that incorporates the ingestion strategy of the FA algorithm to increase the diversity of solutions and global search capability, the Firefly Algorithm-based Sparrow Search Optimization Algorithm (FA-SSA algorithm) is introduced. By comparing multiple models and validating the data with an airborne LiDAR reference dataset, the results show that the R2 (R-Square) of the CNN-LightGBM model improves by more than 0.05 compared to the other models, and performs better in the experiments. The FA-SSA-CNN-LightGBM model has the highest accuracy, with an RMSE of 1.09 meters, and a reduction of more than 30% of the RMSE when compared to the LightGBM and other hybrid models. Compared to other forested area DEMs (such as FABDEM and GEDI), its accuracy is improved by more than 50%, and the performance is significantly better than other commonly used DEMs in forested areas, indicating the feasibility of this method in correcting elevation errors in forested area DEMs and its significant importance in advancing global topographic mapping.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jie Liu; Jian Zhang; Haodong Huang; Yunting Wang; Zuyue Zhang; Yunfeng Ma; Xiangqian He (2023). Data_Sheet_1_A Machine Learning Model to Predict Intravenous Immunoglobulin-Resistant Kawasaki Disease Patients: A Retrospective Study Based on the Chongqing Population.CSV [Dataset]. http://doi.org/10.3389/fped.2021.756095.s001

Data_Sheet_1_A Machine Learning Model to Predict Intravenous Immunoglobulin-Resistant Kawasaki Disease Patients: A Retrospective Study Based on the Chongqing Population.CSV

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Jun 8, 2023
Dataset provided by
Frontiers
Authors
Jie Liu; Jian Zhang; Haodong Huang; Yunting Wang; Zuyue Zhang; Yunfeng Ma; Xiangqian He
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Chongqing
Description

Objective: We explored the risk factors for intravenous immunoglobulin (IVIG) resistance in children with Kawasaki disease (KD) and constructed a prediction model based on machine learning algorithms.Methods: A retrospective study including 1,398 KD patients hospitalized in 7 affiliated hospitals of Chongqing Medical University from January 2015 to August 2020 was conducted. All patients were divided into IVIG-responsive and IVIG-resistant groups, which were randomly divided into training and validation sets. The independent risk factors were determined using logistic regression analysis. Logistic regression nomograms, support vector machine (SVM), XGBoost and LightGBM prediction models were constructed and compared with the previous models.Results: In total, 1,240 out of 1,398 patients were IVIG responders, while 158 were resistant to IVIG. According to the results of logistic regression analysis of the training set, four independent risk factors were identified, including total bilirubin (TBIL) (OR = 1.115, 95% CI 1.067–1.165), procalcitonin (PCT) (OR = 1.511, 95% CI 1.270–1.798), alanine aminotransferase (ALT) (OR = 1.013, 95% CI 1.008–1.018) and platelet count (PLT) (OR = 0.998, 95% CI 0.996–1). Logistic regression nomogram, SVM, XGBoost, and LightGBM prediction models were constructed based on the above independent risk factors. The sensitivity was 0.617, 0.681, 0.638, and 0.702, the specificity was 0.712, 0.841, 0.967, and 0.903, and the area under curve (AUC) was 0.731, 0.814, 0.804, and 0.874, respectively. Among the prediction models, the LightGBM model displayed the best ability for comprehensive prediction, with an AUC of 0.874, which surpassed the previous classic models of Egami (AUC = 0.581), Kobayashi (AUC = 0.524), Sano (AUC = 0.519), Fu (AUC = 0.578), and Formosa (AUC = 0.575).Conclusion: The machine learning LightGBM prediction model for IVIG-resistant KD patients was superior to previous models. Our findings may help to accomplish early identification of the risk of IVIG resistance and improve their outcomes.

Search
Clear search
Close search
Google apps
Main menu