12 datasets found
  1. f

    DataSheet1_Comparative analysis of tissue-specific genes in maize based on...

    • frontiersin.figshare.com
    docx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zijie Wang; Yuzhi Zhu; Zhule Liu; Hongfu Li; Xinqiang Tang; Yi Jiang (2023). DataSheet1_Comparative analysis of tissue-specific genes in maize based on machine learning models: CNN performs technically best, LightGBM performs biologically soundest.docx [Dataset]. http://doi.org/10.3389/fgene.2023.1190887.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Zijie Wang; Yuzhi Zhu; Zhule Liu; Hongfu Li; Xinqiang Tang; Yi Jiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants.Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes.Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant.Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.

  2. Py style code for volatility

    • kaggle.com
    zip
    Updated Aug 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sushi (2021). Py style code for volatility [Dataset]. https://www.kaggle.com/madquer/volatility
    Explore at:
    zip(24231567 bytes)Available download formats
    Dataset updated
    Aug 25, 2021
    Authors
    sushi
    Description

    Context

    🚀 python package style code with package code on datasets - LightGBM and TabNet This is the code of training model and inference. Normally we use ipynb style code in kaggle. I just change the code style to py package and it's better for training with shell command.

    I refer the original code below and thanks to @chumajin

    [Notebook] Reference Notebook by chumajin

    Content

    1. contents in directory of src

    • prepare data(with feature engineering),
    • lightgbm : train and predict
    • tabnet : train and predict
    • volatility_2021.ipynb : the notebook of local version for last submission with shell command.

    2. structure in detail

    • light_gbm

    -- config : yaml file of parameter for lightgbm
    -- models : saved model
    -- train.py
    -- predict test.py

    • prepare

    -- feature_engineering.py
    -- metric.py
    -- preprocessing.py
    -- seed.py
    -- tabnet preprocessing.py

    • tabnet

    -- config : tabnet hyp.yaml / tabnet config.py
    -- models : saved model
    -- predict_test.py
    -- train.py

    • volatility_2021.ipynb

    Acknowledgements

    I refer the original code below and thanks to @chumajin

    [Notebook] Reference Notebook by chumajin

  3. f

    Table_1_Deep learning for crown profile modelling of Pinus yunnanensis...

    • figshare.com
    • frontiersin.figshare.com
    docx
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuling Chen; Jianming Wang (2023). Table_1_Deep learning for crown profile modelling of Pinus yunnanensis secondary forests in Southwest China.docx [Dataset]. http://doi.org/10.3389/fpls.2023.1093905.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Frontiers
    Authors
    Yuling Chen; Jianming Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Southwestern China
    Description

    Accurate information concerning crown profile is critical in analyzing biological processes and providing a more accurate estimate of carbon balance, which is conducive to sustainable forest management and planning. The similarities between the types of data addressed with LSTM algorithms and crown profile data make a compelling argument for the integration of deep learning into the crown profile modeling. Thus, the aim was to study the application of deep learning method LSTM and its variant algorithms in the crown profile modeling, using the crown profile database from Pinus yunnanensis secondary forests in Yunnan province, in southwest China. Furthermore, the SHAP (SHapley Additive exPlanations) was used to interpret the predictions of ensemble or deep learning models. The results showed that LSTM’s variant algorithms was competitive with traditional Vanila LSTM, but substantially outperformed ensemble learning model LightGBM. Specifically, the proposed Hybrid LSTM-LightGBM and Integrated LSTM-LightGBM have achieved a best forecasting performance on training set and testing set respectively. Furthermore, the feature importance analysis of LightGBM and Vanila LSTM presented that there were more factors that contribute significantly to Vanila LSTM model compared to LightGBM model. This phenomenon can explain why deep learning outperforms ensemble learning when there are more interrelated features.

  4. Sentinel-2 Cloud Mask Catalogue

    • zenodo.org
    • data.niaid.nih.gov
    csv, pdf, zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alistair Francis; Alistair Francis; John Mrziglod; Panagiotis Sidiropoulos; Panagiotis Sidiropoulos; Jan-Peter Muller; Jan-Peter Muller; John Mrziglod (2024). Sentinel-2 Cloud Mask Catalogue [Dataset]. http://doi.org/10.5281/zenodo.4172871
    Explore at:
    pdf, zip, csvAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alistair Francis; Alistair Francis; John Mrziglod; Panagiotis Sidiropoulos; Panagiotis Sidiropoulos; Jan-Peter Muller; Jan-Peter Muller; John Mrziglod
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset comprises cloud masks for 513 1022-by-1022 pixel subscenes, at 20m resolution, sampled random from the 2018 Level-1C Sentinel-2 archive. The design of this dataset follows from some observations about cloud masking: (i) performance over an entire product is highly correlated, thus subscenes provide more value per-pixel than full scenes, (ii) current cloud masking datasets often focus on specific regions, or hand-select the products used, which introduces a bias into the dataset that is not representative of the real-world data, (iii) cloud mask performance appears to be highly correlated to surface type and cloud structure, so testing should include analysis of failure modes in relation to these variables.

    The data was annotated semi-automatically, using the IRIS toolkit, which allows users to dynamically train a Random Forest (implemented using LightGBM), speeding up annotations by iteratively improving it's predictions, but preserving the annotator's ability to make final manual changes when needed. This hybrid approach allowed us to process many more masks than would have been possible manually, which we felt was vital in creating a large enough dataset to approximate the statistics of the whole Sentinel-2 archive.

    In addition to the pixel-wise, 3 class (CLEAR, CLOUD, CLOUD_SHADOW) segmentation masks, we also provide users with binary
    classification "tags" for each subscene that can be used in testing to determine performance in specific circumstances. These include:

    • SURFACE TYPE: 11 categories
    • CLOUD TYPE: 7 categories
    • CLOUD HEIGHT: low, high
    • CLOUD THICKNESS: thin, thick
    • CLOUD EXTENT: isolated, extended

    Wherever practical, cloud shadows were also annotated, however this was sometimes not possible due to high-relief terrain, or large ambiguities. In total, 424 were marked with shadows (if present), and 89 have shadows that were not annotatable due to very ambiguous shadow boundaries, or terrain that cast significant shadows. If users wish to train an algorithm specifically for cloud shadow masks, we advise them to remove those 89 images for which shadow was not possible, however, bear in mind that this will systematically reduce the difficulty of the shadow class compared to real-world use, as these contain the most difficult shadow examples.

    In addition to the 20m sampled subscenes and masks, we also provide users with shapefiles that define the boundary of the mask on the original Sentinel-2 scene. If users wish to retrieve the L1C bands at their original resolutions, they can use these to do so.

    Please see the README for further details on the dataset structure and more.

    Contributions & Acknowledgements

    The data were collected, annotated, checked, formatted and published by Alistair Francis and John Mrziglod.

    Support and advice was provided by Prof. Jan-Peter Muller and Dr. Panagiotis Sidiropoulos, for which we are grateful.

    We would like to extend our thanks to Dr. Pierre-Philippe Mathieu and the rest of the team at ESA PhiLab, who provided the environment in which this project was conceived, and continued to give technical support throughout.

    Finally, we thank the ESA Network of Resources for sponsoring this project by providing ICT resources.

  5. f

    DataSheet1_Blood pressure stratification using photoplethysmography and...

    • frontiersin.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xudong Hu; Shimin Yin; Xizhuang Zhang; Carlo Menon; Cheng Fang; Zhencheng Chen; Mohamed Elgendi; Yongbo Liang (2023). DataSheet1_Blood pressure stratification using photoplethysmography and light gradient boosting machine.ZIP [Dataset]. http://doi.org/10.3389/fphys.2023.1072273.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Xudong Hu; Shimin Yin; Xizhuang Zhang; Carlo Menon; Cheng Fang; Zhencheng Chen; Mohamed Elgendi; Yongbo Liang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction: Globally, hypertension (HT) is a substantial risk factor for cardiovascular disease and mortality; hence, rapid identification and treatment of HT is crucial. In this study, we tested the light gradient boosting machine (LightGBM) machine learning method for blood pressure stratification based on photoplethysmography (PPG), which is used in most wearable devices.Methods: We used 121 records of PPG and arterial blood pressure (ABP) signals from the Medical Information Mart for Intensive Care III public database. PPG, velocity plethysmography, and acceleration plethysmography were used to estimate blood pressure; the ABP signals were used to determine the blood pressure stratification categories. Seven feature sets were established and used to train the Optuna-tuned LightGBM model. Three trials compared normotension (NT) vs. prehypertension (PHT), NT vs. HT, and NT + PHT vs. HT.Results: The F1 scores for these three classification trials were 90.18%, 97.51%, and 92.77%, respectively. The results showed that combining multiple features from PPG and its derivative led to a more accurate classification of HT classes than using features from only the PPG signal.Discussion: The proposed method showed high accuracy in stratifying HT risks, providing a noninvasive, rapid, and robust method for the early detection of HT, with promising applications in the field of wearable cuffless blood pressure measurement.

  6. f

    Table3_Comparative analysis of tissue-specific genes in maize based on...

    • figshare.com
    xlsx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zijie Wang; Yuzhi Zhu; Zhule Liu; Hongfu Li; Xinqiang Tang; Yi Jiang (2023). Table3_Comparative analysis of tissue-specific genes in maize based on machine learning models: CNN performs technically best, LightGBM performs biologically soundest.xlsx [Dataset]. http://doi.org/10.3389/fgene.2023.1190887.s006
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Zijie Wang; Yuzhi Zhu; Zhule Liu; Hongfu Li; Xinqiang Tang; Yi Jiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants.Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes.Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant.Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.

  7. f

    Data_Sheet_3_Prediction model of acute kidney injury after different types...

    • figshare.com
    txt
    Updated Jun 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Xinsai; Wang Zhengye; Huang Xuan; Chu Xueqian; Peng Kai; Chen Sisi; Jiang Xuyan; Li Suhua (2023). Data_Sheet_3_Prediction model of acute kidney injury after different types of acute aortic dissection based on machine learning.CSV [Dataset]. http://doi.org/10.3389/fcvm.2022.984772.s003
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    Frontiers
    Authors
    Li Xinsai; Wang Zhengye; Huang Xuan; Chu Xueqian; Peng Kai; Chen Sisi; Jiang Xuyan; Li Suhua
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveA clinical prediction model for postoperative combined Acute kidney injury (AKI) in patients with Type A acute aortic dissection (TAAAD) and Type B acute aortic dissection (TBAAD) was constructed by using Machine Learning (ML).MethodsBaseline data was collected from Acute aortic division (AAD) patients admitted to First Affiliated Hospital of Xinjiang Medical University between January 1, 2019 and December 31, 2021. (1) We identified baseline Serum creatinine (SCR) estimation methods and used them as a basis for diagnosis of AKI. (2) Divide their total datasets randomly into Training set (70%) and Test set (30%), Bootstrap modeling and validation of features using multiple ML methods in the training set, and select models corresponding to the largest Area Under Curve (AUC) for follow-up studies. (3) Screening of the best ML model variables through the model visualization tools Shapley Addictive Explanations (SHAP) and Recursive feature reduction (REF). (4) Finally, the pre-screened prediction models were evaluated using test set data from three aspects: discrimination, Calibration, and clinical benefit.ResultsThe final incidence of AKI was 69.4% (120/173) in 173 patients with TAAAD and 28.6% (81/283) in 283 patients with TBAAD. For TAAAD-AKI, the Random Forest (RF) model showed the best prediction performance in the training set (AUC = 0.760, 95% CI:0.630–0.881); while for TBAAD-AKI, the Light Gradient Boosting Machine (LightGBM) model worked best (AUC = 0.734, 95% CI:0.623–0.847). Screening of the characteristic variables revealed that the common predictors among the two final prediction models for postoperative AKI due to AAD were baseline SCR, Blood urea nitrogen (BUN) and Uric acid (UA) at admission, Mechanical ventilation time (MVT). The specific predictors in the TAAAD-AKI model are: White blood cell (WBC), Platelet (PLT) and D dimer at admission, Plasma The specific predictors in the TBAAD-AKI model were N-terminal pro B-type natriuretic peptide (BNP), Serum kalium, Activated partial thromboplastin time (APTT) and Systolic blood pressure (SBP) at admission, Combined renal arteriography in surgery. Finally, we used in terms of Discrimination, the ROC value of the RF model for TAAAD was 0.81 and the ROC value of the LightGBM model for TBAAD was 0.74, both with good accuracy. In terms of calibration, the calibration curve of TAAAD-AKI's RF fits the ideal curve the best and has the lowest and smallest Brier score (0.16). Similarly, the calibration curve of TBAAD-AKI's LightGBM model fits the ideal curve the best and has the smallest Brier score (0.15). In terms of Clinical benefit, the best ML models for both types of AAD have good Net benefit as shown by Decision Curve Analysis (DCA).ConclusionWe successfully constructed and validated clinical prediction models for the occurrence of AKI after surgery in TAAAD and TBAAD patients using different ML algorithms. The main predictors of the two types of AAD-AKI are somewhat different, and the strategies for early prevention and control of AKI are also different and need more external data for validation.

  8. f

    Table_1_Prediction of recurrence of ischemic stroke within 1 year of...

    • frontiersin.figshare.com
    docx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jianmo Liu; Yifan Wu; Weijie Jia; Mengqi Han; Yongsen Chen; Jingyi Li; Bin Wu; Shujuan Yin; Xiaolin Zhang; Jibiao Chen; Pengfei Yu; Haowen Luo; Jianglong Tu; Fan Zhou; Xuexin Cheng; Yingping Yi (2023). Table_1_Prediction of recurrence of ischemic stroke within 1 year of discharge based on machine learning MRI radiomics.DOCX [Dataset]. http://doi.org/10.3389/fnins.2023.1110579.s003
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Jianmo Liu; Yifan Wu; Weijie Jia; Mengqi Han; Yongsen Chen; Jingyi Li; Bin Wu; Shujuan Yin; Xiaolin Zhang; Jibiao Chen; Pengfei Yu; Haowen Luo; Jianglong Tu; Fan Zhou; Xuexin Cheng; Yingping Yi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PurposeThis study aimed to investigate the value of a machine learning-based magnetic resonance imaging (MRI) radiomics model in predicting the risk of recurrence within 1 year following an acute ischemic stroke (AIS).MethodsThe MRI and clinical data of 612 patients diagnosed with AIS at the Second Affiliated Hospital of Nanchang University from March 1, 2019, to March 5, 2021, were obtained. The patients were divided into recurrence and non-recurrence groups according to whether they had a recurrent stroke within 1 year after discharge. Randomized splitting was used to divide the data into training and validation sets using a ratio of 7:3. Two radiologists used the 3D-slicer software to label the lesions on brain diffusion-weighted (DWI) MRI sequences. Radiomics features were extracted from the annotated images using the pyradiomics software package, and the features were filtered using the Least Absolute Shrinkage and Selection Operator (LASSO) regression analysis. Four machine learning algorithms, logistic regression (LR), Support Vector Classification (SVC), LightGBM, and Random forest (RF), were used to construct a recurrence prediction model. For each algorithm, three models were constructed based on the MRI radiomics features, clinical features, and combined MRI radiomics and clinical features. The sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve (AUC) were used to compare the predictive efficacy of the models.ResultsTwenty features were selected from 1,037 radiomics features extracted from DWI images. The LightGBM model based on data with three different features achieved the best prediction accuracy from all 4 models in the validation set. The LightGBM model based solely on radiomics features achieved a sensitivity, specificity, and AUC of 0.65, 0.671, and 0.647, respectively, and the model based on clinical data achieved a sensitivity, specificity, and AUC of 0.7, 0.799, 0.735, respectively. The sensitivity, specificity, and AUC of the LightGBM model base on both radiomics and clinical features achieved the best performance with a sensitivity, specificity, and AUC of 0.85, 0.805, 0.789, respectively.ConclusionThe ischemic stroke recurrence prediction model based on LightGBM achieved the best prediction of recurrence within 1 year following an AIS. The combination of MRI radiomics features and clinical data improved the prediction performance of the model.

  9. f

    DEM error verified by airborne data.

    • plos.figshare.com
    xls
    Updated Oct 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qinghua Li; Dong Wang; Fengying Liu; Jiachen Yu; Zheng Jia (2024). DEM error verified by airborne data. [Dataset]. http://doi.org/10.1371/journal.pone.0309025.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 7, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Qinghua Li; Dong Wang; Fengying Liu; Jiachen Yu; Zheng Jia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The accuracy of digital elevation models (DEMs) in forested areas plays a crucial role in canopy height monitoring and ecological sensitivity analysis. Despite extensive research on DEMs in recent years, significant errors still exist in forested areas due to factors such as canopy occlusion, terrain complexity, and limited penetration, posing challenges for subsequent analyses based on DEMs. Therefore, a CNN-LightGBM hybrid model is proposed in this paper, with four different types of forests (tropical rainforest, coniferous forest, mixed coniferous and broad-leaved forest, and broad-leaved forest) selected as study sites to validate the performance of the hybrid model in correcting COP30DEM in different forest area DEMs. In the hybrid model of this paper, the choice was made to use the Densenet architecture of CNN models with LightGBM as the primary model. This choice is based on LightGBM’s leaf-growth strategy and histogram linking methods, which are effective in reducing the data’s memory footprint and utilising more of the data without sacrificing speed. The study uses elevation values from ICESat-2 as ground truth, covering several parameters including COP30DEM, canopy height, forest coverage, slope, terrain roughness and relief amplitude. To validate the superiority of the CNN-LightGBM hybrid model in DEMs correction compared to other models, a test of LightGBM model, CNN-SVR model, and SVR model is conducted within the same sample space. To prevent issues such as overfitting or underfitting during model training, although common meta-heuristic optimisation algorithms can alleviate these problems to a certain extent, they still have some shortcomings. To overcome these shortcomings, this paper cites an improved SSA search algorithm that incorporates the ingestion strategy of the FA algorithm to increase the diversity of solutions and global search capability, the Firefly Algorithm-based Sparrow Search Optimization Algorithm (FA-SSA algorithm) is introduced. By comparing multiple models and validating the data with an airborne LiDAR reference dataset, the results show that the R2 (R-Square) of the CNN-LightGBM model improves by more than 0.05 compared to the other models, and performs better in the experiments. The FA-SSA-CNN-LightGBM model has the highest accuracy, with an RMSE of 1.09 meters, and a reduction of more than 30% of the RMSE when compared to the LightGBM and other hybrid models. Compared to other forested area DEMs (such as FABDEM and GEDI), its accuracy is improved by more than 50%, and the performance is significantly better than other commonly used DEMs in forested areas, indicating the feasibility of this method in correcting elevation errors in forested area DEMs and its significant importance in advancing global topographic mapping.

  10. f

    Table_1_Association between TyG index trajectory and new-onset lean NAFLD: a...

    • frontiersin.figshare.com
    docx
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haoshuang Liu; Jingfeng Chen; Qian Qin; Su Yan; Youxiang Wang; Jiaoyan Li; Suying Ding (2024). Table_1_Association between TyG index trajectory and new-onset lean NAFLD: a longitudinal study.docx [Dataset]. http://doi.org/10.3389/fendo.2024.1321922.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Feb 27, 2024
    Dataset provided by
    Frontiers
    Authors
    Haoshuang Liu; Jingfeng Chen; Qian Qin; Su Yan; Youxiang Wang; Jiaoyan Li; Suying Ding
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveThe purpose of this manuscript is to identify longitudinal trajectories of changes in triglyceride glucose (TyG) index and investigate the association of TyG index trajectories with risk of lean nonalcoholic fatty liver disease (NAFLD).MethodsUsing data from 1,109 participants in the Health Management Cohort longitudinal study, we used Latent Class Growth Modeling (LCGM) to develop TyG index trajectories. Using a Cox proportional hazard model, the relationship between TyG index trajectories and incident lean NAFLD was analyzed. Restricted cubic splines (RCS) were used to visually display the dose-response association between TyG index and lean NAFLD. We also deployed machine learning (ML) via Light Gradient Boosting Machine (LightGBM) to predict lean NAFLD, validated by receiver operating characteristic curves (ROCs). The LightGBM model was used to create an online tool for medical use. In addition, NAFLD was assessed by abdominal ultrasound after excluding other liver fat causes.ResultsThe median age of the population was 46.6 years, and 440 (39.68%) of the participants were men. Three distinct TyG index trajectories were identified: “low stable” (TyG index ranged from 7.66 to 7.71, n=206, 18.5%), “moderate stable” (TyG index ranged from 8.11 to 8.15, n=542, 48.8%), and “high stable” (TyG index ranged from 8.61 to 8.67, n=363, 32.7%). Using a “low stable” trajectory as a reference, a “high stable” trajectory was associated with an increased risk of lean-NAFLD (HR: 2.668, 95% CI: 1.098-6.484). After adjusting for baseline age, WC, SBP, BMI, and ALT, HR increased slightly in “moderate stable” and “high stable” trajectories to 1.767 (95% CI:0.730-4.275) and 2.668 (95% CI:1.098-6.484), respectively. RCS analysis showed a significant nonlinear dose-response relationship between TyG index and lean NAFLD risk (χ2 = 11.5, P=0.003). The LightGBM model demonstrated high accuracy (Train AUC 0.870, Test AUC 0.766). An online tool based on our model was developed to assist clinicians in assessing lean NAFLD risk.ConclusionThe TyG index serves as a promising noninvasive marker for lean NAFLD, with significant implications for clinical practice and public health policy.

  11. f

    Table_1_Early prediction of body composition parameters on metabolically...

    • frontiersin.figshare.com
    docx
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiujuan Deng; Lin Qiu; Xin Sun; Hui Li; Zejiao Chen; Min Huang; Fangxing Hu; Zhenyi Zhang (2023). Table_1_Early prediction of body composition parameters on metabolically unhealthy in the Chinese population via advanced machine learning.docx [Dataset]. http://doi.org/10.3389/fendo.2023.1228300.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Aug 29, 2023
    Dataset provided by
    Frontiers
    Authors
    Xiujuan Deng; Lin Qiu; Xin Sun; Hui Li; Zejiao Chen; Min Huang; Fangxing Hu; Zhenyi Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundMetabolic syndrome (Mets) is considered a global epidemic of the 21st century, predisposing to cardiometabolic diseases. This study aims to describe and compare the body composition profiles between metabolic healthy (MH) and metabolic unhealthy (MU) phenotype in normal and obesity population in China, and to explore the predictive ability of body composition indices to distinguish MU by generating machine learning algorithms.MethodsA cross-sectional study was conducted and the subjects who came to the hospital to receive a health examination were enrolled. Body composition was assessed using bioelectrical impedance analyser. A model generator with a gradient-boosting tree algorithm (LightGBM) combined with the SHapley Additive exPlanations method was adapted to train and interpret the model. Receiver-operating characteristic curves were used to analyze the predictive value.ResultsWe found the significant difference in body composition parameters between the metabolic healthy normal weight (MHNW), metabolic healthy obesity (MHO), metabolic unhealthy normal weight (MUNW) and metabolic unhealthy obesity (MUO) individuals, especially among the MHNW, MUNW and MUO phenotype. MHNW phenotype had significantly lower whole fat mass (FM), trunk FM and trunk free fat mass (FFM), and had significantly lower visceral fat areas compared to MUNW and MUO phenotype, respectively. The bioimpedance phase angle, waist-hip ratio (WHR) and free fat mass index (FFMI) were found to be remarkably lower in MHNW than in MUNW and MUO groups, and lower in MHO than in MUO group. For predictive analysis, the LightGBM-based model identified 32 status-predicting features for MUNW with MHNW group as the reference, MUO with MHO as the reference and MUO with MHNW as the reference, achieved high discriminative power, with area under the curve (AUC) values of 0.842 [0.658, 1.000] for MUNW vs. MHNW, 0.746 [0.599, 0.893] for MUO vs. MHO and 0.968 [0.968, 1.000] for MUO and MHNW, respectively. A 2-variable model was developed for more practical clinical applications. WHR > 0.92 and FFMI > 18.5 kg/m2 predict the increased risk of MU.ConclusionBody composition measurement and validation of this model could be a valuable approach for the early management and prevention of MU, whether in obese or normal population.

  12. f

    Average performance metrics across training, validation, and held-out test...

    • plos.figshare.com
    xls
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maya Wardeh; Jack Pilgrim; Melody Hui; Aurelia Kotsiri; Matthew Baylis; Marcus S. C. Blagrove (2024). Average performance metrics across training, validation, and held-out test sets for all class-balancing ensembles and test set performance for top-10 ensembles, for all routes/modes. [Dataset]. http://doi.org/10.1371/journal.ppat.1012629.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    PLOS Pathogens
    Authors
    Maya Wardeh; Jack Pilgrim; Melody Hui; Aurelia Kotsiri; Matthew Baylis; Marcus S. C. Blagrove
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The average performance is calculated as the mean over 50 iterations for the training, validation, and test sets, and over 10 iterations for the top-10 ensembles. Except for ROC-AUC and PR-AUC, all other metrics were computed at >0.5 probability threshold. The top-10 ensembles were selected by ranking each route/mode class-balancing ensembles (n = 50) based on the average of four metrics—AUC, PR-AUC, PPV/Precision, and adjusted Brier score (1—actual score)—computed on the test sets, and then selecting the best 20% ranked ensembles. Brier scores range from 0 (best performance) to 1 (worst performance), while MCC values range from +1 (best performance) to -1 (worst performance). ± values indicate standard deviation from the mean. Values in square brackets indicate the worst and best performing ensembles, respectively. S4 Dataset provides the average performance metrics (and their standard deviations) across the training, validation, and held-out test sets, as well as the percentage of positive class instances for each route/mode.

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zijie Wang; Yuzhi Zhu; Zhule Liu; Hongfu Li; Xinqiang Tang; Yi Jiang (2023). DataSheet1_Comparative analysis of tissue-specific genes in maize based on machine learning models: CNN performs technically best, LightGBM performs biologically soundest.docx [Dataset]. http://doi.org/10.3389/fgene.2023.1190887.s001

DataSheet1_Comparative analysis of tissue-specific genes in maize based on machine learning models: CNN performs technically best, LightGBM performs biologically soundest.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Zijie Wang; Yuzhi Zhu; Zhule Liu; Hongfu Li; Xinqiang Tang; Yi Jiang
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants.Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes.Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant.Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.

Search
Clear search
Close search
Google apps
Main menu