Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants.Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes.Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant.Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.
🚀 python package style code with package code on datasets - LightGBM and TabNet This is the code of training model and inference. Normally we use ipynb style code in kaggle. I just change the code style to py package and it's better for training with shell command.
I refer the original code below and thanks to @chumajin
[Notebook] Reference Notebook by chumajin
-- config : yaml file of parameter for lightgbm
-- models : saved model
-- train.py
-- predict test.py
-- feature_engineering.py
-- metric.py
-- preprocessing.py
-- seed.py
-- tabnet preprocessing.py
-- config : tabnet hyp.yaml / tabnet config.py
-- models : saved model
-- predict_test.py
-- train.py
I refer the original code below and thanks to @chumajin
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate information concerning crown profile is critical in analyzing biological processes and providing a more accurate estimate of carbon balance, which is conducive to sustainable forest management and planning. The similarities between the types of data addressed with LSTM algorithms and crown profile data make a compelling argument for the integration of deep learning into the crown profile modeling. Thus, the aim was to study the application of deep learning method LSTM and its variant algorithms in the crown profile modeling, using the crown profile database from Pinus yunnanensis secondary forests in Yunnan province, in southwest China. Furthermore, the SHAP (SHapley Additive exPlanations) was used to interpret the predictions of ensemble or deep learning models. The results showed that LSTM’s variant algorithms was competitive with traditional Vanila LSTM, but substantially outperformed ensemble learning model LightGBM. Specifically, the proposed Hybrid LSTM-LightGBM and Integrated LSTM-LightGBM have achieved a best forecasting performance on training set and testing set respectively. Furthermore, the feature importance analysis of LightGBM and Vanila LSTM presented that there were more factors that contribute significantly to Vanila LSTM model compared to LightGBM model. This phenomenon can explain why deep learning outperforms ensemble learning when there are more interrelated features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset comprises cloud masks for 513 1022-by-1022 pixel subscenes, at 20m resolution, sampled random from the 2018 Level-1C Sentinel-2 archive. The design of this dataset follows from some observations about cloud masking: (i) performance over an entire product is highly correlated, thus subscenes provide more value per-pixel than full scenes, (ii) current cloud masking datasets often focus on specific regions, or hand-select the products used, which introduces a bias into the dataset that is not representative of the real-world data, (iii) cloud mask performance appears to be highly correlated to surface type and cloud structure, so testing should include analysis of failure modes in relation to these variables.
The data was annotated semi-automatically, using the IRIS toolkit, which allows users to dynamically train a Random Forest (implemented using LightGBM), speeding up annotations by iteratively improving it's predictions, but preserving the annotator's ability to make final manual changes when needed. This hybrid approach allowed us to process many more masks than would have been possible manually, which we felt was vital in creating a large enough dataset to approximate the statistics of the whole Sentinel-2 archive.
In addition to the pixel-wise, 3 class (CLEAR, CLOUD, CLOUD_SHADOW) segmentation masks, we also provide users with binary
classification "tags" for each subscene that can be used in testing to determine performance in specific circumstances. These include:
Wherever practical, cloud shadows were also annotated, however this was sometimes not possible due to high-relief terrain, or large ambiguities. In total, 424 were marked with shadows (if present), and 89 have shadows that were not annotatable due to very ambiguous shadow boundaries, or terrain that cast significant shadows. If users wish to train an algorithm specifically for cloud shadow masks, we advise them to remove those 89 images for which shadow was not possible, however, bear in mind that this will systematically reduce the difficulty of the shadow class compared to real-world use, as these contain the most difficult shadow examples.
In addition to the 20m sampled subscenes and masks, we also provide users with shapefiles that define the boundary of the mask on the original Sentinel-2 scene. If users wish to retrieve the L1C bands at their original resolutions, they can use these to do so.
Please see the README for further details on the dataset structure and more.
Contributions & Acknowledgements
The data were collected, annotated, checked, formatted and published by Alistair Francis and John Mrziglod.
Support and advice was provided by Prof. Jan-Peter Muller and Dr. Panagiotis Sidiropoulos, for which we are grateful.
We would like to extend our thanks to Dr. Pierre-Philippe Mathieu and the rest of the team at ESA PhiLab, who provided the environment in which this project was conceived, and continued to give technical support throughout.
Finally, we thank the ESA Network of Resources for sponsoring this project by providing ICT resources.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction: Globally, hypertension (HT) is a substantial risk factor for cardiovascular disease and mortality; hence, rapid identification and treatment of HT is crucial. In this study, we tested the light gradient boosting machine (LightGBM) machine learning method for blood pressure stratification based on photoplethysmography (PPG), which is used in most wearable devices.Methods: We used 121 records of PPG and arterial blood pressure (ABP) signals from the Medical Information Mart for Intensive Care III public database. PPG, velocity plethysmography, and acceleration plethysmography were used to estimate blood pressure; the ABP signals were used to determine the blood pressure stratification categories. Seven feature sets were established and used to train the Optuna-tuned LightGBM model. Three trials compared normotension (NT) vs. prehypertension (PHT), NT vs. HT, and NT + PHT vs. HT.Results: The F1 scores for these three classification trials were 90.18%, 97.51%, and 92.77%, respectively. The results showed that combining multiple features from PPG and its derivative led to a more accurate classification of HT classes than using features from only the PPG signal.Discussion: The proposed method showed high accuracy in stratifying HT risks, providing a noninvasive, rapid, and robust method for the early detection of HT, with promising applications in the field of wearable cuffless blood pressure measurement.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants.Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes.Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant.Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveA clinical prediction model for postoperative combined Acute kidney injury (AKI) in patients with Type A acute aortic dissection (TAAAD) and Type B acute aortic dissection (TBAAD) was constructed by using Machine Learning (ML).MethodsBaseline data was collected from Acute aortic division (AAD) patients admitted to First Affiliated Hospital of Xinjiang Medical University between January 1, 2019 and December 31, 2021. (1) We identified baseline Serum creatinine (SCR) estimation methods and used them as a basis for diagnosis of AKI. (2) Divide their total datasets randomly into Training set (70%) and Test set (30%), Bootstrap modeling and validation of features using multiple ML methods in the training set, and select models corresponding to the largest Area Under Curve (AUC) for follow-up studies. (3) Screening of the best ML model variables through the model visualization tools Shapley Addictive Explanations (SHAP) and Recursive feature reduction (REF). (4) Finally, the pre-screened prediction models were evaluated using test set data from three aspects: discrimination, Calibration, and clinical benefit.ResultsThe final incidence of AKI was 69.4% (120/173) in 173 patients with TAAAD and 28.6% (81/283) in 283 patients with TBAAD. For TAAAD-AKI, the Random Forest (RF) model showed the best prediction performance in the training set (AUC = 0.760, 95% CI:0.630–0.881); while for TBAAD-AKI, the Light Gradient Boosting Machine (LightGBM) model worked best (AUC = 0.734, 95% CI:0.623–0.847). Screening of the characteristic variables revealed that the common predictors among the two final prediction models for postoperative AKI due to AAD were baseline SCR, Blood urea nitrogen (BUN) and Uric acid (UA) at admission, Mechanical ventilation time (MVT). The specific predictors in the TAAAD-AKI model are: White blood cell (WBC), Platelet (PLT) and D dimer at admission, Plasma The specific predictors in the TBAAD-AKI model were N-terminal pro B-type natriuretic peptide (BNP), Serum kalium, Activated partial thromboplastin time (APTT) and Systolic blood pressure (SBP) at admission, Combined renal arteriography in surgery. Finally, we used in terms of Discrimination, the ROC value of the RF model for TAAAD was 0.81 and the ROC value of the LightGBM model for TBAAD was 0.74, both with good accuracy. In terms of calibration, the calibration curve of TAAAD-AKI's RF fits the ideal curve the best and has the lowest and smallest Brier score (0.16). Similarly, the calibration curve of TBAAD-AKI's LightGBM model fits the ideal curve the best and has the smallest Brier score (0.15). In terms of Clinical benefit, the best ML models for both types of AAD have good Net benefit as shown by Decision Curve Analysis (DCA).ConclusionWe successfully constructed and validated clinical prediction models for the occurrence of AKI after surgery in TAAAD and TBAAD patients using different ML algorithms. The main predictors of the two types of AAD-AKI are somewhat different, and the strategies for early prevention and control of AKI are also different and need more external data for validation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PurposeThis study aimed to investigate the value of a machine learning-based magnetic resonance imaging (MRI) radiomics model in predicting the risk of recurrence within 1 year following an acute ischemic stroke (AIS).MethodsThe MRI and clinical data of 612 patients diagnosed with AIS at the Second Affiliated Hospital of Nanchang University from March 1, 2019, to March 5, 2021, were obtained. The patients were divided into recurrence and non-recurrence groups according to whether they had a recurrent stroke within 1 year after discharge. Randomized splitting was used to divide the data into training and validation sets using a ratio of 7:3. Two radiologists used the 3D-slicer software to label the lesions on brain diffusion-weighted (DWI) MRI sequences. Radiomics features were extracted from the annotated images using the pyradiomics software package, and the features were filtered using the Least Absolute Shrinkage and Selection Operator (LASSO) regression analysis. Four machine learning algorithms, logistic regression (LR), Support Vector Classification (SVC), LightGBM, and Random forest (RF), were used to construct a recurrence prediction model. For each algorithm, three models were constructed based on the MRI radiomics features, clinical features, and combined MRI radiomics and clinical features. The sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve (AUC) were used to compare the predictive efficacy of the models.ResultsTwenty features were selected from 1,037 radiomics features extracted from DWI images. The LightGBM model based on data with three different features achieved the best prediction accuracy from all 4 models in the validation set. The LightGBM model based solely on radiomics features achieved a sensitivity, specificity, and AUC of 0.65, 0.671, and 0.647, respectively, and the model based on clinical data achieved a sensitivity, specificity, and AUC of 0.7, 0.799, 0.735, respectively. The sensitivity, specificity, and AUC of the LightGBM model base on both radiomics and clinical features achieved the best performance with a sensitivity, specificity, and AUC of 0.85, 0.805, 0.789, respectively.ConclusionThe ischemic stroke recurrence prediction model based on LightGBM achieved the best prediction of recurrence within 1 year following an AIS. The combination of MRI radiomics features and clinical data improved the prediction performance of the model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The accuracy of digital elevation models (DEMs) in forested areas plays a crucial role in canopy height monitoring and ecological sensitivity analysis. Despite extensive research on DEMs in recent years, significant errors still exist in forested areas due to factors such as canopy occlusion, terrain complexity, and limited penetration, posing challenges for subsequent analyses based on DEMs. Therefore, a CNN-LightGBM hybrid model is proposed in this paper, with four different types of forests (tropical rainforest, coniferous forest, mixed coniferous and broad-leaved forest, and broad-leaved forest) selected as study sites to validate the performance of the hybrid model in correcting COP30DEM in different forest area DEMs. In the hybrid model of this paper, the choice was made to use the Densenet architecture of CNN models with LightGBM as the primary model. This choice is based on LightGBM’s leaf-growth strategy and histogram linking methods, which are effective in reducing the data’s memory footprint and utilising more of the data without sacrificing speed. The study uses elevation values from ICESat-2 as ground truth, covering several parameters including COP30DEM, canopy height, forest coverage, slope, terrain roughness and relief amplitude. To validate the superiority of the CNN-LightGBM hybrid model in DEMs correction compared to other models, a test of LightGBM model, CNN-SVR model, and SVR model is conducted within the same sample space. To prevent issues such as overfitting or underfitting during model training, although common meta-heuristic optimisation algorithms can alleviate these problems to a certain extent, they still have some shortcomings. To overcome these shortcomings, this paper cites an improved SSA search algorithm that incorporates the ingestion strategy of the FA algorithm to increase the diversity of solutions and global search capability, the Firefly Algorithm-based Sparrow Search Optimization Algorithm (FA-SSA algorithm) is introduced. By comparing multiple models and validating the data with an airborne LiDAR reference dataset, the results show that the R2 (R-Square) of the CNN-LightGBM model improves by more than 0.05 compared to the other models, and performs better in the experiments. The FA-SSA-CNN-LightGBM model has the highest accuracy, with an RMSE of 1.09 meters, and a reduction of more than 30% of the RMSE when compared to the LightGBM and other hybrid models. Compared to other forested area DEMs (such as FABDEM and GEDI), its accuracy is improved by more than 50%, and the performance is significantly better than other commonly used DEMs in forested areas, indicating the feasibility of this method in correcting elevation errors in forested area DEMs and its significant importance in advancing global topographic mapping.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveThe purpose of this manuscript is to identify longitudinal trajectories of changes in triglyceride glucose (TyG) index and investigate the association of TyG index trajectories with risk of lean nonalcoholic fatty liver disease (NAFLD).MethodsUsing data from 1,109 participants in the Health Management Cohort longitudinal study, we used Latent Class Growth Modeling (LCGM) to develop TyG index trajectories. Using a Cox proportional hazard model, the relationship between TyG index trajectories and incident lean NAFLD was analyzed. Restricted cubic splines (RCS) were used to visually display the dose-response association between TyG index and lean NAFLD. We also deployed machine learning (ML) via Light Gradient Boosting Machine (LightGBM) to predict lean NAFLD, validated by receiver operating characteristic curves (ROCs). The LightGBM model was used to create an online tool for medical use. In addition, NAFLD was assessed by abdominal ultrasound after excluding other liver fat causes.ResultsThe median age of the population was 46.6 years, and 440 (39.68%) of the participants were men. Three distinct TyG index trajectories were identified: “low stable” (TyG index ranged from 7.66 to 7.71, n=206, 18.5%), “moderate stable” (TyG index ranged from 8.11 to 8.15, n=542, 48.8%), and “high stable” (TyG index ranged from 8.61 to 8.67, n=363, 32.7%). Using a “low stable” trajectory as a reference, a “high stable” trajectory was associated with an increased risk of lean-NAFLD (HR: 2.668, 95% CI: 1.098-6.484). After adjusting for baseline age, WC, SBP, BMI, and ALT, HR increased slightly in “moderate stable” and “high stable” trajectories to 1.767 (95% CI:0.730-4.275) and 2.668 (95% CI:1.098-6.484), respectively. RCS analysis showed a significant nonlinear dose-response relationship between TyG index and lean NAFLD risk (χ2 = 11.5, P=0.003). The LightGBM model demonstrated high accuracy (Train AUC 0.870, Test AUC 0.766). An online tool based on our model was developed to assist clinicians in assessing lean NAFLD risk.ConclusionThe TyG index serves as a promising noninvasive marker for lean NAFLD, with significant implications for clinical practice and public health policy.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundMetabolic syndrome (Mets) is considered a global epidemic of the 21st century, predisposing to cardiometabolic diseases. This study aims to describe and compare the body composition profiles between metabolic healthy (MH) and metabolic unhealthy (MU) phenotype in normal and obesity population in China, and to explore the predictive ability of body composition indices to distinguish MU by generating machine learning algorithms.MethodsA cross-sectional study was conducted and the subjects who came to the hospital to receive a health examination were enrolled. Body composition was assessed using bioelectrical impedance analyser. A model generator with a gradient-boosting tree algorithm (LightGBM) combined with the SHapley Additive exPlanations method was adapted to train and interpret the model. Receiver-operating characteristic curves were used to analyze the predictive value.ResultsWe found the significant difference in body composition parameters between the metabolic healthy normal weight (MHNW), metabolic healthy obesity (MHO), metabolic unhealthy normal weight (MUNW) and metabolic unhealthy obesity (MUO) individuals, especially among the MHNW, MUNW and MUO phenotype. MHNW phenotype had significantly lower whole fat mass (FM), trunk FM and trunk free fat mass (FFM), and had significantly lower visceral fat areas compared to MUNW and MUO phenotype, respectively. The bioimpedance phase angle, waist-hip ratio (WHR) and free fat mass index (FFMI) were found to be remarkably lower in MHNW than in MUNW and MUO groups, and lower in MHO than in MUO group. For predictive analysis, the LightGBM-based model identified 32 status-predicting features for MUNW with MHNW group as the reference, MUO with MHO as the reference and MUO with MHNW as the reference, achieved high discriminative power, with area under the curve (AUC) values of 0.842 [0.658, 1.000] for MUNW vs. MHNW, 0.746 [0.599, 0.893] for MUO vs. MHO and 0.968 [0.968, 1.000] for MUO and MHNW, respectively. A 2-variable model was developed for more practical clinical applications. WHR > 0.92 and FFMI > 18.5 kg/m2 predict the increased risk of MU.ConclusionBody composition measurement and validation of this model could be a valuable approach for the early management and prevention of MU, whether in obese or normal population.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The average performance is calculated as the mean over 50 iterations for the training, validation, and test sets, and over 10 iterations for the top-10 ensembles. Except for ROC-AUC and PR-AUC, all other metrics were computed at >0.5 probability threshold. The top-10 ensembles were selected by ranking each route/mode class-balancing ensembles (n = 50) based on the average of four metrics—AUC, PR-AUC, PPV/Precision, and adjusted Brier score (1—actual score)—computed on the test sets, and then selecting the best 20% ranked ensembles. Brier scores range from 0 (best performance) to 1 (worst performance), while MCC values range from +1 (best performance) to -1 (worst performance). ± values indicate standard deviation from the mean. Values in square brackets indicate the worst and best performing ensembles, respectively. S4 Dataset provides the average performance metrics (and their standard deviations) across the training, validation, and held-out test sets, as well as the percentage of positive class instances for each route/mode.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants.Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes.Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant.Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.