Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionPreeclampsia, one of the leading causes of maternal and fetal morbidity and mortality, demands accurate predictive models for the lack of effective treatment. Predictive models based on machine learning algorithms demonstrate promising potential, while there is a controversial discussion about whether machine learning methods should be recommended preferably, compared to traditional statistical models.MethodsWe employed both logistic regression and six machine learning methods as binary predictive models for a dataset containing 733 women diagnosed with preeclampsia. Participants were grouped by four different pregnancy outcomes. After the imputation of missing values, statistical description and comparison were conducted preliminarily to explore the characteristics of documented 73 variables. Sequentially, correlation analysis and feature selection were performed as preprocessing steps to filter contributing variables for developing models. The models were evaluated by multiple criteria.ResultsWe first figured out that the influential variables screened by preprocessing steps did not overlap with those determined by statistical differences. Secondly, the most accurate imputation method is K-Nearest Neighbor, and the imputation process did not affect the performance of the developed models much. Finally, the performance of models was investigated. The random forest classifier, multi-layer perceptron, and support vector machine demonstrated better discriminative power for prediction evaluated by the area under the receiver operating characteristic curve, while the decision tree classifier, random forest, and logistic regression yielded better calibration ability verified, as by the calibration curve.ConclusionMachine learning algorithms can accomplish prediction modeling and demonstrate superior discrimination, while Logistic Regression can be calibrated well. Statistical analysis and machine learning are two scientific domains sharing similar themes. The predictive abilities of such developed models vary according to the characteristics of datasets, which still need larger sample sizes and more influential predictors to accumulate evidence.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
Phylogenetic imputation has recently emerged as a potentially powerful tool for predicting missing data in functional traits datasets. As such, understanding the limitations of phylogenetic modelling in predicting trait values is critical if we are to use them in subsequent analyses. Previous studies have focused on the relationship between phylogenetic signal and clade-level prediction accuracy, yet variability in prediction accuracy among individual tips of phylogenies remains largely unexplored. Here, we used simulations of trait evolution along the branches of phylogenetic trees to show how the accuracy of phylogenetic imputations is influenced by the combined effects of (1) the amount of phylogenetic signal in the traits and (2) the branch length of the tips to be imputed. Specifically, we conducted cross-validation trials to estimate the variability in prediction accuracy among individual tips on the phylogenies (hereafter “tip-level accuracy†). We found that under a Brownian moti...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Efficient real-time prediction of free lime (f-CaO) content is critical for cement clinker quality control. However, process parameters from the distributed control system (DCS) often suffer from significant missing values, while time delays, nonlinear relationships, and coupling effects further hinder prediction accuracy. To overcome these challenges, this study presents a GRU-based f-CaO prediction model. Sixteen key parameters were first selected from 180 DCS-collected variables, then reconstructed and temporally aligned through the integration of empirical formulas, expert knowledge, and the time window principle. An encoder module equipped with a masked-attention mechanism processes missing values and extracts essential features, while the GRU framework predicts the f-CaO content. To mitigate data imbalance issues, a weighted loss function was applied during training, achieving 16.7%, 25.0%, and 17.2% reductions in MAE, MSE, and RMSE, respectively. The model achieves both high accuracy and low latency, meeting the needs of real-time and stable cement production.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3–40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31–0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04–0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Aim
Seabirds are heavily threatened by anthropogenic activities and their conservation status is deteriorating rapidly. Yet, these pressures are unlikely to uniformly impact all species. It remains an open question if seabirds with similar ecological roles are responding similarly to human pressures. Here we aim to: 1) test whether threatened vs non-threatened seabirds are separated in trait space; 2) quantify the similarity of species’ roles (redundancy) per IUCN Red List Category; and 3) identify traits that render species vulnerable to anthropogenic threats.
Location
Global
Time period
Contemporary
Major taxa studied
Seabirds
Methods
We compile and impute eight traits that relate to species’ vulnerabilities and ecosystem functioning across 341 seabird species. Using these traits, we build a mixed-data PCA of species’ trait space. We quantify trait redundancy using the unique trait combinations (UTCs) approach. Finally, we employ a SIMPER analysis to identify which traits explain the greatest difference between threat groups.
Results
We find seabirds segregate in trait space based on threat status, indicating anthropogenic impacts are selectively removing large, long-lived, pelagic surface feeders with narrow habitat breadths. We further find that threatened species have higher trait redundancy, while non-threatened species have relatively limited redundancy. Finally, we find that species with narrow habitat breadths, fast reproductive speeds, and varied diets are more likely to be threatened by habitat-modifying processes (e.g., pollution and natural system modifications); whereas pelagic specialists with slow reproductive speeds and varied diets are vulnerable to threats that directly impact survival and fecundity (e.g., invasive species and biological resource use) and climate change. Species with no threats are non-pelagic specialists with invertebrate diets and fast reproductive speeds.
Main conclusions
Our results suggest both threatened and non-threatened species contribute unique ecological strategies. Consequently, conserving both threat groups, but with contrasting approaches may avoid potential changes in ecosystem functioning and stability.
Methods Trait Selection and Data
We compiled data from multiple databases for eight traits across all 341 extant species of seabirds. Here we recognise seabirds as those that feed at sea, either nearshore or offshore, but excluding marine ducks. These traits encompass the varying ecological and life history strategies of seabirds, and relate to ecosystem functioning and species’ vulnerabilities. We first extracted the trait data for body mass, clutch size, habitat breadth and diet guild from a recently compiled trait database for birds (Cooke, Bates, et al., 2019). Generation length and migration status were compiled from BirdLife International (datazone.birdlife.org), and pelagic specialism and foraging guild from Wilman et al. (2014). We further compiled clutch size information for 84 species through a literature search.
Foraging and diet guild describe the most dominant foraging strategy and diet of the species. Wilman et al. (2014) assigned species a score from 0 to 100% for each foraging and diet guild based on their relative usage of a given category. Using these scores, species were classified into four foraging guild categories (diver, surface, ground, and generalist foragers) and three diet guild categories (omnivore, invertebrate, and vertebrate & scavenger diets). Each was assigned to a guild based on the predominant foraging strategy or diet (score > 50%). Species with category scores < 50% were classified as generalists for the foraging guild trait and omnivores for the diet guild trait. Body mass was measured in grams and was the median across multiple databases. Habitat breadth is the number of habitats listed as suitable by the International Union for Conservation of Nature (IUCN, iucnredlist.org). Generation length describes the mean age in years at which a species produces offspring. Clutch size is the number of eggs per clutch (the central tendency was recorded as the mean or mode). Migration status describes whether a species undertakes full migration (regular or seasonal cyclical movements beyond the breeding range, with predictable timing and destinations) or not. Pelagic specialism describes whether foraging is predominantly pelagic. To improve normality of the data, continuous traits, except clutch size, were log10 transformed.
Multiple Imputation
All traits had more than 80% coverage for our list of 341 seabird species, and body mass and habitat breadth had complete species coverage. To achieve complete species trait coverage, we imputed missing data for clutch size (4 species), generation length (1 species), diet guild (60 species), foraging guild (60 species), pelagic specialism (60 species) and migration status (3 species). The imputation approach has the advantage of increasing the sample size and consequently the statistical power of any analysis whilst reducing bias and error (Kim, Blomberg, & Pandolfi, 2018; Penone et al., 2014; Taugourdeau, Villerd, Plantureux, Huguenin-Elie, & Amiaud, 2014).
We estimated missing values using random forest regression trees, a non-parametric imputation method, based on the ecological and phylogenetic relationships between species (Breiman, 2001; Stekhoven & Bühlmann, 2012). This method has high predictive accuracy and the capacity to deal with complexity in relationships including non-linearities and interactions (Cutler et al., 2007). To perform the random forest multiple imputations, we used the missForest function from package “missForest” (Stekhoven & Bühlmann, 2012). We imputed missing values based on the ecological (the trait data) and phylogenetic (the first 10 phylogenetic eigenvectors, detailed below) relationships between species. We generated 1,000 trees - a cautiously large number to increase predictive accuracy and prevent overfitting (Stekhoven & Bühlmann, 2012). We set the number of variables randomly sampled at each split (mtry) as the square-root of the number variables included (10 phylogenetic eigenvectors, 8 traits; mtry = 4); a useful compromise between imputation error and computation time (Stekhoven & Bühlmann, 2012). We used a maximum of 20 iterations (maxiter = 20), to ensure the imputations finished due to the stopping criterion and not due to the limit of iterations (the imputed datasets generally finished after 4 – 10 iterations).
Due to the stochastic nature of the regression tree imputation approach, the estimated values will differ slightly each time. To capture this imputation uncertainty and to converge on a reliable result, we repeated the process 15 times, resulting in 15 trait datasets, which is suggested to be sufficient (González-Suárez, Zanchetta Ferreira, & Grilo, 2018; van Buuren & Groothuis-Oudshoorn, 2011). We took the mean values for continuous traits and modal values for categorical traits across the 15 datasets for subsequent analyses.
Phylogenetic data can improve the estimation of missing trait values in the imputation process (Kim et al., 2018; Swenson, 2014), because closely related species tend to be more similar to each other (Pagel, 1999) and many traits display high degrees of phylogenetic signal (Blomberg, Garland, & Ives, 2003). Phylogenetic information was summarised by eigenvectors extracted from a principal coordinate analysis, representing the variation in the phylogenetic distances among species (Jose Alexandre F. Diniz-Filho et al., 2012; José Alexandre Felizola Diniz-Filho, Rangel, Santos, & Bini, 2012). Bird phylogenetic distance data (Prum et al., 2015) were decomposed into a set of orthogonal phylogenetic eigenvectors using the Phylo2DirectedGraph and PEM.build functions from the “MPSEM” package (Guenard & Legendre, 2018). Here, we used the first 10 phylogenetic eigenvectors, which have previously been shown to minimise imputation error (Penone et al., 2014). These phylogenetic eigenvectors summarise major phylogenetic differences between species (Diniz-Filho et al., 2012) and captured 61% of the variation in the phylogenetic distances among seabirds. Still, these eigenvectors do not include fine-scale differences between species (Diniz-Filho et al., 2012), however the inclusion of many phylogenetic eigenvectors would dilute the ecological information contained in the traits, and could lead to excessive noise (Diniz-Filho et al., 2012; Peres‐Neto & Legendre, 2010). Thus, including the first 10 phylogenetic eigenvectors reduces imputation error and ensures a balance between including detailed phylogenetic information and diluting the information contained in the other traits.
To quantify the average error in random forest predictions across the imputed datasets (out-of-bag error), we calculated the mean normalized root squared error and associated standard deviation across the 15 datasets for continuous traits (clutch size = 13.3 ± 0.35 %, generation length = 0.6 ± 0.02 %). For categorical data, we quantified the mean percentage of traits falsely classified (diet guild = 28.6 ± 0.97 %, foraging guild = 18.0 ± 1.05 %, pelagic specialism = 11.2 ± 0.66 %, migration status = 18.8 ± 0.58 %). Since body mass and habitat breadth have complete trait coverage, they did not require imputation. Low imputation accuracy is reflected in high out-of-bag error values where diet guild had the lowest imputation accuracy with 28.6% wrongly classified on average. Diet is generally difficult to predict (Gainsbury, Tallowin, & Meiri, 2018), potentially due to species’ high dietary plasticity (Gaglio, Cook, McInnes, Sherley, & Ryan, 2018) and/or the low phylogenetic conservatism of diet (Gainsbury et al., 2018). With this caveat in mind, we chose dietary guild, as more coarse dietary classifications are more
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Micro-nanoplastics (MNPs) enter biological systems, forming a protein corona (PC) by adsorbing proteins from bodily fluids, influencing their biological effects. Mass spectrometry-based proteomics characterizes PC composition, and recent advances have leveraged protein amino acid sequence-derived features to predict PC formation using a supervised random forest (RF) classifier. However, mass spectrometry often generates substantial missing values (MVs), which may hinder the model’s predictive performance and the understanding of protein–particle interactions. This study assessed the impact of 20 imputation methods on RF classifier performance in predicting human plasma PC formation on polylactic acid (PLA) and photoaged PLA microplastics (MPs), considering their rising ecological and health concerns. The results showed that five left-censored imputation methods (Zero, Half-min, Min, QRILC, GSimp) achieved the best performance, with high accuracy (0.80–0.82), AUC (0.78–0.84), precision (0.78–0.80), and recall (0.97–0.98). Protein spatial features, including secondary sheet structure (negative) and absolute solvent-accessible area (positive), were identified as key factors influencing protein adsorption onto MPs. Additionally, UV aging increased the importance ranking of features frac_aa_S and fraction_exposed_exposed_S, highlighting altered protein–MPs interactions, likely through hydrogen bonding and electrostatic forces. This study demonstrates the potential of left-censored imputation methods in enhancing RF classifier performance for predicting PC formation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Type 2 diabetes mellitus remains a critical global health challenge, with rising incidence rates placing immense pressure on healthcare systems worldwide. This chronic metabolic disorder affects diverse populations, including the elderly and children, leading to severe complications. Early and accurate prediction is essential to mitigate these consequences, yet traditional models often struggle with challenges such as imbalanced datasets, high-dimensional data, missing values, and outliers, resulting in limited predictive performance and interpretability. This study introduces DiabetesXpertNet, an innovative deep learning framework designed to enhance the prediction of Type 2 diabetes mellitus. Unlike existing convolutional neural network models optimized for image data, which focus on generalized attention mechanisms, DiabetesXpertNet is specifically tailored for tabular medical data. It incorporates a convolutional neural network architecture with dynamic channel attention modules to prioritize clinically significant features, such as glucose and insulin levels, and a context-aware feature enhancer to capture complex sequential relationships within structured datasets. The model employs advanced preprocessing techniques, including mean imputation for missing values, median replacement for outliers, and feature selection through mutual information and LASSO regression, to improve dataset quality and computational efficiency. Additionally, a logistic regression-based class weighting strategy addresses class imbalance, enhancing model fairness. Evaluated on the PID dataset and Frankfurt Hospital, Germany Diabetes datasets, DiabetesXpertNet achieves an accuracy of 89.98%, AUC of 91.95%, precision of 89.08%, recall of 88.11%, and F1-score of 88.01%, outperforming existing machine learning and deep learning models. Compared to traditional machine learning approaches, it demonstrates significant improvements in precision (+5.1%), recall (+4.8%), F1-score (+5.1%), accuracy (+6.0%), and AUC (+4.5%). Against other convolutional neural network models, it shows meaningful gains in precision (+2.2%), recall (+1.1%), F1-score (+1.2%), accuracy (+1.9%), and AUC (+0.6%). These results underscore the robustness and interpretability of DiabetesXpertNet, making it a promising tool for early Type 2 diabetes diagnosis in clinical settings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Type 2 diabetes mellitus remains a critical global health challenge, with rising incidence rates placing immense pressure on healthcare systems worldwide. This chronic metabolic disorder affects diverse populations, including the elderly and children, leading to severe complications. Early and accurate prediction is essential to mitigate these consequences, yet traditional models often struggle with challenges such as imbalanced datasets, high-dimensional data, missing values, and outliers, resulting in limited predictive performance and interpretability. This study introduces DiabetesXpertNet, an innovative deep learning framework designed to enhance the prediction of Type 2 diabetes mellitus. Unlike existing convolutional neural network models optimized for image data, which focus on generalized attention mechanisms, DiabetesXpertNet is specifically tailored for tabular medical data. It incorporates a convolutional neural network architecture with dynamic channel attention modules to prioritize clinically significant features, such as glucose and insulin levels, and a context-aware feature enhancer to capture complex sequential relationships within structured datasets. The model employs advanced preprocessing techniques, including mean imputation for missing values, median replacement for outliers, and feature selection through mutual information and LASSO regression, to improve dataset quality and computational efficiency. Additionally, a logistic regression-based class weighting strategy addresses class imbalance, enhancing model fairness. Evaluated on the PID dataset and Frankfurt Hospital, Germany Diabetes datasets, DiabetesXpertNet achieves an accuracy of 89.98%, AUC of 91.95%, precision of 89.08%, recall of 88.11%, and F1-score of 88.01%, outperforming existing machine learning and deep learning models. Compared to traditional machine learning approaches, it demonstrates significant improvements in precision (+5.1%), recall (+4.8%), F1-score (+5.1%), accuracy (+6.0%), and AUC (+4.5%). Against other convolutional neural network models, it shows meaningful gains in precision (+2.2%), recall (+1.1%), F1-score (+1.2%), accuracy (+1.9%), and AUC (+0.6%). These results underscore the robustness and interpretability of DiabetesXpertNet, making it a promising tool for early Type 2 diabetes diagnosis in clinical settings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Structured expert judgement (SEJ) is a suite of techniques used to elicit expert predictions, e.g. probability predictions of the occurrence of events, for situations in which data are too expensive or impossible to obtain. The quality of expert predictions can be assessed using Brier scores and calibration questions. In practice, these scores are computed from data that may have a correlation structure due to sharing the effects of the same levels of grouping factors of the experimental design. For example, asking common questions from experts may result in correlated probability predictions due to sharing common question effects. Furthermore, experts commonly fail to answer all the needed questions. Here, we focus on (i) improving the computation of standard error estimates of expert Brier scores by using mixed-effects models that support design-based correlation structures of observations, and (ii) imputation of missing probability predictions in computing expert Brier scores to enhance the comparability of the prediction accuracy of experts. We show that the accuracy of estimating standard errors of expert Brier scores can be improved by incorporating the within-question correlations due to asking common questions. We recommend the use of multiple imputation to correct for missing data in expert elicitation exercises. We also discuss the implications of adopting a formal experimental design approach for SEJ exercises.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Type 2 diabetes mellitus remains a critical global health challenge, with rising incidence rates placing immense pressure on healthcare systems worldwide. This chronic metabolic disorder affects diverse populations, including the elderly and children, leading to severe complications. Early and accurate prediction is essential to mitigate these consequences, yet traditional models often struggle with challenges such as imbalanced datasets, high-dimensional data, missing values, and outliers, resulting in limited predictive performance and interpretability. This study introduces DiabetesXpertNet, an innovative deep learning framework designed to enhance the prediction of Type 2 diabetes mellitus. Unlike existing convolutional neural network models optimized for image data, which focus on generalized attention mechanisms, DiabetesXpertNet is specifically tailored for tabular medical data. It incorporates a convolutional neural network architecture with dynamic channel attention modules to prioritize clinically significant features, such as glucose and insulin levels, and a context-aware feature enhancer to capture complex sequential relationships within structured datasets. The model employs advanced preprocessing techniques, including mean imputation for missing values, median replacement for outliers, and feature selection through mutual information and LASSO regression, to improve dataset quality and computational efficiency. Additionally, a logistic regression-based class weighting strategy addresses class imbalance, enhancing model fairness. Evaluated on the PID dataset and Frankfurt Hospital, Germany Diabetes datasets, DiabetesXpertNet achieves an accuracy of 89.98%, AUC of 91.95%, precision of 89.08%, recall of 88.11%, and F1-score of 88.01%, outperforming existing machine learning and deep learning models. Compared to traditional machine learning approaches, it demonstrates significant improvements in precision (+5.1%), recall (+4.8%), F1-score (+5.1%), accuracy (+6.0%), and AUC (+4.5%). Against other convolutional neural network models, it shows meaningful gains in precision (+2.2%), recall (+1.1%), F1-score (+1.2%), accuracy (+1.9%), and AUC (+0.6%). These results underscore the robustness and interpretability of DiabetesXpertNet, making it a promising tool for early Type 2 diabetes diagnosis in clinical settings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diabetes is a chronic disease, which is characterized by abnormally high blood sugar levels. It may affect various organs and tissues, and even lead to life-threatening complications. Accurate prediction of diabetes can significantly reduce its incidence. However, the current prediction methods struggle to accurately capture the essential characteristics of nonlinear data, and the black-box nature of these methods hampers its clinical application. To address these challenges, we propose KCCAM_DNN, a diabetes prediction method that integrates Kendall’s correlation coefficient and an attention mechanism within a deep neural network. In the KCCAM_DNN, Kendall’s correlation coefficient is initially employed for feature selection, which effectively filters out key features influencing diabetes prediction. For missing values in the data, polynomial regression is utilized for imputation, ensuring data completeness. Subsequently, we construct a deep neural network (KCCAM_DNN) based on the self-attention mechanism, which assigns greater weight to crucial features affecting diabetes and enhances the model’s predictive performance. Finally, we employ the SHAP model to analyze the impact of each feature on diabetes prediction, augmenting the model’s interpretability. Experimental results show that KCCAM_DNN exhibits superior performance on both PIMA Indian and LMCH diabetes datasets, achieving test accuracies of 99.090% and 99.333%, respectively, approximately 2% higher than the best existing method. These results suggest that KCCAM_DNN is proficient in diabetes prediction, providing a foundation for informed decision-making in the diagnosis and prevention of diabetes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In response to Taiwan’s rapidly aging population and the rising demand for personalized health care, accurately assessing individual physiological aging has become an essential area of study. This research utilizes health examination data to propose a machine learning-based biological age prediction model that quantifies physiological age through residual life estimation. The model leverages LightGBM, which shows an 11.40% improvement in predictive performance (R-squared) compared to the XGBoost model. In the experiments, the use of MICE imputation for missing data significantly enhanced prediction accuracy, resulting in a 23.35% improvement in predictive performance. Kaplan-Meier (K-M) estimator survival analysis revealed that the model effectively differentiates between groups with varying health levels, underscoring the validity of biological age as a health status indicator. Additionally, the model identified the top ten biomarkers most influential in aging for both men and women, with a 69.23% overlap with Taiwan’s leading causes of death and previously identified top health-impact factors, further validating its practical relevance. Through multidimensional health recommendations based on SHAP and PCC interpretations, if the health recommendations provided by the model are implemented, 64.58% of individuals could potentially extend their life expectancy. This study provides new methodological support and data backing for precision health interventions and life extension.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cervical cancer remains a leading cause of female mortality, particularly in developing regions, underscoring the critical need for early detection and intervention guided by skilled medical professionals. While Pap smear images serve as valuable diagnostic tools, many available datasets for automated cervical cancer detection contain missing data, posing challenges for machine learning models’ efficacy. To address these hurdles, this study presents an automated system adept at managing missing information using ADASYN characteristics, resulting in exceptional accuracy. The proposed methodology integrates a voting classifier model harnessing the predictive capacity of three distinct machine learning models. It further incorporates SVM Imputer and ADASYN up-sampled features to mitigate missing value concerns, while leveraging CNN-generated features to augment the model’s capabilities. Notably, this model achieves remarkable performance metrics, boasting a 99.99% accuracy, precision, recall, and F1 score. A comprehensive comparative analysis evaluates the proposed model against various machine learning algorithms across four scenarios: original dataset usage, SVM imputation, ADASYN feature utilization, and CNN-generated features. Results indicate the superior efficacy of the proposed model over existing state-of-the-art techniques. This research not only introduces a novel approach but also offers actionable suggestions for refining automated cervical cancer detection systems. Its impact extends to benefiting medical practitioners by enabling earlier detection and improved patient care. Furthermore, the study’s findings have substantial societal implications, potentially reducing the burden of cervical cancer through enhanced diagnostic accuracy and timely intervention.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ryegrass single plants, bi-parental family pools, and multi-parental family pools are often genotyped, based on allele-frequencies using genotyping-by-sequencing (GBS) assays. GBS assays can be performed at low-coverage depth to reduce costs. However, reducing the coverage depth leads to a higher proportion of missing data, and leads to a reduction in accuracy when identifying the allele-frequency at each locus. As a consequence of the latter, genomic relationship matrices (GRMs) will be biased. This bias in GRMs affects variance estimates and the accuracy of GBLUP for genomic prediction (GBLUP-GP). We derived equations that describe the bias from low-coverage sequencing as an effect of binomial sampling of sequence reads, and allowed for any ploidy level of the sample considered. This allowed us to combine individual and pool genotypes in one GRM, treating pool-genotypes as a polyploid genotype, equal to the total ploidy-level of the parents of the pool. Using simulated data, we verified the magnitude of the GRM bias at different coverage depths for three different kinds of ryegrass breeding material: individual genotypes from single plants, pool-genotypes from F2 families, and pool-genotypes from synthetic varieties. To better handle missing data, we also tested imputation procedures, which are suited for analyzing allele-frequency genomic data. The relative advantages of the bias-correction and the imputation of missing data were evaluated using real data. We examined a large dataset, including single plants, F2 families, and synthetic varieties genotyped in three GBS assays, each with a different coverage depth, and evaluated them for heading date, crown rust resistance, and seed yield. Cross validations were used to test the accuracy using GBLUP approaches, demonstrating the feasibility of predicting among different breeding material. Bias-corrected GRMs proved to increase predictive accuracies when compared with standard approaches to construct GRMs. Among the imputation methods we tested, the random forest method yielded the highest predictive accuracy. The combinations of these two methods resulted in a meaningful increase of predictive ability (up to 0.09). The possibility of predicting across individuals and pools provides new opportunities for improving ryegrass breeding schemes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cervical cancer remains a leading cause of female mortality, particularly in developing regions, underscoring the critical need for early detection and intervention guided by skilled medical professionals. While Pap smear images serve as valuable diagnostic tools, many available datasets for automated cervical cancer detection contain missing data, posing challenges for machine learning models’ efficacy. To address these hurdles, this study presents an automated system adept at managing missing information using ADASYN characteristics, resulting in exceptional accuracy. The proposed methodology integrates a voting classifier model harnessing the predictive capacity of three distinct machine learning models. It further incorporates SVM Imputer and ADASYN up-sampled features to mitigate missing value concerns, while leveraging CNN-generated features to augment the model’s capabilities. Notably, this model achieves remarkable performance metrics, boasting a 99.99% accuracy, precision, recall, and F1 score. A comprehensive comparative analysis evaluates the proposed model against various machine learning algorithms across four scenarios: original dataset usage, SVM imputation, ADASYN feature utilization, and CNN-generated features. Results indicate the superior efficacy of the proposed model over existing state-of-the-art techniques. This research not only introduces a novel approach but also offers actionable suggestions for refining automated cervical cancer detection systems. Its impact extends to benefiting medical practitioners by enabling earlier detection and improved patient care. Furthermore, the study’s findings have substantial societal implications, potentially reducing the burden of cervical cancer through enhanced diagnostic accuracy and timely intervention.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diabetes is a chronic disease, which is characterized by abnormally high blood sugar levels. It may affect various organs and tissues, and even lead to life-threatening complications. Accurate prediction of diabetes can significantly reduce its incidence. However, the current prediction methods struggle to accurately capture the essential characteristics of nonlinear data, and the black-box nature of these methods hampers its clinical application. To address these challenges, we propose KCCAM_DNN, a diabetes prediction method that integrates Kendall’s correlation coefficient and an attention mechanism within a deep neural network. In the KCCAM_DNN, Kendall’s correlation coefficient is initially employed for feature selection, which effectively filters out key features influencing diabetes prediction. For missing values in the data, polynomial regression is utilized for imputation, ensuring data completeness. Subsequently, we construct a deep neural network (KCCAM_DNN) based on the self-attention mechanism, which assigns greater weight to crucial features affecting diabetes and enhances the model’s predictive performance. Finally, we employ the SHAP model to analyze the impact of each feature on diabetes prediction, augmenting the model’s interpretability. Experimental results show that KCCAM_DNN exhibits superior performance on both PIMA Indian and LMCH diabetes datasets, achieving test accuracies of 99.090% and 99.333%, respectively, approximately 2% higher than the best existing method. These results suggest that KCCAM_DNN is proficient in diabetes prediction, providing a foundation for informed decision-making in the diagnosis and prevention of diabetes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Real-time soil matric potential measurements for determining potato production's water availability are currently used in precision irrigation. It is well known that managing irrigation based on soil matric potential (SMP) helps increase water use efficiency and reduce crop environmental impact. Yet, SMP monitoring presents challenges and sometimes leads to gaps in the collected data. This research sought to address these data gaps in the SMP time series. Using meteorological and field measurements, we developed a filtering and imputation algorithm by implementing three prominent predictive models in the algorithm to estimate missing values. Over 2 months, we gathered hourly SMP values from a field north of the Péribonka River in Lac-Saint-Jean, Québec, Canada. Our study evaluated various data input combinations, including only meteorological data, SMP measurements, or a mix of both. The Extreme Learning Machine (ELM) model proved the most effective among the tested models. It outperformed the k-Nearest Neighbors (kNN) model and the Evolutionary Optimized Inverse Distance Method (gaIDW). The ELM model, with five inputs comprising SMP measurements, achieved a correlation coefficient of 0.992, a root-mean-square error of 0.164 cm, a mean absolute error of 0.122 cm, and a Nash-Sutcliffe efficiency of 0.983. The ELM model requires at least five inputs to achieve the best results in the study context. These can be meteorological inputs like relative humidity, dew temperature, land inputs, or a combination of both. The results were within 5% of the best-performing input combination we identified earlier. To mitigate the computational demands of these models, a quicker baseline model can be used for initial input filtering. With this method, we expect the output from simpler models such as gaIDW and kNN to vary by no more than 20%. Nevertheless, this discrepancy can be efficiently managed by leveraging more sophisticated models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diabetes is a chronic disease, which is characterized by abnormally high blood sugar levels. It may affect various organs and tissues, and even lead to life-threatening complications. Accurate prediction of diabetes can significantly reduce its incidence. However, the current prediction methods struggle to accurately capture the essential characteristics of nonlinear data, and the black-box nature of these methods hampers its clinical application. To address these challenges, we propose KCCAM_DNN, a diabetes prediction method that integrates Kendall’s correlation coefficient and an attention mechanism within a deep neural network. In the KCCAM_DNN, Kendall’s correlation coefficient is initially employed for feature selection, which effectively filters out key features influencing diabetes prediction. For missing values in the data, polynomial regression is utilized for imputation, ensuring data completeness. Subsequently, we construct a deep neural network (KCCAM_DNN) based on the self-attention mechanism, which assigns greater weight to crucial features affecting diabetes and enhances the model’s predictive performance. Finally, we employ the SHAP model to analyze the impact of each feature on diabetes prediction, augmenting the model’s interpretability. Experimental results show that KCCAM_DNN exhibits superior performance on both PIMA Indian and LMCH diabetes datasets, achieving test accuracies of 99.090% and 99.333%, respectively, approximately 2% higher than the best existing method. These results suggest that KCCAM_DNN is proficient in diabetes prediction, providing a foundation for informed decision-making in the diagnosis and prevention of diabetes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionPreeclampsia, one of the leading causes of maternal and fetal morbidity and mortality, demands accurate predictive models for the lack of effective treatment. Predictive models based on machine learning algorithms demonstrate promising potential, while there is a controversial discussion about whether machine learning methods should be recommended preferably, compared to traditional statistical models.MethodsWe employed both logistic regression and six machine learning methods as binary predictive models for a dataset containing 733 women diagnosed with preeclampsia. Participants were grouped by four different pregnancy outcomes. After the imputation of missing values, statistical description and comparison were conducted preliminarily to explore the characteristics of documented 73 variables. Sequentially, correlation analysis and feature selection were performed as preprocessing steps to filter contributing variables for developing models. The models were evaluated by multiple criteria.ResultsWe first figured out that the influential variables screened by preprocessing steps did not overlap with those determined by statistical differences. Secondly, the most accurate imputation method is K-Nearest Neighbor, and the imputation process did not affect the performance of the developed models much. Finally, the performance of models was investigated. The random forest classifier, multi-layer perceptron, and support vector machine demonstrated better discriminative power for prediction evaluated by the area under the receiver operating characteristic curve, while the decision tree classifier, random forest, and logistic regression yielded better calibration ability verified, as by the calibration curve.ConclusionMachine learning algorithms can accomplish prediction modeling and demonstrate superior discrimination, while Logistic Regression can be calibrated well. Statistical analysis and machine learning are two scientific domains sharing similar themes. The predictive abilities of such developed models vary according to the characteristics of datasets, which still need larger sample sizes and more influential predictors to accumulate evidence.