23 datasets found
  1. f

    Table_3_Comparison of machine learning and logistic regression as predictive...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dongying Zheng; Xinyu Hao; Muhanmmad Khan; Lixia Wang; Fan Li; Ning Xiang; Fuli Kang; Timo Hamalainen; Fengyu Cong; Kedong Song; Chong Qiao (2023). Table_3_Comparison of machine learning and logistic regression as predictive models for adverse maternal and neonatal outcomes of preeclampsia: A retrospective study.XLSX [Dataset]. http://doi.org/10.3389/fcvm.2022.959649.s005
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Dongying Zheng; Xinyu Hao; Muhanmmad Khan; Lixia Wang; Fan Li; Ning Xiang; Fuli Kang; Timo Hamalainen; Fengyu Cong; Kedong Song; Chong Qiao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionPreeclampsia, one of the leading causes of maternal and fetal morbidity and mortality, demands accurate predictive models for the lack of effective treatment. Predictive models based on machine learning algorithms demonstrate promising potential, while there is a controversial discussion about whether machine learning methods should be recommended preferably, compared to traditional statistical models.MethodsWe employed both logistic regression and six machine learning methods as binary predictive models for a dataset containing 733 women diagnosed with preeclampsia. Participants were grouped by four different pregnancy outcomes. After the imputation of missing values, statistical description and comparison were conducted preliminarily to explore the characteristics of documented 73 variables. Sequentially, correlation analysis and feature selection were performed as preprocessing steps to filter contributing variables for developing models. The models were evaluated by multiple criteria.ResultsWe first figured out that the influential variables screened by preprocessing steps did not overlap with those determined by statistical differences. Secondly, the most accurate imputation method is K-Nearest Neighbor, and the imputation process did not affect the performance of the developed models much. Finally, the performance of models was investigated. The random forest classifier, multi-layer perceptron, and support vector machine demonstrated better discriminative power for prediction evaluated by the area under the receiver operating characteristic curve, while the decision tree classifier, random forest, and logistic regression yielded better calibration ability verified, as by the calibration curve.ConclusionMachine learning algorithms can accomplish prediction modeling and demonstrate superior discrimination, while Logistic Regression can be calibrated well. Statistical analysis and machine learning are two scientific domains sharing similar themes. The predictive abilities of such developed models vary according to the characteristics of datasets, which still need larger sample sizes and more influential predictors to accumulate evidence.

  2. h

    Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

    • datahub.hku.hk
    Updated Aug 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
    Explore at:
    Dataset updated
    Aug 13, 2020
    Dataset provided by
    HKU Data Repository
    Authors
    Wen Ma
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description
    1. NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
  3. f

    Results of the ML models using PCA imputer.

    • plos.figshare.com
    xls
    Updated Jan 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turki Aljrees (2024). Results of the ML models using PCA imputer. [Dataset]. http://doi.org/10.1371/journal.pone.0295632.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 3, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Turki Aljrees
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.

  4. d

    Assessing among-lineage variability in phylogenetic imputation of functional...

    • search.dataone.org
    • zenodo.org
    • +1more
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Molin-Venegas; Juan Carlos Moreno-Saiz; Isabel Castro Castro; T. Jonathan Davies; Pedro R. Peres-Neto; Miguel à . Rodriguez; Rafael Molina-Venegas (2025). Assessing among-lineage variability in phylogenetic imputation of functional trait datasets [Dataset]. http://doi.org/10.5061/dryad.12111
    Explore at:
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Rafael Molin-Venegas; Juan Carlos Moreno-Saiz; Isabel Castro Castro; T. Jonathan Davies; Pedro R. Peres-Neto; Miguel à . Rodriguez; Rafael Molina-Venegas
    Time period covered
    Jan 23, 2018
    Description

    Phylogenetic imputation has recently emerged as a potentially powerful tool for predicting missing data in functional traits datasets. As such, understanding the limitations of phylogenetic modelling in predicting trait values is critical if we are to use them in subsequent analyses. Previous studies have focused on the relationship between phylogenetic signal and clade-level prediction accuracy, yet variability in prediction accuracy among individual tips of phylogenies remains largely unexplored. Here, we used simulations of trait evolution along the branches of phylogenetic trees to show how the accuracy of phylogenetic imputations is influenced by the combined effects of (1) the amount of phylogenetic signal in the traits and (2) the branch length of the tips to be imputed. Specifically, we conducted cross-validation trials to estimate the variability in prediction accuracy among individual tips on the phylogenies (hereafter “tip-level accuracy†). We found that under a Brownian moti...

  5. f

    Data from: Prediction of f-CaO content in cement clinker using a GRU-based...

    • tandf.figshare.com
    xlsx
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinyue Yao; Xuehong Ren; Jiayuan Ye; Fuli Cao; Chengwen Xu; Munan Zhai; Wensheng Zhang (2025). Prediction of f-CaO content in cement clinker using a GRU-based deep learning model with masked-attention mechanism for incomplete DCS data [Dataset]. http://doi.org/10.6084/m9.figshare.29203416.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Xinyue Yao; Xuehong Ren; Jiayuan Ye; Fuli Cao; Chengwen Xu; Munan Zhai; Wensheng Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Efficient real-time prediction of free lime (f-CaO) content is critical for cement clinker quality control. However, process parameters from the distributed control system (DCS) often suffer from significant missing values, while time delays, nonlinear relationships, and coupling effects further hinder prediction accuracy. To overcome these challenges, this study presents a GRU-based f-CaO prediction model. Sixteen key parameters were first selected from 180 DCS-collected variables, then reconstructed and temporally aligned through the integration of empirical formulas, expert knowledge, and the time window principle. An encoder module equipped with a masked-attention mechanism processes missing values and extracts essential features, while the GRU framework predicts the f-CaO content. To mitigate data imbalance issues, a weighted loss function was applied during training, achieving 16.7%, 25.0%, and 17.2% reductions in MAE, MSE, and RMSE, respectively. The model achieves both high accuracy and low latency, meeting the needs of real-time and stable cement production.

  6. Data from: A comparison of genomic selection models across time in interior...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    Updated May 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blaise Ratcliffe; Omnia Gamal El-Dien; Jaroslav Klápště; Ilga Porth; Charles Chen; Barry Jaquish; Yousry A. El-Kassaby; Blaise Ratcliffe; Omnia Gamal El-Dien; Jaroslav Klápště; Ilga Porth; Charles Chen; Barry Jaquish; Yousry A. El-Kassaby (2022). Data from: A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods [Dataset]. http://doi.org/10.5061/dryad.m4vh4
    Explore at:
    Dataset updated
    May 27, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Blaise Ratcliffe; Omnia Gamal El-Dien; Jaroslav Klápště; Ilga Porth; Charles Chen; Barry Jaquish; Yousry A. El-Kassaby; Blaise Ratcliffe; Omnia Gamal El-Dien; Jaroslav Klápště; Ilga Porth; Charles Chen; Barry Jaquish; Yousry A. El-Kassaby
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3–40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31–0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04–0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.

  7. n

    Data from: Biological traits of seabirds predict extinction risk and...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Mar 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cerren Richards; Robert Cooke; Amanda Bates (2021). Biological traits of seabirds predict extinction risk and vulnerability to anthropogenic threats [Dataset]. http://doi.org/10.5061/dryad.x69p8czhd
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 16, 2021
    Dataset provided by
    Memorial University of Newfoundland
    University of Gothenburg
    Authors
    Cerren Richards; Robert Cooke; Amanda Bates
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Aim

    Seabirds are heavily threatened by anthropogenic activities and their conservation status is deteriorating rapidly. Yet, these pressures are unlikely to uniformly impact all species. It remains an open question if seabirds with similar ecological roles are responding similarly to human pressures. Here we aim to: 1) test whether threatened vs non-threatened seabirds are separated in trait space; 2) quantify the similarity of species’ roles (redundancy) per IUCN Red List Category; and 3) identify traits that render species vulnerable to anthropogenic threats.

    Location

    Global

    Time period

    Contemporary

    Major taxa studied

    Seabirds

    Methods

    We compile and impute eight traits that relate to species’ vulnerabilities and ecosystem functioning across 341 seabird species. Using these traits, we build a mixed-data PCA of species’ trait space. We quantify trait redundancy using the unique trait combinations (UTCs) approach. Finally, we employ a SIMPER analysis to identify which traits explain the greatest difference between threat groups.

    Results

    We find seabirds segregate in trait space based on threat status, indicating anthropogenic impacts are selectively removing large, long-lived, pelagic surface feeders with narrow habitat breadths. We further find that threatened species have higher trait redundancy, while non-threatened species have relatively limited redundancy. Finally, we find that species with narrow habitat breadths, fast reproductive speeds, and varied diets are more likely to be threatened by habitat-modifying processes (e.g., pollution and natural system modifications); whereas pelagic specialists with slow reproductive speeds and varied diets are vulnerable to threats that directly impact survival and fecundity (e.g., invasive species and biological resource use) and climate change. Species with no threats are non-pelagic specialists with invertebrate diets and fast reproductive speeds.

    Main conclusions

    Our results suggest both threatened and non-threatened species contribute unique ecological strategies. Consequently, conserving both threat groups, but with contrasting approaches may avoid potential changes in ecosystem functioning and stability.

    Methods ​​​​Trait Selection and Data

    We compiled data from multiple databases for eight traits across all 341 extant species of seabirds. Here we recognise seabirds as those that feed at sea, either nearshore or offshore, but excluding marine ducks. These traits encompass the varying ecological and life history strategies of seabirds, and relate to ecosystem functioning and species’ vulnerabilities. We first extracted the trait data for body mass, clutch size, habitat breadth and diet guild from a recently compiled trait database for birds (Cooke, Bates, et al., 2019). Generation length and migration status were compiled from BirdLife International (datazone.birdlife.org), and pelagic specialism and foraging guild from Wilman et al. (2014). We further compiled clutch size information for 84 species through a literature search.

    Foraging and diet guild describe the most dominant foraging strategy and diet of the species. Wilman et al. (2014) assigned species a score from 0 to 100% for each foraging and diet guild based on their relative usage of a given category. Using these scores, species were classified into four foraging guild categories (diver, surface, ground, and generalist foragers) and three diet guild categories (omnivore, invertebrate, and vertebrate & scavenger diets). Each was assigned to a guild based on the predominant foraging strategy or diet (score > 50%). Species with category scores < 50% were classified as generalists for the foraging guild trait and omnivores for the diet guild trait. Body mass was measured in grams and was the median across multiple databases. Habitat breadth is the number of habitats listed as suitable by the International Union for Conservation of Nature (IUCN, iucnredlist.org). Generation length describes the mean age in years at which a species produces offspring. Clutch size is the number of eggs per clutch (the central tendency was recorded as the mean or mode). Migration status describes whether a species undertakes full migration (regular or seasonal cyclical movements beyond the breeding range, with predictable timing and destinations) or not. Pelagic specialism describes whether foraging is predominantly pelagic. To improve normality of the data, continuous traits, except clutch size, were log10 transformed.

    Multiple Imputation

    All traits had more than 80% coverage for our list of 341 seabird species, and body mass and habitat breadth had complete species coverage. To achieve complete species trait coverage, we imputed missing data for clutch size (4 species), generation length (1 species), diet guild (60 species), foraging guild (60 species), pelagic specialism (60 species) and migration status (3 species). The imputation approach has the advantage of increasing the sample size and consequently the statistical power of any analysis whilst reducing bias and error (Kim, Blomberg, & Pandolfi, 2018; Penone et al., 2014; Taugourdeau, Villerd, Plantureux, Huguenin-Elie, & Amiaud, 2014).

    We estimated missing values using random forest regression trees, a non-parametric imputation method, based on the ecological and phylogenetic relationships between species (Breiman, 2001; Stekhoven & Bühlmann, 2012). This method has high predictive accuracy and the capacity to deal with complexity in relationships including non-linearities and interactions (Cutler et al., 2007). To perform the random forest multiple imputations, we used the missForest function from package “missForest” (Stekhoven & Bühlmann, 2012). We imputed missing values based on the ecological (the trait data) and phylogenetic (the first 10 phylogenetic eigenvectors, detailed below) relationships between species. We generated 1,000 trees - a cautiously large number to increase predictive accuracy and prevent overfitting (Stekhoven & Bühlmann, 2012). We set the number of variables randomly sampled at each split (mtry) as the square-root of the number variables included (10 phylogenetic eigenvectors, 8 traits; mtry = 4); a useful compromise between imputation error and computation time (Stekhoven & Bühlmann, 2012). We used a maximum of 20 iterations (maxiter = 20), to ensure the imputations finished due to the stopping criterion and not due to the limit of iterations (the imputed datasets generally finished after 4 – 10 iterations).

    Due to the stochastic nature of the regression tree imputation approach, the estimated values will differ slightly each time. To capture this imputation uncertainty and to converge on a reliable result, we repeated the process 15 times, resulting in 15 trait datasets, which is suggested to be sufficient (González-Suárez, Zanchetta Ferreira, & Grilo, 2018; van Buuren & Groothuis-Oudshoorn, 2011). We took the mean values for continuous traits and modal values for categorical traits across the 15 datasets for subsequent analyses.

    Phylogenetic data can improve the estimation of missing trait values in the imputation process (Kim et al., 2018; Swenson, 2014), because closely related species tend to be more similar to each other (Pagel, 1999) and many traits display high degrees of phylogenetic signal (Blomberg, Garland, & Ives, 2003). Phylogenetic information was summarised by eigenvectors extracted from a principal coordinate analysis, representing the variation in the phylogenetic distances among species (Jose Alexandre F. Diniz-Filho et al., 2012; José Alexandre Felizola Diniz-Filho, Rangel, Santos, & Bini, 2012). Bird phylogenetic distance data (Prum et al., 2015) were decomposed into a set of orthogonal phylogenetic eigenvectors using the Phylo2DirectedGraph and PEM.build functions from the “MPSEM” package (Guenard & Legendre, 2018). Here, we used the first 10 phylogenetic eigenvectors, which have previously been shown to minimise imputation error (Penone et al., 2014). These phylogenetic eigenvectors summarise major phylogenetic differences between species (Diniz-Filho et al., 2012) and captured 61% of the variation in the phylogenetic distances among seabirds. Still, these eigenvectors do not include fine-scale differences between species (Diniz-Filho et al., 2012), however the inclusion of many phylogenetic eigenvectors would dilute the ecological information contained in the traits, and could lead to excessive noise (Diniz-Filho et al., 2012; Peres‐Neto & Legendre, 2010). Thus, including the first 10 phylogenetic eigenvectors reduces imputation error and ensures a balance between including detailed phylogenetic information and diluting the information contained in the other traits.

    To quantify the average error in random forest predictions across the imputed datasets (out-of-bag error), we calculated the mean normalized root squared error and associated standard deviation across the 15 datasets for continuous traits (clutch size = 13.3 ± 0.35 %, generation length = 0.6 ± 0.02 %). For categorical data, we quantified the mean percentage of traits falsely classified (diet guild = 28.6 ± 0.97 %, foraging guild = 18.0 ± 1.05 %, pelagic specialism = 11.2 ± 0.66 %, migration status = 18.8 ± 0.58 %). Since body mass and habitat breadth have complete trait coverage, they did not require imputation. Low imputation accuracy is reflected in high out-of-bag error values where diet guild had the lowest imputation accuracy with 28.6% wrongly classified on average. Diet is generally difficult to predict (Gainsbury, Tallowin, & Meiri, 2018), potentially due to species’ high dietary plasticity (Gaglio, Cook, McInnes, Sherley, & Ryan, 2018) and/or the low phylogenetic conservatism of diet (Gainsbury et al., 2018). With this caveat in mind, we chose dietary guild, as more coarse dietary classifications are more

  8. Data from: Predicting Protein Corona Formation on Polylactic Acid...

    • acs.figshare.com
    • figshare.com
    xlsx
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuri Wu; Liping Huang; Lina Zhou; Yan Fang; Feng Tan (2025). Predicting Protein Corona Formation on Polylactic Acid Microplastics Pre- and Post-Photoaging: The Importance of Optimal Imputation Methods [Dataset]. http://doi.org/10.1021/acs.estlett.5c00183.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 5, 2025
    Dataset provided by
    ACS Publications
    Authors
    Xuri Wu; Liping Huang; Lina Zhou; Yan Fang; Feng Tan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Micro-nanoplastics (MNPs) enter biological systems, forming a protein corona (PC) by adsorbing proteins from bodily fluids, influencing their biological effects. Mass spectrometry-based proteomics characterizes PC composition, and recent advances have leveraged protein amino acid sequence-derived features to predict PC formation using a supervised random forest (RF) classifier. However, mass spectrometry often generates substantial missing values (MVs), which may hinder the model’s predictive performance and the understanding of protein–particle interactions. This study assessed the impact of 20 imputation methods on RF classifier performance in predicting human plasma PC formation on polylactic acid (PLA) and photoaged PLA microplastics (MPs), considering their rising ecological and health concerns. The results showed that five left-censored imputation methods (Zero, Half-min, Min, QRILC, GSimp) achieved the best performance, with high accuracy (0.80–0.82), AUC (0.78–0.84), precision (0.78–0.80), and recall (0.97–0.98). Protein spatial features, including secondary sheet structure (negative) and absolute solvent-accessible area (positive), were identified as key factors influencing protein adsorption onto MPs. Additionally, UV aging increased the importance ranking of features frac_aa_S and fraction_exposed_exposed_S, highlighting altered protein–MPs interactions, likely through hydrogen bonding and electrostatic forces. This study demonstrates the potential of left-censored imputation methods in enhancing RF classifier performance for predicting PC formation.

  9. f

    Feature Importance Based on MI Scores.

    • plos.figshare.com
    xls
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahman Farnoosh; Karlo Abnoosian; Rasha Abbas Isewid; Danial Javaheri (2025). Feature Importance Based on MI Scores. [Dataset]. http://doi.org/10.1371/journal.pone.0330454.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rahman Farnoosh; Karlo Abnoosian; Rasha Abbas Isewid; Danial Javaheri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Type 2 diabetes mellitus remains a critical global health challenge, with rising incidence rates placing immense pressure on healthcare systems worldwide. This chronic metabolic disorder affects diverse populations, including the elderly and children, leading to severe complications. Early and accurate prediction is essential to mitigate these consequences, yet traditional models often struggle with challenges such as imbalanced datasets, high-dimensional data, missing values, and outliers, resulting in limited predictive performance and interpretability. This study introduces DiabetesXpertNet, an innovative deep learning framework designed to enhance the prediction of Type 2 diabetes mellitus. Unlike existing convolutional neural network models optimized for image data, which focus on generalized attention mechanisms, DiabetesXpertNet is specifically tailored for tabular medical data. It incorporates a convolutional neural network architecture with dynamic channel attention modules to prioritize clinically significant features, such as glucose and insulin levels, and a context-aware feature enhancer to capture complex sequential relationships within structured datasets. The model employs advanced preprocessing techniques, including mean imputation for missing values, median replacement for outliers, and feature selection through mutual information and LASSO regression, to improve dataset quality and computational efficiency. Additionally, a logistic regression-based class weighting strategy addresses class imbalance, enhancing model fairness. Evaluated on the PID dataset and Frankfurt Hospital, Germany Diabetes datasets, DiabetesXpertNet achieves an accuracy of 89.98%, AUC of 91.95%, precision of 89.08%, recall of 88.11%, and F1-score of 88.01%, outperforming existing machine learning and deep learning models. Compared to traditional machine learning approaches, it demonstrates significant improvements in precision (+5.1%), recall (+4.8%), F1-score (+5.1%), accuracy (+6.0%), and AUC (+4.5%). Against other convolutional neural network models, it shows meaningful gains in precision (+2.2%), recall (+1.1%), F1-score (+1.2%), accuracy (+1.9%), and AUC (+0.6%). These results underscore the robustness and interpretability of DiabetesXpertNet, making it a promising tool for early Type 2 diabetes diagnosis in clinical settings.

  10. f

    Final Features Selected by LassoR.

    • figshare.com
    xls
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahman Farnoosh; Karlo Abnoosian; Rasha Abbas Isewid; Danial Javaheri (2025). Final Features Selected by LassoR. [Dataset]. http://doi.org/10.1371/journal.pone.0330454.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rahman Farnoosh; Karlo Abnoosian; Rasha Abbas Isewid; Danial Javaheri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Type 2 diabetes mellitus remains a critical global health challenge, with rising incidence rates placing immense pressure on healthcare systems worldwide. This chronic metabolic disorder affects diverse populations, including the elderly and children, leading to severe complications. Early and accurate prediction is essential to mitigate these consequences, yet traditional models often struggle with challenges such as imbalanced datasets, high-dimensional data, missing values, and outliers, resulting in limited predictive performance and interpretability. This study introduces DiabetesXpertNet, an innovative deep learning framework designed to enhance the prediction of Type 2 diabetes mellitus. Unlike existing convolutional neural network models optimized for image data, which focus on generalized attention mechanisms, DiabetesXpertNet is specifically tailored for tabular medical data. It incorporates a convolutional neural network architecture with dynamic channel attention modules to prioritize clinically significant features, such as glucose and insulin levels, and a context-aware feature enhancer to capture complex sequential relationships within structured datasets. The model employs advanced preprocessing techniques, including mean imputation for missing values, median replacement for outliers, and feature selection through mutual information and LASSO regression, to improve dataset quality and computational efficiency. Additionally, a logistic regression-based class weighting strategy addresses class imbalance, enhancing model fairness. Evaluated on the PID dataset and Frankfurt Hospital, Germany Diabetes datasets, DiabetesXpertNet achieves an accuracy of 89.98%, AUC of 91.95%, precision of 89.08%, recall of 88.11%, and F1-score of 88.01%, outperforming existing machine learning and deep learning models. Compared to traditional machine learning approaches, it demonstrates significant improvements in precision (+5.1%), recall (+4.8%), F1-score (+5.1%), accuracy (+6.0%), and AUC (+4.5%). Against other convolutional neural network models, it shows meaningful gains in precision (+2.2%), recall (+1.1%), F1-score (+1.2%), accuracy (+1.9%), and AUC (+0.6%). These results underscore the robustness and interpretability of DiabetesXpertNet, making it a promising tool for early Type 2 diabetes diagnosis in clinical settings.

  11. f

    DataSheet1_Improving the Computation of Brier Scores for Evaluating...

    • frontiersin.figshare.com
    pdf
    Updated Jun 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gayan Dharmarathne; Anca Hanea; Andrew P. Robinson (2021). DataSheet1_Improving the Computation of Brier Scores for Evaluating Expert-Elicited Judgements.PDF [Dataset]. http://doi.org/10.3389/fams.2021.669546.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 10, 2021
    Dataset provided by
    Frontiers
    Authors
    Gayan Dharmarathne; Anca Hanea; Andrew P. Robinson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Structured expert judgement (SEJ) is a suite of techniques used to elicit expert predictions, e.g. probability predictions of the occurrence of events, for situations in which data are too expensive or impossible to obtain. The quality of expert predictions can be assessed using Brier scores and calibration questions. In practice, these scores are computed from data that may have a correlation structure due to sharing the effects of the same levels of grouping factors of the experimental design. For example, asking common questions from experts may result in correlated probability predictions due to sharing common question effects. Furthermore, experts commonly fail to answer all the needed questions. Here, we focus on (i) improving the computation of standard error estimates of expert Brier scores by using mixed-effects models that support design-based correlation structures of observations, and (ii) imputation of missing probability predictions in computing expert Brier scores to enhance the comparability of the prediction accuracy of experts. We show that the accuracy of estimating standard errors of expert Brier scores can be improved by incorporating the within-question correlations due to asking common questions. We recommend the use of multiple imputation to correct for missing data in expert elicitation exercises. We also discuss the implications of adopting a formal experimental design approach for SEJ exercises.

  12. f

    Overview of the PID and FHGD Datasets.

    • figshare.com
    xls
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahman Farnoosh; Karlo Abnoosian; Rasha Abbas Isewid; Danial Javaheri (2025). Overview of the PID and FHGD Datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0330454.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rahman Farnoosh; Karlo Abnoosian; Rasha Abbas Isewid; Danial Javaheri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Type 2 diabetes mellitus remains a critical global health challenge, with rising incidence rates placing immense pressure on healthcare systems worldwide. This chronic metabolic disorder affects diverse populations, including the elderly and children, leading to severe complications. Early and accurate prediction is essential to mitigate these consequences, yet traditional models often struggle with challenges such as imbalanced datasets, high-dimensional data, missing values, and outliers, resulting in limited predictive performance and interpretability. This study introduces DiabetesXpertNet, an innovative deep learning framework designed to enhance the prediction of Type 2 diabetes mellitus. Unlike existing convolutional neural network models optimized for image data, which focus on generalized attention mechanisms, DiabetesXpertNet is specifically tailored for tabular medical data. It incorporates a convolutional neural network architecture with dynamic channel attention modules to prioritize clinically significant features, such as glucose and insulin levels, and a context-aware feature enhancer to capture complex sequential relationships within structured datasets. The model employs advanced preprocessing techniques, including mean imputation for missing values, median replacement for outliers, and feature selection through mutual information and LASSO regression, to improve dataset quality and computational efficiency. Additionally, a logistic regression-based class weighting strategy addresses class imbalance, enhancing model fairness. Evaluated on the PID dataset and Frankfurt Hospital, Germany Diabetes datasets, DiabetesXpertNet achieves an accuracy of 89.98%, AUC of 91.95%, precision of 89.08%, recall of 88.11%, and F1-score of 88.01%, outperforming existing machine learning and deep learning models. Compared to traditional machine learning approaches, it demonstrates significant improvements in precision (+5.1%), recall (+4.8%), F1-score (+5.1%), accuracy (+6.0%), and AUC (+4.5%). Against other convolutional neural network models, it shows meaningful gains in precision (+2.2%), recall (+1.1%), F1-score (+1.2%), accuracy (+1.9%), and AUC (+0.6%). These results underscore the robustness and interpretability of DiabetesXpertNet, making it a promising tool for early Type 2 diabetes diagnosis in clinical settings.

  13. f

    Performance evaluation of feature selection.

    • plos.figshare.com
    xls
    Updated Jul 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaobo Qi; Yachen Lu; Ying Shi; Hui Qi; Lifang Ren (2024). Performance evaluation of feature selection. [Dataset]. http://doi.org/10.1371/journal.pone.0306090.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 2, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Xiaobo Qi; Yachen Lu; Ying Shi; Hui Qi; Lifang Ren
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diabetes is a chronic disease, which is characterized by abnormally high blood sugar levels. It may affect various organs and tissues, and even lead to life-threatening complications. Accurate prediction of diabetes can significantly reduce its incidence. However, the current prediction methods struggle to accurately capture the essential characteristics of nonlinear data, and the black-box nature of these methods hampers its clinical application. To address these challenges, we propose KCCAM_DNN, a diabetes prediction method that integrates Kendall’s correlation coefficient and an attention mechanism within a deep neural network. In the KCCAM_DNN, Kendall’s correlation coefficient is initially employed for feature selection, which effectively filters out key features influencing diabetes prediction. For missing values in the data, polynomial regression is utilized for imputation, ensuring data completeness. Subsequently, we construct a deep neural network (KCCAM_DNN) based on the self-attention mechanism, which assigns greater weight to crucial features affecting diabetes and enhances the model’s predictive performance. Finally, we employ the SHAP model to analyze the impact of each feature on diabetes prediction, augmenting the model’s interpretability. Experimental results show that KCCAM_DNN exhibits superior performance on both PIMA Indian and LMCH diabetes datasets, achieving test accuracies of 99.090% and 99.333%, respectively, approximately 2% higher than the best existing method. These results suggest that KCCAM_DNN is proficient in diabetes prediction, providing a foundation for informed decision-making in the diagnosis and prevention of diabetes.

  14. f

    Hardware environment table.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Sep 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kai Zhang; Po-Chung Chen; YiYang Huang; Shiow-Jyu Tzou; Sheng-Tang Wu; Ta-Wei Chu; Chung-Che Wang; Jyh-Shing Roger Jang (2025). Hardware environment table. [Dataset]. http://doi.org/10.1371/journal.pone.0330184.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 24, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Kai Zhang; Po-Chung Chen; YiYang Huang; Shiow-Jyu Tzou; Sheng-Tang Wu; Ta-Wei Chu; Chung-Che Wang; Jyh-Shing Roger Jang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In response to Taiwan’s rapidly aging population and the rising demand for personalized health care, accurately assessing individual physiological aging has become an essential area of study. This research utilizes health examination data to propose a machine learning-based biological age prediction model that quantifies physiological age through residual life estimation. The model leverages LightGBM, which shows an 11.40% improvement in predictive performance (R-squared) compared to the XGBoost model. In the experiments, the use of MICE imputation for missing data significantly enhanced prediction accuracy, resulting in a 23.35% improvement in predictive performance. Kaplan-Meier (K-M) estimator survival analysis revealed that the model effectively differentiates between groups with varying health levels, underscoring the validity of biological age as a health status indicator. Additionally, the model identified the top ten biomarkers most influential in aging for both men and women, with a 69.23% overlap with Taiwan’s leading causes of death and previously identified top health-impact factors, further validating its practical relevance. Through multidimensional health recommendations based on SHAP and PCC interpretations, if the health recommendations provided by the model are implemented, 64.58% of individuals could potentially extend their life expectancy. This study provides new methodological support and data backing for precision health interventions and life extension.

  15. f

    S1 File -

    • plos.figshare.com
    zip
    Updated Jan 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raafat M. Munshi (2024). S1 File - [Dataset]. http://doi.org/10.1371/journal.pone.0296107.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 10, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Raafat M. Munshi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cervical cancer remains a leading cause of female mortality, particularly in developing regions, underscoring the critical need for early detection and intervention guided by skilled medical professionals. While Pap smear images serve as valuable diagnostic tools, many available datasets for automated cervical cancer detection contain missing data, posing challenges for machine learning models’ efficacy. To address these hurdles, this study presents an automated system adept at managing missing information using ADASYN characteristics, resulting in exceptional accuracy. The proposed methodology integrates a voting classifier model harnessing the predictive capacity of three distinct machine learning models. It further incorporates SVM Imputer and ADASYN up-sampled features to mitigate missing value concerns, while leveraging CNN-generated features to augment the model’s capabilities. Notably, this model achieves remarkable performance metrics, boasting a 99.99% accuracy, precision, recall, and F1 score. A comprehensive comparative analysis evaluates the proposed model against various machine learning algorithms across four scenarios: original dataset usage, SVM imputation, ADASYN feature utilization, and CNN-generated features. Results indicate the superior efficacy of the proposed model over existing state-of-the-art techniques. This research not only introduces a novel approach but also offers actionable suggestions for refining automated cervical cancer detection systems. Its impact extends to benefiting medical practitioners by enabling earlier detection and improved patient care. Furthermore, the study’s findings have substantial societal implications, potentially reducing the burden of cervical cancer through enhanced diagnostic accuracy and timely intervention.

  16. f

    Data from: DataSheet1.docx

    • figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabio Cericola; Ingo Lenk; Dario Fè; Stephen Byrne; Christian S. Jensen; Morten G. Pedersen; Torben Asp; Just Jensen; Luc Janss (2023). DataSheet1.docx [Dataset]. http://doi.org/10.3389/fpls.2018.00369.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Fabio Cericola; Ingo Lenk; Dario Fè; Stephen Byrne; Christian S. Jensen; Morten G. Pedersen; Torben Asp; Just Jensen; Luc Janss
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ryegrass single plants, bi-parental family pools, and multi-parental family pools are often genotyped, based on allele-frequencies using genotyping-by-sequencing (GBS) assays. GBS assays can be performed at low-coverage depth to reduce costs. However, reducing the coverage depth leads to a higher proportion of missing data, and leads to a reduction in accuracy when identifying the allele-frequency at each locus. As a consequence of the latter, genomic relationship matrices (GRMs) will be biased. This bias in GRMs affects variance estimates and the accuracy of GBLUP for genomic prediction (GBLUP-GP). We derived equations that describe the bias from low-coverage sequencing as an effect of binomial sampling of sequence reads, and allowed for any ploidy level of the sample considered. This allowed us to combine individual and pool genotypes in one GRM, treating pool-genotypes as a polyploid genotype, equal to the total ploidy-level of the parents of the pool. Using simulated data, we verified the magnitude of the GRM bias at different coverage depths for three different kinds of ryegrass breeding material: individual genotypes from single plants, pool-genotypes from F2 families, and pool-genotypes from synthetic varieties. To better handle missing data, we also tested imputation procedures, which are suited for analyzing allele-frequency genomic data. The relative advantages of the bias-correction and the imputation of missing data were evaluated using real data. We examined a large dataset, including single plants, F2 families, and synthetic varieties genotyped in three GBS assays, each with a different coverage depth, and evaluated them for heading date, crown rust resistance, and seed yield. Cross validations were used to test the accuracy using GBLUP approaches, demonstrating the feasibility of predicting among different breeding material. Bias-corrected GRMs proved to increase predictive accuracies when compared with standard approaches to construct GRMs. Among the imputation methods we tested, the random forest method yielded the highest predictive accuracy. The combinations of these two methods resulted in a meaningful increase of predictive ability (up to 0.09). The possibility of predicting across individuals and pools provides new opportunities for improving ryegrass breeding schemes.

  17. f

    Results of the learning models using SVM imputer.

    • figshare.com
    xls
    Updated Jan 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raafat M. Munshi (2024). Results of the learning models using SVM imputer. [Dataset]. http://doi.org/10.1371/journal.pone.0296107.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 10, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Raafat M. Munshi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cervical cancer remains a leading cause of female mortality, particularly in developing regions, underscoring the critical need for early detection and intervention guided by skilled medical professionals. While Pap smear images serve as valuable diagnostic tools, many available datasets for automated cervical cancer detection contain missing data, posing challenges for machine learning models’ efficacy. To address these hurdles, this study presents an automated system adept at managing missing information using ADASYN characteristics, resulting in exceptional accuracy. The proposed methodology integrates a voting classifier model harnessing the predictive capacity of three distinct machine learning models. It further incorporates SVM Imputer and ADASYN up-sampled features to mitigate missing value concerns, while leveraging CNN-generated features to augment the model’s capabilities. Notably, this model achieves remarkable performance metrics, boasting a 99.99% accuracy, precision, recall, and F1 score. A comprehensive comparative analysis evaluates the proposed model against various machine learning algorithms across four scenarios: original dataset usage, SVM imputation, ADASYN feature utilization, and CNN-generated features. Results indicate the superior efficacy of the proposed model over existing state-of-the-art techniques. This research not only introduces a novel approach but also offers actionable suggestions for refining automated cervical cancer detection systems. Its impact extends to benefiting medical practitioners by enabling earlier detection and improved patient care. Furthermore, the study’s findings have substantial societal implications, potentially reducing the burden of cervical cancer through enhanced diagnostic accuracy and timely intervention.

  18. f

    Time evaluation of the proposed KCCAM_DNN.

    • figshare.com
    xls
    Updated Jul 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaobo Qi; Yachen Lu; Ying Shi; Hui Qi; Lifang Ren (2024). Time evaluation of the proposed KCCAM_DNN. [Dataset]. http://doi.org/10.1371/journal.pone.0306090.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 2, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Xiaobo Qi; Yachen Lu; Ying Shi; Hui Qi; Lifang Ren
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diabetes is a chronic disease, which is characterized by abnormally high blood sugar levels. It may affect various organs and tissues, and even lead to life-threatening complications. Accurate prediction of diabetes can significantly reduce its incidence. However, the current prediction methods struggle to accurately capture the essential characteristics of nonlinear data, and the black-box nature of these methods hampers its clinical application. To address these challenges, we propose KCCAM_DNN, a diabetes prediction method that integrates Kendall’s correlation coefficient and an attention mechanism within a deep neural network. In the KCCAM_DNN, Kendall’s correlation coefficient is initially employed for feature selection, which effectively filters out key features influencing diabetes prediction. For missing values in the data, polynomial regression is utilized for imputation, ensuring data completeness. Subsequently, we construct a deep neural network (KCCAM_DNN) based on the self-attention mechanism, which assigns greater weight to crucial features affecting diabetes and enhances the model’s predictive performance. Finally, we employ the SHAP model to analyze the impact of each feature on diabetes prediction, augmenting the model’s interpretability. Experimental results show that KCCAM_DNN exhibits superior performance on both PIMA Indian and LMCH diabetes datasets, achieving test accuracies of 99.090% and 99.333%, respectively, approximately 2% higher than the best existing method. These results suggest that KCCAM_DNN is proficient in diabetes prediction, providing a foundation for informed decision-making in the diagnosis and prevention of diabetes.

  19. f

    Table_1_Enhancing water use efficiency in precision irrigation: data-driven...

    • frontiersin.figshare.com
    docx
    Updated Aug 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Zeynoddin; Silvio José Gumiere; Hossein Bonakdari (2023). Table_1_Enhancing water use efficiency in precision irrigation: data-driven approaches for addressing data gaps in time series.docx [Dataset]. http://doi.org/10.3389/frwa.2023.1237592.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Aug 22, 2023
    Dataset provided by
    Frontiers
    Authors
    Mohammad Zeynoddin; Silvio José Gumiere; Hossein Bonakdari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Real-time soil matric potential measurements for determining potato production's water availability are currently used in precision irrigation. It is well known that managing irrigation based on soil matric potential (SMP) helps increase water use efficiency and reduce crop environmental impact. Yet, SMP monitoring presents challenges and sometimes leads to gaps in the collected data. This research sought to address these data gaps in the SMP time series. Using meteorological and field measurements, we developed a filtering and imputation algorithm by implementing three prominent predictive models in the algorithm to estimate missing values. Over 2 months, we gathered hourly SMP values from a field north of the Péribonka River in Lac-Saint-Jean, Québec, Canada. Our study evaluated various data input combinations, including only meteorological data, SMP measurements, or a mix of both. The Extreme Learning Machine (ELM) model proved the most effective among the tested models. It outperformed the k-Nearest Neighbors (kNN) model and the Evolutionary Optimized Inverse Distance Method (gaIDW). The ELM model, with five inputs comprising SMP measurements, achieved a correlation coefficient of 0.992, a root-mean-square error of 0.164 cm, a mean absolute error of 0.122 cm, and a Nash-Sutcliffe efficiency of 0.983. The ELM model requires at least five inputs to achieve the best results in the study context. These can be meteorological inputs like relative humidity, dew temperature, land inputs, or a combination of both. The results were within 5% of the best-performing input combination we identified earlier. To mitigate the computational demands of these models, a quicker baseline model can be used for initial input filtering. With this method, we expect the output from simpler models such as gaIDW and kNN to vary by no more than 20%. Nevertheless, this discrepancy can be efficiently managed by leveraging more sophisticated models.

  20. f

    Parameter table.

    • plos.figshare.com
    xls
    Updated Jul 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaobo Qi; Yachen Lu; Ying Shi; Hui Qi; Lifang Ren (2024). Parameter table. [Dataset]. http://doi.org/10.1371/journal.pone.0306090.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 2, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Xiaobo Qi; Yachen Lu; Ying Shi; Hui Qi; Lifang Ren
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diabetes is a chronic disease, which is characterized by abnormally high blood sugar levels. It may affect various organs and tissues, and even lead to life-threatening complications. Accurate prediction of diabetes can significantly reduce its incidence. However, the current prediction methods struggle to accurately capture the essential characteristics of nonlinear data, and the black-box nature of these methods hampers its clinical application. To address these challenges, we propose KCCAM_DNN, a diabetes prediction method that integrates Kendall’s correlation coefficient and an attention mechanism within a deep neural network. In the KCCAM_DNN, Kendall’s correlation coefficient is initially employed for feature selection, which effectively filters out key features influencing diabetes prediction. For missing values in the data, polynomial regression is utilized for imputation, ensuring data completeness. Subsequently, we construct a deep neural network (KCCAM_DNN) based on the self-attention mechanism, which assigns greater weight to crucial features affecting diabetes and enhances the model’s predictive performance. Finally, we employ the SHAP model to analyze the impact of each feature on diabetes prediction, augmenting the model’s interpretability. Experimental results show that KCCAM_DNN exhibits superior performance on both PIMA Indian and LMCH diabetes datasets, achieving test accuracies of 99.090% and 99.333%, respectively, approximately 2% higher than the best existing method. These results suggest that KCCAM_DNN is proficient in diabetes prediction, providing a foundation for informed decision-making in the diagnosis and prevention of diabetes.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dongying Zheng; Xinyu Hao; Muhanmmad Khan; Lixia Wang; Fan Li; Ning Xiang; Fuli Kang; Timo Hamalainen; Fengyu Cong; Kedong Song; Chong Qiao (2023). Table_3_Comparison of machine learning and logistic regression as predictive models for adverse maternal and neonatal outcomes of preeclampsia: A retrospective study.XLSX [Dataset]. http://doi.org/10.3389/fcvm.2022.959649.s005

Table_3_Comparison of machine learning and logistic regression as predictive models for adverse maternal and neonatal outcomes of preeclampsia: A retrospective study.XLSX

Related Article
Explore at:
xlsxAvailable download formats
Dataset updated
Jun 13, 2023
Dataset provided by
Frontiers
Authors
Dongying Zheng; Xinyu Hao; Muhanmmad Khan; Lixia Wang; Fan Li; Ning Xiang; Fuli Kang; Timo Hamalainen; Fengyu Cong; Kedong Song; Chong Qiao
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

IntroductionPreeclampsia, one of the leading causes of maternal and fetal morbidity and mortality, demands accurate predictive models for the lack of effective treatment. Predictive models based on machine learning algorithms demonstrate promising potential, while there is a controversial discussion about whether machine learning methods should be recommended preferably, compared to traditional statistical models.MethodsWe employed both logistic regression and six machine learning methods as binary predictive models for a dataset containing 733 women diagnosed with preeclampsia. Participants were grouped by four different pregnancy outcomes. After the imputation of missing values, statistical description and comparison were conducted preliminarily to explore the characteristics of documented 73 variables. Sequentially, correlation analysis and feature selection were performed as preprocessing steps to filter contributing variables for developing models. The models were evaluated by multiple criteria.ResultsWe first figured out that the influential variables screened by preprocessing steps did not overlap with those determined by statistical differences. Secondly, the most accurate imputation method is K-Nearest Neighbor, and the imputation process did not affect the performance of the developed models much. Finally, the performance of models was investigated. The random forest classifier, multi-layer perceptron, and support vector machine demonstrated better discriminative power for prediction evaluated by the area under the receiver operating characteristic curve, while the decision tree classifier, random forest, and logistic regression yielded better calibration ability verified, as by the calibration curve.ConclusionMachine learning algorithms can accomplish prediction modeling and demonstrate superior discrimination, while Logistic Regression can be calibrated well. Statistical analysis and machine learning are two scientific domains sharing similar themes. The predictive abilities of such developed models vary according to the characteristics of datasets, which still need larger sample sizes and more influential predictors to accumulate evidence.

Search
Clear search
Close search
Google apps
Main menu