Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Most tokens are annotated as class O, which is about 10 times that of ADR with entity labels. The algorithms tend to produce unsatisfactory classifiers when faced with (even extremely) imbalanced datasets. Those models may have a bias towards classes and only predict the majority class.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The proposed weighted CRF significantly outperformed several baselines on both the Twitter and PubMed datasets. In addition, the weighting strategy on both softmax and CRF can alleviate the imbalanced data distribution, and they thus outperformed their conventional versions by about 1.1% and 1.8% on average across the two ADR tasks.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In silico prediction of antileishmanial activity using quantitative structure–activity relationship (QSAR) models has been developed on limited and small datasets. Nowadays, the availability of large and diverse high-throughput screening data provides an opportunity to the scientific community to model this activity from the chemical structure. In this study, we present the first KNIME automated workflow to modeling a large, diverse, and highly imbalanced dataset of compounds with antileishmanial activity. Because the data is strongly biased toward inactive compounds, a novel strategy was implemented based on the selection of different balanced training sets and a further consensus model using single decision trees as the base model and three criteria for output combinations. The decision tree consensus was adopted after comparing its classification performance to consensuses built upon Gaussian-Naı̈ve-Bayes, Support-Vector-Machine, Random-Forest, Gradient-Boost, and Multi-Layer-Perceptron base models. All these consensuses were rigorously validated using internal and external test validation sets and were compared against each other using Friedman and Bonferroni–Dunn statistics. For the retained decision tree-based consensus model, which covers 100% of the chemical space of the dataset and with the lowest consensus level, the overall accuracy statistics for test and external sets were between 71 and 74% and 71 and 76%, respectively, while for a reduced chemical space (21%) and with an incremental consensus level, the accuracy statistics were substantially improved with values for the test and external sets between 86 and 92% and 88 and 92%, respectively. These results highlight the relevance of the consensus model to prioritize a relatively small set of active compounds with high prediction sensitivity using the Incremental Consensus at high level values or to predict as many compounds as possible, lowering the level of Incremental Consensus. Finally, the workflow developed eliminates human bias, improves the procedure reproducibility, and allows other researchers to reproduce our design and use it in their own QSAR problems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Kawasaki Disease (KD) is a rare febrile illness affecting infants and young children, potentially leading to coronary artery complications and, in severe cases, mortality if untreated. However, KD is frequently misdiagnosed as a common fever in clinical settings, and the inherent data imbalance further complicates accurate prediction when using traditional machine learning and statistical methods. This paper introduces two advanced approaches to address these challenges, enhancing prediction accuracy and generalizability. The first approach proposes a stacking model termed the Disease Classifier (DC), specifically designed to recognize minority class samples within imbalanced datasets, thereby mitigating the bias commonly observed in traditional models toward the majority class. Secondly, we introduce a combined model, the Disease Classifier with CTGAN (CTGAN-DC), which integrates DC with Conditional Tabular Generative Adversarial Network (CTGAN) technology to improve data balance and predictive performance further. Utilizing CTGAN-based oversampling techniques, this model retains the original data characteristics of KD while expanding data diversity. This effectively balances positive and negative KD samples, significantly reducing model bias toward the majority class and enhancing both predictive accuracy and generalizability. Experimental evaluations indicate substantial performance gains, with the DC and CTGAN-DC models achieving notably higher predictive accuracy than individual machine learning models. Specifically, the DC model achieves sensitivity and specificity rates of 95%, while the CTGAN-DC model achieves 95% sensitivity and 97% specificity, demonstrating superior recognition capability. Furthermore, both models exhibit strong generalizability across diverse KD datasets, particularly the CTGAN-DC model, which surpasses the JAMA model with a 3% increase in sensitivity and a 95% improvement in generalization sensitivity and specificity, effectively resolving the model collapse issue observed in the JAMA model. In sum, the proposed DC and CTGAN-DC architectures demonstrate robust generalizability across multiple KD datasets from various healthcare institutions and significantly outperform other models, including XGBoost. These findings lay a solid foundation for advancing disease prediction in the context of imbalanced medical data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Most tokens are annotated as class O, which is about 10 times that of ADR with entity labels. The algorithms tend to produce unsatisfactory classifiers when faced with (even extremely) imbalanced datasets. Those models may have a bias towards classes and only predict the majority class.