4 datasets found
  1. Statistic results on the imbalance datasets of adverse drug reactions.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jin Wang; Liang-Chih Yu; Xuejie Zhang (2023). Statistic results on the imbalance datasets of adverse drug reactions. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010144.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jin Wang; Liang-Chih Yu; Xuejie Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Most tokens are annotated as class O, which is about 10 times that of ADR with entity labels. The algorithms tend to produce unsatisfactory classifiers when faced with (even extremely) imbalanced datasets. Those models may have a bias towards classes and only predict the majority class.

  2. f

    Results comparison of different context encoders w/ and w/o weighted...

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jin Wang; Liang-Chih Yu; Xuejie Zhang (2023). Results comparison of different context encoders w/ and w/o weighted mechanism for ADR detection tasks. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010144.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Jin Wang; Liang-Chih Yu; Xuejie Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The proposed weighted CRF significantly outperformed several baselines on both the Twitter and PubMed datasets. In addition, the weighting strategy on both softmax and CRF can alleviate the imbalanced data distribution, and they thus outperformed their conventional versions by about 1.1% and 1.8% on average across the two ADR tasks.

  3. f

    Data from: A Novel Automated Framework for QSAR Modeling of Highly...

    • acs.figshare.com
    xlsx
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Casanova-Alvarez; Aliuska Morales-Helguera; Miguel Ángel Cabrera-Pérez; Reinaldo Molina-Ruiz; Christophe Molina (2023). A Novel Automated Framework for QSAR Modeling of Highly Imbalanced Leishmania High-Throughput Screening Data [Dataset]. http://doi.org/10.1021/acs.jcim.0c01439.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    ACS Publications
    Authors
    Omar Casanova-Alvarez; Aliuska Morales-Helguera; Miguel Ángel Cabrera-Pérez; Reinaldo Molina-Ruiz; Christophe Molina
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In silico prediction of antileishmanial activity using quantitative structure–activity relationship (QSAR) models has been developed on limited and small datasets. Nowadays, the availability of large and diverse high-throughput screening data provides an opportunity to the scientific community to model this activity from the chemical structure. In this study, we present the first KNIME automated workflow to modeling a large, diverse, and highly imbalanced dataset of compounds with antileishmanial activity. Because the data is strongly biased toward inactive compounds, a novel strategy was implemented based on the selection of different balanced training sets and a further consensus model using single decision trees as the base model and three criteria for output combinations. The decision tree consensus was adopted after comparing its classification performance to consensuses built upon Gaussian-Naı̈ve-Bayes, Support-Vector-Machine, Random-Forest, Gradient-Boost, and Multi-Layer-Perceptron base models. All these consensuses were rigorously validated using internal and external test validation sets and were compared against each other using Friedman and Bonferroni–Dunn statistics. For the retained decision tree-based consensus model, which covers 100% of the chemical space of the dataset and with the lowest consensus level, the overall accuracy statistics for test and external sets were between 71 and 74% and 71 and 76%, respectively, while for a reduced chemical space (21%) and with an incremental consensus level, the accuracy statistics were substantially improved with values for the test and external sets between 86 and 92% and 88 and 92%, respectively. These results highlight the relevance of the consensus model to prioritize a relatively small set of active compounds with high prediction sensitivity using the Incremental Consensus at high level values or to predict as many compounds as possible, lowering the level of Incremental Consensus. Finally, the workflow developed eliminates human bias, improves the procedure reproducibility, and allows other researchers to reproduce our design and use it in their own QSAR problems.

  4. f

    The features of the KD datasets.

    • plos.figshare.com
    xls
    Updated Dec 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chuan-Sheng Hung; Chun-Hung Richard Lin; Jain-Shing Liu; Shi-Huang Chen; Tsung-Chi Hung; Chih-Min Tsai (2024). The features of the KD datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0314995.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Chuan-Sheng Hung; Chun-Hung Richard Lin; Jain-Shing Liu; Shi-Huang Chen; Tsung-Chi Hung; Chih-Min Tsai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Kawasaki Disease (KD) is a rare febrile illness affecting infants and young children, potentially leading to coronary artery complications and, in severe cases, mortality if untreated. However, KD is frequently misdiagnosed as a common fever in clinical settings, and the inherent data imbalance further complicates accurate prediction when using traditional machine learning and statistical methods. This paper introduces two advanced approaches to address these challenges, enhancing prediction accuracy and generalizability. The first approach proposes a stacking model termed the Disease Classifier (DC), specifically designed to recognize minority class samples within imbalanced datasets, thereby mitigating the bias commonly observed in traditional models toward the majority class. Secondly, we introduce a combined model, the Disease Classifier with CTGAN (CTGAN-DC), which integrates DC with Conditional Tabular Generative Adversarial Network (CTGAN) technology to improve data balance and predictive performance further. Utilizing CTGAN-based oversampling techniques, this model retains the original data characteristics of KD while expanding data diversity. This effectively balances positive and negative KD samples, significantly reducing model bias toward the majority class and enhancing both predictive accuracy and generalizability. Experimental evaluations indicate substantial performance gains, with the DC and CTGAN-DC models achieving notably higher predictive accuracy than individual machine learning models. Specifically, the DC model achieves sensitivity and specificity rates of 95%, while the CTGAN-DC model achieves 95% sensitivity and 97% specificity, demonstrating superior recognition capability. Furthermore, both models exhibit strong generalizability across diverse KD datasets, particularly the CTGAN-DC model, which surpasses the JAMA model with a 3% increase in sensitivity and a 95% improvement in generalization sensitivity and specificity, effectively resolving the model collapse issue observed in the JAMA model. In sum, the proposed DC and CTGAN-DC architectures demonstrate robust generalizability across multiple KD datasets from various healthcare institutions and significantly outperform other models, including XGBoost. These findings lay a solid foundation for advancing disease prediction in the context of imbalanced medical data.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jin Wang; Liang-Chih Yu; Xuejie Zhang (2023). Statistic results on the imbalance datasets of adverse drug reactions. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010144.t002
Organization logo

Statistic results on the imbalance datasets of adverse drug reactions.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jin Wang; Liang-Chih Yu; Xuejie Zhang
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Most tokens are annotated as class O, which is about 10 times that of ADR with entity labels. The algorithms tend to produce unsatisfactory classifiers when faced with (even extremely) imbalanced datasets. Those models may have a bias towards classes and only predict the majority class.

Search
Clear search
Close search
Google apps
Main menu