4 datasets found

Statistic results on the imbalance datasets of adverse drug reactions.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jin Wang; Liang-Chih Yu; Xuejie Zhang (2023). Statistic results on the imbalance datasets of adverse drug reactions. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010144.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1010144.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jin Wang; Liang-Chih Yu; Xuejie Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Most tokens are annotated as class O, which is about 10 times that of ADR with entity labels. The algorithms tend to produce unsatisfactory classifiers when faced with (even extremely) imbalanced datasets. Those models may have a bias towards classes and only predict the majority class.
f
Results comparison of different context encoders w/ and w/o weighted...
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jin Wang; Liang-Chih Yu; Xuejie Zhang (2023). Results comparison of different context encoders w/ and w/o weighted mechanism for ADR detection tasks. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010144.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1010144.t003
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS Computational Biology
Authors
Jin Wang; Liang-Chih Yu; Xuejie Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The proposed weighted CRF significantly outperformed several baselines on both the Twitter and PubMed datasets. In addition, the weighting strategy on both softmax and CRF can alleviate the imbalanced data distribution, and they thus outperformed their conventional versions by about 1.1% and 1.8% on average across the two ADR tasks.
f
Data from: A Novel Automated Framework for QSAR Modeling of Highly...
acs.figshare.com
xlsx
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Casanova-Alvarez; Aliuska Morales-Helguera; Miguel Ángel Cabrera-Pérez; Reinaldo Molina-Ruiz; Christophe Molina (2023). A Novel Automated Framework for QSAR Modeling of Highly Imbalanced Leishmania High-Throughput Screening Data [Dataset]. http://doi.org/10.1021/acs.jcim.0c01439.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.0c01439.s002
Dataset updated
Jun 10, 2023
Dataset provided by
ACS Publications
Authors
Omar Casanova-Alvarez; Aliuska Morales-Helguera; Miguel Ángel Cabrera-Pérez; Reinaldo Molina-Ruiz; Christophe Molina
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In silico prediction of antileishmanial activity using quantitative structure–activity relationship (QSAR) models has been developed on limited and small datasets. Nowadays, the availability of large and diverse high-throughput screening data provides an opportunity to the scientific community to model this activity from the chemical structure. In this study, we present the first KNIME automated workflow to modeling a large, diverse, and highly imbalanced dataset of compounds with antileishmanial activity. Because the data is strongly biased toward inactive compounds, a novel strategy was implemented based on the selection of different balanced training sets and a further consensus model using single decision trees as the base model and three criteria for output combinations. The decision tree consensus was adopted after comparing its classification performance to consensuses built upon Gaussian-Naı̈ve-Bayes, Support-Vector-Machine, Random-Forest, Gradient-Boost, and Multi-Layer-Perceptron base models. All these consensuses were rigorously validated using internal and external test validation sets and were compared against each other using Friedman and Bonferroni–Dunn statistics. For the retained decision tree-based consensus model, which covers 100% of the chemical space of the dataset and with the lowest consensus level, the overall accuracy statistics for test and external sets were between 71 and 74% and 71 and 76%, respectively, while for a reduced chemical space (21%) and with an incremental consensus level, the accuracy statistics were substantially improved with values for the test and external sets between 86 and 92% and 88 and 92%, respectively. These results highlight the relevance of the consensus model to prioritize a relatively small set of active compounds with high prediction sensitivity using the Incremental Consensus at high level values or to predict as many compounds as possible, lowering the level of Incremental Consensus. Finally, the workflow developed eliminates human bias, improves the procedure reproducibility, and allows other researchers to reproduce our design and use it in their own QSAR problems.
f
The features of the KD datasets.
plos.figshare.com
xls
Updated Dec 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chuan-Sheng Hung; Chun-Hung Richard Lin; Jain-Shing Liu; Shi-Huang Chen; Tsung-Chi Hung; Chih-Min Tsai (2024). The features of the KD datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0314995.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0314995.t001
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Chuan-Sheng Hung; Chun-Hung Richard Lin; Jain-Shing Liu; Shi-Huang Chen; Tsung-Chi Hung; Chih-Min Tsai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Kawasaki Disease (KD) is a rare febrile illness affecting infants and young children, potentially leading to coronary artery complications and, in severe cases, mortality if untreated. However, KD is frequently misdiagnosed as a common fever in clinical settings, and the inherent data imbalance further complicates accurate prediction when using traditional machine learning and statistical methods. This paper introduces two advanced approaches to address these challenges, enhancing prediction accuracy and generalizability. The first approach proposes a stacking model termed the Disease Classifier (DC), specifically designed to recognize minority class samples within imbalanced datasets, thereby mitigating the bias commonly observed in traditional models toward the majority class. Secondly, we introduce a combined model, the Disease Classifier with CTGAN (CTGAN-DC), which integrates DC with Conditional Tabular Generative Adversarial Network (CTGAN) technology to improve data balance and predictive performance further. Utilizing CTGAN-based oversampling techniques, this model retains the original data characteristics of KD while expanding data diversity. This effectively balances positive and negative KD samples, significantly reducing model bias toward the majority class and enhancing both predictive accuracy and generalizability. Experimental evaluations indicate substantial performance gains, with the DC and CTGAN-DC models achieving notably higher predictive accuracy than individual machine learning models. Specifically, the DC model achieves sensitivity and specificity rates of 95%, while the CTGAN-DC model achieves 95% sensitivity and 97% specificity, demonstrating superior recognition capability. Furthermore, both models exhibit strong generalizability across diverse KD datasets, particularly the CTGAN-DC model, which surpasses the JAMA model with a 3% increase in sensitivity and a 95% improvement in generalization sensitivity and specificity, effectively resolving the model collapse issue observed in the JAMA model. In sum, the proposed DC and CTGAN-DC architectures demonstrate robust generalizability across multiple KD datasets from various healthcare institutions and significantly outperform other models, including XGBoost. These findings lay a solid foundation for advancing disease prediction in the context of imbalanced medical data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jin Wang; Liang-Chih Yu; Xuejie Zhang (2023). Statistic results on the imbalance datasets of adverse drug reactions. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010144.t002

Statistic results on the imbalance datasets of adverse drug reactions.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pcbi.1010144.t002

Dataset updated

Jun 1, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Jin Wang; Liang-Chih Yu; Xuejie Zhang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Most tokens are annotated as class O, which is about 10 times that of ADR with entity labels. The algorithms tend to produce unsatisfactory classifiers when faced with (even extremely) imbalanced datasets. Those models may have a bias towards classes and only predict the majority class.

Clear search

Close search

Google apps

Main menu

Statistic results on the imbalance datasets of adverse drug reactions.

Results comparison of different context encoders w/ and w/o weighted...

Data from: A Novel Automated Framework for QSAR Modeling of Highly...

The features of the KD datasets.

Statistic results on the imbalance datasets of adverse drug reactions.