73 datasets found
  1. Data from: Multitask Modeling with Confidence Using Matrix Factorization and...

    • acs.figshare.com
    xlsx
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ulf Norinder; Fredrik Svensson (2023). Multitask Modeling with Confidence Using Matrix Factorization and Conformal Prediction [Dataset]. http://doi.org/10.1021/acs.jcim.9b00027.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    ACS Publications
    Authors
    Ulf Norinder; Fredrik Svensson
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Multitask prediction of bioactivities is often faced with challenges relating to the sparsity of data and imbalance between different labels. We propose class conditional (Mondrian) conformal predictors using underlying Macau models as a novel approach for large scale bioactivity prediction. This approach handles both high degrees of missing data and label imbalances while still producing high quality predictive models. When applied to ten assay end points from PubChem, the models generated valid models with an efficiency of 74.0–80.1% at the 80% confidence level with similar performance both for the minority and majority class. Also when deleting progressively larger portions of the available data (0–80%) the performance of the models remained robust with only minor deterioration (reduction in efficiency between 5 and 10%). Compared to using Macau without conformal prediction the method presented here significantly improves the performance on imbalanced data sets.

  2. Comparison of the performance evaluation of Random Forest results for...

    • plos.figshare.com
    xls
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Comparison of the performance evaluation of Random Forest results for 5-class classification on both balanced and unbalanced data set. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of the performance evaluation of Random Forest results for 5-class classification on both balanced and unbalanced data set.

  3. Uplift Modeling , Marketing Campaign Data

    • kaggle.com
    zip
    Updated Nov 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Möbius (2020). Uplift Modeling , Marketing Campaign Data [Dataset]. https://www.kaggle.com/arashnic/uplift-modeling
    Explore at:
    zip(340156703 bytes)Available download formats
    Dataset updated
    Nov 1, 2020
    Authors
    Möbius
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.

    ###
    ###

    Content

    The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.

    Following is a detailed description of the features:

    • f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)
    • treatment: treatment group (1 = treated, 0 = control)
    • conversion: whether a conversion occured for this user (binary, label)
    • visit: whether a visit occured for this user (binary, label)
    • exposure: treatment effect, whether the user has been effectively exposed (binary)

    ###

    Context

    Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.

    ###
    ###

    Content

    The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.

    Following is a detailed description of the features:

    • f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)
    • treatment: treatment group (1 = treated, 0 = control)
    • conversion: whether a conversion occured for this user (binary, label)
    • visit: whether a visit occured for this user (binary, label)
    • exposure: treatment effect, whether the user has been effectively exposed (binary)

    ###

    Starter Kernels

    Acknowledgement

    The data provided for paper: "A Large Scale Benchmark for Uplift Modeling"

    https://s3.us-east-2.amazonaws.com/criteo-uplift-dataset/large-scale-benchmark.pdf

    • Eustache Diemert CAIL e.diemert@criteo.com
    • Artem Betlei CAIL & Université Grenoble Alpes a.betlei@criteo.com
    • Christophe Renaudin CAIL c.renaudin@criteo.com
    • Massih-Reza Amini Université Grenoble Alpes massih-reza.amini@imag.fr

    For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.

    Inspiration

    We can foresee related usages such as but not limited to:

    • Uplift modeling
    • Interactions between features and treatment
    • Heterogeneity of treatment

    More Readings

    MORE DATASETs ...

  4. f

    Under-sampled dataset.

    • plos.figshare.com
    xls
    Updated Dec 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongil Han; Haemin Jung (2024). Under-sampled dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Seongil Han; Haemin Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.

  5. f

    Comparison of the performance evaluation of Random Forest results for binary...

    • figshare.com
    xls
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Comparison of the performance evaluation of Random Forest results for binary classification on both balanced and unbalanced data set. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of the performance evaluation of Random Forest results for binary classification on both balanced and unbalanced data set.

  6. S

    Research on Financial Distress Prediction of Listed Companies Based on...

    • scidb.cn
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    邢凯; 盛利琴; 张盼; 李珊 (2024). Research on Financial Distress Prediction of Listed Companies Based on Unbalanced Data Processing and Multivariable Screening Methods [Dataset]. http://doi.org/10.57760/sciencedb.j00214.00026
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Science Data Bank
    Authors
    邢凯; 盛利琴; 张盼; 李珊
    License

    https://api.github.com/licenses/agpl-3.0https://api.github.com/licenses/agpl-3.0

    Description

    In the context of domestic supply side structural reform, the market environment is complex and ever-changing, and corporate debt defaults occur frequently. It is necessary to establish a timely and effective financial distress warning model Most of the existing distress prediction models have not effectively solved problems such as imbalanced datasets, unstable selection of key prediction indicators, and randomness in sample matching, and are not suitable for the current complex and changing market conditions in China Therefore, this article uses the Bootstrap resampling method to construct 1000 research samples, and uses LASSO (Least absolute shrinkage and selection operator) variable selection technology to screen key predictive factors to construct a logit model for predicting ahead of 3 years. In the prediction stage, the samples are randomly cut and predicted 1000 times to reduce random errors The results indicate that the Logit dilemma prediction model constructed by combining Bootstrap sample construction method with LASSO has stronger predictive ability compared to the traditional application of "similar industry asset size" method In addition, the embedded Bootstrap Lasso logit model has better predictive performance than mainstream O-Score models and ZChina Score models, with an accuracy increase of 10%, and is more suitable for China's time-varying market. The model constructed in this article can help corporate stakeholders better identify financial difficulties and make timely adjustments to reduce corporate bond default rates or avoid corporate defaults

  7. D

    Sample size requirements with unbalanced subgroups in latent growth models

    • ssh.datastations.nl
    • datacatalogue.cessda.eu
    pdf, txt, zip
    Updated May 8, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. A. J. Zondervan-Zwijnenburg; M. A. J. Zondervan-Zwijnenburg (2014). Sample size requirements with unbalanced subgroups in latent growth models [Dataset]. http://doi.org/10.17026/DANS-ZD4-QMCS
    Explore at:
    zip(17837), txt(1384), txt(864), txt(1144), txt(1232), pdf(1059165), txt(11638), txt(1604), txt(1627), txt(1372)Available download formats
    Dataset updated
    May 8, 2014
    Dataset provided by
    DANS Data Station Social Sciences and Humanities
    Authors
    M. A. J. Zondervan-Zwijnenburg; M. A. J. Zondervan-Zwijnenburg
    License

    https://doi.org/10.17026/fp39-0x58https://doi.org/10.17026/fp39-0x58

    Description

    Syntax to replicate the simulation study that is described by the following abstract:In the social and behavioral sciences, a general interest exists in the comparison of development between groups, especially when one of the groups is exceptional and abnormal development is expected. Multiple group latent growth models enable these comparisons. However, the combination of a smaller subgroup with a larger reference group has been shown to cause issues with power and Type I errors. The current study explores the limits of the subsample sizes in latent growth modeling (LGM) that can and cannot be analyzed with Maximum Likelihood and Bayesian estimation, where Bayesian estimation was examined not only with uninformed, but also with informed priors. The results show that Bayesian estimation resolves computational issues that occur with ML estimation, and that the addition of prior information can be the key to achieving sufficient power to detect a small growth difference between groups. Prior information has to be acquired, especially with respect to the exceptional group, to promote statistical power.

  8. f

    Comparison of the performance evaluation of Random Forest results for...

    • figshare.com
    xls
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Comparison of the performance evaluation of Random Forest results for 3-class classification on both balanced and unbalanced data set. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of the performance evaluation of Random Forest results for 3-class classification on both balanced and unbalanced data set.

  9. f

    GMSC dataset (IR: Imbalance Ratio).

    • plos.figshare.com
    xls
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongil Han; Haemin Jung (2024). GMSC dataset (IR: Imbalance Ratio). [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Seongil Han; Haemin Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.

  10. f

    Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

    • acs.figshare.com
    zip
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    ACS Publications
    Authors
    Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.

  11. f

    Validation of the validity of the ccr index.

    • figshare.com
    xls
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng (2024). Validation of the validity of the ccr index. [Dataset]. http://doi.org/10.1371/journal.pone.0291656.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Imbalanced data is a problem in that the number of samples in different categories or target value ranges varies greatly. Data imbalance imposes excellent challenges to machine learning and pattern recognition. The performance of machine learning models leans to be partially towards the majority of samples in the imbalanced dataset, which will further affect the effect of the model. The imbalanced data problem includes an imbalanced categorical problem and an imbalanced regression problem. Many studies have been developed to address the issue of imbalanced classification data. Nevertheless, the imbalanced regression problem has not been well-researched. In order to solve the problem of unbalanced regression data, we define an RNGRU model that can simultaneously learn the regression characteristics and neighbor characteristics of regression samples. To obtain the most comprehensive sample information of regression samples, the model uses the idea of confrontation to determine the proportion between the regression characteristics and neighbor characteristics of the original samples. According to the regression characteristics of the regression samples, an index ccr (correlation change rate) is proposed to evaluate the similarity between the generated samples and the original samples. And on this basis, an RNGAN model is proposed to reduce the similarity between the generated samples and the original samples by using the idea of confrontation.

  12. f

    S1 Text -

    • plos.figshare.com
    • figshare.com
    text/x-tex
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). S1 Text - [Dataset]. http://doi.org/10.1371/journal.pone.0320955.s001
    Explore at:
    text/x-texAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary material. This file contains Tables S1 to S9.S1 to S3 are Tables 9–11 showing the results for all classes for Model 1.S4 to S6 are Tables 12–14 showing the results for all classes for Model 2.S7 to S9 are Tables 15–17 showing the results for all classes for Model 3. S1 to S3 are Tables 9–11 showing the results for all classes for Model 1. S4 to S6 are Tables 12–14 showing the results for all classes for Model 2. S7 to S9 are Tables 15–17 showing the results for all classes for Model 3. (TEX)

  13. Data from: Transfer Learning with a Graph Attention Network and Weighted...

    • acs.figshare.com
    xlsx
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haobo Wang; Wenjia Liu; Jingwen Chen; Shengshe Ji (2024). Transfer Learning with a Graph Attention Network and Weighted Loss Function for Screening of Persistent, Bioaccumulative, Mobile, and Toxic Chemicals [Dataset]. http://doi.org/10.1021/acs.est.4c11085.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    ACS Publications
    Authors
    Haobo Wang; Wenjia Liu; Jingwen Chen; Shengshe Ji
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In silico methods for screening hazardous chemicals are necessary for sound management. Persistent, bioaccumulative, mobile, and toxic (PBMT) chemicals persist in the environment and have high mobility in aquatic environments, posing risks to human and ecological health. However, lack of experimental data for the vast number of chemicals hinders identification of PBMT chemicals. Through an extensive search of measured chemical mobility data, as well as persistent, bioaccumulative, and toxic (PBT) chemical inventories, this study constructed comprehensive data sets on PBMT chemicals. To address the limited volume of the PBMT chemical data set, a transfer learning (TL) framework based on graph attention network (GAT) architecture was developed to construct models for screening PBMT chemicals, designating the PBT chemical inventories as source domains and the PBMT chemical data set as target domains. A weighted loss (LW) function was proposed and proved to mitigate the negative impact of imbalanced data on the model performance. Results indicate the TL-GAT models outperformed GAT models, along with large coverage of applicability domains and interpretability. The constructed models were employed to identify PBMT chemicals from inventories consisting of about 1 × 106 chemicals. The developed TL-GAT framework with the LW function holds broad applicability across diverse tasks, especially those involving small and imbalanced data sets.

  14. Class distribution for binary classes.

    • plos.figshare.com
    xls
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Class distribution for binary classes. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Depression presents a significant challenge to global mental health, often intertwined with factors including oxidative stress. Although the precise relationship with mitochondrial pathways remains elusive, recent advances in machine learning present an avenue for further investigation. This study employed advanced machine learning techniques to classify major depressive disorders based on clinical indicators and mitochondrial oxidative stress markers. Six machine learning algorithms, including Random Forest, were applied and their performance was investigated in balanced and unbalanced data sets with respect to binary and multiclass classification scenarios. Results indicate promising accuracy and precision, particularly with Random Forest on balanced data. RF achieved an average accuracy of 92.7% and an F1 score of 83.95% for binary classification, 90.36% and 90.1%, respectively, for the classification of three classes of severity of depression and 89.76% and 88.26%, respectively, for the classification of five classes. Including only oxidative stress markers resulted in accuracy and an F1 score of 79.52% and 80.56%, respectively. Notably, including mitochondrial peptides alongside clinical factors significantly enhances predictive capability, shedding light on the interplay between depression severity and mitochondrial oxidative stress pathways. These findings underscore the potential for machine learning models to aid clinical assessment, particularly in individuals with comorbid conditions such as hypertension, diabetes mellitus, and cardiovascular disease.

  15. f

    Class distribution for 3-class classification.

    • plos.figshare.com
    • figshare.com
    xls
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Class distribution for 3-class classification. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Depression presents a significant challenge to global mental health, often intertwined with factors including oxidative stress. Although the precise relationship with mitochondrial pathways remains elusive, recent advances in machine learning present an avenue for further investigation. This study employed advanced machine learning techniques to classify major depressive disorders based on clinical indicators and mitochondrial oxidative stress markers. Six machine learning algorithms, including Random Forest, were applied and their performance was investigated in balanced and unbalanced data sets with respect to binary and multiclass classification scenarios. Results indicate promising accuracy and precision, particularly with Random Forest on balanced data. RF achieved an average accuracy of 92.7% and an F1 score of 83.95% for binary classification, 90.36% and 90.1%, respectively, for the classification of three classes of severity of depression and 89.76% and 88.26%, respectively, for the classification of five classes. Including only oxidative stress markers resulted in accuracy and an F1 score of 79.52% and 80.56%, respectively. Notably, including mitochondrial peptides alongside clinical factors significantly enhances predictive capability, shedding light on the interplay between depression severity and mitochondrial oxidative stress pathways. These findings underscore the potential for machine learning models to aid clinical assessment, particularly in individuals with comorbid conditions such as hypertension, diabetes mellitus, and cardiovascular disease.

  16. f

    Searching space for hyperparameters in Table 7.

    • plos.figshare.com
    xls
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongil Han; Haemin Jung (2024). Searching space for hyperparameters in Table 7. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Seongil Han; Haemin Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.

  17. f

    Hyperparameter settings of BERT model.

    • plos.figshare.com
    xls
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Hyperparameter settings of BERT model. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.

  18. f

    ML algorithms used in this study.

    • figshare.com
    xls
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). ML algorithms used in this study. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Depression presents a significant challenge to global mental health, often intertwined with factors including oxidative stress. Although the precise relationship with mitochondrial pathways remains elusive, recent advances in machine learning present an avenue for further investigation. This study employed advanced machine learning techniques to classify major depressive disorders based on clinical indicators and mitochondrial oxidative stress markers. Six machine learning algorithms, including Random Forest, were applied and their performance was investigated in balanced and unbalanced data sets with respect to binary and multiclass classification scenarios. Results indicate promising accuracy and precision, particularly with Random Forest on balanced data. RF achieved an average accuracy of 92.7% and an F1 score of 83.95% for binary classification, 90.36% and 90.1%, respectively, for the classification of three classes of severity of depression and 89.76% and 88.26%, respectively, for the classification of five classes. Including only oxidative stress markers resulted in accuracy and an F1 score of 79.52% and 80.56%, respectively. Notably, including mitochondrial peptides alongside clinical factors significantly enhances predictive capability, shedding light on the interplay between depression severity and mitochondrial oxidative stress pathways. These findings underscore the potential for machine learning models to aid clinical assessment, particularly in individuals with comorbid conditions such as hypertension, diabetes mellitus, and cardiovascular disease.

  19. Depression severity level with corresponding PHQ-9 scores

    • plos.figshare.com
    xls
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Depression severity level with corresponding PHQ-9 scores [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Depression severity level with corresponding PHQ-9 scores

  20. f

    Values of the evaluation measures for the reference model derived from the...

    • plos.figshare.com
    xls
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik (2025). Values of the evaluation measures for the reference model derived from the training and test datasets across imbalance ranging from 1% to 99% of the event class. [Dataset]. http://doi.org/10.1371/journal.pone.0321661.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Values of the evaluation measures for the reference model derived from the training and test datasets across imbalance ranging from 1% to 99% of the event class.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ulf Norinder; Fredrik Svensson (2023). Multitask Modeling with Confidence Using Matrix Factorization and Conformal Prediction [Dataset]. http://doi.org/10.1021/acs.jcim.9b00027.s001
Organization logo

Data from: Multitask Modeling with Confidence Using Matrix Factorization and Conformal Prediction

Related Article
Explore at:
xlsxAvailable download formats
Dataset updated
Jun 3, 2023
Dataset provided by
ACS Publications
Authors
Ulf Norinder; Fredrik Svensson
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Multitask prediction of bioactivities is often faced with challenges relating to the sparsity of data and imbalance between different labels. We propose class conditional (Mondrian) conformal predictors using underlying Macau models as a novel approach for large scale bioactivity prediction. This approach handles both high degrees of missing data and label imbalances while still producing high quality predictive models. When applied to ten assay end points from PubChem, the models generated valid models with an efficiency of 74.0–80.1% at the 80% confidence level with similar performance both for the minority and majority class. Also when deleting progressively larger portions of the available data (0–80%) the performance of the models remained robust with only minor deterioration (reduction in efficiency between 5 and 10%). Compared to using Macau without conformal prediction the method presented here significantly improves the performance on imbalanced data sets.

Search
Clear search
Close search
Google apps
Main menu