73 datasets found

Data from: Multitask Modeling with Confidence Using Matrix Factorization and...
acs.figshare.com
xlsx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ulf Norinder; Fredrik Svensson (2023). Multitask Modeling with Confidence Using Matrix Factorization and Conformal Prediction [Dataset]. http://doi.org/10.1021/acs.jcim.9b00027.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.9b00027.s001
Dataset updated
Jun 3, 2023
Dataset provided by
ACS Publications
Authors
Ulf Norinder; Fredrik Svensson
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Multitask prediction of bioactivities is often faced with challenges relating to the sparsity of data and imbalance between different labels. We propose class conditional (Mondrian) conformal predictors using underlying Macau models as a novel approach for large scale bioactivity prediction. This approach handles both high degrees of missing data and label imbalances while still producing high quality predictive models. When applied to ten assay end points from PubChem, the models generated valid models with an efficiency of 74.0–80.1% at the 80% confidence level with similar performance both for the minority and majority class. Also when deleting progressively larger portions of the available data (0–80%) the performance of the models remained robust with only minor deterioration (reduction in efficiency between 5 and 10%). Compared to using Macau without conformal prediction the method presented here significantly improves the performance on imbalanced data sets.
Comparison of the performance evaluation of Random Forest results for...
plos.figshare.com
xls
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Comparison of the performance evaluation of Random Forest results for 5-class classification on both balanced and unbalanced data set. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320955.t008
Dataset updated
May 15, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of the performance evaluation of Random Forest results for 5-class classification on both balanced and unbalanced data set.
Uplift Modeling , Marketing Campaign Data
kaggle.com
zip
Updated Nov 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Möbius (2020). Uplift Modeling , Marketing Campaign Data [Dataset]. https://www.kaggle.com/arashnic/uplift-modeling
Explore at:
zip(340156703 bytes)Available download formats
Dataset updated
Nov 1, 2020
Authors
Möbius
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.

###
###

Content

The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.

Following is a detailed description of the features:

f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)

treatment: treatment group (1 = treated, 0 = control)

conversion: whether a conversion occured for this user (binary, label)

visit: whether a visit occured for this user (binary, label)

exposure: treatment effect, whether the user has been effectively exposed (binary)

###

Context

Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.

###
###

Content

The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.

Following is a detailed description of the features:

f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)

treatment: treatment group (1 = treated, 0 = control)

conversion: whether a conversion occured for this user (binary, label)

visit: whether a visit occured for this user (binary, label)

exposure: treatment effect, whether the user has been effectively exposed (binary)

###

Starter Kernels

HistGradientBoostingClassifier Base Model

Acknowledgement

The data provided for paper: "A Large Scale Benchmark for Uplift Modeling"

https://s3.us-east-2.amazonaws.com/criteo-uplift-dataset/large-scale-benchmark.pdf

Eustache Diemert CAIL e.diemert@criteo.com

Artem Betlei CAIL & Université Grenoble Alpes a.betlei@criteo.com

Christophe Renaudin CAIL c.renaudin@criteo.com

Massih-Reza Amini Université Grenoble Alpes massih-reza.amini@imag.fr

For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.

Inspiration

We can foresee related usages such as but not limited to:

Uplift modeling

Interactions between features and treatment

Heterogeneity of treatment

More Readings

Supercharging customer touchpoints with uplift modeling

CasualML

PyLift

MORE DATASETs ...
f
Under-sampled dataset.
plos.figshare.com
xls
Updated Dec 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Under-sampled dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t003
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
f
Comparison of the performance evaluation of Random Forest results for binary...
figshare.com
xls
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Comparison of the performance evaluation of Random Forest results for binary classification on both balanced and unbalanced data set. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320955.t006
Dataset updated
May 15, 2025
Dataset provided by
PLOS ONE
Authors
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of the performance evaluation of Random Forest results for binary classification on both balanced and unbalanced data set.
S
Research on Financial Distress Prediction of Listed Companies Based on...
scidb.cn
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
邢凯; 盛利琴; 张盼; 李珊 (2024). Research on Financial Distress Prediction of Listed Companies Based on Unbalanced Data Processing and Multivariable Screening Methods [Dataset]. http://doi.org/10.57760/sciencedb.j00214.00026
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00214.00026
Dataset updated
Feb 28, 2024
Dataset provided by
Science Data Bank
Authors
邢凯; 盛利琴; 张盼; 李珊
License
https://api.github.com/licenses/agpl-3.0https://api.github.com/licenses/agpl-3.0
Description
In the context of domestic supply side structural reform, the market environment is complex and ever-changing, and corporate debt defaults occur frequently. It is necessary to establish a timely and effective financial distress warning model Most of the existing distress prediction models have not effectively solved problems such as imbalanced datasets, unstable selection of key prediction indicators, and randomness in sample matching, and are not suitable for the current complex and changing market conditions in China Therefore, this article uses the Bootstrap resampling method to construct 1000 research samples, and uses LASSO (Least absolute shrinkage and selection operator) variable selection technology to screen key predictive factors to construct a logit model for predicting ahead of 3 years. In the prediction stage, the samples are randomly cut and predicted 1000 times to reduce random errors The results indicate that the Logit dilemma prediction model constructed by combining Bootstrap sample construction method with LASSO has stronger predictive ability compared to the traditional application of "similar industry asset size" method In addition, the embedded Bootstrap Lasso logit model has better predictive performance than mainstream O-Score models and ZChina Score models, with an accuracy increase of 10%, and is more suitable for China's time-varying market. The model constructed in this article can help corporate stakeholders better identify financial difficulties and make timely adjustments to reduce corporate bond default rates or avoid corporate defaults
D
Sample size requirements with unbalanced subgroups in latent growth models
ssh.datastations.nl
datacatalogue.cessda.eu
pdf, txt, zip
Updated May 8, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. A. J. Zondervan-Zwijnenburg; M. A. J. Zondervan-Zwijnenburg (2014). Sample size requirements with unbalanced subgroups in latent growth models [Dataset]. http://doi.org/10.17026/DANS-ZD4-QMCS
Explore at:
zip(17837), txt(1384), txt(864), txt(1144), txt(1232), pdf(1059165), txt(11638), txt(1604), txt(1627), txt(1372)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-ZD4-QMCS
Dataset updated
May 8, 2014
Dataset provided by
DANS Data Station Social Sciences and Humanities
Authors
M. A. J. Zondervan-Zwijnenburg; M. A. J. Zondervan-Zwijnenburg
License
https://doi.org/10.17026/fp39-0x58https://doi.org/10.17026/fp39-0x58
Description
Syntax to replicate the simulation study that is described by the following abstract:In the social and behavioral sciences, a general interest exists in the comparison of development between groups, especially when one of the groups is exceptional and abnormal development is expected. Multiple group latent growth models enable these comparisons. However, the combination of a smaller subgroup with a larger reference group has been shown to cause issues with power and Type I errors. The current study explores the limits of the subsample sizes in latent growth modeling (LGM) that can and cannot be analyzed with Maximum Likelihood and Bayesian estimation, where Bayesian estimation was examined not only with uninformed, but also with informed priors. The results show that Bayesian estimation resolves computational issues that occur with ML estimation, and that the addition of prior information can be the key to achieving sufficient power to detect a small growth difference between groups. Prior information has to be acquired, especially with respect to the exceptional group, to promote statistical power.
f
Comparison of the performance evaluation of Random Forest results for...
figshare.com
xls
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Comparison of the performance evaluation of Random Forest results for 3-class classification on both balanced and unbalanced data set. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320955.t007
Dataset updated
May 15, 2025
Dataset provided by
PLOS ONE
Authors
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of the performance evaluation of Random Forest results for 3-class classification on both balanced and unbalanced data set.
f
GMSC dataset (IR: Imbalance Ratio).
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). GMSC dataset (IR: Imbalance Ratio). [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t001
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
f
Data from: Addressing Imbalanced Classification Problems in Drug Discovery...
acs.figshare.com
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00023.s001
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
f
Validation of the validity of the ccr index.
figshare.com
xls
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng (2024). Validation of the validity of the ccr index. [Dataset]. http://doi.org/10.1371/journal.pone.0291656.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291656.t001
Dataset updated
Jan 18, 2024
Dataset provided by
PLOS ONE
Authors
Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Imbalanced data is a problem in that the number of samples in different categories or target value ranges varies greatly. Data imbalance imposes excellent challenges to machine learning and pattern recognition. The performance of machine learning models leans to be partially towards the majority of samples in the imbalanced dataset, which will further affect the effect of the model. The imbalanced data problem includes an imbalanced categorical problem and an imbalanced regression problem. Many studies have been developed to address the issue of imbalanced classification data. Nevertheless, the imbalanced regression problem has not been well-researched. In order to solve the problem of unbalanced regression data, we define an RNGRU model that can simultaneously learn the regression characteristics and neighbor characteristics of regression samples. To obtain the most comprehensive sample information of regression samples, the model uses the idea of confrontation to determine the proportion between the regression characteristics and neighbor characteristics of the original samples. According to the regression characteristics of the regression samples, an index ccr (correlation change rate) is proposed to evaluate the similarity between the generated samples and the original samples. And on this basis, an RNGAN model is proposed to reduce the similarity between the generated samples and the original samples by using the idea of confrontation.
f
S1 Text -
plos.figshare.com
figshare.com
text/x-tex
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). S1 Text - [Dataset]. http://doi.org/10.1371/journal.pone.0320955.s001
Explore at:
text/x-texAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320955.s001
Dataset updated
May 15, 2025
Dataset provided by
PLOS ONE
Authors
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary material. This file contains Tables S1 to S9.S1 to S3 are Tables 9–11 showing the results for all classes for Model 1.S4 to S6 are Tables 12–14 showing the results for all classes for Model 2.S7 to S9 are Tables 15–17 showing the results for all classes for Model 3. S1 to S3 are Tables 9–11 showing the results for all classes for Model 1. S4 to S6 are Tables 12–14 showing the results for all classes for Model 2. S7 to S9 are Tables 15–17 showing the results for all classes for Model 3. (TEX)
Data from: Transfer Learning with a Graph Attention Network and Weighted...
acs.figshare.com
xlsx
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haobo Wang; Wenjia Liu; Jingwen Chen; Shengshe Ji (2024). Transfer Learning with a Graph Attention Network and Weighted Loss Function for Screening of Persistent, Bioaccumulative, Mobile, and Toxic Chemicals [Dataset]. http://doi.org/10.1021/acs.est.4c11085.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.est.4c11085.s002
Dataset updated
Dec 16, 2024
Dataset provided by
ACS Publications
Authors
Haobo Wang; Wenjia Liu; Jingwen Chen; Shengshe Ji
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In silico methods for screening hazardous chemicals are necessary for sound management. Persistent, bioaccumulative, mobile, and toxic (PBMT) chemicals persist in the environment and have high mobility in aquatic environments, posing risks to human and ecological health. However, lack of experimental data for the vast number of chemicals hinders identification of PBMT chemicals. Through an extensive search of measured chemical mobility data, as well as persistent, bioaccumulative, and toxic (PBT) chemical inventories, this study constructed comprehensive data sets on PBMT chemicals. To address the limited volume of the PBMT chemical data set, a transfer learning (TL) framework based on graph attention network (GAT) architecture was developed to construct models for screening PBMT chemicals, designating the PBT chemical inventories as source domains and the PBMT chemical data set as target domains. A weighted loss (LW) function was proposed and proved to mitigate the negative impact of imbalanced data on the model performance. Results indicate the TL-GAT models outperformed GAT models, along with large coverage of applicability domains and interpretability. The constructed models were employed to identify PBMT chemicals from inventories consisting of about 1 × 106 chemicals. The developed TL-GAT framework with the LW function holds broad applicability across diverse tasks, especially those involving small and imbalanced data sets.
Class distribution for binary classes.
plos.figshare.com
xls
Updated May 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Class distribution for binary classes. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320955.t004
Dataset updated
May 15, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Depression presents a significant challenge to global mental health, often intertwined with factors including oxidative stress. Although the precise relationship with mitochondrial pathways remains elusive, recent advances in machine learning present an avenue for further investigation. This study employed advanced machine learning techniques to classify major depressive disorders based on clinical indicators and mitochondrial oxidative stress markers. Six machine learning algorithms, including Random Forest, were applied and their performance was investigated in balanced and unbalanced data sets with respect to binary and multiclass classification scenarios. Results indicate promising accuracy and precision, particularly with Random Forest on balanced data. RF achieved an average accuracy of 92.7% and an F1 score of 83.95% for binary classification, 90.36% and 90.1%, respectively, for the classification of three classes of severity of depression and 89.76% and 88.26%, respectively, for the classification of five classes. Including only oxidative stress markers resulted in accuracy and an F1 score of 79.52% and 80.56%, respectively. Notably, including mitochondrial peptides alongside clinical factors significantly enhances predictive capability, shedding light on the interplay between depression severity and mitochondrial oxidative stress pathways. These findings underscore the potential for machine learning models to aid clinical assessment, particularly in individuals with comorbid conditions such as hypertension, diabetes mellitus, and cardiovascular disease.
f
Class distribution for 3-class classification.
plos.figshare.com
figshare.com
xls
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Class distribution for 3-class classification. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320955.t003
Dataset updated
May 15, 2025
Dataset provided by
PLOS ONE
Authors
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Depression presents a significant challenge to global mental health, often intertwined with factors including oxidative stress. Although the precise relationship with mitochondrial pathways remains elusive, recent advances in machine learning present an avenue for further investigation. This study employed advanced machine learning techniques to classify major depressive disorders based on clinical indicators and mitochondrial oxidative stress markers. Six machine learning algorithms, including Random Forest, were applied and their performance was investigated in balanced and unbalanced data sets with respect to binary and multiclass classification scenarios. Results indicate promising accuracy and precision, particularly with Random Forest on balanced data. RF achieved an average accuracy of 92.7% and an F1 score of 83.95% for binary classification, 90.36% and 90.1%, respectively, for the classification of three classes of severity of depression and 89.76% and 88.26%, respectively, for the classification of five classes. Including only oxidative stress markers resulted in accuracy and an F1 score of 79.52% and 80.56%, respectively. Notably, including mitochondrial peptides alongside clinical factors significantly enhances predictive capability, shedding light on the interplay between depression severity and mitochondrial oxidative stress pathways. These findings underscore the potential for machine learning models to aid clinical assessment, particularly in individuals with comorbid conditions such as hypertension, diabetes mellitus, and cardiovascular disease.
f
Searching space for hyperparameters in Table 7.
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Searching space for hyperparameters in Table 7. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t006
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
f
Hyperparameter settings of BERT model.
plos.figshare.com
xls
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Hyperparameter settings of BERT model. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t007
Dataset updated
Oct 18, 2024
Dataset provided by
PLOS ONE
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.
f
ML algorithms used in this study.
figshare.com
xls
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). ML algorithms used in this study. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320955.t005
Dataset updated
May 15, 2025
Dataset provided by
PLOS ONE
Authors
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Depression presents a significant challenge to global mental health, often intertwined with factors including oxidative stress. Although the precise relationship with mitochondrial pathways remains elusive, recent advances in machine learning present an avenue for further investigation. This study employed advanced machine learning techniques to classify major depressive disorders based on clinical indicators and mitochondrial oxidative stress markers. Six machine learning algorithms, including Random Forest, were applied and their performance was investigated in balanced and unbalanced data sets with respect to binary and multiclass classification scenarios. Results indicate promising accuracy and precision, particularly with Random Forest on balanced data. RF achieved an average accuracy of 92.7% and an F1 score of 83.95% for binary classification, 90.36% and 90.1%, respectively, for the classification of three classes of severity of depression and 89.76% and 88.26%, respectively, for the classification of five classes. Including only oxidative stress markers resulted in accuracy and an F1 score of 79.52% and 80.56%, respectively. Notably, including mitochondrial peptides alongside clinical factors significantly enhances predictive capability, shedding light on the interplay between depression severity and mitochondrial oxidative stress pathways. These findings underscore the potential for machine learning models to aid clinical assessment, particularly in individuals with comorbid conditions such as hypertension, diabetes mellitus, and cardiovascular disease.
Depression severity level with corresponding PHQ-9 scores
plos.figshare.com
xls
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Depression severity level with corresponding PHQ-9 scores [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320955.t001
Dataset updated
May 15, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Depression severity level with corresponding PHQ-9 scores
f
Values of the evaluation measures for the reference model derived from the...
plos.figshare.com
xls
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik (2025). Values of the evaluation measures for the reference model derived from the training and test datasets across imbalance ranging from 1% to 99% of the event class. [Dataset]. http://doi.org/10.1371/journal.pone.0321661.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0321661.t002
Dataset updated
Apr 10, 2025
Dataset provided by
PLOS ONE
Authors
Barbara Więckowska; Katarzyna B. Kubiak; Przemysław Guzik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Values of the evaluation measures for the reference model derived from the training and test datasets across imbalance ranging from 1% to 99% of the event class.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ulf Norinder; Fredrik Svensson (2023). Multitask Modeling with Confidence Using Matrix Factorization and Conformal Prediction [Dataset]. http://doi.org/10.1021/acs.jcim.9b00027.s001

Data from: Multitask Modeling with Confidence Using Matrix Factorization and Conformal Prediction

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.9b00027.s001

Dataset updated

Jun 3, 2023

Dataset provided by

ACS Publications

Authors

Ulf Norinder; Fredrik Svensson

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Multitask prediction of bioactivities is often faced with challenges relating to the sparsity of data and imbalance between different labels. We propose class conditional (Mondrian) conformal predictors using underlying Macau models as a novel approach for large scale bioactivity prediction. This approach handles both high degrees of missing data and label imbalances while still producing high quality predictive models. When applied to ten assay end points from PubChem, the models generated valid models with an efficiency of 74.0–80.1% at the 80% confidence level with similar performance both for the minority and majority class. Also when deleting progressively larger portions of the available data (0–80%) the performance of the models remained robust with only minor deterioration (reduction in efficiency between 5 and 10%). Compared to using Macau without conformal prediction the method presented here significantly improves the performance on imbalanced data sets.

Clear search

Close search

Google apps

Main menu

Data from: Multitask Modeling with Confidence Using Matrix Factorization and...

Comparison of the performance evaluation of Random Forest results for...

Uplift Modeling , Marketing Campaign Data

Context

Content

Context

Content

Starter Kernels

Acknowledgement

Inspiration

More Readings

MORE DATASETs ...

Under-sampled dataset.

Comparison of the performance evaluation of Random Forest results for binary...

Research on Financial Distress Prediction of Listed Companies Based on...

Sample size requirements with unbalanced subgroups in latent growth models

Comparison of the performance evaluation of Random Forest results for...

GMSC dataset (IR: Imbalance Ratio).

Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

Validation of the validity of the ccr index.

S1 Text -

Data from: Transfer Learning with a Graph Attention Network and Weighted...

Class distribution for binary classes.

Class distribution for 3-class classification.

Searching space for hyperparameters in Table 7.

Hyperparameter settings of BERT model.

ML algorithms used in this study.

Depression severity level with corresponding PHQ-9 scores

Values of the evaluation measures for the reference model derived from the...

Data from: Multitask Modeling with Confidence Using Matrix Factorization and Conformal Prediction