Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Multitask prediction of bioactivities is often faced with challenges relating to the sparsity of data and imbalance between different labels. We propose class conditional (Mondrian) conformal predictors using underlying Macau models as a novel approach for large scale bioactivity prediction. This approach handles both high degrees of missing data and label imbalances while still producing high quality predictive models. When applied to ten assay end points from PubChem, the models generated valid models with an efficiency of 74.0–80.1% at the 80% confidence level with similar performance both for the minority and majority class. Also when deleting progressively larger portions of the available data (0–80%) the performance of the models remained robust with only minor deterioration (reduction in efficiency between 5 and 10%). Compared to using Macau without conformal prediction the method presented here significantly improves the performance on imbalanced data sets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of the performance evaluation of Random Forest results for 5-class classification on both balanced and unbalanced data set.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.
###
###
The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.
Following is a detailed description of the features:
###
Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.
###
###
The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.
Following is a detailed description of the features:
###
The data provided for paper: "A Large Scale Benchmark for Uplift Modeling"
https://s3.us-east-2.amazonaws.com/criteo-uplift-dataset/large-scale-benchmark.pdf
For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.
We can foresee related usages such as but not limited to:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of the performance evaluation of Random Forest results for binary classification on both balanced and unbalanced data set.
https://api.github.com/licenses/agpl-3.0https://api.github.com/licenses/agpl-3.0
In the context of domestic supply side structural reform, the market environment is complex and ever-changing, and corporate debt defaults occur frequently. It is necessary to establish a timely and effective financial distress warning model Most of the existing distress prediction models have not effectively solved problems such as imbalanced datasets, unstable selection of key prediction indicators, and randomness in sample matching, and are not suitable for the current complex and changing market conditions in China Therefore, this article uses the Bootstrap resampling method to construct 1000 research samples, and uses LASSO (Least absolute shrinkage and selection operator) variable selection technology to screen key predictive factors to construct a logit model for predicting ahead of 3 years. In the prediction stage, the samples are randomly cut and predicted 1000 times to reduce random errors The results indicate that the Logit dilemma prediction model constructed by combining Bootstrap sample construction method with LASSO has stronger predictive ability compared to the traditional application of "similar industry asset size" method In addition, the embedded Bootstrap Lasso logit model has better predictive performance than mainstream O-Score models and ZChina Score models, with an accuracy increase of 10%, and is more suitable for China's time-varying market. The model constructed in this article can help corporate stakeholders better identify financial difficulties and make timely adjustments to reduce corporate bond default rates or avoid corporate defaults
https://doi.org/10.17026/fp39-0x58https://doi.org/10.17026/fp39-0x58
Syntax to replicate the simulation study that is described by the following abstract:In the social and behavioral sciences, a general interest exists in the comparison of development between groups, especially when one of the groups is exceptional and abnormal development is expected. Multiple group latent growth models enable these comparisons. However, the combination of a smaller subgroup with a larger reference group has been shown to cause issues with power and Type I errors. The current study explores the limits of the subsample sizes in latent growth modeling (LGM) that can and cannot be analyzed with Maximum Likelihood and Bayesian estimation, where Bayesian estimation was examined not only with uninformed, but also with informed priors. The results show that Bayesian estimation resolves computational issues that occur with ML estimation, and that the addition of prior information can be the key to achieving sufficient power to detect a small growth difference between groups. Prior information has to be acquired, especially with respect to the exceptional group, to promote statistical power.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of the performance evaluation of Random Forest results for 3-class classification on both balanced and unbalanced data set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Imbalanced data is a problem in that the number of samples in different categories or target value ranges varies greatly. Data imbalance imposes excellent challenges to machine learning and pattern recognition. The performance of machine learning models leans to be partially towards the majority of samples in the imbalanced dataset, which will further affect the effect of the model. The imbalanced data problem includes an imbalanced categorical problem and an imbalanced regression problem. Many studies have been developed to address the issue of imbalanced classification data. Nevertheless, the imbalanced regression problem has not been well-researched. In order to solve the problem of unbalanced regression data, we define an RNGRU model that can simultaneously learn the regression characteristics and neighbor characteristics of regression samples. To obtain the most comprehensive sample information of regression samples, the model uses the idea of confrontation to determine the proportion between the regression characteristics and neighbor characteristics of the original samples. According to the regression characteristics of the regression samples, an index ccr (correlation change rate) is proposed to evaluate the similarity between the generated samples and the original samples. And on this basis, an RNGAN model is proposed to reduce the similarity between the generated samples and the original samples by using the idea of confrontation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary material. This file contains Tables S1 to S9.S1 to S3 are Tables 9–11 showing the results for all classes for Model 1.S4 to S6 are Tables 12–14 showing the results for all classes for Model 2.S7 to S9 are Tables 15–17 showing the results for all classes for Model 3. S1 to S3 are Tables 9–11 showing the results for all classes for Model 1. S4 to S6 are Tables 12–14 showing the results for all classes for Model 2. S7 to S9 are Tables 15–17 showing the results for all classes for Model 3. (TEX)
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In silico methods for screening hazardous chemicals are necessary for sound management. Persistent, bioaccumulative, mobile, and toxic (PBMT) chemicals persist in the environment and have high mobility in aquatic environments, posing risks to human and ecological health. However, lack of experimental data for the vast number of chemicals hinders identification of PBMT chemicals. Through an extensive search of measured chemical mobility data, as well as persistent, bioaccumulative, and toxic (PBT) chemical inventories, this study constructed comprehensive data sets on PBMT chemicals. To address the limited volume of the PBMT chemical data set, a transfer learning (TL) framework based on graph attention network (GAT) architecture was developed to construct models for screening PBMT chemicals, designating the PBT chemical inventories as source domains and the PBMT chemical data set as target domains. A weighted loss (LW) function was proposed and proved to mitigate the negative impact of imbalanced data on the model performance. Results indicate the TL-GAT models outperformed GAT models, along with large coverage of applicability domains and interpretability. The constructed models were employed to identify PBMT chemicals from inventories consisting of about 1 × 106 chemicals. The developed TL-GAT framework with the LW function holds broad applicability across diverse tasks, especially those involving small and imbalanced data sets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Depression presents a significant challenge to global mental health, often intertwined with factors including oxidative stress. Although the precise relationship with mitochondrial pathways remains elusive, recent advances in machine learning present an avenue for further investigation. This study employed advanced machine learning techniques to classify major depressive disorders based on clinical indicators and mitochondrial oxidative stress markers. Six machine learning algorithms, including Random Forest, were applied and their performance was investigated in balanced and unbalanced data sets with respect to binary and multiclass classification scenarios. Results indicate promising accuracy and precision, particularly with Random Forest on balanced data. RF achieved an average accuracy of 92.7% and an F1 score of 83.95% for binary classification, 90.36% and 90.1%, respectively, for the classification of three classes of severity of depression and 89.76% and 88.26%, respectively, for the classification of five classes. Including only oxidative stress markers resulted in accuracy and an F1 score of 79.52% and 80.56%, respectively. Notably, including mitochondrial peptides alongside clinical factors significantly enhances predictive capability, shedding light on the interplay between depression severity and mitochondrial oxidative stress pathways. These findings underscore the potential for machine learning models to aid clinical assessment, particularly in individuals with comorbid conditions such as hypertension, diabetes mellitus, and cardiovascular disease.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Depression presents a significant challenge to global mental health, often intertwined with factors including oxidative stress. Although the precise relationship with mitochondrial pathways remains elusive, recent advances in machine learning present an avenue for further investigation. This study employed advanced machine learning techniques to classify major depressive disorders based on clinical indicators and mitochondrial oxidative stress markers. Six machine learning algorithms, including Random Forest, were applied and their performance was investigated in balanced and unbalanced data sets with respect to binary and multiclass classification scenarios. Results indicate promising accuracy and precision, particularly with Random Forest on balanced data. RF achieved an average accuracy of 92.7% and an F1 score of 83.95% for binary classification, 90.36% and 90.1%, respectively, for the classification of three classes of severity of depression and 89.76% and 88.26%, respectively, for the classification of five classes. Including only oxidative stress markers resulted in accuracy and an F1 score of 79.52% and 80.56%, respectively. Notably, including mitochondrial peptides alongside clinical factors significantly enhances predictive capability, shedding light on the interplay between depression severity and mitochondrial oxidative stress pathways. These findings underscore the potential for machine learning models to aid clinical assessment, particularly in individuals with comorbid conditions such as hypertension, diabetes mellitus, and cardiovascular disease.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Depression presents a significant challenge to global mental health, often intertwined with factors including oxidative stress. Although the precise relationship with mitochondrial pathways remains elusive, recent advances in machine learning present an avenue for further investigation. This study employed advanced machine learning techniques to classify major depressive disorders based on clinical indicators and mitochondrial oxidative stress markers. Six machine learning algorithms, including Random Forest, were applied and their performance was investigated in balanced and unbalanced data sets with respect to binary and multiclass classification scenarios. Results indicate promising accuracy and precision, particularly with Random Forest on balanced data. RF achieved an average accuracy of 92.7% and an F1 score of 83.95% for binary classification, 90.36% and 90.1%, respectively, for the classification of three classes of severity of depression and 89.76% and 88.26%, respectively, for the classification of five classes. Including only oxidative stress markers resulted in accuracy and an F1 score of 79.52% and 80.56%, respectively. Notably, including mitochondrial peptides alongside clinical factors significantly enhances predictive capability, shedding light on the interplay between depression severity and mitochondrial oxidative stress pathways. These findings underscore the potential for machine learning models to aid clinical assessment, particularly in individuals with comorbid conditions such as hypertension, diabetes mellitus, and cardiovascular disease.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Depression severity level with corresponding PHQ-9 scores
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Values of the evaluation measures for the reference model derived from the training and test datasets across imbalance ranging from 1% to 99% of the event class.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Multitask prediction of bioactivities is often faced with challenges relating to the sparsity of data and imbalance between different labels. We propose class conditional (Mondrian) conformal predictors using underlying Macau models as a novel approach for large scale bioactivity prediction. This approach handles both high degrees of missing data and label imbalances while still producing high quality predictive models. When applied to ten assay end points from PubChem, the models generated valid models with an efficiency of 74.0–80.1% at the 80% confidence level with similar performance both for the minority and majority class. Also when deleting progressively larger portions of the available data (0–80%) the performance of the models remained robust with only minor deterioration (reduction in efficiency between 5 and 10%). Compared to using Macau without conformal prediction the method presented here significantly improves the performance on imbalanced data sets.