100+ datasets found

f
Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...
frontiersin.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2021.715421.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
i
Tackling Class Imbalance with Ranking - Dataset - CKAN
rdm.inesctec.pt
Updated Feb 20, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Tackling Class Imbalance with Ranking - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/nis-2017-002
Explore at:
Dataset updated
Feb 20, 2017
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The dataset comes originally from UCI Machine Learning. The multiclass datasets were transformed in binary classification as mentioned in the paper. Ranking methods were applied to improve class imbalance. The datasets are divided in 30 folds so that other class imbalance methods can be compared to the methods in the paper. The code used in the paper is also provided.
f
Over-sampled dataset.
figshare.com
xls
Updated Dec 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Over-sampled dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t004
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
f
Data from: Addressing Imbalanced Classification Problems in Drug Discovery...
acs.figshare.com
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00023.s001
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
f
Performance comparison of machine learning models across accuracy, AUC, MCC,...
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t005
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.
n
Data from: Exploring deep learning techniques for wild animal behaviour...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2ngf1vhwk
Dataset updated
Feb 22, 2024
Dataset provided by
Nagoya University
Osaka University
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
Credit Card Fraud Detection Dataset
kaggle.com
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghanshyam Saini (2025). Credit Card Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/credit-card-fraud-detection-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghanshyam Saini
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.

About the Dataset:

This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.

Content of the Data:

Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.

The only features that have not been transformed by PCA are:

Time: Numerical. Represents the number of seconds elapsed between each transaction and the very first transaction recorded in the dataset.

Amount: Numerical. The transaction amount in Euros (€). This feature could be valuable for cost-sensitive learning approaches.

The target variable for this classification task is:

Class: Integer. Takes the value 1 in the case of a fraudulent transaction and 0 otherwise.

Important Note on Evaluation:

Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).

How to Use This Dataset:

Download the dataset file (likely in CSV format).

Load the data using libraries like Pandas.

Understand the class imbalance: Be aware that fraudulent transactions are rare.

Explore the features: Analyze the distributions of 'Time', 'Amount', and the PCA-transformed features (V1-V28).

Address the class imbalance: Consider using techniques like oversampling the minority class, undersampling the majority class, or using specialized algorithms designed for imbalanced datasets.

Build and train binary classification models to predict the 'Class' variable.

Evaluate your models using AUPRC to get a meaningful assessment of performance in detecting fraud.

Acknowledgements and Citation:

This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).

When using this dataset in your research or projects, please cite the following works as appropriate:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon.

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE.

Andrea Dal Pozzolo. Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi).

Fabrizio Carcillo, Andrea Dal Pozzolo, Yann-Aël Le Borgne, Olivier Caelen, Yannis Mazzer, Gianluca Bontempi. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier.

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Gianluca Bontempi. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing.

Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019.

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi *Combining Unsupervised and Supervised...
f
Data from: Isometric Stratified Ensembles: A Partial and Incremental...
acs.figshare.com
xlsx
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux (2023). Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation [Dataset]. http://doi.org/10.1021/acs.jcim.2c00293.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.2c00293.s004
Dataset updated
Jun 7, 2023
Dataset provided by
ACS Publications
Authors
Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.
h
resampled_IDS_datasets
huggingface.co
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Le (2025). resampled_IDS_datasets [Dataset]. http://doi.org/10.57967/hf/4961
Explore at:
Unique identifier
https://doi.org/10.57967/hf/4961
Dataset updated
Jul 17, 2025
Authors
Le
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for resampled_IDS_datasets

Intrusion Detection Systems (IDS) play a crucial role in securing computer networks against malicious activities. However, their efficacy is consistently hindered by the persistent challenge of class imbalance in real-world datasets. While various methods, such as resampling techniques, ensemble methods, cost-sensitive learning, data augmentation, and so on, have individually addressed imbalance classification issues, there exists a notable… See the full description on the dataset page: https://huggingface.co/datasets/Thi-Thu-Huong/resampled_IDS_datasets.
f
The definition of a confusion matrix.
plos.figshare.com
xls
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t002
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
Datasets for: A Study on Machine Vision Techniques...
zenodo.org
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alice Stazio; Alice Stazio; Juan G Victores; Juan G Victores; David Estevez; David Estevez; Carlos Balaguer; Carlos Balaguer (2020). Datasets for: A Study on Machine Vision Techniques... [Dataset]. http://doi.org/10.5281/zenodo.3251898
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3251898
Dataset updated
Jan 21, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alice Stazio; Alice Stazio; Juan G Victores; Juan G Victores; David Estevez; David Estevez; Carlos Balaguer; Carlos Balaguer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets for: A Study on Machine Vision Techniques...
Avian point-counts from Rhode Island and Connecticut used to test species...
zenodo.org
datadryad.org
csv, txt
Updated Jun 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valerie Steen; Valerie Steen; Morgan Tingley; Peter Paton; Chris Elphick; Morgan Tingley; Peter Paton; Chris Elphick (2022). Avian point-counts from Rhode Island and Connecticut used to test species distribution models [Dataset]. http://doi.org/10.5061/dryad.8cz8w9gnp
Explore at:
txt, csvAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.8cz8w9gnp
Dataset updated
Jun 3, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valerie Steen; Valerie Steen; Morgan Tingley; Peter Paton; Chris Elphick; Morgan Tingley; Peter Paton; Chris Elphick
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Rhode Island, Connecticut
Description
Spatial-biases are a common feature of presence-absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non-detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modeling technique. To explore the consequences of spatial bias and class imbalance in presence-absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing, and majority-only thinning (i.e., retaining all samples of the minority class). We created SDMs using two parametric or semi-parametric techniques (generalized linear models and generalized additive models) and two machine-learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision-recall curve) and calibration (Brier score; Cohen's kappa) metrics. We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modeling technique, performance metric, and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence < 0.1. For most of these rare species, balancing classes improved model discrimination between presence and absence classes, but hindered model calibration. Baseline sample prevalence, sample size, modeling approach, and the intended application of SDM output – whether discrimination or calibration – should guide decisions about how to thin or balance data, given the considerable influence of these methodological choices on SDM performance. For prognostic applications requiring good model calibration (vis-à-vis discrimination), the match between sample prevalence and true species prevalence may be the overriding feature and warrants further investigation.
Jab technique prediction with Body Landmark data
kaggle.com
Updated May 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bill OHanlon (2021). Jab technique prediction with Body Landmark data [Dataset]. https://www.kaggle.com/billohanlon/jab-technique-prediction-with-body-landmark-data/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 19, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bill OHanlon
Description
Context

Dataset of body landmark data, extracted from raw video jab strikes.

Content

These five datasets were used to train sklearn classifiers to identify five contributors to jabbing technique in beginners.

The five identified contributing factors to the strike are as follows:

1. Failing to sufficiently cover the chin with the left shoulder when striking (Arm_protection). 2. Leaving the chin exposed on the non jabbing arm side, by failing to cover the chin with lead hand (Chin_protected). 3. Allowing the jabbing arm elbow to come in from the side and not from underneath the jab (Jab_elbows). 4. Overcommitting to the jab by leaning too far forward (Overcommitting). 5. Not protecting the body with the non jabbing arm (Rightelbowin).

This dataset was created using MediaPipe's pose estimation solution, BlazePose: [https://google.github.io/mediapipe/solutions/pose.html]

Each directory contains a .tsv file containing all data extracted using BlazePose (i.e. all rows of 33 * 3 dimensional arrays) and a more refined serialised dataset which has been normalised, checked for class imbalance and unnecessary body landmarks removed. This later datatset also does not contain visibility as any row with low visibility landmarks was removed.
f
Classification result classifiers using TF-IDF with SMOTE.
plos.figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t007
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification result classifiers using TF-IDF with SMOTE.
f
GMSC dataset (IR: Imbalance Ratio).
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). GMSC dataset (IR: Imbalance Ratio). [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t001
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
m
Data from: Mental issues, internet addiction and quality of life predict...
data.mendeley.com
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andras Matuz (2024). Mental issues, internet addiction and quality of life predict burnout among Hungarian teachers: a machine learning analysis [Dataset]. http://doi.org/10.17632/2yy4j7rgvg.1
Explore at:
Unique identifier
https://doi.org/10.17632/2yy4j7rgvg.1
Dataset updated
Jul 12, 2024
Authors
Andras Matuz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Burnout is usually defined as a state of emotional, physical, and mental exhaustion that affects people in various professions (e.g. physicians, nurses, teachers). The consequences of burnout involve decreased motivation, productivity, and overall diminished well-being. The machine learning-based prediction of burnout has therefore become the focus of recent research. In this study, the aim was to detect burnout using machine learning and to identify its most important predictors in a sample of Hungarian high-school teachers. Methods: The final sample consisted of 1,576 high-school teachers (522 male), who completed a survey including various sociodemographic and health-related questions and psychological questionnaires. Specifically, depression, insomnia, internet habits (e.g. when and why one uses the internet) and problematic internet usage were among the most important predictors tested in this study. Supervised classification algorithms were trained to detect burnout assessed by two well-known burnout questionnaires. Feature selection was conducted using recursive feature elimination. Hyperparameters were tuned via grid search with 5-fold cross-validation. Due to class imbalance, class weights (i.e. cost-sensitive learning), downsampling and a hybrid method (SMOTE-ENN) were applied in separate analyses. The final model evaluation was carried out on a previously unseen holdout test sample. Results: Burnout was detected in 19.7% of the teachers included in the final dataset. The best predictive performance on the holdout test sample was achieved by support vector machine with SMOTE-ENN (AUC = .942; balanced accuracy = .868, sensitivity = .898; specificity = .837). The best predictors of burnout were Beck’s Depression Inventory scores, Athen’s Insomnia Scale scores, subscales of the Problematic Internet Use Questionnaire and self-reported current health status. Conclusions: The performances of the algorithms were comparable with previous studies; however, it is important to note that we tested our models on previously unseen holdout samples suggesting higher levels of generalizability. Another remarkable finding is that besides depression and insomnia, other variables such as problematic internet use and time spent online also turned out to be important predictors of burnout.
f
Comparison of classification results obtained through class imbalance...
figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dapeng Xiong; Fen Xiao; Li Liu; Kai Hu; Yanping Tan; Shunmin He; Xieping Gao (2023). Comparison of classification results obtained through class imbalance learning method with the optimal feature subsets by 5-fold cross validation. [Dataset]. http://doi.org/10.1371/journal.pone.0043126.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0043126.t003
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Dapeng Xiong; Fen Xiao; Li Liu; Kai Hu; Yanping Tan; Shunmin He; Xieping Gao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of classification results obtained through class imbalance learning method with the optimal feature subsets by 5-fold cross validation.
f
S1 Data -
plos.figshare.com
zip
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JiaMing Gong; MingGang Dong (2024). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0311133.s007
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311133.s007
Dataset updated
Dec 13, 2024
Dataset provided by
PLOS ONE
Authors
JiaMing Gong; MingGang Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.
f
Data from: Effective Feature Selection Method for Class-Imbalance Datasets...
acs.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aurelio Antelo-Collado; Ramón Carrasco-Velar; Nicolás García-Pedrajas; Gonzalo Cerruela-García (2023). Effective Feature Selection Method for Class-Imbalance Datasets Applied to Chemical Toxicity Prediction [Dataset]. http://doi.org/10.1021/acs.jcim.0c00908.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.0c00908.s002
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Aurelio Antelo-Collado; Ramón Carrasco-Velar; Nicolás García-Pedrajas; Gonzalo Cerruela-García
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
During the drug development process, it is common to carry out toxicity tests and adverse effect studies, which are essential to guarantee patient safety and the success of the research. The use of in silico quantitative structure–activity relationship (QSAR) approaches for this task involves processing a huge amount of data that, in many cases, have an imbalanced distribution of active and inactive samples. This is usually termed the class-imbalance problem and may have a significant negative effect on the performance of the learned models. The performance of feature selection (FS) for QSAR models is usually damaged by the class-imbalance nature of the involved datasets. This paper proposes the use of an FS method focused on dealing with the class-imbalance problems. The method is based on the use of FS ensembles constructed by boosting and using two well-known FS methods, fast clustering-based FS and the fast correlation-based filter. The experimental results demonstrate the efficiency of the proposal in terms of the classification performance compared to standard methods. The proposal can be extended to other FS methods and applied to other problems in cheminformatics.
Exploration of machine learning techniques in predicting multiple sclerosis...
plos.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yijun Zhao; Brian C. Healy; Dalia Rotstein; Charles R. G. Guttmann; Rohit Bakshi; Howard L. Weiner; Carla E. Brodley; Tanuja Chitnis (2023). Exploration of machine learning techniques in predicting multiple sclerosis disease course [Dataset]. http://doi.org/10.1371/journal.pone.0174866
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0174866
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yijun Zhao; Brian C. Healy; Dalia Rotstein; Charles R. G. Guttmann; Rohit Bakshi; Howard L. Weiner; Carla E. Brodley; Tanuja Chitnis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveTo explore the value of machine learning methods for predicting multiple sclerosis disease course.Methods1693 CLIMB study patients were classified as increased EDSS≥1.5 (worsening) or not (non-worsening) at up to five years after baseline visit. Support vector machines (SVM) were used to build the classifier, and compared to logistic regression (LR) using demographic, clinical and MRI data obtained at years one and two to predict EDSS at five years follow-up.ResultsBaseline data alone provided little predictive value. Clinical observation for one year improved overall SVM sensitivity to 62% and specificity to 65% in predicting worsening cases. The addition of one year MRI data improved sensitivity to 71% and specificity to 68%. Use of non-uniform misclassification costs in the SVM model, weighting towards increased sensitivity, improved predictions (up to 86%). Sensitivity, specificity, and overall accuracy improved minimally with additional follow-up data. Predictions improved within specific groups defined by baseline EDSS. LR performed more poorly than SVM in most cases. Race, family history of MS, and brain parenchymal fraction, ranked highly as predictors of the non-worsening group. Brain T2 lesion volume ranked highly as predictive of the worsening group.InterpretationSVM incorporating short-term clinical and brain MRI data, class imbalance corrective measures, and classification costs may be a promising means to predict MS disease course, and for selection of patients suitable for more aggressive treatment regimens.

Facebook

Twitter

Click to copy link

Link copied

Cite

Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX

Explore at:

docxAvailable download formats

Unique identifier

https://doi.org/10.3389/fninf.2021.715421.s002

Dataset updated

Jun 1, 2023

Dataset provided by

Frontiers

Authors

Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

Clear search

Close search

Google apps

Main menu

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

Tackling Class Imbalance with Ranking - Dataset - CKAN

Over-sampled dataset.

Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

Performance comparison of machine learning models across accuracy, AUC, MCC,...

Data from: Exploring deep learning techniques for wild animal behaviour...

Credit Card Fraud Detection Dataset

Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

Data from: Isometric Stratified Ensembles: A Partial and Incremental...

resampled_IDS_datasets

The definition of a confusion matrix.

Datasets for: A Study on Machine Vision Techniques...

Avian point-counts from Rhode Island and Connecticut used to test species...

Jab technique prediction with Body Landmark data

Context

Content

Classification result classifiers using TF-IDF with SMOTE.

GMSC dataset (IR: Imbalance Ratio).

Data from: Mental issues, internet addiction and quality of life predict...

Comparison of classification results obtained through class imbalance...

S1 Data -

Data from: Effective Feature Selection Method for Class-Imbalance Datasets...

Exploration of machine learning techniques in predicting multiple sclerosis...

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX