100+ datasets found

f
Data from: Addressing Imbalanced Classification Problems in Drug Discovery...
acs.figshare.com
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00023.s001
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
i
Imbalanced Data
ieee-dataport.org
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0
Explore at:
Dataset updated
Aug 23, 2023
Authors
Blessa Binolin M
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.
Imbalanced dataset for benchmarking
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira (2020). Imbalanced dataset for benchmarking [Dataset]. http://doi.org/10.5281/zenodo.61452
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.61452
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Imbalanced dataset for benchmarking
=======================

The different algorithms of the `imbalanced-learn` toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.

Characteristics
-------------------

|ID |Name |Repository & Target |Ratio |# samples| # features |
|:---:|:----------------------:|--------------------------------------|:------:|:-------------:|:--------------:|
|1 |Ecoli |UCI, target: imU |8.6:1 |336 |7 |
|2 |Optical Digits |UCI, target: 8 |9.1:1 |5,620 |64 |
|3 |SatImage |UCI, target: 4 |9.3:1 |6,435 |36 |
|4 |Pen Digits |UCI, target: 5 |9.4:1 |10,992 |16 |
|5 |Abalone |UCI, target: 7 |9.7:1 |4,177 |8 |
|6 |Sick Euthyroid |UCI, target: sick euthyroid |9.8:1 |3,163 |25 |
|7 |Spectrometer |UCI, target: >=44 |11:1 |531 |93 |
|8 |Car_Eval_34 |UCI, target: good, v good |12:1 |1,728 |6 |
|9 |ISOLET |UCI, target: A, B |12:1 |7,797 |617 |
|10 |US Crime |UCI, target: >0.65 |12:1 |1,994 |122 |
|11 |Yeast_ML8 |LIBSVM, target: 8 |13:1 |2,417 |103 |
|12 |Scene |LIBSVM, target: >one label |13:1 |2,407 |294 |
|13 |Libras Move |UCI, target: 1 |14:1 |360 |90 |
|14 |Thyroid Sick |UCI, target: sick |15:1 |3,772 |28 |
|15 |Coil_2000 |KDD, CoIL, target: minority |16:1 |9,822 |85 |
|16 |Arrhythmia |UCI, target: 06 |17:1 |452 |279 |
|17 |Solar Flare M0 |UCI, target: M->0 |19:1 |1,389 |10 |
|18 |OIL |UCI, target: minority |22:1 |937 |49 |
|19 |Car_Eval_4 |UCI, target: vgood |26:1 |1,728 |6 |
|20 |Wine Quality |UCI, wine, target: <=4 |26:1 |4,898 |11 |
|21 |Letter Img |UCI, target: Z |26:1 |20,000 |16 |
|22 |Yeast _ME2 |UCI, target: ME2 |28:1 |1,484 |8 |
|23 |Webpage |LIBSVM, w7a, target: minority|33:1 |49,749 |300 |
|24 |Ozone Level |UCI, ozone, data |34:1 |2,536 |72 |
|25 |Mammography |UCI, target: minority |42:1 |11,183 |6 |
|26 |Protein homo. |KDD CUP 2004, minority |111:1|145,751 |74 |
|27 |Abalone_19 |UCI, target: 19 |130:1|4,177 |8 |

References
----------
[1] Ding, Zejin, "Diversified Ensemble Classifiers for H
ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).

[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).

[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.

[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.
i
imbalanced data
ieee-dataport.org
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ZHI WANG (2022). imbalanced data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data
Explore at:
Dataset updated
Dec 14, 2022
Authors
ZHI WANG
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset file is used for the study of imbalanced data and contains 6 imbalanced datasets
f
Performance comparison of machine learning models across accuracy, AUC, MCC,...
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t005
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.
f
Data from: S1 Datasets -
plos.figshare.com
bin
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). S1 Datasets - [Dataset]. http://doi.org/10.1371/journal.pone.0317396.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.s001
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
f
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...
plos.figshare.com
xls
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290581.t007
Dataset updated
Nov 16, 2023
Dataset provided by
PLOS ONE
Authors
Alaa Alomari; Hossam Faris; Pedro A. Castillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes.
i
Unbalanced data sets
ieee-dataport.org
Updated Dec 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuchen liu (2022). Unbalanced data sets [Dataset]. https://ieee-dataport.org/documents/unbalanced-data-sets
Explore at:
Dataset updated
Dec 4, 2022
Authors
Yuchen liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Derived from public unbalanced data sets
f
Under-sampled dataset.
plos.figshare.com
xls
Updated Dec 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Under-sampled dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t003
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
Predict students' dropout and academic success
zenodo.org
data.niaid.nih.gov
Updated Mar 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins (2023). Predict students' dropout and academic success [Dataset]. http://doi.org/10.5281/zenodo.5777340
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5777340
Dataset updated
Mar 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins
Description
A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.

The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters.

The data is used to build classification models to predict students' dropout and academic success. The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.

Funding
We acknowledge support of this work by the program "SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal"
n
Results of machine learning experiments for "Multi-classifier prediction of...
data.ncl.ac.uk
tar
Updated Oct 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paweł Widera (2019). Results of machine learning experiments for "Multi-classifier prediction of knee osteoarthritis progression from incomplete imbalanced longitudinal data" [Dataset]. http://doi.org/10.25405/data.ncl.10043060
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.25405/data.ncl.10043060
Dataset updated
Oct 30, 2019
Dataset provided by
Newcastle University
Authors
Paweł Widera
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The archive file includes results of machine learning experiments performed for the article "Multi-classifier prediction of knee osteoarthritis progression from incomplete imbalanced longitudinal data". The hypothesis of the article is that prediction models trained on historical data will be more effective at identifying fast progressing knee OA patients than conventional inclusion criteria.For all experiments the first level folder hierarchy indicates the method used. Where parameter tuning is performed, the second level folders indicate algorithm parameters. Each experiment output is stored in a xz compressed text file in JSON format.In experiments measuring the learning curves (training-*), each results file describes:* experiment setup (algorithm, number of subsets, down-sampled class size)* list of training set sizes* performance measure statistics for all subsets at each training size (flat list) including min, median and max score, and median deviation from median (mad), given for both test and training set instancesIn parameter tuning experiments (prediction-multi-*), each results file contains:* experiment setup (method / algorithm, number of CV repeats, number of model runs)* imputer parameters (not important, kept constant in all experiments)* classifier parameters (for random forest)* true class for each instance* class predictions by the median model from each CV-repeat* class probabilities estimated by the median model from each CV-repeat* performance measure statistics for each CV-repeat including min, median and max score, and median deviation from median (mad)In RFE experiments (prediction-multi-rfe-*) the results additionally include:* scores for all RFE steps for each CV-repeat* number of times each feature was selected (across all folds and CV-repeats)
f
Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner (2023). Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets.PDF [Dataset]. http://doi.org/10.3389/fchem.2018.00362.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fchem.2018.00362.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.
Classification results for: Hellinger Distance Trees for Imbalanced Streams
figshare.com
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Lyon (2023). Classification results for: Hellinger Distance Trees for Imbalanced Streams [Dataset]. http://doi.org/10.6084/m9.figshare.1534549.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1534549.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Robert Lyon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data sets supporting the results reported in the paper: Hellinger Distance Trees for Imbalanced Streams, R. J. Lyon, J.M. Brooke, J.D. Knowles, B.W Stappers, 22nd International Conference on Pattern Recognition (ICPR), p.1969 - 1974, 2014. DOI: 10.1109/ICPR.2014.344 Contained in this distribution are results of stream classifier perfromance on four different data sets. Also included are the test results from our attempt at reproducing the outcome of the paper, Learning Decision Trees for Un-balanced Data, D. A. Cieslak and N. V. Chawla, in Machine Learning and Knowledge Discovery in Databases (W. Daelemans, B. Goethals, and K. Morik, eds.), vol. 5211 of LNCS, pp. 241-256, 2008. The data sets used for these experiments include, MAGIC Gamma Telescope Data Set : https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+TelescopeMiniBooNE particle identification Data Set : https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identificationSkin Segmentation Data Set : https://archive.ics.uci.edu/ml/datasets/Skin+SegmentationLetter Recognition Data Set : https://archive.ics.uci.edu/ml/datasets/Letter+RecognitionPen-Based Recognition of Handwritten Digits Data Set : https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+DigitsStatlog (Landsat Satellite) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)Statlog (Image Segmentation) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Image+Segmentation) A further data set used is not publicly available at present. However we are in the process of releasing it for public use. Please get in touch if you'd like to use it.

A readme file accompanies the data describing it in more detail.
m
Safety dataset
data.mendeley.com
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kai lai Sun (2025). Safety dataset [Dataset]. http://doi.org/10.17632/m8rwjx67bk.1
Explore at:
Unique identifier
https://doi.org/10.17632/m8rwjx67bk.1
Dataset updated
Jul 16, 2025
Authors
Kai lai Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Three imbalanced safety datasets are established and shared: a 9-year nationwide construction accident dataset in Singapore, an accident and safety management dataset in a major development project in Singapore, and a US truck driver safety climate survey dataset. The paper link: https://doi.org/10.1016/j.knosys.2025.114120.
Appendix: Data Analysis and Machine Learning Experiments
zenodo.org
bin, zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Gerling; Mauricio Aniche; Jan Gerling; Mauricio Aniche (2025). Appendix: Data Analysis and Machine Learning Experiments [Dataset]. http://doi.org/10.5281/zenodo.4267824
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4267824
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Gerling; Mauricio Aniche; Jan Gerling; Mauricio Aniche
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The plots and statistics generated for the data analysis are given in this data set.

Furthermore, this data set contains the models, feature sets, scaler, prediction results and visualizations for the machine learning experiments conducted.

Reproduction Experiment

Multiple Commit Thresholds Experiment

Imbalanced Training Experiment
m
Data for: A hybrid machine learning approach to cerebral stroke prediction...
data.mendeley.com
Updated Nov 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianyu Liu (2019). Data for: A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical-datasets [Dataset]. http://doi.org/10.17632/x8ygrw87jw.1
Explore at:
Unique identifier
https://doi.org/10.17632/x8ygrw87jw.1
Dataset updated
Nov 11, 2019
Authors
Tianyu Liu
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
basic dataset of stroke prediction
Additional file 1 of Prediction of low Apgar score at five minutes following...
springernature.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clifford Silver Tarimo; Soumitra S. Bhuyan; Yizhen Zhao; Weicun Ren; Akram Mohammed; Quanman Li; Marilyn Gardner; Michael Johnson Mahande; Yuhui Wang; Jian Wu (2024). Additional file 1 of Prediction of low Apgar score at five minutes following labor induction intervention in vaginal deliveries: machine learning approach for imbalanced data at a tertiary hospital in North Tanzania [Dataset]. http://doi.org/10.6084/m9.figshare.19498594.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19498594.v1
Dataset updated
Feb 9, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Clifford Silver Tarimo; Soumitra S. Bhuyan; Yizhen Zhao; Weicun Ren; Akram Mohammed; Quanman Li; Marilyn Gardner; Michael Johnson Mahande; Yuhui Wang; Jian Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Tanzania
Description
Additional file 1.
Cerebral Stroke Prediction-Imbalanced Dataset
kaggle.com
Updated Aug 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shashwat Tiwari (2021). Cerebral Stroke Prediction-Imbalanced Dataset [Dataset]. https://www.kaggle.com/datasets/shashwatwork/cerebral-stroke-predictionimbalaced-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 22, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shashwat Tiwari
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
Context

A stroke, also known as a cerebrovascular accident or CVA is when part of the brain loses its blood supply and the part of the body that the blood-deprived brain cells control stops working. This loss of blood supply can be ischemic because of lack of blood flow, or hemorrhagic because of bleeding into brain tissue. A stroke is a medical emergency because strokes can lead to death or permanent disability. There are opportunities to treat ischemic strokes but that treatment needs to be started in the first few hours after the signs of a stroke begin.

Content

The cerebral Stroke dataset consists of 12 features including the target column which is imbalanced.

Acknowledgements

Liu, Tianyu; Fan, Wenhui; Wu, Cheng (2019), “Data for A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical-datasets”, Mendeley Data, V1, doi: 10.17632/x8ygrw87jw.1 Dataset is sourced from here.
d
Replication Data for: Less Annotating, More Classifying: Addressing the Data...
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laurer, Moritz; van Atteveldt, Wouter; Casas, Andreu; Welbers, Kasper (2023). Replication Data for: Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI [Dataset]. http://doi.org/10.7910/DVN/8ACDTT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/8ACDTT
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Laurer, Moritz; van Atteveldt, Wouter; Casas, Andreu; Welbers, Kasper
Description
Supervised machine learning is an increasingly popular tool for analysing large political text corpora. The main disadvantage of supervised machine learning is the need for thousands of manually annotated training data points. This issue is particularly important in the social sciences where most new research questions require the automation of a new task with new and imbalanced training data. This paper analyses how deep transfer learning can help address this challenge by accumulating ‘prior knowledge’ in algorithms. Pre-training algorithms like BERT creates representations of statistical language patterns (‘language knowledge’), and training on universal tasks like Natural Language Inference (NLI) reduces reliance on task-specific data (‘task knowledge’). We systematically show the benefits of transfer learning on a wide range of eight tasks. Across these eight tasks, BERT-NLI fine-tuned on 100 to 2500 data points performs on average 10.7 to 18.3 percentage points better than classical algorithms without transfer learning. Our study indicates that BERT-NLI trained on 500 data points achieves similar average performance as classical algorithms trained on around 5000 data points. Moreover, we show that transfer learning works particularly well on imbalanced data. We conclude by discussing limitations of transfer learning and by outlining new opportunities for political science research.
Supplementary Information for Owen et al. 2025. What is 'accuracy'?...
zenodo.org
bin
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bianca Owen; Bianca Owen (2025). Supplementary Information for Owen et al. 2025. What is 'accuracy'? Rethinking machine learning classifier performance metrics for highly imbalanced, high variance, zero-inflated species count data [Dataset]. http://doi.org/10.5281/zenodo.15913476
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15913476
Dataset updated
Jul 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bianca Owen; Bianca Owen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Microsoft Excel file showing the datasets and working for all findings in Owen et al. 2025. What is ‘accuracy’? Rethinking machine learning classifier performance metrics for highly imbalanced, high variance, zero-inflated species count data

Facebook

Twitter

Click to copy link

Link copied

Cite

Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001

Data from: Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.5c00023.s001

Dataset updated

Apr 15, 2025

Dataset provided by

ACS Publications

Authors

Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.

Clear search

Close search

Google apps

Main menu

Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

Imbalanced Data

Imbalanced dataset for benchmarking

imbalanced data

Performance comparison of machine learning models across accuracy, AUC, MCC,...

Data from: S1 Datasets -

Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...

Unbalanced data sets

Under-sampled dataset.

Predict students' dropout and academic success

Results of machine learning experiments for "Multi-classifier prediction of...

Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...

Classification results for: Hellinger Distance Trees for Imbalanced Streams

Safety dataset

Appendix: Data Analysis and Machine Learning Experiments

Data for: A hybrid machine learning approach to cerebral stroke prediction...

Additional file 1 of Prediction of low Apgar score at five minutes following...

Cerebral Stroke Prediction-Imbalanced Dataset

Context

Content

Acknowledgements

Replication Data for: Less Annotating, More Classifying: Addressing the Data...

Supplementary Information for Owen et al. 2025. What is 'accuracy'?...

Data from: Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML