100+ datasets found
  1. f

    Data from: GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data...

    • acs.figshare.com
    zip
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker (2023). GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning [Dataset]. http://doi.org/10.1021/acs.jcim.1c00160.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure–activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.

  2. Data from: A virtual multi-label approach to imbalanced data classification

    • tandf.figshare.com
    text/x-tex
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
    Explore at:
    text/x-texAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Elizabeth P. Chou; Shan-Ping Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.

  3. Results of BILSTM for rare classes for the imbalanced dataset with different...

    • plos.figshare.com
    xls
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Results of BILSTM for rare classes for the imbalanced dataset with different reweighting factors. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 16, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Alaa Alomari; Hossam Faris; Pedro A. Castillo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results of BILSTM for rare classes for the imbalanced dataset with different reweighting factors.

  4. Real Time Bidding

    • kaggle.com
    zip
    Updated Feb 27, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ricky (2017). Real Time Bidding [Dataset]. https://www.kaggle.com/zurfer/rtb
    Explore at:
    zip(144371473 bytes)Available download formats
    Dataset updated
    Feb 27, 2017
    Authors
    Ricky
    Description

    Context

    This is real real-time bidding data that is used to predict if an advertiser should bid for a marketing slot e.g. a banner on a webpage. Explanatory variables are things like browser, operation system or time of the day the user is online, marketplace his identifiers were traded on earlier, etc. The column 'convert' is 1, when the person clicked on the ad, and 0 if this is not the case.

    Content

    Unfortunately, the data had to be anonymized, so you basically can't do a lot of feature engineering. I just applied PCA and kept 0.99 of the linear explanatory power. However, I think it's still really interesting data to just test your general algorithms on imbalanced data. ;)

    Inspiration

    Since it's heavily imbalanced data, it doesn't make sense to train for accuracy, but rather try to get obtain a good AUC, F1Score, MCC or recall rate, by cross-validating your data. It's interesting to compare different models (logistic regression, decision trees, svms, ...) over these metrics and see the impact that your split in train:test data has on the data.

    It might be good strategy to follow these Tactics to combat imbalanced classes.

  5. Imbalanced Cifar-10

    • kaggle.com
    zip
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
    Explore at:
    zip(807146485 bytes)Available download formats
    Dataset updated
    Jun 17, 2023
    Authors
    Akhil Theerthala
    Description

    This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

    The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

    The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

    This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

    Usage Information:

    The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

    License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

    Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.

  6. The definition of a confusion matrix.

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

  7. Z

    Data from: Imbalanced dataset for benchmarking

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lemaitre, Guillaume; Nogueira, Fernando; Aridas, Christos K.; Oliveira, Dayvid V. R. (2020). Imbalanced dataset for benchmarking [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_61452
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Universite de Bourgogne, Universitat de Girona
    ShoppeAI
    University of Patras
    Universidade Federal de Pernambuco
    Authors
    Lemaitre, Guillaume; Nogueira, Fernando; Aridas, Christos K.; Oliveira, Dayvid V. R.
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Imbalanced dataset for benchmarking

    The different algorithms of the imbalanced-learn toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.

    Characteristics

    IDNameRepository & TargetRatio# samples# features
    1EcoliUCI, target: imU8.6:13367
    2Optical DigitsUCI, target: 89.1:15,62064
    3SatImageUCI, target: 49.3:16,43536
    4Pen DigitsUCI, target: 59.4:110,99216
    5AbaloneUCI, target: 79.7:14,1778
    6Sick EuthyroidUCI, target: sick euthyroid9.8:13,16325
    7SpectrometerUCI, target: >=4411:153193
    8Car_Eval_34UCI, target: good, v good12:11,7286
    9ISOLETUCI, target: A, B12:17,797617
    10US CrimeUCI, target: >0.6512:11,994122
    11Yeast_ML8LIBSVM, target: 813:12,417103
    12SceneLIBSVM, target: >one label13:12,407294
    13Libras MoveUCI, target: 114:136090
    14Thyroid SickUCI, target: sick15:13,77228
    15Coil_2000KDD, CoIL, target: minority16:19,82285
    16ArrhythmiaUCI, target: 0617:1452279
    17Solar Flare M0UCI, target: M->019:11,38910
    18OILUCI, target: minority22:193749
    19Car_Eval_4UCI, target: vgood26:11,7286
    20Wine QualityUCI, wine, target: <=426:14,89811
    21Letter ImgUCI, target: Z26:120,00016
    22Yeast _ME2UCI, target: ME228:11,4848
    23WebpageLIBSVM, w7a, target: minority33:149,749300
    24Ozone LevelUCI, ozone, data34:12,53672
    25MammographyUCI, target: minority42:111,1836
    26Protein homo.KDD CUP 2004, minority111:1145,75174
    27Abalone_19UCI, target: 19130:14,1778

    References

    [1] Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).

    [2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).

    [3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.

    [4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.

  8. f

    Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Aug 28, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dehnbostel, Frederic O.; Banerjee, Priyanka; Preissner, Robert (2018). Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets.PDF [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000631456
    Explore at:
    Dataset updated
    Aug 28, 2018
    Authors
    Dehnbostel, Frederic O.; Banerjee, Priyanka; Preissner, Robert
    Description

    Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.

  9. Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

  10. G

    Data Balancing for Model Training Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Data Balancing for Model Training Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-balancing-for-model-training-market
    Explore at:
    pptx, pdf, csvAvailable download formats
    Dataset updated
    Oct 3, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Balancing for Model Training Market Outlook



    According to our latest research, the global Data Balancing for Model Training market size in 2024 is valued at USD 1.37 billion, with a robust CAGR of 19.8% expected during the forecast period. By 2033, the market is forecasted to reach USD 6.59 billion. The primary growth factor driving this market is the exponential increase in demand for high-quality, unbiased machine learning models across industries, fueled by the rapid digital transformation and adoption of artificial intelligence.



    One of the most significant growth drivers for the Data Balancing for Model Training market is the surging need for accurate and reliable AI models in critical sectors such as healthcare, finance, and retail. As organizations increasingly leverage AI and machine learning for decision-making, the importance of balanced datasets becomes paramount to ensure model fairness, accuracy, and compliance. Data imbalance, if not addressed, can lead to biased predictions and suboptimal business outcomes, making data balancing solutions essential for organizations aiming to deploy trustworthy and high-performing models. Furthermore, regulatory pressures and ethical considerations are compelling enterprises to adopt advanced data balancing techniques, further accelerating market growth.



    Another key factor propelling the market is the proliferation of big data and the complexity of modern datasets. With the explosion of data sources and the diversity of data types, organizations are facing unprecedented challenges in managing and processing imbalanced datasets. This complexity necessitates the adoption of sophisticated data balancing solutions such as oversampling, undersampling, hybrid methods, and synthetic data generation. These solutions not only enhance model performance but also streamline the data preparation process, enabling faster and more efficient model training cycles. The growing integration of automated machine learning (AutoML) platforms is also contributing to the adoption of data balancing tools, as these platforms increasingly embed balancing techniques to democratize AI development.



    The ongoing digital transformation across industries, coupled with the rise of Industry 4.0, is further boosting the demand for data balancing solutions. Enterprises in manufacturing, IT & telecommunications, and retail are deploying AI-powered applications at scale, which rely heavily on balanced training data to deliver accurate insights and automation. The expanding use of Internet of Things (IoT) devices and connected systems is generating vast volumes of imbalanced data, necessitating robust data balancing frameworks. Additionally, advancements in synthetic data generation are opening new avenues for addressing data scarcity and imbalance, especially in sensitive domains like healthcare where data privacy is a concern.



    From a regional perspective, North America leads the Data Balancing for Model Training market, driven by early adoption of AI technologies, strong presence of tech giants, and significant investments in AI research and development. Europe follows closely, supported by stringent regulatory frameworks and a growing focus on ethical AI. The Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding IT infrastructure, and increasing adoption of AI in emerging economies such as China and India. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with increasing awareness and investments in AI-driven solutions.





    Solution Type Analysis



    The Solution Type segment of the Data Balancing for Model Training market encompasses Oversampling, Undersampling, Hybrid Methods, Synthetic Data Generation, and Others. Oversampling remains one of the most widely adopted techniques, particularly in scenarios where minority class data is scarce but critical for accurate model predictions. Techniques such as SMOTE (Synthetic Minority Over-sampling Technique) and its variants are extensively used to generate synthetic samples, thereby improv

  11. Cerebral Stroke Prediction-Imbalanced Dataset

    • kaggle.com
    zip
    Updated Aug 22, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shashwat Tiwari (2021). Cerebral Stroke Prediction-Imbalanced Dataset [Dataset]. https://www.kaggle.com/shashwatwork/cerebral-stroke-predictionimbalaced-dataset
    Explore at:
    zip(573312 bytes)Available download formats
    Dataset updated
    Aug 22, 2021
    Authors
    Shashwat Tiwari
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    Context

    A stroke, also known as a cerebrovascular accident or CVA is when part of the brain loses its blood supply and the part of the body that the blood-deprived brain cells control stops working. This loss of blood supply can be ischemic because of lack of blood flow, or hemorrhagic because of bleeding into brain tissue. A stroke is a medical emergency because strokes can lead to death or permanent disability. There are opportunities to treat ischemic strokes but that treatment needs to be started in the first few hours after the signs of a stroke begin.

    Content

    The cerebral Stroke dataset consists of 12 features including the target column which is imbalanced.

    Acknowledgements

    Liu, Tianyu; Fan, Wenhui; Wu, Cheng (2019), “Data for A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical-datasets”, Mendeley Data, V1, doi: 10.17632/x8ygrw87jw.1 Dataset is sourced from here.

  12. Is this a good customer?

    • kaggle.com
    zip
    Updated Apr 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    podsyp (2020). Is this a good customer? [Dataset]. https://www.kaggle.com/podsyp/is-this-a-good-customer
    Explore at:
    zip(19523 bytes)Available download formats
    Dataset updated
    Apr 16, 2020
    Authors
    podsyp
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Imbalanced classes put “accuracy” out of business. This is a surprisingly common problem in machine learning (specifically in classification), occurring in datasets with a disproportionate ratio of observations in each class.

    Content

    Standard accuracy no longer reliably measures performance, which makes model training much trickier. Imbalanced classes appear in many domains, including: - Antifraud - Antispam - ...

    Inspiration

    5 tactics for handling imbalanced classes in machine learning: - Up-sample the minority class - Down-sample the majority class - Change your performance metric - Penalize algorithms (cost-sensitive training) - Use tree-based algorithms

  13. Predict students' dropout and academic success

    • zenodo.org
    • data-staging.niaid.nih.gov
    • +1more
    Updated Mar 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins (2023). Predict students' dropout and academic success [Dataset]. http://doi.org/10.5281/zenodo.5777340
    Explore at:
    Dataset updated
    Mar 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins
    Description

    A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.

    The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters.

    The data is used to build classification models to predict students' dropout and academic success. The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.

    Funding
    We acknowledge support of this work by the program "SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal"

  14. Financial Transaction Fraud Detection

    • kaggle.com
    zip
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhi pratap (2025). Financial Transaction Fraud Detection [Dataset]. https://www.kaggle.com/datasets/abhipratapsingh/fraud-detection
    Explore at:
    zip(186385507 bytes)Available download formats
    Dataset updated
    Aug 20, 2025
    Authors
    Abhi pratap
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset is a valuable resource for building and evaluating machine learning models to predict fraudulent transactions in an e-commerce environment. With 6.3 million rows, it provides a rich, real-world scenario for data science tasks.

    The data is an excellent case study for several key challenges in machine learning, including:

    • Handling Imbalanced Data: The dataset is highly imbalanced, as legitimate transactions vastly outnumber fraudulent ones. This necessitates the use of specialized techniques like SMOTE or advanced models like XGBoost that can handle class imbalance effectively.

    • Feature Engineering: The raw data provides an opportunity to create new, more powerful features, such as transaction velocity or the ratio of account balances, which can improve model performance.

    • Model Evaluation: Traditional metrics like accuracy are misleading for this type of dataset. The project requires a deeper analysis using metrics such as Precision, Recall, F1-Score, and the Precision-Recall AUC to truly understand the model's effectiveness.

    Key Features: The dataset includes a variety of anonymized transaction details:

    • amount: The value of the transaction.

    • type: The type of transaction (e.g., TRANSFER, CASH_OUT).

    • oldbalance & newbalance: The balances of the origin and destination accounts before and after the transaction.

    • isFraud: The target variable, a binary flag indicating a fraudulent transaction.

  15. D

    Data Balance Optimization AI Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Data Balance Optimization AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/data-balance-optimization-ai-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Balance Optimization AI Market Outlook




    According to our latest research, the global Data Balance Optimization AI market size in 2024 stands at USD 2.18 billion, with a robust compound annual growth rate (CAGR) of 23.7% projected from 2025 to 2033. By the end of 2033, the market is forecasted to reach an impressive USD 17.3 billion. This substantial growth is driven by the surging demand for AI-powered analytics and increasing adoption of data-intensive applications across industries, establishing Data Balance Optimization AI as a critical enabler for enterprise digital transformation.




    One of the primary growth factors fueling the Data Balance Optimization AI market is the exponential surge in data generation across various sectors. Organizations are increasingly leveraging digital technologies, IoT devices, and cloud platforms, resulting in vast, complex, and often imbalanced datasets. The need for advanced AI solutions that can optimize, balance, and manage these datasets has become paramount to ensure high-quality analytics, accurate machine learning models, and improved business decision-making. Enterprises recognize that imbalanced data can severely skew AI outcomes, leading to biases and reduced operational efficiency. Consequently, the demand for Data Balance Optimization AI tools is accelerating as businesses strive to extract actionable insights from diverse and voluminous data sources.




    Another critical driver is the rapid evolution of AI and machine learning algorithms, which require balanced and high-integrity datasets for optimal performance. As industries such as healthcare, finance, and retail increasingly rely on predictive analytics and automation, the integrity of underlying data becomes a focal point. Data Balance Optimization AI technologies are being integrated into data pipelines to automatically detect and correct imbalances, ensuring that AI models are trained on representative and unbiased data. This not only enhances model accuracy but also helps organizations comply with stringent regulatory requirements related to data fairness and transparency, further reinforcing the market’s upward trajectory.




    The proliferation of cloud computing and the shift toward hybrid IT infrastructures are also significant contributors to market growth. Cloud-based Data Balance Optimization AI solutions offer scalability, flexibility, and cost-effectiveness, making them attractive to both large enterprises and small and medium-sized businesses. These solutions facilitate seamless integration with existing data management systems, enabling real-time optimization and balancing of data across distributed environments. Furthermore, the rise of data-centric business models in sectors such as e-commerce, telecommunications, and manufacturing is amplifying the need for robust data optimization frameworks, propelling further adoption of Data Balance Optimization AI technologies globally.




    From a regional perspective, North America currently dominates the Data Balance Optimization AI market, accounting for the largest share due to its advanced technological infrastructure, high investment in AI research, and the presence of leading technology firms. However, the Asia Pacific region is poised to experience the fastest growth during the forecast period, driven by rapid digitalization, expanding IT ecosystems, and increasing adoption of AI-powered solutions in emerging economies such as China, India, and Southeast Asia. Europe also presents significant opportunities, particularly in regulated industries such as finance and healthcare, where data integrity and compliance are paramount. Collectively, these regional trends underscore the global momentum behind Data Balance Optimization AI adoption.



    Component Analysis




    The Data Balance Optimization AI market by component is segmented into software, hardware, and services, each playing a pivotal role in the overall ecosystem. The software segment commands the largest market share, driven by the continuous evolution of AI algorithms, data preprocessing tools, and machine learning frameworks designed to address data imbalance challenges. Organizations are increasingly investing in advanced software solutions that automate data balancing, cleansing, and augmentation processes, ensuring the reliability of AI-driven analytics. These software platforms often integrate seamlessly with existing data management systems, providing us

  16. Stroke Risk Synthetic 2025

    • kaggle.com
    zip
    Updated Sep 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Imaad Mahmood (2025). Stroke Risk Synthetic 2025 [Dataset]. https://www.kaggle.com/datasets/imaadmahmood/stroke-risk-synthetic-2025
    Explore at:
    zip(2288 bytes)Available download formats
    Dataset updated
    Sep 26, 2025
    Authors
    Imaad Mahmood
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    StrokeRiskSynthetic2025 Dataset

    Overview

    The StrokeRiskSynthetic2025 dataset is a synthetically generated dataset designed for machine learning and data analysis tasks focused on predicting stroke risk. Created in September 2025, it simulates realistic patient profiles based on established stroke risk factors, drawing inspiration from medical literature and existing healthcare datasets. With 1,000 records, it provides a balanced yet imbalanced target (approximately 5% stroke cases) to reflect real-world stroke prevalence, making it ideal for binary classification, feature engineering, and handling imbalanced data in educational and research settings.

    Data Description

    • Rows: 1,000
    • Columns: 12
    • Target Variable: stroke (binary: 0 = No stroke, 1 = Stroke)
    • File Format: CSV
    • Size: Approximately 60 KB

    Columns

    Column NameTypeDescription
    idIntegerUnique identifier for each record (1 to 1,000).
    genderCategoricalPatient gender: Male, Female, Other.
    ageIntegerPatient age in years (0 to 100, skewed toward older adults).
    hypertensionBinaryHypertension status: 0 = No, 1 = Yes (~30% prevalence).
    heart_diseaseBinaryHeart disease status: 0 = No, 1 = Yes (~5-10% prevalence).
    ever_marriedCategoricalMarital status: Yes, No (correlated with age).
    work_typeCategoricalEmployment type: children, Govt_job, Never_worked, Private, Self-employed.
    Residence_typeCategoricalResidence: Urban, Rural (balanced distribution).
    avg_glucose_levelFloatAverage blood glucose level in mg/dL (50 to 300, mean ~100).
    bmiFloatBody Mass Index (10 to 60, mean ~25).
    smoking_statusCategoricalSmoking history: formerly smoked, never smoked, smokes, Unknown.
    strokeBinaryTarget variable: 0 = No stroke, 1 = Stroke (~5% positive cases).

    Key Features

    • Realistic Distributions: Reflects real-world stroke risk factors (e.g., age, hypertension, glucose levels) based on 2025 medical data, with ~5% stroke prevalence to mimic imbalanced healthcare datasets.
    • Synthetic Data: Generated to avoid privacy concerns, ensuring ethical use for research and education.
    • Versatility: Suitable for binary classification, feature importance analysis (e.g., SHAP), data preprocessing (e.g., imputation, scaling), and handling imbalanced data (e.g., SMOTE).
    • No Missing Values: Clean dataset for straightforward analysis, though users can introduce missingness for preprocessing practice.

    Use Cases

    • Machine Learning: Train models like Logistic Regression, Random Forest, or XGBoost for stroke prediction.
    • Data Analysis: Explore correlations between risk factors (e.g., age, hypertension) and stroke outcomes.
    • Educational Projects: Ideal for learning EDA, feature engineering, and model deployment (e.g., Flask apps).
    • Healthcare Research: Simulate clinical scenarios for studying stroke risk without real patient data.

    Source and Inspiration

    This dataset is inspired by stroke risk factors outlined in medical literature (e.g., CDC, WHO) and existing datasets like the Kaggle Stroke Prediction Dataset (2021) and Mendeley’s Synthetic Stroke Prediction Dataset (2025). It incorporates 2025 trends in healthcare ML, such as handling imbalanced data and feature importance analysis.

    Usage Notes

    • Preprocessing: Numerical features (age, avg_glucose_level, bmi) may require scaling; categorical features (gender, work_type, etc.) need encoding (e.g., one-hot, label).
    • Imbalanced Data: The ~5% stroke prevalence requires techniques like SMOTE, oversampling, or class weighting for effective modeling.
    • Scalability: Contact the creator to generate larger datasets (e.g., 10,000+ rows) if needed.

    License

    This dataset is provided for educational and research purposes under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

    Contact

    For questions or to request expanded datasets, contact the creator via the platform where this dataset is hosted.

  17. Lending Club Loan Data

    • kaggle.com
    zip
    Updated Nov 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sweta Shetye (2020). Lending Club Loan Data [Dataset]. https://www.kaggle.com/swetashetye/lending-club-loan-data-imbalance-dataset
    Explore at:
    zip(218250 bytes)Available download formats
    Dataset updated
    Nov 8, 2020
    Authors
    Sweta Shetye
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I wanted a highly imbalanced dataset to share with others. It has the perfect one for us.

    Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).

    For example, In this dataset, There are way more samples of fully paid borrowers versus not fully paid borrowers.

    Full LendingClub data available from their site.

    Content

    For companies like Lending Club correctly predicting whether or not a loan will be default is very important. This dataset contains historical data from 2007 to 2015, you can to build a deep learning model to predict the chance of default for future loans. As you will see this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

  18. Fraudulent Financial Transaction Prediction

    • kaggle.com
    zip
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2025). Fraudulent Financial Transaction Prediction [Dataset]. https://www.kaggle.com/datasets/younusmohamed/fraudulent-financial-transaction-prediction
    Explore at:
    zip(41695207 bytes)Available download formats
    Dataset updated
    Feb 15, 2025
    Authors
    Younus_Mohamed
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Fraud Detection with Imbalanced Data

    Overview
    This dataset is designed to help build, train, and evaluate machine learning models that detect fraudulent transactions. We have included additional CSV files containing location-based scores, proprietary weights for grouping, network turn-around times, and vulnerability scores.

    Key Points
    - Severe Class Imbalance: Only a tiny fraction (less than 1%) of transactions are fraud.
    - Multiple Feature Files: Combine them by matching on id or Group.
    - Target: The Target column in train.csv indicates fraud (1) vs. clean (0).
    - Goal: Predict which transactions in test_share.csv might be fraudulent.

    Files in this Dataset

    1. train.csv

      • Rows: 227,845 (example size)
      • Columns: 28
      • Description: Contains historical transaction data for training a fraud detection model.
      • Important: The Target column (0 = Clean, 1 = Fraud).
    2. test_share.csv

      • Rows: 56,962 (example size)
      • Columns: 27
      • Description: Test dataset, with the same structure as train.csv but without the Target column.
    3. Geo_scores.csv

      • Columns: (id, geo_score)
      • Description: Location-based geospatial scores for each transaction.
    4. Lambda_wts.csv

      • Columns: (Group, lambda_wt)
      • Description: Proprietary “lambda” weights associated with each Group.
    5. Qset_tats.csv

      • Columns: (id, qsets_normalized_tat)
      • Description: Network turn-around times (TAT) for each transaction.
    6. instance_scores.csv

      • Columns: (id, instance_scores)
      • Description: Vulnerability or risk qualification scores for each transaction.

    Suggested Usage

    1. Load all CSVs into dataframes.
    2. Merge additional files (Geo_scores.csv, Lambda_wts.csv, etc.) by matching id or Group.
    3. Explore the severe class imbalance in train.csv (Target ~1% is fraud).
    4. Train any suitable classification model (Random Forest, XGBoost, etc.) on train.csv.
    5. Predict on test_share.csv or your own external data.

    Possible Tools:
    - Python: pandas, NumPy, scikit-learn
    - Imbalance Handling: SMOTE, Random Oversampler, or class weights
    - Metrics: Precision, Recall, F1-score, ROC-AUC, etc.

    Beginner Tip: Check how these extra CSVs (Geo, lambda, instance scores, TAT) might improve fraud detection performance!

    Tags

    • fraud-detection
    • classification
    • imbalanced-data
    • financial-transactions
    • machine-learning
    • python
    • beginner-friendly

    License: CC BY-NC-SA 4.0

  19. Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...

    • plos.figshare.com
    xls
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 16, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Alaa Alomari; Hossam Faris; Pedro A. Castillo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes.

  20. f

    Additional file 1 of Prediction of low Apgar score at five minutes following...

    • datasetcatalog.nlm.nih.gov
    • springernature.figshare.com
    Updated Apr 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tarimo, Clifford Silver; Li, Quanman; Wang, Yuhui; Zhao, Yizhen; Mohammed, Akram; Gardner, Marilyn; Ren, Weicun; Wu, Jian; Bhuyan, Soumitra S.; Mahande, Michael Johnson (2022). Additional file 1 of Prediction of low Apgar score at five minutes following labor induction intervention in vaginal deliveries: machine learning approach for imbalanced data at a tertiary hospital in North Tanzania [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000421685
    Explore at:
    Dataset updated
    Apr 2, 2022
    Authors
    Tarimo, Clifford Silver; Li, Quanman; Wang, Yuhui; Zhao, Yizhen; Mohammed, Akram; Gardner, Marilyn; Ren, Weicun; Wu, Jian; Bhuyan, Soumitra S.; Mahande, Michael Johnson
    Description

    Additional file 1.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker (2023). GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning [Dataset]. http://doi.org/10.1021/acs.jcim.1c00160.s002

Data from: GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Carmen Esposito; Gregory A. Landrum; Nadine Schneider; Nikolaus Stiefl; Sereina Riniker
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure–activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.

Search
Clear search
Close search
Google apps
Main menu