82 datasets found
  1. f

    Imbalanced classification metric.

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). Imbalanced classification metric. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

  2. f

    Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...

    • plos.figshare.com
    xls
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 16, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Alaa Alomari; Hossam Faris; Pedro A. Castillo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes.

  3. s

    Data from: High impact bug report identification with imbalanced learning...

    • researchdata.smu.edu.sg
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN (2023). Data from: High impact bug report identification with imbalanced learning strategies [Dataset]. http://doi.org/10.25440/smu.12062763.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This record contains the underlying research data for the publication "High impact bug report identification with imbalanced learning strategies" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/3702In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the F1-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.Supplementary code and data available from GitHub:

  4. f

    Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...

    • plos.figshare.com
    txt
    Updated Jun 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data [Dataset]. http://doi.org/10.1371/journal.pone.0180830
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 18, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method.

  5. Indian Liver Patient Dataset (ILPD)

    • kaggle.com
    Updated Sep 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saumya Mohandas N (2021). Indian Liver Patient Dataset (ILPD) [Dataset]. https://www.kaggle.com/datasets/saumyamohandas/indian-liver-patient-dataset-ilpd/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 27, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Saumya Mohandas N
    Description

    Dataset

    This dataset was created by Saumya Mohandas N

    Contents

  6. f

    DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance...

    • frontiersin.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Lowell Weller; Tanzy M. T. Love; Martin Wiedmann (2023). DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water.docx [Dataset]. http://doi.org/10.3389/fenvs.2021.701288.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Daniel Lowell Weller; Tanzy M. T. Love; Martin Wiedmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.

  7. Classification result classifiers using TF-IDF with SMOTE.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 28, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Khaled Alnowaiser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification result classifiers using TF-IDF with SMOTE.

  8. f

    Parameters of machine learning models.

    • plos.figshare.com
    xls
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran (2024). Parameters of machine learning models. [Dataset]. http://doi.org/10.1371/journal.pone.0301263.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The diagnosis of human knee abnormalities using the surface electromyography (sEMG) signal obtained from lower limb muscles with machine learning is a major problem due to the noisy nature of the sEMG signal and the imbalance in data corresponding to healthy and knee abnormal subjects. To address this challenge, a combination of wavelet decomposition (WD) with ensemble empirical mode decomposition (EEMD) and the Synthetic Minority Oversampling Technique (S-WD-EEMD) is proposed. In this study, a hybrid WD-EEMD is considered for the minimization of noises produced in the sEMG signal during the collection, while the Synthetic Minority Oversampling Technique (SMOTE) is considered to balance the data by increasing the minority class samples during the training of machine learning techniques. The findings indicate that the hybrid WD-EEMD with SMOTE oversampling technique enhances the efficacy of the examined classifiers when employed on the imbalanced sEMG data. The F-Score of the Extra Tree Classifier, when utilizing WD-EEMD signal processing with SMOTE oversampling, is 98.4%, whereas, without the SMOTE oversampling technique, it is 95.1%.

  9. c

    Data from: CreditCardTransactions Dataset

    • cubig.ai
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). CreditCardTransactions Dataset [Dataset]. https://cubig.ai/store/products/554/creditcardtransactions-dataset
    Explore at:
    Dataset updated
    Jul 8, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data Introduction • The Credit_Card_Transactions Dataset is a representative sample data for building fraud detection models, including anonymized real-world transaction data such as financial transaction type, amount, sender/receiver account balance, and fraud indicators.

    2) Data Utilization (1) Credit_Card_Transactions Dataset has characteristics that: • This dataset provides individual transaction records on a row-by-row basis, reflecting the real-world class imbalance problem with the extremely low percentage of fraudulent transactions (isFraud=1). • It is an unprocessed raw data structure that allows you to directly utilize key variables such as transaction time, amount, and account change. (2) Credit_Card_Transactions Dataset can be used to: • Binary classification modeling: Fraud transaction detection models can be developed by applying imbalance processing techniques such as SMOTE and undersampling, and appropriate evaluation indicators such as F1-score and ROC-AUC. • Real-time anomaly detection: It can be used to build a real-time anomaly signal detection system through analysis of transaction patterns (amount, frequency, account change).

  10. Historical Loan Records with Default Status

    • kaggle.com
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Nishad (2025). Historical Loan Records with Default Status [Dataset]. https://www.kaggle.com/datasets/abhisheknishad8988/defaulter-data/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2025
    Dataset provided by
    Kaggle
    Authors
    Abhishek Nishad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Loan Default Prediction Dataset

    This dataset contains preprocessed loan records from a large-scale financial dataset, designed for loan default prediction modeling. It includes a wide range of features related to borrower profiles, loan terms, and historical repayment behavior.

    This dataset has been cleaned, preprocessed, and structured for use in loan default prediction modeling. It includes most normalized numerical features, and a binary target column indicating whether a loan defaulted or not.

    Want to Improve or Customize It?

    Users are encouraged to:

    • Apply additional feature engineering (e.g., create debt-to-income ratio, rolling averages)
    • Encode categorical variables using different techniques (e.g., one-hot encoding, target encoding)
    • Handle class imbalance with oversampling (SMOTE) or undersampling
    • Perform train/test splitting , cross-validation , or scaling
    • Add derived features based on domain knowledge

    Use Cases:

    Binary classification for loan default prediction Credit risk modeling Financial machine learning Loan approval system development Model benchmarking and testing

  11. f

    Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner (2023). Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets.PDF [Dataset]. http://doi.org/10.3389/fchem.2018.00362.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Frontiers
    Authors
    Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.

  12. CerditCard fraud dataset

    • kaggle.com
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wasiq Ali (2025). CerditCard fraud dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/cerditcard-fraud-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Wasiq Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Credit Card Fraud Detection Dataset

    Uncover fraudulent transactions with this anonymized, PCA-transformed dataset. Perfect for building and testing fraud detection algorithms!

    Dataset Overview

    • Objective: Detect fraudulent credit card transactions using anonymized features- - - -

    • Samples: 1,000 transactions

    • Features: 7 columns (5 PCA components + Transaction Amount + Target)

    Class Distribution:

    • Legit (Class 0): 993 transactions (~99.3%)

    • Fraud (Class 1): 7 transactions (~0.7%)

    • Key Challenge: Extreme class imbalance – realistic representation of fraud patterns

    Features Description

    Feature Description Characteristics

    V1-V5 Anonymized principal components PCA-transformed numerical features; preserves >transaction patterns while hiding sensitive details Amount Transaction value Highly variable (min: $0.20, max: $1,916.06); critical for fraud analysis Class Target variable Binary labels: • 0 = Legitimate transaction • 1 = Fraudulent transaction Key Insights & Patterns

    Fraud Indicators:

    • Fraudulent transactions occur across diverse amounts (low: $1.83 → high: $1,916)

    • No obvious amount threshold for fraud – requires nuanced modeling

    Sample fraud cases:

    1. V1:0.579, V2:-0.384, Amount:1916.06

    2. V1:1.023, V2:-0.638, Amount:1094.42

    Data Characteristics:
    1. V1-V5 Distributions:

    2. V1: Concentrated near zero (mean ≈ -0.1)

    3. V2: Wider spread (mean ≈ 0.05)

    4. V3-V5: Asymmetric distributions

    Amount Distribution:

    1. Right-skewed – most transactions < $500

    2.Fraud cases span low and high values

    Class Imbalance:

     - Severe skew: 993:7 legit-to-fraud ratio
    
     - Models must optimize for recall/precision over accuracy
    
    Analysis Challenges

    ⚠️ Class Imbalance: Standard accuracy metrics misleading

    🔍 Feature Interpretation: PCA components lack real-world context

    📊 Non-linear Patterns: Complex interactions between V1-V5

    ⚡ High Stakes: False negatives (missed fraud) costlier than false positives

    Recommended Applications Fraud Detection Models:

    Logistic Regression (with class weighting)

    Random Forests / XGBoost (handle non-linearities)

    Isolation Forests (anomaly detection)

    Evaluation Focus:

    Precision-Recall Curves > ROC-AUC

    F2-Score (prioritize recall)

    Confusion matrix analysis

    Advanced Techniques:

    SMOTE/ADASYN for oversampling

    Autoencoders for anomaly detection

    Feature engineering: Amount-to-Var ratios

    Dataset Source & Ethics Origin: Synthetic dataset mirroring real-world financial patterns

    Anonymization: Original features transformed via PCA for privacy compliance

    Bias Consideration: Geographic/cultural biases possible in source data

    Potential Use Cases

    🏦 Banking: Real-time transaction monitoring systems

    📱 FinTech Apps: Fraud detection APIs for payment gateways

    🎓 Education: Imbalanced classification tutorials

    🏆 Kaggle Competitions: Lightweight fraud detection challenge

    Example Project Idea "Minimalist Fraud Detector":

    # python
    from imblearn.pipeline import make_pipeline
    from sklearn.ensemble import RandomForestClassifier
    
    model = make_pipeline(
      RobustScaler(), 
      SMOTE(sampling_strategy=0.3), 
      RandomForestClassifier(class_weight={0:1, 1:15}) 
    )
    Optimize for: Recall @ Precision > 0.85
    

    Dataset Summary markdown | Feature | Mean | Std | Min | Max | |----------|----------|----------|-----------|-----------| | V1 | -0.11 | 1.02 | -3.24 | 3.85 | | V2 | 0.05 | 1.01 | -2.94 | 2.60 | | V3 | 0.02 | 0.98 | -3.02 | 2.95 |
    | Amount | 250.32 | 190.19 | 0.20 | 1916.06 |

  13. f

    Data from: Variable description.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee, Carl; Bashyal, Suraj; Bhandari, Ramesh; Budhathoki, Nirajan (2023). Variable description. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001031498
    Explore at:
    Dataset updated
    Dec 7, 2023
    Authors
    Lee, Carl; Bashyal, Suraj; Bhandari, Ramesh; Budhathoki, Nirajan
    Description

    Studies in the past have examined asthma prevalence and the associated risk factors in the United States using data from national surveys. However, the findings of these studies may not be relevant to specific states because of the different environmental and socioeconomic factors that vary across regions. The 2019 Behavioral Risk Factor Surveillance System (BRFSS) showed that Michigan had higher asthma prevalence rates than the national average. In this regard, we employ various modern machine learning techniques to predict asthma and identify risk factors associated with asthma among Michigan adults using the 2019 BRFSS data. After data cleaning, a sample of 10,337 individuals was selected for analysis, out of which 1,118 individuals (10.8%) reported having asthma during the survey period. Typical machine learning techniques often perform poorly due to imbalanced data issues. To address this challenge, we employed two synthetic data generation techniques, namely the Random Over-Sampling Examples (ROSE) and Synthetic Minority Over-Sampling Technique (SMOTE) and compared their performances. The overall performance of machine learning algorithms was improved using both methods, with ROSE performing better than SMOTE. Among the ROSE-adjusted models, we found that logistic regression, partial least squares, gradient boosting, LASSO, and elastic net had comparable performance, with sensitivity at around 50% and area under the curve (AUC) at around 63%. Due to ease of interpretability, logistic regression is chosen for further exploration of risk factors. Presence of chronic obstructive pulmonary disease, lower income, female sex, financial barrier to see a doctor due to cost, taken flu shot/spray in the past 12 months, 18–24 age group, Black, non-Hispanic group, and presence of diabetes are identified as asthma risk factors. This study demonstrates the potentiality of machine learning coupled with imbalanced data modeling approaches for predicting asthma from a large survey dataset. We conclude that the findings could guide early screening of at-risk asthma patients and designing appropriate interventions to improve care practices.

  14. f

    Performance of models using CNN features.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umer, Muhammad; Mohamed, Abdullah; Abuzinadah, Nihal; Ishaq, Abid; Eshmawi, Ala’ Abdulmajid; Alsubai, Shtwai; Ashraf, Imran; Al Hejaili, Abdullah (2023). Performance of models using CNN features. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000971153
    Explore at:
    Dataset updated
    Nov 8, 2023
    Authors
    Umer, Muhammad; Mohamed, Abdullah; Abuzinadah, Nihal; Ishaq, Abid; Eshmawi, Ala’ Abdulmajid; Alsubai, Shtwai; Ashraf, Imran; Al Hejaili, Abdullah
    Description

    Predicting student performance automatically is of utmost importance, due to the substantial volume of data within educational databases. Educational data mining (EDM) devises techniques to uncover insights from data originating in educational settings. Artificial intelligence (AI) can mine educational data to predict student performance and provide measures to help students avoid failing and learn better. Learning platforms complement traditional learning settings by analyzing student performance, which can help reduce the chance of student failure. Existing methods for student performance prediction in educational data mining faced challenges such as limited accuracy, imbalanced data, and difficulties in feature engineering. These issues hindered effective adaptability and generalization across diverse educational contexts. This study proposes a machine learning-based system with deep convoluted features for the prediction of students’ academic performance. The proposed framework is employed to predict student academic performance using balanced as well as, imbalanced datasets using the synthetic minority oversampling technique (SMOTE). In addition, the performance is also evaluated using the original and deep convoluted features. Experimental results indicate that the use of deep convoluted features provides improved prediction accuracy compared to original features. Results obtained using the extra tree classifier with convoluted features show the highest classification accuracy of 99.9%. In comparison with the state-of-the-art approaches, the proposed approach achieved higher performance. This research introduces a powerful AI-driven system for student performance prediction, offering substantial advancements in accuracy compared to existing approaches.

  15. f

    A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.

  16. f

    Performance of machine learning models using SMOTE-balanced dataset.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf (2023). Performance of machine learning models using SMOTE-balanced dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0293061.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance of machine learning models using SMOTE-balanced dataset.

  17. f

    Acronym table with description.

    • datasetcatalog.nlm.nih.gov
    Updated Nov 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alsubai, Shtwai; Umer, Muhammad; Abuzinadah, Nihal; Eshmawi, Ala’ Abdulmajid; Al Hejaili, Abdullah; Ishaq, Abid; Ashraf, Imran; Mohamed, Abdullah (2023). Acronym table with description. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000971159
    Explore at:
    Dataset updated
    Nov 8, 2023
    Authors
    Alsubai, Shtwai; Umer, Muhammad; Abuzinadah, Nihal; Eshmawi, Ala’ Abdulmajid; Al Hejaili, Abdullah; Ishaq, Abid; Ashraf, Imran; Mohamed, Abdullah
    Description

    Predicting student performance automatically is of utmost importance, due to the substantial volume of data within educational databases. Educational data mining (EDM) devises techniques to uncover insights from data originating in educational settings. Artificial intelligence (AI) can mine educational data to predict student performance and provide measures to help students avoid failing and learn better. Learning platforms complement traditional learning settings by analyzing student performance, which can help reduce the chance of student failure. Existing methods for student performance prediction in educational data mining faced challenges such as limited accuracy, imbalanced data, and difficulties in feature engineering. These issues hindered effective adaptability and generalization across diverse educational contexts. This study proposes a machine learning-based system with deep convoluted features for the prediction of students’ academic performance. The proposed framework is employed to predict student academic performance using balanced as well as, imbalanced datasets using the synthetic minority oversampling technique (SMOTE). In addition, the performance is also evaluated using the original and deep convoluted features. Experimental results indicate that the use of deep convoluted features provides improved prediction accuracy compared to original features. Results obtained using the extra tree classifier with convoluted features show the highest classification accuracy of 99.9%. In comparison with the state-of-the-art approaches, the proposed approach achieved higher performance. This research introduces a powerful AI-driven system for student performance prediction, offering substantial advancements in accuracy compared to existing approaches.

  18. f

    The selected explanatory variables.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The selected explanatory variables. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.

  19. f

    Area under the ROC and PR curves for medicated patients, using the original...

    • plos.figshare.com
    xls
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Albuquerque; Ana Margarida Medeiros; Ana Catarina Alves; Mafalda Bourbon; Marília Antunes (2023). Area under the ROC and PR curves for medicated patients, using the original and SMOTE sample data. [Dataset]. http://doi.org/10.1371/journal.pone.0269713.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    PLOS ONE
    Authors
    João Albuquerque; Ana Margarida Medeiros; Ana Catarina Alves; Mafalda Bourbon; Marília Antunes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Area under the ROC and PR curves for medicated patients, using the original and SMOTE sample data.

  20. f

    BILSTM using SMOTE and ADASYN oversampling techniques.

    • plos.figshare.com
    xls
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). BILSTM using SMOTE and ADASYN oversampling techniques. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 16, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Alaa Alomari; Hossam Faris; Pedro A. Castillo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BILSTM using SMOTE and ADASYN oversampling techniques.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). Imbalanced classification metric. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t003

Imbalanced classification metric.

Related Article
Explore at:
27 scholarly articles cite this dataset (View in Google Scholar)
xlsAvailable download formats
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

Search
Clear search
Close search
Google apps
Main menu