100+ datasets found
  1. Classifier in terms of different performance metrics with different...

    • plos.figshare.com
    xls
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran (2024). Classifier in terms of different performance metrics with different pre-processing techniques with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0301263.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classifier in terms of different performance metrics with different pre-processing techniques with SMOTE.

  2. f

    Performance of machine learning models using SMOTE-balanced dataset.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umer, Muhammad; Alsubai, Shtwai; Ishaq, Abid; Ashraf, Imran; Abuzinadah, Nihal; Eshmawi, Ala’ Abdulmajid; Al Hejaili, Abdullah; Mohamed, Abdullah (2023). Performance of machine learning models using SMOTE-balanced dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000971150
    Explore at:
    Dataset updated
    Nov 8, 2023
    Authors
    Umer, Muhammad; Alsubai, Shtwai; Ishaq, Abid; Ashraf, Imran; Abuzinadah, Nihal; Eshmawi, Ala’ Abdulmajid; Al Hejaili, Abdullah; Mohamed, Abdullah
    Description

    Performance of machine learning models using SMOTE-balanced dataset.

  3. f

    Performance of machine learning models on test set using the SMOTE-adjusted...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee, Carl; Bashyal, Suraj; Bhandari, Ramesh; Budhathoki, Nirajan (2023). Performance of machine learning models on test set using the SMOTE-adjusted balanced training set. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001031532
    Explore at:
    Dataset updated
    Dec 7, 2023
    Authors
    Lee, Carl; Bashyal, Suraj; Bhandari, Ramesh; Budhathoki, Nirajan
    Description

    Performance of machine learning models on test set using the SMOTE-adjusted balanced training set.

  4. f

    Classification results using ML algorithms after applying SMOTE and feature...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Sep 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abu Marjan,; Al Mamun, Abdulla; Islam, Rashedul; Uddin, Palash; Nitu, Adiba Mahjabin; Ibn Afjal, Masud (2024). Classification results using ML algorithms after applying SMOTE and feature engineering, training, and testing ratios is 50:50. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001300526
    Explore at:
    Dataset updated
    Sep 3, 2024
    Authors
    Abu Marjan,; Al Mamun, Abdulla; Islam, Rashedul; Uddin, Palash; Nitu, Adiba Mahjabin; Ibn Afjal, Masud
    Description

    All values represent the mean value of 5 trials of experiments.

  5. The definition of a confusion matrix.

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

  6. i

    UIC GII ML SMOTE

    • ieee-dataport.org
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ezenwa Nwanesi (2025). UIC GII ML SMOTE [Dataset]. https://ieee-dataport.org/documents/uic-gii-ml-smote
    Explore at:
    Dataset updated
    Aug 2, 2025
    Authors
    Ezenwa Nwanesi
    Description

    their implementation in Africa is limited.

  7. Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...

    • plos.figshare.com
    xls
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 16, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Alaa Alomari; Hossam Faris; Pedro A. Castillo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes.

  8. f

    Classification results of machine learning models using TF-IDF with SMOTE.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Choi, Gyu Sang; Rustam, Furqan; Sadiq, Saima; Saad, Eysha; Ashraf, Imran; Mehmood, Arif; Jamil, Ramish (2022). Classification results of machine learning models using TF-IDF with SMOTE. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000223631
    Explore at:
    Dataset updated
    Jun 29, 2022
    Authors
    Choi, Gyu Sang; Rustam, Furqan; Sadiq, Saima; Saad, Eysha; Ashraf, Imran; Mehmood, Arif; Jamil, Ramish
    Description

    Classification results of machine learning models using TF-IDF with SMOTE.

  9. A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.

  10. DermaEvolve - Skin Disease Pred. - SMOTE Balanced

    • kaggle.com
    zip
    Updated Nov 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lokesh Bhaskar (2024). DermaEvolve - Skin Disease Pred. - SMOTE Balanced [Dataset]. https://www.kaggle.com/datasets/lokeshbhaskarnr/synthetic-images-unprocessed
    Explore at:
    zip(167941761 bytes)Available download formats
    Dataset updated
    Nov 2, 2024
    Authors
    Lokesh Bhaskar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    DermaEvolve Dataset

    Overview

    The DermaEvolve dataset is a comprehensive collection of skin lesion images, sourced from publicly available datasets and extended with additional rare diseases. This dataset aims to aid in the development and evaluation of machine learning models for dermatological diagnosis.

    Sources

    The dataset is primarily derived from: - HAM10000 (Kaggle link) – A collection of dermatoscopic images with various skin lesion types. - ISIC Archive (Kaggle link) – A dataset of skin cancer images categorized into multiple classes. - Dermnet NZ – Used to source additional rare diseases for dataset extension. https://dermnetnz.org/ - Google Database - Images

    Categories

    The dataset includes images of the following skin conditions:

    Common Categories:

    • Basal Cell Carcinoma
    • Squamous Cell Carcinoma
    • Melanoma
    • Actinic Keratosis
    • Pigmented Benign Keratosis
    • Seborrheic Keratosis
    • Vascular Lesion
    • Melanocytic Nevus
    • Dermatofibroma

    Rare Diseases (Extended):

    To enhance diversity, the following rare skin conditions were added from Dermnet NZ: - Elastosis Perforans Serpiginosa - Lentigo Maligna - Nevus Sebaceus - Blue Naevus

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15829785%2Fa8d519f4192efe1575c428ab269a6dc9%2Fsmote.png?generation=1741698292237699&alt=media" alt="smote">

    Dataset Characteristics

    • Class Imbalance Handles: The dataset consists of uniform class distribution.
    • Image Size: 64 x 64 for memory standards in kaggle.

    The resizing and augmentation are made on dataset from my previously uploaded raw dataset : https://www.kaggle.com/datasets/lokeshbhaskarnr/dermaevolve-original-unprocessed/data

    Acknowledgements

    Special thanks to the authors of the original datasets: - HAM10000 – Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. - ISIC Archive – International Skin Imaging Collaboration (ISIC), a repository for dermatology imaging. - Dermnet NZ – A valuable resource for dermatological images.

    Usage

    This dataset can be used for: - Training deep learning models for skin lesion classification. - Research on dermatological image analysis. - Development of computer-aided diagnostic tools.

    Please cite the original datasets if you use this resource in your work.

    NOTE :

    Check out the github repository for the streamlit application that focuses on skin disease prediction --> https://github.com/LokeshBhaskarNR/DermaEvolve---An-Advanced-Skin-Disease-Predictor.git

    Streamlit Application Link : https://dermaevolve.streamlit.app/

    Kindly check out my notebooks for the processed models and code -->

    Check out my NoteBooks on multiple models trained on this dataset :

  11. ml_smote

    • kaggle.com
    zip
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexis Moraga (2025). ml_smote [Dataset]. https://www.kaggle.com/senoratiramisu/ml-smote
    Explore at:
    zip(1428 bytes)Available download formats
    Dataset updated
    Nov 5, 2025
    Authors
    Alexis Moraga
    Description

    Dataset

    This dataset was created by Alexis Moraga

    Contents

  12. A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the ILPD and QSAR datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the ILPD and QSAR datasets is presented, based on various classification metrics using the Random Forest classifier.

  13. The selected explanatory variables.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The selected explanatory variables. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.

  14. Classification result classifiers using TF-IDF with SMOTE.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 28, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Khaled Alnowaiser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification result classifiers using TF-IDF with SMOTE.

  15. DermaEvolve - Original Unprocessed

    • kaggle.com
    zip
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lokesh Bhaskar (2025). DermaEvolve - Original Unprocessed [Dataset]. https://www.kaggle.com/datasets/lokeshbhaskarnr/dermaevolve-original-unprocessed
    Explore at:
    zip(3287235366 bytes)Available download formats
    Dataset updated
    Mar 11, 2025
    Authors
    Lokesh Bhaskar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    DermaEvolve Dataset

    Overview

    The DermaEvolve dataset is a comprehensive collection of skin lesion images, sourced from publicly available datasets and extended with additional rare diseases. This dataset aims to aid in the development and evaluation of machine learning models for dermatological diagnosis.

    Sources

    The dataset is primarily derived from: - HAM10000 (Kaggle link) – A collection of dermatoscopic images with various skin lesion types. - ISIC Archive (Kaggle link) – A dataset of skin cancer images categorized into multiple classes. - Dermnet NZ – Used to source additional rare diseases for dataset extension. https://dermnetnz.org/ - Google Database - Images

    Categories

    The dataset includes images of the following skin conditions:

    Common Categories:

    • Basal Cell Carcinoma
    • Squamous Cell Carcinoma
    • Melanoma
    • Actinic Keratosis
    • Pigmented Benign Keratosis
    • Seborrheic Keratosis
    • Vascular Lesion
    • Melanocytic Nevus
    • Dermatofibroma

    Rare Diseases (Extended):

    To enhance diversity, the following rare skin conditions were added from Dermnet NZ: - Elastosis Perforans Serpiginosa - Lentigo Maligna - Nevus Sebaceus - Blue Naevus

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15829785%2F732c8390d6b7e2f0d7b51eeefdc03299%2Fupn.png?generation=1741697396385432&alt=media" alt="Original dataset distribution">

    Dataset Characteristics

    • Unprocessed: The dataset consists of raw, unprocessed images.
    • Variable Image Sizes: Image dimensions vary as they have not been standardized.

    Acknowledgements

    Special thanks to the authors of the original datasets: - HAM10000 – Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. - ISIC Archive – International Skin Imaging Collaboration (ISIC), a repository for dermatology imaging. - Dermnet NZ – A valuable resource for dermatological images.

    Usage

    This dataset can be used for: - Training deep learning models for skin lesion classification. - Research on dermatological image analysis. - Development of computer-aided diagnostic tools.

    Please cite the original datasets if you use this resource in your work.

    NOTE :

    Check out the github repository for the streamlit application that focuses on skin disease prediction --> https://github.com/LokeshBhaskarNR/DermaEvolve---An-Advanced-Skin-Disease-Predictor.git

    Streamlit Application Link : https://dermaevolve.streamlit.app/

    Kindly check out my notebooks for the processed models and code -->

    Check out my NoteBooks on multiple models trained on this dataset :

  16. A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk...

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier.

  17. u

    Data from: Dataset for classification of signaling proteins based on...

    • portalinvestigacion.udc.gal
    • portalcientifico.sergas.es
    • +1more
    Updated 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernandez-Lozano, Carlos; Munteanu, Cristian Robert; Fernandez-Lozano, Carlos; Munteanu, Cristian Robert (2015). Dataset for classification of signaling proteins based on molecular star graph descriptors using machine-learning models [Dataset]. https://portalinvestigacion.udc.gal/documentos/668fc447b9e7c03b01bd8975
    Explore at:
    Dataset updated
    2015
    Authors
    Fernandez-Lozano, Carlos; Munteanu, Cristian Robert; Fernandez-Lozano, Carlos; Munteanu, Cristian Robert
    Description

    The positive group of 608 signaling protein sequences was downloaded as FASTA format from Protein Databank (Berman et al., 2000) by using the “Molecular Function Browser” in the “Advanced Search Interface” (“Signaling (GO ID23052)”, protein identity cut-off = 30%). The negative group of 2077 non-signaling proteins was downloaded as the PISCES CulledPDB (http://dunbrack.fccc.edu/PISCES.php) (Wang & R. L. Dunbrack, 2003) (November 19th, 2012) using identity (degree of correspondence between two sequences) less than 20%, resolution of 1.6 Å and R-factor 0.25. The full dataset is containing 2685 FASTA sequences of protein chains from the PDB databank: 608 are signaling proteins and 2077 are non-signaling peptides. This kind of unbalanced data is not the most suitable to be used as an input for learning algorithms because the results would present a high sensitivity and low specificity; learning algorithms would tend to classify most of samples as part of the most common group. To avoid this situation, a pre-processing stage is needed in order to get a more balanced dataset, in this case by means of the synthetic minority oversampling technique (SMOTE). In short, SMOTE provides a more balanced dataset using an expansion of the lower class by creating new samples, interpolating other minority-class samples. After this pre-processing, the final dataset is composed of 1824 positive samples (signaling protein chains) and 2432 negative cases (non-signaling protein chains). Paper is available at: http://dx.doi.org/10.1016/j.jtbi.2015.07.038 Please cite: Carlos Fernandez-Lozano, Rubén F. Cuiñas, José A. Seoane, Enrique Fernández-Blanco, Julian Dorado, Cristian R. Munteanu, Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models, Journal of Theoretical Biology, Volume 384, 7 November 2015, Pages 50-58, ISSN 0022-5193, http://dx.doi.org/10.1016/j.jtbi.2015.07.038.(http://www.sciencedirect.com/science/article/pii/S0022519315003999)

  18. d

    Predictive Models on the 2013 NCDB Colon Cancer Data

    • elsevier.digitalcommonsdata.com
    Updated May 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grey Leonard (2021). Predictive Models on the 2013 NCDB Colon Cancer Data [Dataset]. http://doi.org/10.17632/jg44fgspzk.1
    Explore at:
    Dataset updated
    May 4, 2021
    Authors
    Grey Leonard
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The attached file contains R code which encompasses and describes the process of loading data, cleaning data, selecting variables, imputing missing values, creating training and test sets, model building and evaluation. Additionally, the code contains the process to create graphs and tables for data and model evaluation.

    The goal was to build a logistic regression model to predict outcomes after surgery for colon cancer and to compare its performance with machine learning algorithms. An XGBgoost model, a Random Forest model and an XGBoost model from oversampled data using SMOTE were built and compared with logistic regression. Overall, the machine learning algorithms had improved AUC.

  19. f

    Data from: Variable description.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee, Carl; Bashyal, Suraj; Bhandari, Ramesh; Budhathoki, Nirajan (2023). Variable description. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001031498
    Explore at:
    Dataset updated
    Dec 7, 2023
    Authors
    Lee, Carl; Bashyal, Suraj; Bhandari, Ramesh; Budhathoki, Nirajan
    Description

    Studies in the past have examined asthma prevalence and the associated risk factors in the United States using data from national surveys. However, the findings of these studies may not be relevant to specific states because of the different environmental and socioeconomic factors that vary across regions. The 2019 Behavioral Risk Factor Surveillance System (BRFSS) showed that Michigan had higher asthma prevalence rates than the national average. In this regard, we employ various modern machine learning techniques to predict asthma and identify risk factors associated with asthma among Michigan adults using the 2019 BRFSS data. After data cleaning, a sample of 10,337 individuals was selected for analysis, out of which 1,118 individuals (10.8%) reported having asthma during the survey period. Typical machine learning techniques often perform poorly due to imbalanced data issues. To address this challenge, we employed two synthetic data generation techniques, namely the Random Over-Sampling Examples (ROSE) and Synthetic Minority Over-Sampling Technique (SMOTE) and compared their performances. The overall performance of machine learning algorithms was improved using both methods, with ROSE performing better than SMOTE. Among the ROSE-adjusted models, we found that logistic regression, partial least squares, gradient boosting, LASSO, and elastic net had comparable performance, with sensitivity at around 50% and area under the curve (AUC) at around 63%. Due to ease of interpretability, logistic regression is chosen for further exploration of risk factors. Presence of chronic obstructive pulmonary disease, lower income, female sex, financial barrier to see a doctor due to cost, taken flu shot/spray in the past 12 months, 18–24 age group, Black, non-Hispanic group, and presence of diabetes are identified as asthma risk factors. This study demonstrates the potentiality of machine learning coupled with imbalanced data modeling approaches for predicting asthma from a large survey dataset. We conclude that the findings could guide early screening of at-risk asthma patients and designing appropriate interventions to improve care practices.

  20. Stroke Risk Synthetic 2025

    • kaggle.com
    zip
    Updated Sep 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Imaad Mahmood (2025). Stroke Risk Synthetic 2025 [Dataset]. https://www.kaggle.com/datasets/imaadmahmood/stroke-risk-synthetic-2025
    Explore at:
    zip(2288 bytes)Available download formats
    Dataset updated
    Sep 26, 2025
    Authors
    Imaad Mahmood
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    StrokeRiskSynthetic2025 Dataset

    Overview

    The StrokeRiskSynthetic2025 dataset is a synthetically generated dataset designed for machine learning and data analysis tasks focused on predicting stroke risk. Created in September 2025, it simulates realistic patient profiles based on established stroke risk factors, drawing inspiration from medical literature and existing healthcare datasets. With 1,000 records, it provides a balanced yet imbalanced target (approximately 5% stroke cases) to reflect real-world stroke prevalence, making it ideal for binary classification, feature engineering, and handling imbalanced data in educational and research settings.

    Data Description

    • Rows: 1,000
    • Columns: 12
    • Target Variable: stroke (binary: 0 = No stroke, 1 = Stroke)
    • File Format: CSV
    • Size: Approximately 60 KB

    Columns

    Column NameTypeDescription
    idIntegerUnique identifier for each record (1 to 1,000).
    genderCategoricalPatient gender: Male, Female, Other.
    ageIntegerPatient age in years (0 to 100, skewed toward older adults).
    hypertensionBinaryHypertension status: 0 = No, 1 = Yes (~30% prevalence).
    heart_diseaseBinaryHeart disease status: 0 = No, 1 = Yes (~5-10% prevalence).
    ever_marriedCategoricalMarital status: Yes, No (correlated with age).
    work_typeCategoricalEmployment type: children, Govt_job, Never_worked, Private, Self-employed.
    Residence_typeCategoricalResidence: Urban, Rural (balanced distribution).
    avg_glucose_levelFloatAverage blood glucose level in mg/dL (50 to 300, mean ~100).
    bmiFloatBody Mass Index (10 to 60, mean ~25).
    smoking_statusCategoricalSmoking history: formerly smoked, never smoked, smokes, Unknown.
    strokeBinaryTarget variable: 0 = No stroke, 1 = Stroke (~5% positive cases).

    Key Features

    • Realistic Distributions: Reflects real-world stroke risk factors (e.g., age, hypertension, glucose levels) based on 2025 medical data, with ~5% stroke prevalence to mimic imbalanced healthcare datasets.
    • Synthetic Data: Generated to avoid privacy concerns, ensuring ethical use for research and education.
    • Versatility: Suitable for binary classification, feature importance analysis (e.g., SHAP), data preprocessing (e.g., imputation, scaling), and handling imbalanced data (e.g., SMOTE).
    • No Missing Values: Clean dataset for straightforward analysis, though users can introduce missingness for preprocessing practice.

    Use Cases

    • Machine Learning: Train models like Logistic Regression, Random Forest, or XGBoost for stroke prediction.
    • Data Analysis: Explore correlations between risk factors (e.g., age, hypertension) and stroke outcomes.
    • Educational Projects: Ideal for learning EDA, feature engineering, and model deployment (e.g., Flask apps).
    • Healthcare Research: Simulate clinical scenarios for studying stroke risk without real patient data.

    Source and Inspiration

    This dataset is inspired by stroke risk factors outlined in medical literature (e.g., CDC, WHO) and existing datasets like the Kaggle Stroke Prediction Dataset (2021) and Mendeley’s Synthetic Stroke Prediction Dataset (2025). It incorporates 2025 trends in healthcare ML, such as handling imbalanced data and feature importance analysis.

    Usage Notes

    • Preprocessing: Numerical features (age, avg_glucose_level, bmi) may require scaling; categorical features (gender, work_type, etc.) need encoding (e.g., one-hot, label).
    • Imbalanced Data: The ~5% stroke prevalence requires techniques like SMOTE, oversampling, or class weighting for effective modeling.
    • Scalability: Contact the creator to generate larger datasets (e.g., 10,000+ rows) if needed.

    License

    This dataset is provided for educational and research purposes under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

    Contact

    For questions or to request expanded datasets, contact the creator via the platform where this dataset is hosted.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran (2024). Classifier in terms of different performance metrics with different pre-processing techniques with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0301263.t003
Organization logo

Classifier in terms of different performance metrics with different pre-processing techniques with SMOTE.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
May 31, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Classifier in terms of different performance metrics with different pre-processing techniques with SMOTE.

Search
Clear search
Close search
Google apps
Main menu