100+ datasets found
  1. f

    Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...

    • plos.figshare.com
    txt
    Updated Jun 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data [Dataset]. http://doi.org/10.1371/journal.pone.0180830
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 18, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method.

  2. f

    Data from: S1 Datasets -

    • plos.figshare.com
    bin
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). S1 Datasets - [Dataset]. http://doi.org/10.1371/journal.pone.0317396.s001
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

  3. f

    Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...

    • plos.figshare.com
    xls
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 16, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Alaa Alomari; Hossam Faris; Pedro A. Castillo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes.

  4. f

    A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.

  5. f

    Performance of machine learning models using SMOTE-balanced dataset.

    • plos.figshare.com
    xls
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf (2023). Performance of machine learning models using SMOTE-balanced dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0293061.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance of machine learning models using SMOTE-balanced dataset.

  6. t

    Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, W Philip Kegelmeyer...

    • service.tib.eu
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, W Philip Kegelmeyer (2024). Dataset: SMOTE: Synthetic Minority Over-Sampling Technique. https://doi.org/10.57702/tq0zp0i3 [Dataset]. https://service.tib.eu/ldmservice/dataset/smote--synthetic-minority-over-sampling-technique
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    SMOTE: synthetic minority over-sampling technique.

  7. s

    Data from: High impact bug report identification with imbalanced learning...

    • researchdata.smu.edu.sg
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN (2023). Data from: High impact bug report identification with imbalanced learning strategies [Dataset]. http://doi.org/10.25440/smu.12062763.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This record contains the underlying research data for the publication "High impact bug report identification with imbalanced learning strategies" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/3702In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the F1-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.Supplementary code and data available from GitHub:

  8. m

    Synthetic oversampling for credit card default prediction

    • data.mendeley.com
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fransiscus Pratikto (2023). Synthetic oversampling for credit card default prediction [Dataset]. http://doi.org/10.17632/jrss9jdjz9.1
    Explore at:
    Dataset updated
    Mar 8, 2023
    Authors
    Fransiscus Pratikto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains more than 17000 data of credit card holder with 20 predictor variables and 1 binary target variable. The corresponding R code for comparing several proposed (density-based) and existing synthetic oversampling methods (SMOTE-based) is also provided.

  9. i

    Korean Voice Phishing Detection Dataset with Multilingual Back-Translation...

    • ieee-dataport.org
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MILANDU KEITH MOUSSAVOU BOUSSOUGOU (2024). Korean Voice Phishing Detection Dataset with Multilingual Back-Translation and SMOTE Augmentations [Dataset]. https://ieee-dataport.org/documents/korean-voice-phishing-detection-dataset-multilingual-back-translation-and-smote
    Explore at:
    Dataset updated
    Nov 11, 2024
    Authors
    MILANDU KEITH MOUSSAVOU BOUSSOUGOU
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Chinese

  10. f

    Data from: Dataset for classification of signaling proteins based on...

    • figshare.com
    • portalcientifico.sergas.es
    txt
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Fernandez-Lozano; Cristian Robert Munteanu (2016). Dataset for classification of signaling proteins based on molecular star graph descriptors using machine-learning models [Dataset]. http://doi.org/10.6084/m9.figshare.1330132.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    figshare
    Authors
    Carlos Fernandez-Lozano; Cristian Robert Munteanu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The positive group of 608 signaling protein sequences was downloaded as FASTA format from Protein Databank (Berman et al., 2000) by using the “Molecular Function Browser” in the “Advanced Search Interface” (“Signaling (GO ID23052)”, protein identity cut-off = 30%). The negative group of 2077 non-signaling proteins was downloaded as the PISCES CulledPDB (http://dunbrack.fccc.edu/PISCES.php) (Wang & R. L. Dunbrack, 2003) (November 19th, 2012) using identity (degree of correspondence between two sequences) less than 20%, resolution of 1.6 Å and R-factor 0.25. The full dataset is containing 2685 FASTA sequences of protein chains from the PDB databank: 608 are signaling proteins and 2077 are non-signaling peptides. This kind of unbalanced data is not the most suitable to be used as an input for learning algorithms because the results would present a high sensitivity and low specificity; learning algorithms would tend to classify most of samples as part of the most common group. To avoid this situation, a pre-processing stage is needed in order to get a more balanced dataset, in this case by means of the synthetic minority oversampling technique (SMOTE). In short, SMOTE provides a more balanced dataset using an expansion of the lower class by creating new samples, interpolating other minority-class samples. After this pre-processing, the final dataset is composed of 1824 positive samples (signaling protein chains) and 2432 negative cases (non-signaling protein chains). Paper is available at: http://dx.doi.org/10.1016/j.jtbi.2015.07.038

    Please cite: Carlos Fernandez-Lozano, Rubén F. Cuiñas, José A. Seoane, Enrique Fernández-Blanco, Julian Dorado, Cristian R. Munteanu, Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models, Journal of Theoretical Biology, Volume 384, 7 November 2015, Pages 50-58, ISSN 0022-5193, http://dx.doi.org/10.1016/j.jtbi.2015.07.038.(http://www.sciencedirect.com/science/article/pii/S0022519315003999)

  11. d

    Data from: Skin Cancer Diagnostics with an All-Inclusive Smartphone...

    • search.dataone.org
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pandey, Santosh (2023). Skin Cancer Diagnostics with an All-Inclusive Smartphone Application [Dataset]. http://doi.org/10.7910/DVN/HUQK9R
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Pandey, Santosh
    Description

    Among the different types of skin cancer, melanoma is considered to be the deadliest and is difficult to treat at advanced stages. Detection of melanoma at earlier stages can lead to reduced mortality rates. Desktop-based computer-aided systems have been developed to assist dermatologists with early diagnosis. However, there is significant interest in developing portable, at-home melanoma diagnostic systems which can assess the risk of cancerous skin lesions. Here, we present a smartphone application that combines image capture capabilities with preprocessing and segmentation to extract the Asymmetry, Border irregularity, Color variegation, and Diameter (ABCD) features of a skin lesion. Using the feature sets, classification of malignancy is achieved through support vector machine classifiers. By using adaptive algorithms in the individual data-processing stages, our approach is made computationally light, user friendly, and reliable in discriminating melanoma cases from benign ones. Images of skin lesions are either captured with the smartphone camera or imported from public datasets. The entire process from image capture to classification runs on an Android smartphone equipped with a detachable 10x lens, and processes an image in less than a second. The overall performance metrics are evaluated on a public database of 200 images with Synthetic Minority Over-sampling Technique (SMOTE) (80% sensitivity, 90% specificity, 88% accuracy, and 0.85 area under curve (AUC)) and without SMOTE (55% sensitivity, 95% specificity, 90% accuracy, and 0.75 AUC). The evaluated performance metrics and computation times are comparable or better than previous methods. This all-inclusive smartphone application is designed to be easy-to-download and easy-to-navigate for the end user, which is imperative for the eventual democratization of such medical diagnostic systems.

  12. c

    Data from: CreditCardTransactions Dataset

    • cubig.ai
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). CreditCardTransactions Dataset [Dataset]. https://cubig.ai/store/products/554/creditcardtransactions-dataset
    Explore at:
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data Introduction • The Credit_Card_Transactions Dataset is a representative sample data for building fraud detection models, including anonymized real-world transaction data such as financial transaction type, amount, sender/receiver account balance, and fraud indicators.

    2) Data Utilization (1) Credit_Card_Transactions Dataset has characteristics that: • This dataset provides individual transaction records on a row-by-row basis, reflecting the real-world class imbalance problem with the extremely low percentage of fraudulent transactions (isFraud=1). • It is an unprocessed raw data structure that allows you to directly utilize key variables such as transaction time, amount, and account change. (2) Credit_Card_Transactions Dataset can be used to: • Binary classification modeling: Fraud transaction detection models can be developed by applying imbalance processing techniques such as SMOTE and undersampling, and appropriate evaluation indicators such as F1-score and ROC-AUC. • Real-time anomaly detection: It can be used to build a real-time anomaly signal detection system through analysis of transaction patterns (amount, frequency, account change).

  13. f

    Number of instances increased by SMOTE technique.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr (2023). Number of instances increased by SMOTE technique. [Dataset]. http://doi.org/10.1371/journal.pone.0179805.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of instances increased by SMOTE technique.

  14. m

    MQTTEEB-D: A Real-World IoT Cybersecurity Dataset for AI-Powered Threat...

    • data.mendeley.com
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ABDERRAHMANE AQACHTOUL (2025). MQTTEEB-D: A Real-World IoT Cybersecurity Dataset for AI-Powered Threat Detection in MQTT Networks [Dataset]. http://doi.org/10.17632/jfttfjn6tr.1
    Explore at:
    Dataset updated
    Mar 20, 2025
    Authors
    ABDERRAHMANE AQACHTOUL
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies the research article on MQTTEEB-D and is intended for public use in cybersecurity research. The MQTTEEB-D dataset is a practical real-world data set for intrusion detection improvement in Message Queuing Telemetry Transport (MQTT)-based Internet of Things (IoT) networks. In contrast to already existing datasets that are constructed on simulated network traffic, MQTTEEB-D is obtained from a real-time IoT deployment at the International University of Rabat (UIR), Morocco. Using MySignals IoT health sensors, Raspberry Pi 4, and an MQTT broker server, this dataset represents the actual complexity of the active IoT communication process, which synthetic data fails to offer. To narrow the gap between simulated and real-world attack scenarios, various cyberattacks including Denial of Service (DoS), Slow DoS against Internet of Things Environments (SlowITe), Malformed Data Injection, Brute Force, and MQTT publish flooding were carried out in real-time, permitting close monitoring of network traffic anomalies. The data was captured using Python wrapper for tshark (PyShark) and organized into multiple Comma-Separated Values (CSV) files. To ensure high data quality, we performed pre-processing steps, such as outlier removal, normalization, standardization, and class balance. Several processed forms (raw, cleaned, normalized, standardized, Synthetic Minority Over-sampling Technique (SMOTE)) applied for this dataset are provided, along with detailed metadata to facilitate ease of use in cybersecurity research. This dataset provides an opportunity for researchers to develop and validate intrusion detection models in a real-world MQTT environment - a critical ingredient in Artificial Intelligence (AI)-driven cybersecurity solutions for IoT networks. The dataset will support future research IoT security and anomaly detection domains.

  15. Z

    Data from: Date Fruit classification using a wide range of classifiers

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yandre M. G. Costa (2023). Date Fruit classification using a wide range of classifiers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7938712
    Explore at:
    Dataset updated
    May 16, 2023
    Dataset provided by
    Yandre M. G. Costa
    André F. R. Cordeiro
    Edson OliveiraJr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets generated by the techniques RUS and SMOTE. These datasets were used in the paper Date Fruit classification using a wide range of classifiers, accepted for publication in the International Conference on Systems, Signals and Image Processing (IWSSIP) 2023.

  16. h

    Language_Indentification_v2

    • huggingface.co
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ProcessVenue (2025). Language_Indentification_v2 [Dataset]. https://huggingface.co/datasets/Process-Venue/Language_Indentification_v2
    Explore at:
    Dataset updated
    Mar 18, 2025
    Dataset authored and provided by
    ProcessVenue
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for Language Identification Dataset

      Sample Notebook:
    

    https://www.kaggle.com/code/rishabhbhartiya/indian-language-classification-smote-resampled

      Kaggle Dataset link:
    

    https://www.kaggle.com/datasets/processvenue/indian-language-identification

      Dataset Summary
    

    A comprehensive dataset for Indian language identification and text classification. The dataset contains text samples across 18 major Indian languages, making it suitable for… See the full description on the dataset page: https://huggingface.co/datasets/Process-Venue/Language_Indentification_v2.

  17. h

    ml_data_test_detection_bank_transaction_frauds_unbalanced

    • huggingface.co
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberto Armas (2023). ml_data_test_detection_bank_transaction_frauds_unbalanced [Dataset]. https://huggingface.co/datasets/roberto-armas/ml_data_test_detection_bank_transaction_frauds_unbalanced
    Explore at:
    Dataset updated
    Jun 19, 2023
    Authors
    Roberto Armas
    Description

    ML Data Test Detection Bank Transaction Frauds Unbalanced

    The project provides a quick and accessible dataset designed for learning and experimenting with machine learning algorithms, specifically in the context of detecting fraudulent bank transactions. It is intended for practicing and applying concepts such as Random Forest, Support Vector Machines (SVM), and Synthetic Minority Over-sampling Technique (SMOTE) to address unbalanced classification problems. Note: This dataset is… See the full description on the dataset page: https://huggingface.co/datasets/roberto-armas/ml_data_test_detection_bank_transaction_frauds_unbalanced.

  18. f

    A comparative analysis of earlier studies.

    • plos.figshare.com
    xls
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Praveen Talari; Bharathiraja N; Gaganpreet Kaur; Hani Alshahrani; Mana Saleh Al Reshan; Adel Sulaiman; Asadullah Shaikh (2024). A comparative analysis of earlier studies. [Dataset]. http://doi.org/10.1371/journal.pone.0292100.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Praveen Talari; Bharathiraja N; Gaganpreet Kaur; Hani Alshahrani; Mana Saleh Al Reshan; Adel Sulaiman; Asadullah Shaikh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diabetes prediction is an ongoing study topic in which medical specialists are attempting to forecast the condition with greater precision. Diabetes typically stays lethargic, and on the off chance that patients are determined to have another illness, like harm to the kidney vessels, issues with the retina of the eye, or a heart issue, it can cause metabolic problems and various complexities in the body. Various worldwide learning procedures, including casting a ballot, supporting, and sacking, have been applied in this review. The Engineered Minority Oversampling Procedure (Destroyed), along with the K-overlay cross-approval approach, was utilized to achieve class evening out and approve the discoveries. Pima Indian Diabetes (PID) dataset is accumulated from the UCI Machine Learning (UCI ML) store for this review, and this dataset was picked. A highlighted engineering technique was used to calculate the influence of lifestyle factors. A two-phase classification model has been developed to predict insulin resistance using the Sequential Minimal Optimisation (SMO) and SMOTE approaches together. The SMOTE technique is used to preprocess data in the model’s first phase, while SMO classes are used in the second phase. All other categorization techniques were outperformed by bagging decision trees in terms of Misclassification Error rate, Accuracy, Specificity, Precision, Recall, F1 measures, and ROC curve. The model was created using a combined SMOTE and SMO strategy, which achieved 99.07% correction with 0.1 ms of runtime. The suggested system’s result is to enhance the classifier’s performance in spotting illness early.

  19. f

    Classification result classifiers using TF-IDF with SMOTE.

    • plos.figshare.com
    xls
    Updated May 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 28, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Khaled Alnowaiser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification result classifiers using TF-IDF with SMOTE.

  20. f

    Results of Bioassay 1608 dataset in experiment 2.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Results of Bioassay 1608 dataset in experiment 2. [Dataset]. http://doi.org/10.1371/journal.pone.0180830.t011
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results of Bioassay 1608 dataset in experiment 2.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data [Dataset]. http://doi.org/10.1371/journal.pone.0180830

Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data

Explore at:
25 scholarly articles cite this dataset (View in Google Scholar)
txtAvailable download formats
Dataset updated
Jun 18, 2023
Dataset provided by
PLOS ONE
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method.

Search
Clear search
Close search
Google apps
Main menu