100+ datasets found
  1. i

    Imbalanced Data

    • ieee-dataport.org
    Updated Aug 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0
    Explore at:
    Dataset updated
    Aug 23, 2023
    Authors
    Blessa Binolin M
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.

  2. f

    Performance comparison of machine learning models across accuracy, AUC, MCC,...

    • plos.figshare.com
    xls
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Seongil Han; Haemin Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.

  3. i

    imbalanced data

    • ieee-dataport.org
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZHI WANG (2022). imbalanced data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data
    Explore at:
    Dataset updated
    Dec 14, 2022
    Authors
    ZHI WANG
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset file is used for the study of imbalanced data and contains 6 imbalanced datasets

  4. f

    Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

  5. Predict students' dropout and academic success

    • zenodo.org
    • explore.openaire.eu
    • +1more
    Updated Mar 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins (2023). Predict students' dropout and academic success [Dataset]. http://doi.org/10.5281/zenodo.5777340
    Explore at:
    Dataset updated
    Mar 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins
    Description

    A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.

    The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters.

    The data is used to build classification models to predict students' dropout and academic success. The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.

    Funding
    We acknowledge support of this work by the program "SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal"

  6. Z

    Data from: Imbalanced dataset for benchmarking

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lemaitre, Guillaume (2020). Imbalanced dataset for benchmarking [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_61452
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Oliveira, Dayvid V. R.
    Aridas, Christos K.
    Lemaitre, Guillaume
    Nogueira, Fernando
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Imbalanced dataset for benchmarking

    The different algorithms of the imbalanced-learn toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.

    Characteristics

    IDNameRepository & TargetRatio# samples# features
    1EcoliUCI, target: imU8.6:13367
    2Optical DigitsUCI, target: 89.1:15,62064
    3SatImageUCI, target: 49.3:16,43536
    4Pen DigitsUCI, target: 59.4:110,99216
    5AbaloneUCI, target: 79.7:14,1778
    6Sick EuthyroidUCI, target: sick euthyroid9.8:13,16325
    7SpectrometerUCI, target: >=4411:153193
    8Car_Eval_34UCI, target: good, v good12:11,7286
    9ISOLETUCI, target: A, B12:17,797617
    10US CrimeUCI, target: >0.6512:11,994122
    11Yeast_ML8LIBSVM, target: 813:12,417103
    12SceneLIBSVM, target: >one label13:12,407294
    13Libras MoveUCI, target: 114:136090
    14Thyroid SickUCI, target: sick15:13,77228
    15Coil_2000KDD, CoIL, target: minority16:19,82285
    16ArrhythmiaUCI, target: 0617:1452279
    17Solar Flare M0UCI, target: M->019:11,38910
    18OILUCI, target: minority22:193749
    19Car_Eval_4UCI, target: vgood26:11,7286
    20Wine QualityUCI, wine, target: <=426:14,89811
    21Letter ImgUCI, target: Z26:120,00016
    22Yeast _ME2UCI, target: ME228:11,4848
    23WebpageLIBSVM, w7a, target: minority33:149,749300
    24Ozone LevelUCI, ozone, data34:12,53672
    25MammographyUCI, target: minority42:111,1836
    26Protein homo.KDD CUP 2004, minority111:1145,75174
    27Abalone_19UCI, target: 19130:14,1778

    References

    [1] Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).

    [2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).

    [3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.

    [4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.

  7. f

    The definition of a confusion matrix.

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

  8. s

    Data from: High impact bug report identification with imbalanced learning...

    • researchdata.smu.edu.sg
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN (2023). Data from: High impact bug report identification with imbalanced learning strategies [Dataset]. http://doi.org/10.25440/smu.12062763.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This record contains the underlying research data for the publication "High impact bug report identification with imbalanced learning strategies" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/3702In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the F1-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.Supplementary code and data available from GitHub:

  9. Lending Club Loan Data

    • kaggle.com
    Updated Nov 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sweta Shetye (2020). Lending Club Loan Data [Dataset]. https://www.kaggle.com/swetashetye/lending-club-loan-data-imbalance-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sweta Shetye
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I wanted a highly imbalanced dataset to share with others. It has the perfect one for us.

    Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).

    For example, In this dataset, There are way more samples of fully paid borrowers versus not fully paid borrowers.

    Full LendingClub data available from their site.

    Content

    For companies like Lending Club correctly predicting whether or not a loan will be default is very important. This dataset contains historical data from 2007 to 2015, you can to build a deep learning model to predict the chance of default for future loans. As you will see this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

  10. f

    Under-sampled dataset.

    • plos.figshare.com
    xls
    Updated Dec 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongil Han; Haemin Jung (2024). Under-sampled dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Seongil Han; Haemin Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.

  11. m

    Imbalanced Class

    • data.mendeley.com
    Updated Jan 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siti Mutrofin (2022). Imbalanced Class [Dataset]. http://doi.org/10.17632/grckc9gnn5.1
    Explore at:
    Dataset updated
    Jan 10, 2022
    Authors
    Siti Mutrofin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Imbalanced Class

  12. f

    Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

    • acs.figshare.com
    zip
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    ACS Publications
    Authors
    Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.

  13. i

    Unbalanced data sets

    • ieee-dataport.org
    Updated Dec 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuchen liu (2022). Unbalanced data sets [Dataset]. https://ieee-dataport.org/documents/unbalanced-data-sets
    Explore at:
    Dataset updated
    Dec 4, 2022
    Authors
    Yuchen liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Derived from public unbalanced data sets

  14. Dataset for Class Imbalance Classification Problem

    • kaggle.com
    Updated Jan 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akalya Subramanian (2021). Dataset for Class Imbalance Classification Problem [Dataset]. https://www.kaggle.com/akalyasubramanian/dataset-for-class-imbalance-classification-problem/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akalya Subramanian
    Description

    Dataset

    This dataset was created by Akalya Subramanian

    Contents

  15. imbalanced-data-practice

    • huggingface.co
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeSignal (2025). imbalanced-data-practice [Dataset]. https://huggingface.co/datasets/codesignal/imbalanced-data-practice
    Explore at:
    Dataset updated
    Apr 18, 2025
    Dataset authored and provided by
    CodeSignalhttps://codesignal.com/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    codesignal/imbalanced-data-practice dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. i

    Data from: swissALPS: a Dataset for Imbalanced Land Cover Mapping in Alpine...

    • ieee-dataport.org
    Updated Jan 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valerie Zermatten (2024). swissALPS: a Dataset for Imbalanced Land Cover Mapping in Alpine Regions [Dataset]. https://ieee-dataport.org/documents/swissalps-dataset-imbalanced-land-cover-mapping-alpine-regions
    Explore at:
    Dataset updated
    Jan 29, 2024
    Authors
    Valerie Zermatten
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    scree

  17. d

    Simulation Results on the Effect of Ensemble on Data Imbalance

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang, Yu (2023). Simulation Results on the Effect of Ensemble on Data Imbalance [Dataset]. http://doi.org/10.7910/DVN/ONTUOL
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Yang, Yu
    Description

    This dataset contains all the simulation results on the effect of ensemble models in dealing with data imbalance. The simulations are performed with sample size n=2000, number of variables p=200, and number of groups k=20 under six imbalanced scenarios. It shows the result of ensemble models with threshold from [0, 0.05, 0.1, ..., 0.95, 1.0], in terms of the overall AP/AR and discrete (continuous) specific AP/AR. This dataset serves as a reference for practitioners to find the appropriate ensemble threshold that fits their business needs the best.

  18. d

    Replication Data: Leveraging Researcher Domain Expertise to Annotate...

    • search.dataone.org
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus, Dror (2023). Replication Data: Leveraging Researcher Domain Expertise to Annotate Concepts within Imbalanced Data [Dataset]. http://doi.org/10.7910/DVN/IEX083
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Markus, Dror
    Description

    In this manuscript, we describe a method to utilize researcher domain expertise to annotate concepts efficiently and accurately within an imbalanced dataset. This folder contains two scripts that run two variations of the simulation referred to in our paper. Additionally, we included two separate datasets that were utilized in the simulations. For each, we shared the list of document embeddings used for classification, together with a corresponding CSV which holds the categorical labels for each embedding. We recommend first reading the "README" text file, before running the scripts.

  19. o

    Youtube Videos Dataset (~3400 videos)

    • opendatabay.com
    .undefined
    Updated Jun 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Youtube Videos Dataset (~3400 videos) [Dataset]. https://www.opendatabay.com/data/ai-ml/fef9b558-dda7-42c6-83e3-048d99e5135b
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 10, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    YouTube, Social Media and Networking
    Description

    Context 📃 I wanted to practice text classification using NLP techniques, so I thought why not practice it by generating the data myself! This way, I brushed up on my scraping techniques using Selenium, collected the data, cleaned it, and then started working on it. You can take a peek at my work Github Repository For This Dataset and Trained Models/ Results

    Content 📰 The total number of videos scraped was 3600. I scraped the following things from each video:

    link title description category Video ID Category for which the video was scraped Description of the video Category for which the video was scraped. I queried the videos for 4 categories:

    Travel Vlogs 🧳 Food 🥑 Art and Music 🎨 🎻 History 📜

    Acknowledgements 🙏 I could have used a ready made API, but just for the fun of it, I scraped the data from Youtube using Selenium.

    Inspiration 🦋 The data is not clean (for your enjoyment of cleaning the data!), has some missing values, and is imbalanced. Practice text classification on this dataset, you will have to learn different techniques for eg:- How to handle imbalanced classes..? While working on this dataset, you will learn a lot of different things and also get an opportunity to apply on this dataset.

    Original Data Source: Youtube Videos Dataset (~3400 videos)

  20. n

    Results of machine learning experiments for "Multi-classifier prediction of...

    • data.ncl.ac.uk
    tar
    Updated Oct 30, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paweł Widera (2019). Results of machine learning experiments for "Multi-classifier prediction of knee osteoarthritis progression from incomplete imbalanced longitudinal data" [Dataset]. http://doi.org/10.25405/data.ncl.10043060
    Explore at:
    tarAvailable download formats
    Dataset updated
    Oct 30, 2019
    Dataset provided by
    Newcastle University
    Authors
    Paweł Widera
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The archive file includes results of machine learning experiments performed for the article "Multi-classifier prediction of knee osteoarthritis progression from incomplete imbalanced longitudinal data". The hypothesis of the article is that prediction models trained on historical data will be more effective at identifying fast progressing knee OA patients than conventional inclusion criteria.For all experiments the first level folder hierarchy indicates the method used. Where parameter tuning is performed, the second level folders indicate algorithm parameters. Each experiment output is stored in a xz compressed text file in JSON format.In experiments measuring the learning curves (training-*), each results file describes:* experiment setup (algorithm, number of subsets, down-sampled class size)* list of training set sizes* performance measure statistics for all subsets at each training size (flat list) including min, median and max score, and median deviation from median (mad), given for both test and training set instancesIn parameter tuning experiments (prediction-multi-*), each results file contains:* experiment setup (method / algorithm, number of CV repeats, number of model runs)* imputer parameters (not important, kept constant in all experiments)* classifier parameters (for random forest)* true class for each instance* class predictions by the median model from each CV-repeat* class probabilities estimated by the median model from each CV-repeat* performance measure statistics for each CV-repeat including min, median and max score, and median deviation from median (mad)In RFE experiments (prediction-multi-rfe-*) the results additionally include:* scores for all RFE steps for each CV-repeat* number of times each feature was selected (across all folds and CV-repeats)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0

Imbalanced Data

Explore at:
Dataset updated
Aug 23, 2023
Authors
Blessa Binolin M
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.

Search
Clear search
Close search
Google apps
Main menu