41 datasets found
  1. f

    Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

    • acs.figshare.com
    zip
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    ACS Publications
    Authors
    Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.

  2. i

    Imbalanced Data

    • ieee-dataport.org
    Updated Aug 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0
    Explore at:
    Dataset updated
    Aug 23, 2023
    Authors
    Blessa Binolin M
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.

  3. f

    Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

  4. f

    Data from: S1 Datasets -

    • plos.figshare.com
    bin
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). S1 Datasets - [Dataset]. http://doi.org/10.1371/journal.pone.0317396.s001
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

  5. f

    Data from: Handling Imbalanced Classification Problems by Weighted...

    • tandf.figshare.com
    ai
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen Dou; Yan Lv; Zhen Wang; Lan Bai (2024). Handling Imbalanced Classification Problems by Weighted Generalization Memorization Machine [Dataset]. http://doi.org/10.6084/m9.figshare.25858505.v1
    Explore at:
    aiAvailable download formats
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Chen Dou; Yan Lv; Zhen Wang; Lan Bai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Imbalanced classification problems are of great significance in life, and there have been many methods to deal with them, e.g. eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Decision Trees (DT), and Support Vector Machine (SVM). Recently, a novel Generalization-Memorization Machine (GMM) was proposed to maintain good generalization ability with zero empirical for binary classification. This paper proposes a Weighted Generalization Memorization Machine (WGMM) for imbalanced classification. By improving the memory cost function and memory influence function of GMM, our WGMM also maintains zero empirical risk with well generalization ability for imbalanced classification learning. The new adaptive memory influence function in our WGMM achieves that samples are described individually and not affected by other training samples from different category. We conduct experiments on 31 datasets and compare the WGMM with some other classification methods. The results exhibit the effectiveness of the WGMM.

  6. Predict students' dropout and academic success

    • zenodo.org
    Updated Mar 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins (2023). Predict students' dropout and academic success [Dataset]. http://doi.org/10.5281/zenodo.5777340
    Explore at:
    Dataset updated
    Mar 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins
    Description

    A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.

    The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters.

    The data is used to build classification models to predict students' dropout and academic success. The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.

    Funding
    We acknowledge support of this work by the program "SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal"

  7. o

    Data from: Financial Fraud Detection Dataset

    • opendatabay.com
    .undefined
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Review Nexus (2025). Financial Fraud Detection Dataset [Dataset]. https://www.opendatabay.com/data/financial/d226c56e-5929-4059-a30d-13632e07b344
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Review Nexus
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    Fraud Detection & Risk Management
    Description

    This dataset is designed to support research and model development in the area of fraud detection. It consists of real-world credit card transactions made by European cardholders over a two-day period in September 2013. Out of 284,807 transactions, 492 are labeled as fraudulent (positive class), making this a highly imbalanced classification problem.

    Performance Note:

    Due to the extreme class imbalance, standard accuracy metrics are not informative. We recommend using the Area Under the Precision-Recall Curve (AUPRC) or F1-score for model evaluation.

    Features:

    • Time Series Data: Each row represents a transaction, with the Time feature indicating the number of seconds elapsed since the first transaction.
    • Dimensionality Reduction Applied: Features V1 through V28 are anonymized principal components derived from a PCA transformation due to confidentiality constraints.
    • Raw Transaction Amount: The Amount field reflects the transaction value, useful for cost-sensitive modeling.
    • Binary Classification Target: The Class label is 1 for fraud and 0 for legitimate transactions.

    Usage:

    • Machine learning model training for fraud detection.
    • Evaluation of anomaly detection and imbalanced classification methods.
    • Development of cost-sensitive learning approaches using the Amount variable.

    Data Summary:

    • Total Records: 284,807
    • Fraud Cases: 492
    • Imbalance Ratio: Fraudulent transactions account for just 0.172% of the dataset.
    • Columns: 31 total (28 PCA features, plus Time, Amount, and Class)

    License:

    The dataset is provided under the CC0 (Public Domain) license, allowing users to freely use, modify, and distribute the data without any restrictions.

    Acknowledgements

    The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

    Please cite the following works:

    Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

    Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

    Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

    Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

    Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

    Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

    Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

    Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

    Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook

    Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics

  8. f

    Comparison of prediction errors of different models.

    • plos.figshare.com
    xls
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng (2024). Comparison of prediction errors of different models. [Dataset]. http://doi.org/10.1371/journal.pone.0291656.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of prediction errors of different models.

  9. Data from: WikiChurches – A Fine-Grained Dataset of Architectural Styles...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    bin, json, pdf, txt +1
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Björn Barz; Björn Barz; Joachim Denzler; Joachim Denzler (2024). WikiChurches – A Fine-Grained Dataset of Architectural Styles with Real-World Challenges [Dataset]. http://doi.org/10.5281/zenodo.5166987
    Explore at:
    pdf, txt, json, bin, zipAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Björn Barz; Björn Barz; Joachim Denzler; Joachim Denzler
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    WikiChurches is a dataset for architectural style classification, consisting of 9,485 images of church buildings. Both images and style labels were sourced from Wikipedia. The dataset can serve as a benchmark for various research fields, as it combines numerous real-world challenges: fine-grained distinctions between classes based on subtle visual features, a comparatively small sample size, a highly imbalanced class distribution, a high variance of viewpoints, and a hierarchical organization of labels, where only some images are labeled at the most precise level. In addition, we provide 631 bounding box annotations of characteristic visual features for 139 churches from four major categories. These annotations can, for example, be useful for research on fine-grained classification, where additional expert knowledge about distinctive object parts is often available.

    Please refer to the README.md file for information about the different files contained in this dataset.

  10. f

    Similarity measures results of different datasets on different models.

    • figshare.com
    xls
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng (2024). Similarity measures results of different datasets on different models. [Dataset]. http://doi.org/10.1371/journal.pone.0291656.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Similarity measures results of different datasets on different models.

  11. A

    ‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Sep 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-sentiment-analysis-of-commodity-news-gold-732f/e3232de2/?iid=002-045&v=presentation
    Explore at:
    Dataset updated
    Sep 27, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Sentiment Analysis of Commodity News (Gold)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ankurzing/sentiment-analysis-in-commodity-market-gold on 14 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    This is a news dataset for the commodity market where we have manually annotated 11,412 news headlines across multiple dimensions into various classes. The dataset has been sampled from a period of 20+ years (2000-2021).

    Content

    The dataset has been collected from various news sources and annotated by three human annotators who were subject experts. Each news headline was evaluated on various dimensions, for instance - if a headline is a price related news then what is the direction of price movements it is talking about; whether the news headline is talking about the past or future; whether the news item is talking about asset comparison; etc.

    Acknowledgements

    Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." In Future of Information and Communication Conference, pp. 589-601. Springer, Cham, 2021.

    https://arxiv.org/abs/2009.04202 Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." arXiv preprint arXiv:2009.04202 (2020)

    We would like to acknowledge the financial support provided by the India Gold Policy Centre (IGPC).

    Inspiration

    Commodity prices are known to be quite volatile. Machine learning models that understand the commodity news well, will be able to provide an additional input to the short-term and long-term price forecasting models. The dataset will also be useful in creating news-based indicators for commodities.

    Apart from researchers and practitioners working in the area of news analytics for commodities, the dataset will also be useful for researchers looking to evaluate their models on classification problems in the context of text-analytics. Some of the classes in the dataset are highly imbalanced and may pose challenges to the machine learning algorithms.

    --- Original source retains full ownership of the source dataset ---

  12. The Depression Dataset

    • kaggle.com
    Updated Feb 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Möbius (2021). The Depression Dataset [Dataset]. https://www.kaggle.com/arashnic/the-depression-dataset/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 6, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Möbius
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Depression is a severe mental disorder with characteristic symptoms like sadness, the feeling of emptiness, anxiety and sleep disturbance, as well as general loss of initiative and interest in activities. Additionally, features like the feeling of guilt or worthlessness, reduced energy, concentration problems, suicidality and psychotic symptoms might be present. The severity of a depression is determined by the quantity of symptoms, their seriousness and duration, as well as the consequences on social and occupational function. Depressions are also common in Bipolar disorder, another severe psychiatric disorder. The main difference between uni-polar depression and bipolar disorder is the periodic occurrence of mania in the latter, a state associated with inflated self-esteem, impulsivity, increased activity, reduced sleep and goal-directed actions. Both diseases are genetic disorders, and might be understood as a genetic vulnerability to the environment disturbing the internal biological state and potentially trigger mood episodes. Depression is associated with disrupted biological rhythms caused by environmental disturbance like seasonal change in daylight, alteration of social rhythms due to for instance shift-work or longitude traveling; besides linked to lifestyles associated with diurnal rhythms inconsistent with the natural daylight cycle. The appearance of depressive symptoms relates furthermore to physical health issues, medical side effects, life events and social factors, besides alcohol and substance abuse, and such factors might also potentially cause symptoms of depression in all humans. The global lifetime prevalence of depression is roughly 15%, but the incidences of episodes with a severity level not meeting the requirements for a depressive diagnosis are far more prevalent. Actigraph recordings of motor activity are considered an objective method for observing depression, although this topic is far from exhaustive studied within psychiatric research.

    Content

    The dataset contains two folders, whereas one contains the data for the controls and one for the condition group. For each patient a csv file has been provided containing the actigraph data collected over time. The columns are: timestamp (one minute intervals), date (date of measurement), activity (activity measurement from the actigraph watch). In addition, the MADRS scores provided in the file "scores.csv". It contains the following columns; number (patient identifier), days (number of days of measurements), gender (1 or 2 for female or male), age (age in age groups), afftype (1: bipolar II, 2: unipolar depressive, 3: bipolar I), melanch (1: melancholia, 2: no melancholia), inpatient (1: inpatient, 2: outpatient), edu (education grouped in years), marriage (1: married or cohabiting, 2: single), work (1: working or studying, 2: unemployed/sick leave/pension), madrs1 (MADRS score when measurement started), madrs2 (MADRS when measurement stopped).

    More : https://datasets.simula.no/depresjon/#dataset-details

    Starter Code

    Enrique Garcia-Ceja, Michael Riegler, Petter Jakobsen, Jim Tørresen, Tine Nordgreen, Ketil J. Oedegaard, Ole Bernt Fasmer, Depresjon: A Motor Activity Database of Depression Episodes in Unipolar and Bipolar Patients, In MMSys'18 Proceedings of the 9th ACM on Multimedia Systems Conference, Amsterdam, The Netherlands, June 12 - 15, 2018. paper: https://dl.acm.org/doi/pdf/10.1145/3204949.3208125

    Inspiration

    • The available data may eventually help researchers to develop systems capable of automatically detecting depression states based on sensor data. This dataset can be suitable (but not limited to) for the following applications:
    • (i) Use machine learning for depression states classification
    • (ii) MADRS score prediction based on motor activity data
    • (iii) Sleep pattern analysis of depressed v.s. non-depressed participants #
      This dataset can be used as the basis for evaluating different machine learning methods and approaches such as: cost-sensitive classification and oversampling techniques for imbalanced class problems. This dataset is also suitable for comparing different machine learning classification approaches such as feature based and deep learning based methods like convolutional neural networks and recurrent neural networks for time series.
  13. f

    Validation of the validity of the ccr index.

    • figshare.com
    xls
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng (2024). Validation of the validity of the ccr index. [Dataset]. http://doi.org/10.1371/journal.pone.0291656.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Imbalanced data is a problem in that the number of samples in different categories or target value ranges varies greatly. Data imbalance imposes excellent challenges to machine learning and pattern recognition. The performance of machine learning models leans to be partially towards the majority of samples in the imbalanced dataset, which will further affect the effect of the model. The imbalanced data problem includes an imbalanced categorical problem and an imbalanced regression problem. Many studies have been developed to address the issue of imbalanced classification data. Nevertheless, the imbalanced regression problem has not been well-researched. In order to solve the problem of unbalanced regression data, we define an RNGRU model that can simultaneously learn the regression characteristics and neighbor characteristics of regression samples. To obtain the most comprehensive sample information of regression samples, the model uses the idea of confrontation to determine the proportion between the regression characteristics and neighbor characteristics of the original samples. According to the regression characteristics of the regression samples, an index ccr (correlation change rate) is proposed to evaluate the similarity between the generated samples and the original samples. And on this basis, an RNGAN model is proposed to reduce the similarity between the generated samples and the original samples by using the idea of confrontation.

  14. h

    ml_data_test_detection_bank_transaction_frauds_unbalanced

    • huggingface.co
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberto Armas (2023). ml_data_test_detection_bank_transaction_frauds_unbalanced [Dataset]. https://huggingface.co/datasets/roberto-armas/ml_data_test_detection_bank_transaction_frauds_unbalanced
    Explore at:
    Dataset updated
    Jun 19, 2023
    Authors
    Roberto Armas
    Description

    ML Data Test Detection Bank Transaction Frauds Unbalanced

    The project provides a quick and accessible dataset designed for learning and experimenting with machine learning algorithms, specifically in the context of detecting fraudulent bank transactions. It is intended for practicing and applying concepts such as Random Forest, Support Vector Machines (SVM), and Synthetic Minority Over-sampling Technique (SMOTE) to address unbalanced classification problems. Note: This dataset is… See the full description on the dataset page: https://huggingface.co/datasets/roberto-armas/ml_data_test_detection_bank_transaction_frauds_unbalanced.

  15. f

    Number of datasets on which a combination of machine learning and sampling...

    • plos.figshare.com
    xls
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Misuk Kim; Kyu-Baek Hwang (2023). Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the precision-recall curve. [Dataset]. http://doi.org/10.1371/journal.pone.0271260.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Misuk Kim; Kyu-Baek Hwang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the precision-recall curve.

  16. Credit Card Fraud Detection

    • kaggle.com
    • test.researchdata.tuwien.ac.at
    • +1more
    zip
    Updated Mar 23, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Machine Learning Group - ULB (2018). Credit Card Fraud Detection [Dataset]. https://www.kaggle.com/mlg-ulb/creditcardfraud
    Explore at:
    zip(69155672 bytes)Available download formats
    Dataset updated
    Mar 23, 2018
    Dataset authored and provided by
    Machine Learning Group - ULB
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

    Content

    The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

    It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

    Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

    Update (03/05/2021)

    A simulator for transaction data has been released as part of the practical handbook on Machine Learning for Credit Card Fraud Detection - https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/SimulatedDataset.html. We invite all practitioners interested in fraud detection datasets to also check out this data simulator, and the methodologies for credit card fraud detection presented in the book.

    Acknowledgements

    The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

    Please cite the following works:

    Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

    Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

    Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

    Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

    Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

    Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

    Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

    Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

    Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook

    Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics

  17. f

    Number of datasets on which a combination of machine learning and sampling...

    • plos.figshare.com
    xls
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Misuk Kim; Kyu-Baek Hwang (2023). Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the receiver operating characteristics curve. [Dataset]. http://doi.org/10.1371/journal.pone.0271260.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Misuk Kim; Kyu-Baek Hwang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the receiver operating characteristics curve.

  18. f

    The adjustment process of step length T of RNGAN model.

    • plos.figshare.com
    xls
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng (2024). The adjustment process of step length T of RNGAN model. [Dataset]. http://doi.org/10.1371/journal.pone.0291656.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The adjustment process of step length T of RNGAN model.

  19. O

    ContractNLI (ContractNLI: A Dataset for Document-level Natural Language...

    • opendatalab.com
    zip
    Updated Sep 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford University (2022). ContractNLI (ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts) [Dataset]. https://opendatalab.com/OpenDataLab/ContractNLI
    Explore at:
    zip(83759009 bytes)Available download formats
    Dataset updated
    Sep 22, 2022
    Dataset provided by
    Hitachi America LTD
    Stanford University
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ContractNLI is a dataset for document-level natural language inference (NLI) on contracts whose goal is to automate/support a time-consuming procedure of contract review. In this task, a system is given a set of hypotheses (such as “Some obligations of Agreement may survive termination.”) and a contract, and it is asked to classify whether each hypothesis is entailed by, contradicting to or not mentioned by (neutral to) the contract as well as identifying evidence for the decision as spans in the contract. ContractNLI is the first dataset to utilize NLI for contracts and is also the largest corpus of annotated contracts (as of September 2021). ContractNLI is an interesting challenge to work on from a machine learning perspective (the label distribution is imbalanced and it is naturally multi-task, all the while training data being scarce) and from a linguistic perspective (linguistic characteristics of contracts, particularly negations by exceptions, make the problem difficult).

  20. f

    Metrics with the dataset.

    • plos.figshare.com
    xls
    Updated May 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    André G. Coimbra; Cleiane G. Oliveira; Matheus P. Libório; Hasheem Mannan; Laercio I. Santos; Elisa Fusco; Marcos F.S.V. D’Angelo (2025). Metrics with the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0320966.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 13, 2025
    Dataset provided by
    PLOS ONE
    Authors
    André G. Coimbra; Cleiane G. Oliveira; Matheus P. Libório; Hasheem Mannan; Laercio I. Santos; Elisa Fusco; Marcos F.S.V. D’Angelo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning has increasingly gained prominence in the healthcare sector due to its ability to address various challenges. However, a significant issue remains unresolved in this field: the handling of imbalanced data. This process is crucial for ensuring the efficiency of algorithms that utilize classification techniques, which are commonly applied in risk management, monitoring, diagnosis, and prognosis of patient health. This study conducts a comparative analysis of techniques for handling imbalanced data and evaluates their effectiveness in combination with a set of classification algorithms, specifically focusing on stroke prediction. Additionally, a new approach based on Particle Swarm Optimization (PSO) and Naive Bayes was proposed. This approach was applied to the real problem of Chagas disease. The application of these techniques aims to improve the quality of life for individuals, reduce healthcare costs, and allocate available resources more efficiently, making it a preventive action.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001

Data from: Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.

Search
Clear search
Close search
Google apps
Main menu