41 datasets found

f
Data from: Addressing Imbalanced Classification Problems in Drug Discovery...
acs.figshare.com
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00023.s001
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
i
Imbalanced Data
ieee-dataport.org
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0
Explore at:
Dataset updated
Aug 23, 2023
Authors
Blessa Binolin M
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.
f
Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...
frontiersin.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2021.715421.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
f
Data from: S1 Datasets -
plos.figshare.com
bin
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). S1 Datasets - [Dataset]. http://doi.org/10.1371/journal.pone.0317396.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.s001
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
f
Data from: Handling Imbalanced Classification Problems by Weighted...
tandf.figshare.com
ai
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen Dou; Yan Lv; Zhen Wang; Lan Bai (2024). Handling Imbalanced Classification Problems by Weighted Generalization Memorization Machine [Dataset]. http://doi.org/10.6084/m9.figshare.25858505.v1
Explore at:
aiAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25858505.v1
Dataset updated
Dec 16, 2024
Dataset provided by
Taylor & Francis
Authors
Chen Dou; Yan Lv; Zhen Wang; Lan Bai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Imbalanced classification problems are of great significance in life, and there have been many methods to deal with them, e.g. eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Decision Trees (DT), and Support Vector Machine (SVM). Recently, a novel Generalization-Memorization Machine (GMM) was proposed to maintain good generalization ability with zero empirical for binary classification. This paper proposes a Weighted Generalization Memorization Machine (WGMM) for imbalanced classification. By improving the memory cost function and memory influence function of GMM, our WGMM also maintains zero empirical risk with well generalization ability for imbalanced classification learning. The new adaptive memory influence function in our WGMM achieves that samples are described individually and not affected by other training samples from different category. We conduct experiments on 31 datasets and compare the WGMM with some other classification methods. The results exhibit the effectiveness of the WGMM.
Predict students' dropout and academic success
zenodo.org
Updated Mar 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins (2023). Predict students' dropout and academic success [Dataset]. http://doi.org/10.5281/zenodo.5777340
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5777340
Dataset updated
Mar 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins
Description
A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.

The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters.

The data is used to build classification models to predict students' dropout and academic success. The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.

Funding
We acknowledge support of this work by the program "SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal"
o
Data from: Financial Fraud Detection Dataset
opendatabay.com
.undefined
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Review Nexus (2025). Financial Fraud Detection Dataset [Dataset]. https://www.opendatabay.com/data/financial/d226c56e-5929-4059-a30d-13632e07b344
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 25, 2025
Dataset authored and provided by
Review Nexus
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Area covered
Fraud Detection & Risk Management
Description
This dataset is designed to support research and model development in the area of fraud detection. It consists of real-world credit card transactions made by European cardholders over a two-day period in September 2013. Out of 284,807 transactions, 492 are labeled as fraudulent (positive class), making this a highly imbalanced classification problem.

Performance Note:

Due to the extreme class imbalance, standard accuracy metrics are not informative. We recommend using the Area Under the Precision-Recall Curve (AUPRC) or F1-score for model evaluation.

Features:

Time Series Data: Each row represents a transaction, with the Time feature indicating the number of seconds elapsed since the first transaction.

Dimensionality Reduction Applied: Features V1 through V28 are anonymized principal components derived from a PCA transformation due to confidentiality constraints.

Raw Transaction Amount: The Amount field reflects the transaction value, useful for cost-sensitive modeling.

Binary Classification Target: The Class label is 1 for fraud and 0 for legitimate transactions.

Usage:

Machine learning model training for fraud detection.

Evaluation of anomaly detection and imbalanced classification methods.

Development of cost-sensitive learning approaches using the Amount variable.

Data Summary:

Total Records: 284,807

Fraud Cases: 492

Imbalance Ratio: Fraudulent transactions account for just 0.172% of the dataset.

Columns: 31 total (28 PCA features, plus Time, Amount, and Class)

License:

The dataset is provided under the CC0 (Public Domain) license, allowing users to freely use, modify, and distribute the data without any restrictions.

Acknowledgements

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

Please cite the following works:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook

Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics
f
Comparison of prediction errors of different models.
plos.figshare.com
xls
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng (2024). Comparison of prediction errors of different models. [Dataset]. http://doi.org/10.1371/journal.pone.0291656.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291656.t005
Dataset updated
Jan 18, 2024
Dataset provided by
PLOS ONE
Authors
Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of prediction errors of different models.
Data from: WikiChurches – A Fine-Grained Dataset of Architectural Styles...
zenodo.org
explore.openaire.eu
+1more
bin, json, pdf, txt +1
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Björn Barz; Björn Barz; Joachim Denzler; Joachim Denzler (2024). WikiChurches – A Fine-Grained Dataset of Architectural Styles with Real-World Challenges [Dataset]. http://doi.org/10.5281/zenodo.5166987
Explore at:
pdf, txt, json, bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5166987
Dataset updated
Jul 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Björn Barz; Björn Barz; Joachim Denzler; Joachim Denzler
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
WikiChurches is a dataset for architectural style classification, consisting of 9,485 images of church buildings. Both images and style labels were sourced from Wikipedia. The dataset can serve as a benchmark for various research fields, as it combines numerous real-world challenges: fine-grained distinctions between classes based on subtle visual features, a comparatively small sample size, a highly imbalanced class distribution, a high variance of viewpoints, and a hierarchical organization of labels, where only some images are labeled at the most precise level. In addition, we provide 631 bounding box annotations of characteristic visual features for 139 churches from four major categories. These annotations can, for example, be useful for research on fine-grained classification, where additional expert knowledge about distinctive object parts is often available.

Please refer to the README.md file for information about the different files contained in this dataset.
f
Similarity measures results of different datasets on different models.
figshare.com
xls
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng (2024). Similarity measures results of different datasets on different models. [Dataset]. http://doi.org/10.1371/journal.pone.0291656.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291656.t003
Dataset updated
Jan 18, 2024
Dataset provided by
PLOS ONE
Authors
Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Similarity measures results of different datasets on different models.
A
‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-sentiment-analysis-of-commodity-news-gold-732f/e3232de2/?iid=002-045&v=presentation
Explore at:
Dataset updated
Sep 27, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Sentiment Analysis of Commodity News (Gold)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ankurzing/sentiment-analysis-in-commodity-market-gold on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Context

This is a news dataset for the commodity market where we have manually annotated 11,412 news headlines across multiple dimensions into various classes. The dataset has been sampled from a period of 20+ years (2000-2021).

Content

The dataset has been collected from various news sources and annotated by three human annotators who were subject experts. Each news headline was evaluated on various dimensions, for instance - if a headline is a price related news then what is the direction of price movements it is talking about; whether the news headline is talking about the past or future; whether the news item is talking about asset comparison; etc.

Acknowledgements

Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." In Future of Information and Communication Conference, pp. 589-601. Springer, Cham, 2021.

https://arxiv.org/abs/2009.04202 Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." arXiv preprint arXiv:2009.04202 (2020)

We would like to acknowledge the financial support provided by the India Gold Policy Centre (IGPC).

Inspiration

Commodity prices are known to be quite volatile. Machine learning models that understand the commodity news well, will be able to provide an additional input to the short-term and long-term price forecasting models. The dataset will also be useful in creating news-based indicators for commodities.

Apart from researchers and practitioners working in the area of news analytics for commodities, the dataset will also be useful for researchers looking to evaluate their models on classification problems in the context of text-analytics. Some of the classes in the dataset are highly imbalanced and may pose challenges to the machine learning algorithms.

--- Original source retains full ownership of the source dataset ---
The Depression Dataset
kaggle.com
Updated Feb 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Möbius (2021). The Depression Dataset [Dataset]. https://www.kaggle.com/arashnic/the-depression-dataset/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 6, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Möbius
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Depression is a severe mental disorder with characteristic symptoms like sadness, the feeling of emptiness, anxiety and sleep disturbance, as well as general loss of initiative and interest in activities. Additionally, features like the feeling of guilt or worthlessness, reduced energy, concentration problems, suicidality and psychotic symptoms might be present. The severity of a depression is determined by the quantity of symptoms, their seriousness and duration, as well as the consequences on social and occupational function. Depressions are also common in Bipolar disorder, another severe psychiatric disorder. The main difference between uni-polar depression and bipolar disorder is the periodic occurrence of mania in the latter, a state associated with inﬂated self-esteem, impulsivity, increased activity, reduced sleep and goal-directed actions. Both diseases are genetic disorders, and might be understood as a genetic vulnerability to the environment disturbing the internal biological state and potentially trigger mood episodes. Depression is associated with disrupted biological rhythms caused by environmental disturbance like seasonal change in daylight, alteration of social rhythms due to for instance shift-work or longitude traveling; besides linked to lifestyles associated with diurnal rhythms inconsistent with the natural daylight cycle. The appearance of depressive symptoms relates furthermore to physical health issues, medical side effects, life events and social factors, besides alcohol and substance abuse, and such factors might also potentially cause symptoms of depression in all humans. The global lifetime prevalence of depression is roughly 15%, but the incidences of episodes with a severity level not meeting the requirements for a depressive diagnosis are far more prevalent. Actigraph recordings of motor activity are considered an objective method for observing depression, although this topic is far from exhaustive studied within psychiatric research.

Content

The dataset contains two folders, whereas one contains the data for the controls and one for the condition group. For each patient a csv file has been provided containing the actigraph data collected over time. The columns are: timestamp (one minute intervals), date (date of measurement), activity (activity measurement from the actigraph watch). In addition, the MADRS scores provided in the file "scores.csv". It contains the following columns; number (patient identifier), days (number of days of measurements), gender (1 or 2 for female or male), age (age in age groups), afftype (1: bipolar II, 2: unipolar depressive, 3: bipolar I), melanch (1: melancholia, 2: no melancholia), inpatient (1: inpatient, 2: outpatient), edu (education grouped in years), marriage (1: married or cohabiting, 2: single), work (1: working or studying, 2: unemployed/sick leave/pension), madrs1 (MADRS score when measurement started), madrs2 (MADRS when measurement stopped).

More : https://datasets.simula.no/depresjon/#dataset-details

Starter Code

@docxian : Evaluate motor activity of depression patients (and healthy control group) ### Acknowledgements

Enrique Garcia-Ceja, Michael Riegler, Petter Jakobsen, Jim Tørresen, Tine Nordgreen, Ketil J. Oedegaard, Ole Bernt Fasmer, Depresjon: A Motor Activity Database of Depression Episodes in Unipolar and Bipolar Patients, In MMSys'18 Proceedings of the 9th ACM on Multimedia Systems Conference, Amsterdam, The Netherlands, June 12 - 15, 2018. paper: https://dl.acm.org/doi/pdf/10.1145/3204949.3208125

Inspiration

The available data may eventually help researchers to develop systems capable of automatically detecting depression states based on sensor data. This dataset can be suitable (but not limited to) for the following applications:

(i) Use machine learning for depression states classification

(ii) MADRS score prediction based on motor activity data

(iii) Sleep pattern analysis of depressed v.s. non-depressed participants #
This dataset can be used as the basis for evaluating different machine learning methods and approaches such as: cost-sensitive classification and oversampling techniques for imbalanced class problems. This dataset is also suitable for comparing different machine learning classification approaches such as feature based and deep learning based methods like convolutional neural networks and recurrent neural networks for time series.
f
Validation of the validity of the ccr index.
figshare.com
xls
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng (2024). Validation of the validity of the ccr index. [Dataset]. http://doi.org/10.1371/journal.pone.0291656.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291656.t001
Dataset updated
Jan 18, 2024
Dataset provided by
PLOS ONE
Authors
Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Imbalanced data is a problem in that the number of samples in different categories or target value ranges varies greatly. Data imbalance imposes excellent challenges to machine learning and pattern recognition. The performance of machine learning models leans to be partially towards the majority of samples in the imbalanced dataset, which will further affect the effect of the model. The imbalanced data problem includes an imbalanced categorical problem and an imbalanced regression problem. Many studies have been developed to address the issue of imbalanced classification data. Nevertheless, the imbalanced regression problem has not been well-researched. In order to solve the problem of unbalanced regression data, we define an RNGRU model that can simultaneously learn the regression characteristics and neighbor characteristics of regression samples. To obtain the most comprehensive sample information of regression samples, the model uses the idea of confrontation to determine the proportion between the regression characteristics and neighbor characteristics of the original samples. According to the regression characteristics of the regression samples, an index ccr (correlation change rate) is proposed to evaluate the similarity between the generated samples and the original samples. And on this basis, an RNGAN model is proposed to reduce the similarity between the generated samples and the original samples by using the idea of confrontation.
h
ml_data_test_detection_bank_transaction_frauds_unbalanced
huggingface.co
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roberto Armas (2023). ml_data_test_detection_bank_transaction_frauds_unbalanced [Dataset]. https://huggingface.co/datasets/roberto-armas/ml_data_test_detection_bank_transaction_frauds_unbalanced
Explore at:
Dataset updated
Jun 19, 2023
Authors
Roberto Armas
Description
ML Data Test Detection Bank Transaction Frauds Unbalanced

The project provides a quick and accessible dataset designed for learning and experimenting with machine learning algorithms, specifically in the context of detecting fraudulent bank transactions. It is intended for practicing and applying concepts such as Random Forest, Support Vector Machines (SVM), and Synthetic Minority Over-sampling Technique (SMOTE) to address unbalanced classification problems. Note: This dataset is… See the full description on the dataset page: https://huggingface.co/datasets/roberto-armas/ml_data_test_detection_bank_transaction_frauds_unbalanced.
f
Number of datasets on which a combination of machine learning and sampling...
plos.figshare.com
xls
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Misuk Kim; Kyu-Baek Hwang (2023). Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the precision-recall curve. [Dataset]. http://doi.org/10.1371/journal.pone.0271260.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0271260.t001
Dataset updated
Jun 14, 2023
Dataset provided by
PLOS ONE
Authors
Misuk Kim; Kyu-Baek Hwang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the precision-recall curve.
Credit Card Fraud Detection
kaggle.com
test.researchdata.tuwien.ac.at
+1more
zip
Updated Mar 23, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Machine Learning Group - ULB (2018). Credit Card Fraud Detection [Dataset]. https://www.kaggle.com/mlg-ulb/creditcardfraud
Explore at:
zip(69155672 bytes)Available download formats
Dataset updated
Mar 23, 2018
Dataset authored and provided by
Machine Learning Group - ULB
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

Content

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

Update (03/05/2021)

A simulator for transaction data has been released as part of the practical handbook on Machine Learning for Credit Card Fraud Detection - https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/SimulatedDataset.html. We invite all practitioners interested in fraud detection datasets to also check out this data simulator, and the methodologies for credit card fraud detection presented in the book.

Acknowledgements

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

Please cite the following works:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook

Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics
f
Number of datasets on which a combination of machine learning and sampling...
plos.figshare.com
xls
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Misuk Kim; Kyu-Baek Hwang (2023). Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the receiver operating characteristics curve. [Dataset]. http://doi.org/10.1371/journal.pone.0271260.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0271260.t002
Dataset updated
Jun 16, 2023
Dataset provided by
PLOS ONE
Authors
Misuk Kim; Kyu-Baek Hwang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the receiver operating characteristics curve.
f
The adjustment process of step length T of RNGAN model.
plos.figshare.com
xls
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng (2024). The adjustment process of step length T of RNGAN model. [Dataset]. http://doi.org/10.1371/journal.pone.0291656.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291656.t002
Dataset updated
Jan 18, 2024
Dataset provided by
PLOS ONE
Authors
Weinan Jia; Ming Lu; Qing Shen; Chunzhi Tian; Xuyang Zheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The adjustment process of step length T of RNGAN model.
O
ContractNLI (ContractNLI: A Dataset for Document-level Natural Language...
opendatalab.com
zip
Updated Sep 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford University (2022). ContractNLI (ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts) [Dataset]. https://opendatalab.com/OpenDataLab/ContractNLI
Explore at:
zip(83759009 bytes)Available download formats
Dataset updated
Sep 22, 2022
Dataset provided by
Hitachi America LTD
Stanford University
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ContractNLI is a dataset for document-level natural language inference (NLI) on contracts whose goal is to automate/support a time-consuming procedure of contract review. In this task, a system is given a set of hypotheses (such as “Some obligations of Agreement may survive termination.”) and a contract, and it is asked to classify whether each hypothesis is entailed by, contradicting to or not mentioned by (neutral to) the contract as well as identifying evidence for the decision as spans in the contract. ContractNLI is the first dataset to utilize NLI for contracts and is also the largest corpus of annotated contracts (as of September 2021). ContractNLI is an interesting challenge to work on from a machine learning perspective (the label distribution is imbalanced and it is naturally multi-task, all the while training data being scarce) and from a linguistic perspective (linguistic characteristics of contracts, particularly negations by exceptions, make the problem difficult).
f
Metrics with the dataset.
plos.figshare.com
xls
Updated May 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
André G. Coimbra; Cleiane G. Oliveira; Matheus P. Libório; Hasheem Mannan; Laercio I. Santos; Elisa Fusco; Marcos F.S.V. D’Angelo (2025). Metrics with the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0320966.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320966.t003
Dataset updated
May 13, 2025
Dataset provided by
PLOS ONE
Authors
André G. Coimbra; Cleiane G. Oliveira; Matheus P. Libório; Hasheem Mannan; Laercio I. Santos; Elisa Fusco; Marcos F.S.V. D’Angelo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning has increasingly gained prominence in the healthcare sector due to its ability to address various challenges. However, a significant issue remains unresolved in this field: the handling of imbalanced data. This process is crucial for ensuring the efficiency of algorithms that utilize classification techniques, which are commonly applied in risk management, monitoring, diagnosis, and prognosis of patient health. This study conducts a comparative analysis of techniques for handling imbalanced data and evaluates their effectiveness in combination with a set of classification algorithms, specifically focusing on stroke prediction. Additionally, a new approach based on Particle Swarm Optimization (PSO) and Naive Bayes was proposed. This approach was applied to the real problem of Chagas disease. The application of these techniques aims to improve the quality of life for individuals, reduce healthcare costs, and allocate available resources more efficiently, making it a preventive action.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001

Data from: Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.5c00023.s001

Dataset updated

Apr 15, 2025

Dataset provided by

ACS Publications

Authors

Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.

Clear search

Close search

Google apps

Main menu

Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

Imbalanced Data

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

Data from: S1 Datasets -

Data from: Handling Imbalanced Classification Problems by Weighted...

Predict students' dropout and academic success

Data from: Financial Fraud Detection Dataset

Performance Note:

Features:

Usage:

Data Summary:

License:

Acknowledgements

Comparison of prediction errors of different models.

Data from: WikiChurches – A Fine-Grained Dataset of Architectural Styles...

Similarity measures results of different datasets on different models.

‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

The Depression Dataset

Context

Content

Starter Code

Inspiration

Validation of the validity of the ccr index.

ml_data_test_detection_bank_transaction_frauds_unbalanced

Number of datasets on which a combination of machine learning and sampling...

Credit Card Fraud Detection

Context

Content

Update (03/05/2021)

Acknowledgements

Number of datasets on which a combination of machine learning and sampling...

The adjustment process of step length T of RNGAN model.

ContractNLI (ContractNLI: A Dataset for Document-level Natural Language...

Metrics with the dataset.

Data from: Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML