100+ datasets found

i
Imbalanced Data
ieee-dataport.org
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0
Explore at:
Dataset updated
Aug 23, 2023
Authors
Blessa Binolin M
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.
f
Performance comparison of machine learning models across accuracy, AUC, MCC,...
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t005
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison of machine learning models across accuracy, AUC, MCC, and F1 score on GMSC dataset.
f
Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...
frontiersin.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2021.715421.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
f
Data from: Addressing Imbalanced Classification Problems in Drug Discovery...
acs.figshare.com
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00023.s001
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
Dataset: The effects of class balance on the training energy consumption of...
zenodo.org
data.niaid.nih.gov
csv
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga (2024). Dataset: The effects of class balance on the training energy consumption of logistic regression models [Dataset]. http://doi.org/10.5281/zenodo.10823624
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10823624
Dataset updated
Mar 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2024
Description
Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.
UVP5 data sorted with EcoTaxa and MorphoCluster
seanoe.org
image/*
Updated 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rainer Kiko; Simon-Martin Schröder (2020). UVP5 data sorted with EcoTaxa and MorphoCluster [Dataset]. http://doi.org/10.17882/73002
Explore at:
image/*Available download formats
Unique identifier
https://doi.org/10.17882/73002
Dataset updated
2020
Dataset provided by
SEANOE
Authors
Rainer Kiko; Simon-Martin Schröder
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
Oct 23, 2012 - Aug 7, 2017
Area covered
Description
here, we provide plankton image data that was sorted with the web applications ecotaxa and morphocluster. the data set was used for image classification tasks as described in schröder et. al (in preparation) and does not include any geospatial or temporal meta-data.plankton was imaged using the underwater vision profiler 5 (picheral et al. 2010) in various regions of the world's oceans between 2012-10-24 and 2017-08-08.this data publication consists of an archive containing "training.csv" (list of 392k training images for classification, validated using ecotaxa), "validation.csv" (list of 196k validation images for classification, validated using ecotaxa), "unlabeld.csv" (list of 1m unlabeled images), "morphocluster.csv" (1.2m objects validated using morphocluster, a subset of "unlabeled.csv" and "validation.csv") and the image files themselves. the csv files each contain the columns "object_id" (a unique id), "image_fn" (the relative filename), and "label" (the assigned name).the training and validation sets were sorted into 65 classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). this data shows a severe class imbalance; the 10% most populated classes contain more than 80% of the objects and the class sizes span four orders of magnitude. the validation set and a set of additional 1m unlabeled images were sorted during the first trial of morphocluster (https://github.com/morphocluster).the images in this data set were sampled during rv meteor cruises m92, m93, m96, m97, m98, m105, m106, m107, m108, m116, m119, m121, m130, m131, m135, m136, m137 and m138, during rv maria s merian cruises msm22, msm23, msm40 and msm49, during the rv polarstern cruise ps88b and during the fluxes1 experiment with rv sarmiento de gamboa.the following people have contributed to the sorting of the image data on ecotaxa:rainer kiko, tristan biard, benjamin blanc, svenja christiansen, justine courboules, charlotte eich, jannik faustmann, christine gawinski, augustin lafond, aakash panchal, marc picheral, akanksha singh and helena haussin schröder et al. (in preparation), the training set serves as a source for knowledge transfer in the training of the feature extractor. the classification using morphocluster was conducted by rainer kiko. used labels are operational and not yet matched to respective ecotaxa classes.
P
HDSNE Chest X-ray Dataset Dataset
paperswithcode.com
Updated Feb 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). HDSNE Chest X-ray Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/hdsne-chest-x-ray-dataset
Explore at:
Dataset updated
Feb 25, 2025
Description
Description:

👉 Download the dataset here

The continuous release of medical image databases, often featuring overlapping or identical categories, poses a significant challenge for the development of autonomous Computer-Aided Diagnostics (CAD) systems. These systems are essential for creating truly comprehensive medical diagnostics. However, one of the main obstacles lies in the frequent bulk release of datasets, which commonly suffer from two critical issues: image duplication and data corruption.

The Problem of Dataset Redundancy

Repeated releases of the same categories often fail to integrate or deduplicate similar images across databases, which can severely impact the effectiveness of machine learning models. Data duplication not only reduces the efficiency of learning models but also leads to overfitting, wastes computational resources, and increases the carbon footprint due to the energy required for training complex models.

Download Dataset

Proposed Solution: Global Data Aggregation Model

In response to these challenges, we introduce a global data aggregation model that intelligently combines data from six distinct and reputable medical imaging databases. Each database was carefully curated to ensure the elimination of redundancies while preserving data diversity. Two robust algorithms were employed:

Hash MD5 Algorithm: This algorithm generates unique hash values for each image, helping in the effective detection and elimination of duplicate images.

t-SNE Algorithm: This technique is used for dimensionality reduction, with a tunable perplexity parameter to ensure accurate representation of high-dimensional data.

Dataset Categories

The final dataset includes an equal number of samples from three key categories of chest X-ray images:

Normal Pneumonia COVID-19

This uniform distribution ensures that the dataset is balanced, avoiding class imbalance—a common issue that can skew results in medical image analysis.

Dataset Application & Model Evaluation

The dataset was applied to the Inception V3 pre-trained model, a leading convolutional neural network (CNN) architecture known for its excellence in image classification tasks. The evaluation was conduct using the following performance metrics:

Accuracy: An exceptional accuracy rate of 98.48% was achieve.

Precision, Recall, and F1-score: The dataset showed strong performance across these metrics, reducing both false positives and false negatives.

Statistical Validation: A t-test was conduct to validate the results, and the t-values and p-values confirm the statistical significance of the model’s performance.

Conclusion

The HDSNE Chest X-ray Dataset offers a novel and effective approach to data aggregation, tackling the issues of redundancy and data duplication that have long plagued the field of medical imaging. By maintaining a balance class distribution and eliminating unnecessary data, this dataset provides a cleaner and more efficient resource for training machine learning models.

This dataset is sourced from Kaggle.
Additional file 3 of Impact of random oversampling and random undersampling...
springernature.figshare.com
xlsx
Updated Aug 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cynthia Yang; Egill A. Fridgeirsson; Jan A. Kors; Jenna M. Reps; Peter R. Rijnbeek (2024). Additional file 3 of Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data [Dataset]. http://doi.org/10.6084/m9.figshare.26660464.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26660464.v1
Dataset updated
Aug 18, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Cynthia Yang; Egill A. Fridgeirsson; Jan A. Kors; Jenna M. Reps; Peter R. Rijnbeek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 3. Candidate predictors per database.
Lending Club Loan Data
kaggle.com
Updated Nov 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sweta Shetye (2020). Lending Club Loan Data [Dataset]. https://www.kaggle.com/swetashetye/lending-club-loan-data-imbalance-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sweta Shetye
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I wanted a highly imbalanced dataset to share with others. It has the perfect one for us.

Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).

For example, In this dataset, There are way more samples of fully paid borrowers versus not fully paid borrowers.

Full LendingClub data available from their site.

Content

For companies like Lending Club correctly predicting whether or not a loan will be default is very important. This dataset contains historical data from 2007 to 2015, you can to build a deep learning model to predict the chance of default for future loans. As you will see this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.
m
Dataset for Transient Stability Assessment of IEEE 39-Bus System
data.mendeley.com
Updated Dec 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Živko Sokolović (2024). Dataset for Transient Stability Assessment of IEEE 39-Bus System [Dataset]. http://doi.org/10.17632/p992nhb8ss.1
Explore at:
Unique identifier
https://doi.org/10.17632/p992nhb8ss.1
Dataset updated
Dec 20, 2024
Authors
Živko Sokolović
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 50 features and was generated through 12,852 time-domain simulations performed on the IEEE New England 39 bus system test case using DIgSILENT PowerFactory and Python automation. The simulations span diverse operating conditions by varying the generation/load profile from 80% to 120% in 5% increments. For each condition, three-phase short-circuit faults were applied at seven distinct locations (0%, 10%, 20%, 50%, 80%, 90%, 100%) along all transmission lines, with fault clearing times ranging from 0.1s to 0.3s.

Key features captured for each of the 10 generators (G02 is the reference machine) include:

P in MW - Active Power ut in p.u. - Terminal Voltage ie in p.u. - Excitation Current xspeed in p.u. - Rotor Speed firel in deg - Rotor Angle (relative to G02)

Simulations lasted 10 seconds to ensure accurate transient stability assessment. Post-fault data was sampled every 0.01s from fault clearance up to 0.6s afterward, labeling the stability state as 1 (stable) or 0 (unstable). The dataset generation process took 5,840 seconds. The dataset exhibits a class imbalance, with 42% of cases belonging to the unstable class. All simulation data were exported to .csv files and subsequently unified into a single pickle file (tsa_data.pkl).

Helper scripts are provided:

dataset_loader.py: Includes the load_tsa_data function to load the dataset. usage.py: Demonstrates how to use the loader module.

This dataset serves as a comprehensive foundation for machine learning applications in transient stability assessment (TSA), offering insights into system behavior under dynamic conditions.
f
Classification result classifiers using TF-IDF with SMOTE.
plos.figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t007
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification result classifiers using TF-IDF with SMOTE.
Balanced Emotic Dataset (75×75, RGB)
kaggle.com
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dolly prajapati 182 (2025). Balanced Emotic Dataset (75×75, RGB) [Dataset]. https://www.kaggle.com/datasets/dollyprajapati182/balance-emotic/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 29, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
dolly prajapati 182
Description
The Balanced Emotic Dataset is a uniformly processed, class-balanced, and augmented version of the original Emotic Dataset Emotion Dataset. This dataset is tailored for deep learning and machine learning applications in Facial Emotion Recognition (FER). It addresses class imbalance and standardizes input dimensions to boost model performance and ensure fair evaluation across classes.

🎯 Purpose The goal of this dataset is to balance the representation of seven basic emotions, enabling the training of fairer and more robust FER models. Each emotion class contains an equal number of images, facilitating consistent model learning and evaluation across all classes.

🧾 Dataset Characteristics Source: Based on the Emotic Dataset

Image Format: RGB .png

Image Size: 75 × 75 pixels

Emotion Classes:

angry disgusted fearful happy neutral sad surprised

⚙️ Preprocessing Pipeline Each image in the dataset has been preprocessed using the following steps:

✅ Converted to RGB

✅ Resized to 75×75 pixels

✅ Augmented using:

Random rotation

Horizontal flip

Brightness adjustment

Contrast enhancement

Sharpness modification

This results in a clean, uniform, and diverse dataset ideal for FER tasks.

Testing (10%): 898 images

Training (80% of remainder): 6472 images

Validation (20% of remainder): 1618 images

✅ Advantages ⚖️ Balanced Classes: Equal images across all seven emotions

🧠 Model-Friendly: Grayscale, resized format reduces preprocessing overhead

🚀 Augmented: Improves model generalization and robustness

📦 Split Ready: Train/Val/Test folders structured per class

📊 Great for Benchmarking: Ideal for training CNNs, Transformers, and ensemble models for FER
f
Data from: Isometric Stratified Ensembles: A Partial and Incremental...
acs.figshare.com
xlsx
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux (2023). Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation [Dataset]. http://doi.org/10.1021/acs.jcim.2c00293.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.2c00293.s004
Dataset updated
Jun 7, 2023
Dataset provided by
ACS Publications
Authors
Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.
d
Data from: Exploring deep learning techniques for wild animal behaviour...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.2ngf1vhwk
Dataset updated
Feb 23, 2024
Dataset provided by
Dryad Digital Repository
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
Time period covered
Jan 1, 2023
Description
1: Machine learning-based behaviour classification using acceleration data is a powerful tool in bio-logging research. Deep learning architectures such as convolutional neural networks (CNN), long short-term memory (LSTM), and self-attention mechanism as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration-based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached, and complexity in data due to complex animal-specific behaviours, which may have limited the application of deep learning techniques in this area. 2: To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup, and pre-training of deep learning models with unlabelled dat..., , , # Data from: Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers

This repository contains the datasets of two seabird species (streaked shearwaters and black-tailed gulls) used in the following paper (Otsuka et al., 2024).

Otsuka, R., Yoshimura, N., Tanigaki, K., Koyama, S., Mizutani, Y., Yoda, K., & Maekawa, T. (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers. Methods in Ecology and Evolution.

The paper aimed to classify the behaviour of these two seabird species using tri-axial acceleration data and deep learning. It explored the effectiveness of deep learning models and related training techniques, such as data augmentation.

Description of the data and file structure

The directory structure of the data is as follows:
(After unzipping the data-v1.0.0.zip file, you will see the following directories and files.)

data â”œâ”€id-files/*.csv ...
Brain cancer gene expression - CuMiDa
kaggle.com
Updated Feb 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bruno Grisci (2020). Brain cancer gene expression - CuMiDa [Dataset]. https://www.kaggle.com/datasets/brunogrisci/brain-cancer-gene-expression-cumida/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2020
Dataset provided by
Kaggle
Authors
Bruno Grisci
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Dataset GSE50161 on brain cancer gene expression from CuMiDa

5 classes

54676 genes

130 samples

About

Here we present the Curated Microarray Database (CuMiDa), a repository containing 78 handpicked cancer microarray datasets, extensively curated from 30.000 studies from the Gene Expression Omnibus (GEO), solely for machine learning. The aim of CuMiDa is to offer homogeneous and state-of-the-art biological preprocessing of these datasets, together with numerous 3-fold cross validation benchmark results to propel machine learning studies focused on cancer research. The database make available various download options to be employed by other programs, as well for PCA and t-SNE results. CuMiDa stands different from existing databases for offering newer datasets, manually and carefully curated, from samples quality, unwanted probes, background correction and normalization, to create a more reliable source of data for computational research.

http://sbcb.inf.ufrgs.br/cumida

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4372803%2Fa5700cd725b4eb18ae791b2e99868c01%2Fworkflowcumida.png?generation=1580134947371830&alt=media" alt="">

References

Feltes, B.C.; Chandelier, E.B.; Grisci, B.I.; Dorn, M. (2019) CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research. Journal of Computational Biology, 26 (4), 376-386. [https://doi.org/10.1089/cmb.2018.0238]

Grisci, B. I., Feltes, B. C., & Dorn, M. (2019). Neuroevolution as a tool for microarray gene expression pattern identification in cancer research. Journal of biomedical informatics, 89, 122-133. [https://doi.org/10.1016/j.jbi.2018.11.013]

Inspiration

How to deal with class imbalance for classification?

How to identify the most important genes for the classification of each cancer subtype?

Is it possible to discover subtypes?

How to beat the classification and clustering benchmarks for this dataset listed on the CuMiDa website?
f
Table1_A comparative study in class imbalance mitigation when working with...
frontiersin.figshare.com
pdf
Updated Mar 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rawan S. Abdulsadig; Esther Rodriguez-Villegas (2024). Table1_A comparative study in class imbalance mitigation when working with physiological signals.pdf [Dataset]. http://doi.org/10.3389/fdgth.2024.1377165.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fdgth.2024.1377165.s001
Dataset updated
Mar 26, 2024
Dataset provided by
Frontiers
Authors
Rawan S. Abdulsadig; Esther Rodriguez-Villegas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Class imbalance is a common challenge that is often faced when dealing with classification tasks aiming to detect medical events that are particularly infrequent. Apnoea is an example of such events. This challenge can however be mitigated using class rebalancing algorithms. This work investigated 10 widely used data-level class imbalance mitigation methods aiming towards building a random forest (RF) model that attempts to detect apnoea events from photoplethysmography (PPG) signals acquired from the neck. Those methods are random undersampling (RandUS), random oversampling (RandOS), condensed nearest-neighbors (CNNUS), edited nearest-neighbors (ENNUS), Tomek’s links (TomekUS), synthetic minority oversampling technique (SMOTE), Borderline-SMOTE (BLSMOTE), adaptive synthetic oversampling (ADASYN), SMOTE with TomekUS (SMOTETomek) and SMOTE with ENNUS (SMOTEENN). Feature-space transformation using PCA and KernelPCA was also examined as a potential way of providing better representations of the data for the class rebalancing methods to operate. This work showed that RandUS is the best option for improving the sensitivity score (up to 11%). However, it could hinder the overall accuracy due to the reduced amount of training data. On the other hand, augmenting the data with new artificial data points was shown to be a non-trivial task that needs further development, especially in the presence of subject dependencies, as was the case in this work.
Learning Privacy from Visual Entities - Curated data sets and pre-computed...
zenodo.org
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro (2025). Learning Privacy from Visual Entities - Curated data sets and pre-computed visual entities [Dataset]. http://doi.org/10.5281/zenodo.15348506
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15348506
Dataset updated
May 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the curated image privacy datasets and pre-computed visual entities used in the publication Learning Privacy from Visual Entities by A. Xompero and A. Cavallaro.
[arxiv][code]

Curated image privacy data sets

In the article, we trained and evaluated models on the Image Privacy Dataset (IPD) and the PrivacyAlert dataset. The datasets are originally provided by other sources and have been re-organised and curated for this work.

Our curation organises the datasets in a common structure. We updated the annotations and labelled the splits of the data in the annotation file. This avoids having separated folders of images for each data split (training, validation, testing) and allows a flexible handling of new splits, e.g. created with a stratified K-Fold cross-validation procedure. As for the original datasets (PicAlert and PrivacyAlert), we provide the link to the images in bash scripts to download the images. Another bash script re-organises the images in sub-folders with maximum 1000 images in each folder.

Both datasets refer to images publicly available on Flickr. These images have a large variety of content, including sensitive content, seminude people, vehicle plates, documents, private events. Images were annotated with a binary label denoting if the content was deemed to be public or private. As the images are publicly available, their label is mostly public. These datasets have therefore a high imbalance towards the public class. Note that IPD combines two other existing datasets, PicAlert and part of VISPR, to increase the number of private images already limited in PicAlert. Further details in our corresponding https://doi.org/10.48550/arXiv.2503.12464" target="_blank" rel="noopener">publication.

List of datasets and their original source:

PicAlert [Images occupy 2.4 GB]

VISPR [Images occupy 49.7 GB]

PrivacyAlert [Images occupy 1 GB]

Notes:

For PicAlert and PrivacyAlert, only urls to the original locations in Flickr are available in the Zenodo record

Collector and authors of the PrivacyAlert dataset selected the images from Flickr under Public Domain license

Owners of the photos on Flick could have removed the photos from the social media platform

Running the bash scripts to download the images can incur in the "429 Too Many Requests" status code

Pre-computed visual entitities

Some of the models run their pipeline end-to-end with the images as input, whereas other models require different or additional inputs. These inputs include the pre-computed visual entities (scene types and object types) represented in a graph format, e.g. for a Graph Neural Network. Re-using these pre-computed visual entities allows other researcher to build new models based on these features while avoiding re-computing the same on their own or for each epoch during the training of a model (faster training).

For each image of each dataset, namely PrivacyAlert, PicAlert, and VISPR, we provide the predicted scene probabilities as a .csv file , the detected objects as a .json file in COCO data format, and the node features (visual entities already organised in graph format with their features) as a .json file. For consistency, all the files are already organised in batches following the structure of the images in the datasets folder. For each dataset, we also provide the pre-computed adjacency matrix for the graph data.

Note: IPD is based on PicAlert and VISPR and therefore IPD refers to the scene probabilities and object detections of the other two datasets. Both PicAlert and VISPR must be downloaded and prepared to use IPD for training and testing.

Further details on downloading and organising data can be found in our GitHub repository: https://github.com/graphnex/privacy-from-visual-entities (see ARTIFACT-EVALUATION.md#pre-computed-visual-entitities-)

Enquiries, questions and comments

If you have any enquiries, question, or comments, or you would like to file a bug report or a feature request, use the issue tracker of our GitHub repository.
m
Volcanic Lithology Logging Identification Based on ADASYN-KNN-Random Forest...
data.mendeley.com
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiayuan Mou (2025). Volcanic Lithology Logging Identification Based on ADASYN-KNN-Random Forest Ensemble Model Taking the Carboniferous System on the Hanging Wall of Kebai Fault Zone as an Example [Dataset]. http://doi.org/10.17632/dsv68j6jg2.1
Explore at:
Unique identifier
https://doi.org/10.17632/dsv68j6jg2.1
Dataset updated
Apr 10, 2025
Authors
Jiayuan Mou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the following files to support the research paper "Volcanic Lithology Logging Identification Based on ADASYN-KNN-Random Forest Ensemble Model Taking the Carboniferous System on the Hanging Wall of Kebai Fault Zone as an Example":

Raw_Data.xlsx:

Thin-Section Data: High-resolution measurements/images of volcanic rock samples with lithology labels (e.g., basalt, andesite).

Logging Data: Corresponding well-logging responses (gamma ray, density, neutron porosity) for each sample.

Columns: Sample_ID, Depth (m), GR (API), DEN (g/cm³), CNL (%), Lithology_Label, Mineral_Composition (%).

ADASYN_Resampled_Data.xlsx:

Balanced dataset generated after applying ADASYN (Adaptive Synthetic Sampling) oversampling to address class imbalance.

Includes synthetic samples for minority lithology classes.

ML_Code.zip:

ADASYN_Oversampling.py: Python script for adaptive oversampling (uses imbalanced-learn).

KNN_RF_Classification.py: Combined script for KNN and Random Forest training/prediction.

Requirements.txt: Dependencies (e.g., Python 3.13, pandas, scikit-learn).
m
Data from: Mental issues, internet addiction and quality of life predict...
data.mendeley.com
Updated Jul 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andras Matuz (2024). Mental issues, internet addiction and quality of life predict burnout among Hungarian teachers: a machine learning analysis [Dataset]. http://doi.org/10.17632/2yy4j7rgvg.2
Explore at:
Unique identifier
https://doi.org/10.17632/2yy4j7rgvg.2
Dataset updated
Jul 31, 2024
Authors
Andras Matuz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Burnout is usually defined as a state of emotional, physical, and mental exhaustion that affects people in various professions (e.g. physicians, nurses, teachers). The consequences of burnout involve decreased motivation, productivity, and overall diminished well-being. The machine learning-based prediction of burnout has therefore become the focus of recent research. In this study, the aim was to detect burnout using machine learning and to identify its most important predictors in a sample of Hungarian high-school teachers. Methods: The final sample consisted of 1,576 high-school teachers (522 male), who completed a survey including various sociodemographic and health-related questions and psychological questionnaires. Specifically, depression, insomnia, internet habits (e.g. when and why one uses the internet) and problematic internet usage were among the most important predictors tested in this study. Supervised classification algorithms were trained to detect burnout assessed by two well-known burnout questionnaires. Feature selection was conducted using recursive feature elimination. Hyperparameters were tuned via grid search with 5-fold cross-validation. Due to class imbalance, class weights (i.e. cost-sensitive learning), downsampling and a hybrid method (SMOTE-ENN) were applied in separate analyses. The final model evaluation was carried out on a previously unseen holdout test sample. Results: Burnout was detected in 19.7% of the teachers included in the final dataset. The best predictive performance on the holdout test sample was achieved by random forest with class weigths (AUC = .811; balanced accuracy = .745, sensitivity = .765; specificity = .726). The best predictors of burnout were Beck’s Depression Inventory scores, Athen’s Insomnia Scale scores, subscales of the Problematic Internet Use Questionnaire and self-reported current health status. Conclusions: The performances of the algorithms were comparable with previous studies; however, it is important to note that we tested our models on previously unseen holdout samples suggesting higher levels of generalizability. Another remarkable finding is that besides depression and insomnia, other variables such as problematic internet use and time spent online also turned out to be important predictors of burnout.
CODEBRIM: COncrete DEfect BRidge IMage Dataset
zenodo.org
explore.openaire.eu
+1more
bin, zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Mundt; Sagnik Majumder; Sreenivas Murali; Panagiotis Panetsos; Visvanathan Ramesh; Martin Mundt; Sagnik Majumder; Sreenivas Murali; Panagiotis Panetsos; Visvanathan Ramesh (2020). CODEBRIM: COncrete DEfect BRidge IMage Dataset [Dataset]. http://doi.org/10.5281/zenodo.2620293
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2620293
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin Mundt; Sagnik Majumder; Sreenivas Murali; Panagiotis Panetsos; Visvanathan Ramesh; Martin Mundt; Sagnik Majumder; Sreenivas Murali; Panagiotis Panetsos; Visvanathan Ramesh
Description
CODEBRIM: COncrete DEfect BRidge IMage Dataset for multi-target multi-class concrete defect classification in computer vision and machine learning.

Dataset as presented and detailed in our CVPR 2019 publication: http://openaccess.thecvf.com/content_CVPR_2019/html/Mundt_Meta-Learning_Convolutional_Neural_Architectures_for_Multi-Target_Concrete_Defect_Classification_With_CVPR_2019_paper.html or https://arxiv.org/abs/1904.08486 . If you make use of the dataset please cite it as follows:

"Martin Mundt, Sagnik Majumder, Sreenivas Murali, Panagiotis Panetsos, Visvanathan Ramesh. Meta-learning Convolutional Neural Architectures for Multi-target Concrete Defect Classification with the COncrete DEfect BRidge IMage Dataset. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019"

We offer a supplementary GitHub repository with code to reproduce the paper and data loaders: https://github.com/ccc-frankfurt/meta-learning-CODEBRIM

For ease of use we provide the dataset in multiple different versions.

Files contained:
* CODEBRIM_original_images: contains the original full-resolution images and bounding box annotations
* CODEBRIM_cropped_dataset: contains the extracted crops/patches with corresponding class labels from the bounding boxes
* CODEBRIM_classification_dataset: contains the cropped patches with corresponding class labels split into training, validation and test sets for machine learning
* CODEBRIM_classification_balanced_dataset: similar to "CODEBRIM_classification_dataset" but with the exact replication of training images to balance the dataset in order to reproduce results obtained in the paper.

Facebook

Twitter

Click to copy link

Link copied

Cite

Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0

Imbalanced Data

Explore at:

Dataset updated

Aug 23, 2023

Authors

Blessa Binolin M

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.

Clear search

Close search

Google apps

Main menu

Imbalanced Data

Performance comparison of machine learning models across accuracy, AUC, MCC,...

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

Dataset: The effects of class balance on the training energy consumption of...

UVP5 data sorted with EcoTaxa and MorphoCluster

HDSNE Chest X-ray Dataset Dataset

Additional file 3 of Impact of random oversampling and random undersampling...

Lending Club Loan Data

Context

Content

Dataset for Transient Stability Assessment of IEEE 39-Bus System

Classification result classifiers using TF-IDF with SMOTE.

Balanced Emotic Dataset (75×75, RGB)

Data from: Isometric Stratified Ensembles: A Partial and Incremental...

Data from: Exploring deep learning techniques for wild animal behaviour...

Description of the data and file structure

Brain cancer gene expression - CuMiDa

Dataset GSE50161 on brain cancer gene expression from CuMiDa

About

References

Inspiration

Table1_A comparative study in class imbalance mitigation when working with...

Learning Privacy from Visual Entities - Curated data sets and pre-computed...

Curated image privacy data sets

Pre-computed visual entitities

Enquiries, questions and comments

Volcanic Lithology Logging Identification Based on ADASYN-KNN-Random Forest...

Data from: Mental issues, internet addiction and quality of life predict...

CODEBRIM: COncrete DEfect BRidge IMage Dataset

Imbalanced Data