100+ datasets found

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2021.715421.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
Z
Dataset: The effects of class balance on the training energy consumption of...
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gutierrez, Maria; Calero, Coral; García, Félix; Moraga, Mª Ángeles (2024). Dataset: The effects of class balance on the training energy consumption of logistic regression models [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10823623
Explore at:
Dataset updated
Mar 18, 2024
Dataset provided by
University of Castilla-La Mancha
Authors
Gutierrez, Maria; Calero, Coral; García, Félix; Moraga, Mª Ángeles
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.
Z
Dataset: The effects of class balance on the training energy consumption of...
nde-dev.biothings.io
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
García, Félix (2024). Dataset: The effects of class balance on the training energy consumption of logistic regression models [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10823623
Explore at:
Dataset updated
Mar 18, 2024
Dataset provided by
Moraga, Mª Ángeles
Calero, Coral
García, Félix
Gutierrez, Maria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.
Balanced KDEF Dataset (75×75, RGB)
kaggle.com
zip
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dolly prajapati 182 (2025). Balanced KDEF Dataset (75×75, RGB) [Dataset]. https://www.kaggle.com/datasets/dollyprajapati182/balanced-kdef-dataset-7575-rgb
Explore at:
zip(243638513 bytes)Available download formats
Dataset updated
Apr 29, 2025
Authors
dolly prajapati 182
Description
The Balanced KDEF Dataset is a uniformly processed, class-balanced, and augmented version of the FER2013-KDEF composite dataset. This curated version is tailored for deep learning and machine learning applications in Facial Emotion Recognition (FER). It addresses class imbalance and standardizes input dimensions to enhance model performance and comparability.

🎯 Purpose The goal of this dataset is to balance the representation of seven basic emotions, enabling the training of fairer and more robust FER models. Each emotion class contains an equal number of images, facilitating consistent model learning and evaluation across all classes.

🧾 Dataset Characteristics Source: Based on the KDEF dataset

Image Format: Grayscale .png

Image Size: 75 × 75 pixels

Emotion Classes:

Angry

Disgust

Fear

Happy

Sad

Surprise

Neutral

Total Images: 62,923

Images per Class: 8,989

⚙️ Preprocessing Pipeline Each image in the dataset has been preprocessed using the following steps:

✅ Converted to grayscale

✅ Resized to 75×75 pixels

✅ Augmented using:

Random rotation

Horizontal flip

Brightness adjustment

Contrast enhancement

Sharpness modification

This results in a clean, uniform, and diverse dataset ideal for FER tasks.

Testing (10%): 6,292 images

Training (80% of remainder): 45,305 images

Validation (20% of remainder): 11,326 images

✅ Advantages ⚖️ Balanced Classes: Equal images across all seven emotions

🧠 Model-Friendly: Grayscale, resized format reduces preprocessing overhead

🚀 Augmented: Improves model generalization and robustness

📦 Split Ready: Train/Val/Test folders structured per class

📊 Great for Benchmarking: Ideal for training CNNs, Transformers, and ensemble models for FER
Under-sampled dataset.
plos.figshare.com
xls
Updated Dec 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Under-sampled dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t003
Dataset updated
Dec 31, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
Balanced Caer-S Dataset (75×75)
kaggle.com
zip
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dolly prajapati 182 (2025). Balanced Caer-S Dataset (75×75) [Dataset]. https://www.kaggle.com/datasets/dollyprajapati182/balanced-caer-s-dataset-7575-grayscale/code
Explore at:
zip(138618475 bytes)Available download formats
Dataset updated
Apr 23, 2025
Authors
dolly prajapati 182
Description
The Balanced Caer-S Dataset is a uniformly processed, class-balanced, and augmented version of the original Caer-S Emotion Dataset. This dataset is tailored for deep learning and machine learning applications in Facial Emotion Recognition (FER). It addresses class imbalance and standardizes input dimensions to boost model performance and ensure fair evaluation across classes.

🎯 Purpose The goal of this dataset is to balance the representation of seven basic emotions, enabling the training of fairer and more robust FER models. Each emotion class contains an equal number of images, facilitating consistent model learning and evaluation across all classes.

🧾 Dataset Characteristics Source: Based on the Caer-s dataset

Image Size: 75 × 75 pixels

Emotion Classes:

Angry

Disgust

Fear

Happy

Sad

Surprise

Neutral

Total Images: 63,000

⚙️ Preprocessing Pipeline Each image in the dataset has been preprocessed using the following steps:

✅ Converted to grayscale

✅ Resized to 75×75 pixels

✅ Augmented using:

Random rotation

Horizontal flip

Brightness adjustment

Contrast enhancement

Sharpness modification

This results in a clean, uniform, and diverse dataset ideal for FER tasks.

Testing (10%): 900 images

Training (80% of remainder): 6,480 images

Validation (20% of remainder): 1,620 images

✅ Advantages ⚖️ Balanced Classes: Equal images across all seven emotions

🧠 Model-Friendly: Resized format reduces preprocessing overhead

🚀 Augmented: Improves model generalization and robustness

📦 Split Ready: Train/Val/Test folders structured per class

📊 Great for Benchmarking: Ideal for training CNNs, Transformers, and ensemble models for FER
m
Sunflower Growth Stage Image Dataset for Phenological Classification
data.mendeley.com
Updated Aug 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jahangir Alam Jibon (2025). Sunflower Growth Stage Image Dataset for Phenological Classification [Dataset]. http://doi.org/10.17632/byftmdzg4g.2
Explore at:
Unique identifier
https://doi.org/10.17632/byftmdzg4g.2
Dataset updated
Aug 18, 2025
Authors
Jahangir Alam Jibon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Sunflower Growth Stage Image Dataset for Phenological Classification was collected from agricultural fields in Bangladesh, focusing on the identification and classification of sunflower growth stages. Images were captured directly in the field using a Redmi Note 11 smartphone, under natural daylight and varying weather conditions to reflect real-world environments. This dataset is meant to aid research in deep learning, computer vision, and plant phenology by providing data for automated classification of growth stages.

A total of 1,255 original images were gathered, each with a high resolution of 12,288 × 16,320 pixels and approximately 25 MB in size. The images are divided into five classes: Stage1 (Young_Bud) with 238 images, Stage2 (Mature_Bud) with 272 images, Stage3 (Early_Bloom) with 218 images, Stage4 (Full_Bloom) with 213 images, and Stage5 (Wilted) with 314 images. To balance the dataset for training, each class was augmented to have 500 images, resulting in a final balanced collection of 2,500 images.

Validation of the dataset was carried out by a Sub-Assistant Agriculture Officer from the Department of Agricultural Extension (DAE), Bangladesh, ensuring its reliability. The data was collected at two main sites: Daffodil International University (Ashulia Campus) and Model Town Nursery, Ashulia, Bangladesh. The camera used for capturing the images was a Redmi Note 11, with 24-bit color depth, an aperture of f/1.8, and images saved in JPEG format.

Example metadata for an image shows it was taken on 2025-05-22 at 17:47 using the MediaTek Camera Application. The image’s dimensions are 12,288 × 16,320 pixels at 72 dpi with 24-bit sRGB color representation. The camera details include Xiaomi as the maker, model 23117RA86G, f-stop f/1.6, exposure time 1/100 sec, ISO 200, focal length 6 mm, and auto white balance. GPS coordinates recorded were Latitude 23.5247046, Longitude 90.1918097, Altitude 34.5 m. The image file example is named IMG_20250522_174724.jpg, is a JPEG of size 26.1 MB.

Attribution Notice This dataset also includes 24 images derived from the publicly available dataset: “Sunflower Plant Health and Growth Stage Image Dataset for Agricultural Machine Learning Applications” Sagor, Saifuddin; Hossan, Md. Faysal ; Ahmed, Faruk; Reyad , Md. Zamirul Islam (2025), “Sunflower Plant Health and Growth Stage Image Dataset for Agricultural Machine Learning Applications”, Mendeley Data, V1, doi: 10.17632/y3ygk98ngr.1

These images were incorporated because the number of collected field images was insufficient for the Stage4 (Full_Bloom) Class. After inclusion, a portion of these images was further augmented to increase the dataset size and maintain class balance. Any modifications or augmentations applied to the derived images are the responsibility of the present authors.

The original dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
Car-Vehicle Crash Dataset (Balanced)
kaggle.com
zip
Updated Oct 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ümit (2025). Car-Vehicle Crash Dataset (Balanced) [Dataset]. https://www.kaggle.com/datasets/umitka/real-world-vehicle-crash-dataset-balanced
Explore at:
zip(12957991974 bytes)Available download formats
Dataset updated
Oct 13, 2025
Authors
ümit
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
📂 Dataset Overview This dataset contains 1,500 real-world vehicle accident videos labeled according to severity levels: minor, moderate, and major — with 500 videos per class. Each video is manually reviewed and assigned a severity class based on visual cues such as collision intensity, vehicle damage, and traffic disruption. To ensure balanced class distribution, additional samples were generated for underrepresented categories through data augmentation. While specific techniques are not detailed, this process helped mitigate class imbalance and improve model generalization without compromising label consistency.

Unlike frame-based datasets, this collection emphasizes video-level analysis, making it suitable for tasks like: - 🚗 Accident detection - 📊 Severity classification - 🧠 Scene understanding - ⏱️ Event-level prediction

⚡ The dataset is ideal for researchers and developers working on: - 🎬 Video classification - 🤖 Deep learning applications in traffic surveillance - 🕵️ Object detection in dynamic scenes - 🚦 Intelligent transportation systems - 🛣️ Road safety analytics

📁 Structure & Format The dataset is organized into three main splits: train, val, and test, following standard machine learning conventions:

Balanced Accident Video Dataset/ ├── train/ │ ├── minor/ │ ├── moderate/ │ ├── major/ ├── val/ │ ├── minor/ │ ├── moderate/ │ ├── major/ ├── test/ ├── minor/ ├── moderate/ ├── major/

Train split: 70% of the dataset

Validation split: 15%

Test split: 15%

All videos are in .mp4 format

Average duration: ~20 seconds

Resolution varies across samples

📌 Additional Notes All videos are sourced from publicly available traffic footage and curated to ensure class balance and real-world diversity. The dataset supports multi-class classification and can be extended for temporal modeling, action recognition, or video summarization tasks.

📜 License This dataset is released under the CC BY-NC 4.0 license, allowing academic and personal use with attribution. Commercial use is not permitted.

🌍 Source All videos are collected from the official website of the Republic of Türkiye General Directorate of Security – Traffic Department: 🔗https://www.trafik.gov.tr/kgys-goruntuleri
Number of datasets on which a combination of machine learning and sampling...
plos.figshare.com
xls
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Misuk Kim; Kyu-Baek Hwang (2023). Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the receiver operating characteristics curve. [Dataset]. http://doi.org/10.1371/journal.pone.0271260.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0271260.t002
Dataset updated
Jun 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Misuk Kim; Kyu-Baek Hwang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the receiver operating characteristics curve.
G
Data Balancing for Model Training Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Oct 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Data Balancing for Model Training Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-balancing-for-model-training-market
Explore at:
pptx, pdf, csvAvailable download formats
Dataset updated
Oct 3, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Data Balancing for Model Training Market Outlook

According to our latest research, the global Data Balancing for Model Training market size in 2024 is valued at USD 1.37 billion, with a robust CAGR of 19.8% expected during the forecast period. By 2033, the market is forecasted to reach USD 6.59 billion. The primary growth factor driving this market is the exponential increase in demand for high-quality, unbiased machine learning models across industries, fueled by the rapid digital transformation and adoption of artificial intelligence.

One of the most significant growth drivers for the Data Balancing for Model Training market is the surging need for accurate and reliable AI models in critical sectors such as healthcare, finance, and retail. As organizations increasingly leverage AI and machine learning for decision-making, the importance of balanced datasets becomes paramount to ensure model fairness, accuracy, and compliance. Data imbalance, if not addressed, can lead to biased predictions and suboptimal business outcomes, making data balancing solutions essential for organizations aiming to deploy trustworthy and high-performing models. Furthermore, regulatory pressures and ethical considerations are compelling enterprises to adopt advanced data balancing techniques, further accelerating market growth.

Another key factor propelling the market is the proliferation of big data and the complexity of modern datasets. With the explosion of data sources and the diversity of data types, organizations are facing unprecedented challenges in managing and processing imbalanced datasets. This complexity necessitates the adoption of sophisticated data balancing solutions such as oversampling, undersampling, hybrid methods, and synthetic data generation. These solutions not only enhance model performance but also streamline the data preparation process, enabling faster and more efficient model training cycles. The growing integration of automated machine learning (AutoML) platforms is also contributing to the adoption of data balancing tools, as these platforms increasingly embed balancing techniques to democratize AI development.

The ongoing digital transformation across industries, coupled with the rise of Industry 4.0, is further boosting the demand for data balancing solutions. Enterprises in manufacturing, IT & telecommunications, and retail are deploying AI-powered applications at scale, which rely heavily on balanced training data to deliver accurate insights and automation. The expanding use of Internet of Things (IoT) devices and connected systems is generating vast volumes of imbalanced data, necessitating robust data balancing frameworks. Additionally, advancements in synthetic data generation are opening new avenues for addressing data scarcity and imbalance, especially in sensitive domains like healthcare where data privacy is a concern.

From a regional perspective, North America leads the Data Balancing for Model Training market, driven by early adoption of AI technologies, strong presence of tech giants, and significant investments in AI research and development. Europe follows closely, supported by stringent regulatory frameworks and a growing focus on ethical AI. The Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding IT infrastructure, and increasing adoption of AI in emerging economies such as China and India. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with increasing awareness and investments in AI-driven solutions.

Solution Type Analysis

The Solution Type segment of the Data Balancing for Model Training market encompasses Oversampling, Undersampling, Hybrid Methods, Synthetic Data Generation, and Others. Oversampling remains one of the most widely adopted techniques, particularly in scenarios where minority class data is scarce but critical for accurate model predictions. Techniques such as SMOTE (Synthetic Minority Over-sampling Technique) and its variants are extensively used to generate synthetic samples, thereby improv
Number of datasets on which a combination of machine learning and sampling...
plos.figshare.com
xls
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Misuk Kim; Kyu-Baek Hwang (2023). Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the precision-recall curve. [Dataset]. http://doi.org/10.1371/journal.pone.0271260.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0271260.t001
Dataset updated
Jun 14, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Misuk Kim; Kyu-Baek Hwang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the precision-recall curve.
f
Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner (2023). Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets.PDF [Dataset]. http://doi.org/10.3389/fchem.2018.00362.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fchem.2018.00362.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.
u
Data from: Voxelized fragment dataset for machine learning
investigacion.ujaen.es
data-staging.niaid.nih.gov
+1more
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
López Ruiz, Alfonso; Rueda Ruiz, Antonio Jesús; Segura, Rafael; Ogayar Anguita, Carlos Javier; Navarro, Pablo; Fuertes García, José Manuel; López Ruiz, Alfonso; Rueda Ruiz, Antonio Jesús; Segura, Rafael; Ogayar Anguita, Carlos Javier; Navarro, Pablo; Fuertes García, José Manuel (2024). Voxelized fragment dataset for machine learning [Dataset]. https://investigacion.ujaen.es/documentos/67321f1daea56d4af04863a7?lang=ca
Explore at:
Dataset updated
2024
Authors
López Ruiz, Alfonso; Rueda Ruiz, Antonio Jesús; Segura, Rafael; Ogayar Anguita, Carlos Javier; Navarro, Pablo; Fuertes García, José Manuel; López Ruiz, Alfonso; Rueda Ruiz, Antonio Jesús; Segura, Rafael; Ogayar Anguita, Carlos Javier; Navarro, Pablo; Fuertes García, José Manuel
Description
One of the primary challenges inherent in utilizing deep learning models is the scarcity and accessibility hurdles associated with acquiring datasets of sufficient size to facilitate effective training of these networks. This is particularly significant in object detection, shape completion, and fracture assembly. Instead of scanning a large number of real-world fragments, it is possible to generate massive datasets with synthetic pieces. However, realistic fragmentation is computationally intensive in the preparation (e.g., pre-factured models) and generation. Otherwise, simpler algorithms such as Voronoi diagrams provide faster processing speeds at the expense of compromising realism. Hence, it is required to balance computational efficiency and realism for generating large datasets for marching learning.

We proposed a GPU-based fragmentation method to improve the baseline Discrete Voronoi Chain aimed at completing this dataset generation task. The dataset in this repository includes voxelized fragments from high-resolution 3D models, curated to be used as training sets for machine learning models. More specifically, these models come from an archaeological dataset, which led to more than 1M fragments from 1,052 Iberian vessels. In this dataset, fragments are not stored individually; instead, the fragmented voxelizations are provided in a compressed binary file (.rle.zip). Once uncompressed, each fragment is represented by a different number in the grid. The class to which each vessel belongs is also included in class.csv. The GPU-based pipeline that generated this dataset is explained at https://doi.org/10.1016/j.cag.2024.104104.

Please, note that this dataset originally provided voxel data, point clouds and triangle meshes. However, we opted for including only voxel data because 1) the original dataset is too large to be uploaded to Zenodo and 2) the original intent of our paper is to generate implicit data in the form of voxels. If interested in the whole dataset (450GB), please visit the web page of our research institute.
Balanced Stock2FER Dataset (75×75, RGB)
kaggle.com
zip
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dolly prajapati 182 (2025). Balanced Stock2FER Dataset (75×75, RGB) [Dataset]. https://www.kaggle.com/datasets/dollyprajapati182/balanced-stock2fer-dataset-7575-grayscale
Explore at:
zip(2161847 bytes)Available download formats
Dataset updated
Jun 15, 2025
Authors
dolly prajapati 182
Description
The Balanced Stock2FER Dataset is a uniformly processed, class-balanced, and augmented version of the original Emotion Dataset. This dataset is tailored for deep learning and machine learning applications in Facial Emotion Recognition (FER). It addresses class imbalance and standardizes input dimensions to boost model performance and ensure fair evaluation across classes.

🎯 Purpose The goal of this dataset is to balance the representation of seven basic emotions, enabling the training of fairer and more robust FER models. Each emotion class contains an equal number of images, facilitating consistent model learning and evaluation across all classes.

🧾 Dataset Characteristics Source: Based on the Stock2FER dataset

Image Format: RGB .png

Image Size: 75 × 75 pixels

Emotion Classes:

Angry

Disgust

Fear

Happy

Sad

Surprise

Neutral

Total Images: 210

⚙️ Preprocessing Pipeline Each image in the dataset has been preprocessed using the following steps:

✅ Converted to grayscale

✅ Resized to 75×75 pixels

✅ Augmented using:

Random rotation

Horizontal flip

Brightness adjustment

Contrast enhancement

Sharpness modification

This results in a clean, uniform, and diverse dataset ideal for FER tasks.

Testing (10%): 3 images

Training (80% of remainder): 22 images

Validation (20% of remainder): 6 images

✅ Advantages ⚖️ Balanced Classes: Equal images across all seven emotions

🧠 Model-Friendly: Grayscale, resized format reduces preprocessing overhead

🚀 Augmented: Improves model generalization and robustness

📦 Split Ready: Train/Val/Test folders structured per class

📊 Great for Benchmarking: Ideal for training CNNs, Transformers, and ensemble models for FER
S1 Appendix -
plos.figshare.com
zip
Updated Sep 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karina Shyrokykh; Max Girnyk; Lisa Dellmuth (2023). S1 Appendix - [Dataset]. http://doi.org/10.1371/journal.pone.0290762.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290762.s001
Dataset updated
Sep 29, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Karina Shyrokykh; Max Girnyk; Lisa Dellmuth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts, computer science provides a useful toolbox of machine-learning methods whose performance remains understudied in the social sciences. In this article, we compare the performance of the most widely used text classifiers by applying them to a typical research scenario in social science research: a relatively small labeled dataset with infrequent occurrence of categories of interest, which is a part of a large unlabeled dataset. As an example case, we look at Twitter communication regarding climate change, a topic of increasing scholarly interest in interdisciplinary social science research. Using a novel dataset including 5,750 tweets from various international organizations regarding the highly ambiguous concept of climate change, we evaluate the performance of methods in automatically classifying tweets based on whether they are about climate change or not. In this context, we highlight two main findings. First, supervised machine-learning methods perform better than state-of-the-art lexicons, in particular as class balance increases. Second, traditional machine-learning methods, such as logistic regression and random forest, perform similarly to sophisticated deep-learning methods, whilst requiring much less training time and computational resources. The results have important implications for the analysis of short texts in social science research.
u
Data from: TimeSpec4LULC: A Smart-Global Dataset of Multi-Spectral Time...
produccioncientifica.ugr.es
observatorio-cientifico.ua.es
+3more
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaldi, Rohaifa; Alcaraz-Segura, Domingo; Guirado, Emilio; Benhammou, Yassir; Tabik, Siham; Khaldi, Rohaifa; Alcaraz-Segura, Domingo; Guirado, Emilio; Benhammou, Yassir; Tabik, Siham (2022). TimeSpec4LULC: A Smart-Global Dataset of Multi-Spectral Time Series of MODIS Terra-Aqua from 2000 to 2021 for Training Machine Learning models to perform LULC Mapping [Dataset]. https://produccioncientifica.ugr.es/documentos/668fc45eb9e7c03b01bdb366
Explore at:
Dataset updated
2022
Authors
Khaldi, Rohaifa; Alcaraz-Segura, Domingo; Guirado, Emilio; Benhammou, Yassir; Tabik, Siham; Khaldi, Rohaifa; Alcaraz-Segura, Domingo; Guirado, Emilio; Benhammou, Yassir; Tabik, Siham
Description
TimeSpec4LULC is a smart open-source global dataset of multi-spectral time series for 29 Land Use and Land Cover (LULC) classes ready to train machine learning models. It was built based on the seven spectral bands of the MODIS sensors at 500 m resolution from 2000 to 2021 (262 observations in each time series). Then, was annotated using spatial-temporal agreement across the 15 global LULC products available in Google Earth Engine (GEE). TimeSpec4LULC contains two datasets: the original dataset distributed over 6,076,531 pixels, and the balanced subset of the original dataset distributed over 29000 pixels. The original dataset contains 30 folders, namely "Metadata", and 29 folders corresponding to the 29 LULC classes. The folder "Metadata" holds 29 different CSV files describing the metadata of the 29 LULC classes. The remaining 29 folders contain the time series data for the 29 LULC classes. Each folder holds 262 CSV files corresponding to the 262 months. Inside each CSV file, we provide the seven values of the spectral bands as well as the coordinates for all the LULC class-related pixels. The balanced subset of the original dataset contains the metadata and the time series data for 1000 pixels per class representative of the globe. It holds 29 different JSON files following the names of the 29 LULC classes. The features of the dataset are: - ".geo": the geometry and coordinates (longitude and latitude) of the pixel center. - "ADM0_Code": the GAUL country code. - "ADM1_Code": the GAUL first-level administrative unit code. - GHM_Index": the average of the global human modification index. - "Products_Agreement_Percentage": the agreement percentage over the 15 global LULC products available in GEE. - "Temporal_Availability_Percentage": the percentage of non-missing values in each band. - "Pixel_TS": the time series values of the seven spectral bands.
CODEBRIM: COncrete DEfect BRidge IMage Dataset
zenodo.org
data-staging.niaid.nih.gov
+2more
bin, zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Mundt; Sagnik Majumder; Sreenivas Murali; Panagiotis Panetsos; Visvanathan Ramesh; Martin Mundt; Sagnik Majumder; Sreenivas Murali; Panagiotis Panetsos; Visvanathan Ramesh (2020). CODEBRIM: COncrete DEfect BRidge IMage Dataset [Dataset]. http://doi.org/10.5281/zenodo.2620293
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2620293
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin Mundt; Sagnik Majumder; Sreenivas Murali; Panagiotis Panetsos; Visvanathan Ramesh; Martin Mundt; Sagnik Majumder; Sreenivas Murali; Panagiotis Panetsos; Visvanathan Ramesh
Description
CODEBRIM: COncrete DEfect BRidge IMage Dataset for multi-target multi-class concrete defect classification in computer vision and machine learning.

Dataset as presented and detailed in our CVPR 2019 publication: http://openaccess.thecvf.com/content_CVPR_2019/html/Mundt_Meta-Learning_Convolutional_Neural_Architectures_for_Multi-Target_Concrete_Defect_Classification_With_CVPR_2019_paper.html or https://arxiv.org/abs/1904.08486 . If you make use of the dataset please cite it as follows:

"Martin Mundt, Sagnik Majumder, Sreenivas Murali, Panagiotis Panetsos, Visvanathan Ramesh. Meta-learning Convolutional Neural Architectures for Multi-target Concrete Defect Classification with the COncrete DEfect BRidge IMage Dataset. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019"

We offer a supplementary GitHub repository with code to reproduce the paper and data loaders: https://github.com/ccc-frankfurt/meta-learning-CODEBRIM

For ease of use we provide the dataset in multiple different versions.

Files contained:
* CODEBRIM_original_images: contains the original full-resolution images and bounding box annotations
* CODEBRIM_cropped_dataset: contains the extracted crops/patches with corresponding class labels from the bounding boxes
* CODEBRIM_classification_dataset: contains the cropped patches with corresponding class labels split into training, validation and test sets for machine learning
* CODEBRIM_classification_balanced_dataset: similar to "CODEBRIM_classification_dataset" but with the exact replication of training images to balance the dataset in order to reproduce results obtained in the paper.
u
Data from: Dataset for classification of signaling proteins based on...
portalinvestigacion.udc.gal
portalcientifico.sergas.es
+1more
Updated 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernandez-Lozano, Carlos; Munteanu, Cristian Robert; Fernandez-Lozano, Carlos; Munteanu, Cristian Robert (2015). Dataset for classification of signaling proteins based on molecular star graph descriptors using machine-learning models [Dataset]. https://portalinvestigacion.udc.gal/documentos/668fc447b9e7c03b01bd8975
Explore at:
Dataset updated
2015
Authors
Fernandez-Lozano, Carlos; Munteanu, Cristian Robert; Fernandez-Lozano, Carlos; Munteanu, Cristian Robert
Description
The positive group of 608 signaling protein sequences was downloaded as FASTA format from Protein Databank (Berman et al., 2000) by using the “Molecular Function Browser” in the “Advanced Search Interface” (“Signaling (GO ID23052)”, protein identity cut-off = 30%). The negative group of 2077 non-signaling proteins was downloaded as the PISCES CulledPDB (http://dunbrack.fccc.edu/PISCES.php) (Wang & R. L. Dunbrack, 2003) (November 19th, 2012) using identity (degree of correspondence between two sequences) less than 20%, resolution of 1.6 Å and R-factor 0.25. The full dataset is containing 2685 FASTA sequences of protein chains from the PDB databank: 608 are signaling proteins and 2077 are non-signaling peptides. This kind of unbalanced data is not the most suitable to be used as an input for learning algorithms because the results would present a high sensitivity and low specificity; learning algorithms would tend to classify most of samples as part of the most common group. To avoid this situation, a pre-processing stage is needed in order to get a more balanced dataset, in this case by means of the synthetic minority oversampling technique (SMOTE). In short, SMOTE provides a more balanced dataset using an expansion of the lower class by creating new samples, interpolating other minority-class samples. After this pre-processing, the final dataset is composed of 1824 positive samples (signaling protein chains) and 2432 negative cases (non-signaling protein chains). Paper is available at: http://dx.doi.org/10.1016/j.jtbi.2015.07.038 Please cite: Carlos Fernandez-Lozano, Rubén F. Cuiñas, José A. Seoane, Enrique Fernández-Blanco, Julian Dorado, Cristian R. Munteanu, Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models, Journal of Theoretical Biology, Volume 384, 7 November 2015, Pages 50-58, ISSN 0022-5193, http://dx.doi.org/10.1016/j.jtbi.2015.07.038.(http://www.sciencedirect.com/science/article/pii/S0022519315003999)
m
Composed Encrypted Malicious Traffic Dataset for machine learning based...
data.mendeley.com
Updated Oct 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihao Wang (2021). Composed Encrypted Malicious Traffic Dataset for machine learning based encrypted malicious traffic analysis. [Dataset]. http://doi.org/10.17632/ztyk4h3v6s.2
Explore at:
Unique identifier
https://doi.org/10.17632/ztyk4h3v6s.2
Dataset updated
Oct 12, 2021
Authors
Zihao Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a traffic dataset which contains balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection. The dataset is a secondary csv feature data which is composed of five public traffic datasets. Our dataset is composed based on three criteria: The first criterion is to combine widely considered public datasets which contain both encrypted malicious and legitimate traffic in existing works, such as the Malwares Capture Facility Project dataset and the CICIDS-2017 dataset. The second criterion is to ensure the data balance, i.e., balance of malicious and legitimate network traffic and similar size of network traffic contributed by each individual dataset. Thus, approximate proportions of malicious and legitimate traffic from each selected public dataset are extracted by using random sampling. We also ensured that there will be no traffic size from one selected public dataset that is much larger than other selected public datasets. The third criterion is that our dataset includes both conventional devices' and IoT devices' encrypted malicious and legitimate traffic, as these devices are increasingly being deployed and are working in the same environments such as offices, homes, and other smart city settings.

Based on the criteria, 5 public datasets are selected. After data pre-processing, details of each selected public dataset and the final composed dataset are shown in “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, proportions of selected traffic size from each selected public dataset with respect to the total traffic size of the composed dataset (% w.r.t the composed dataset), proportions of selected encrypted traffic size from each selected public dataset (% of selected public dataset), and total traffic size of the composed dataset. From the table, we are able to observe that each public dataset equally contributes to approximately 20% of the composed dataset, except for CICDS-2012 (due to its limited number of encrypted malicious traffic). This achieves a balance across individual datasets and reduces bias towards traffic belonging to any dataset during learning. We can also observe that the size of malicious and legitimate traffic are almost the same, thus achieving class balance. The datasets now made available were prepared aiming at encrypted malicious traffic detection. Since the dataset is used for machine learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4 and stratification is applied during data split. Such datasets can be used directly for machine or deep learning model training based on selected features.
S
BD Sports-10 Dataset
scidb.cn
Updated Apr 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wazih Ullah Tanzim; Niloy Barua Supta; Shifatun Nur Shifa; Khondaker A. Mamun (2025). BD Sports-10 Dataset [Dataset]. http://doi.org/10.57760/sciencedb.24216
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.24216
Dataset updated
Apr 26, 2025
Dataset provided by
Science Data Bank
Authors
Wazih Ullah Tanzim; Niloy Barua Supta; Shifatun Nur Shifa; Khondaker A. Mamun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh
Description
The BD Sports-10 Dataset is a comprehensive collection of 3,000 high-resolution videos (1920×1080 pixels at 30 frames per second) showcasing ten culturally and traditionally significant Bangladeshi sports. It is designed to support research in action recognition, cultural heritage preservation, sports video classification, and machine learning applications. The BD_Sports_10 folder contains two subfolders: Annotation and Dataset. The Dataset folder includes 10 subfolders, each corresponding to a sports class. Each sports category comprises 300 videos, ensuring a balanced distribution for supervised learning tasks.The dataset includes the following Bangladeshi sports:Hari VangaJoldangaKanamachiLathimMorog LoraiToilakto Kolagach Arohon (Kolagach)Nouka BaichKabaddiKho KhoLathi Khela

Facebook

Twitter

Click to copy link

Link copied

Cite

Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX

Explore at:

docxAvailable download formats

Unique identifier

https://doi.org/10.3389/fninf.2021.715421.s002

Dataset updated

Jun 1, 2023

Dataset provided by

Frontiers Mediahttp://www.frontiersin.org/

Authors

Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

Clear search

Close search

Google apps

Main menu

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

Dataset: The effects of class balance on the training energy consumption of...

Dataset: The effects of class balance on the training energy consumption of...

Balanced KDEF Dataset (75×75, RGB)

Under-sampled dataset.

Balanced Caer-S Dataset (75×75)

Sunflower Growth Stage Image Dataset for Phenological Classification

Car-Vehicle Crash Dataset (Balanced)

Number of datasets on which a combination of machine learning and sampling...

Data Balancing for Model Training Market Research Report 2033

Data Balancing for Model Training Market Outlook

Solution Type Analysis

Number of datasets on which a combination of machine learning and sampling...

Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...

Data from: Voxelized fragment dataset for machine learning

Balanced Stock2FER Dataset (75×75, RGB)

S1 Appendix -

Data from: TimeSpec4LULC: A Smart-Global Dataset of Multi-Spectral Time...

CODEBRIM: COncrete DEfect BRidge IMage Dataset

Data from: Dataset for classification of signaling proteins based on...

Composed Encrypted Malicious Traffic Dataset for machine learning based...

BD Sports-10 Dataset

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX