Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.
Facebook
TwitterThe Balanced KDEF Dataset is a uniformly processed, class-balanced, and augmented version of the FER2013-KDEF composite dataset. This curated version is tailored for deep learning and machine learning applications in Facial Emotion Recognition (FER). It addresses class imbalance and standardizes input dimensions to enhance model performance and comparability.
🎯 Purpose The goal of this dataset is to balance the representation of seven basic emotions, enabling the training of fairer and more robust FER models. Each emotion class contains an equal number of images, facilitating consistent model learning and evaluation across all classes.
🧾 Dataset Characteristics Source: Based on the KDEF dataset
Image Format: Grayscale .png
Image Size: 75 × 75 pixels
Emotion Classes:
Angry
Disgust
Fear
Happy
Sad
Surprise
Neutral
Total Images: 62,923
Images per Class: 8,989
⚙️ Preprocessing Pipeline Each image in the dataset has been preprocessed using the following steps:
✅ Converted to grayscale
✅ Resized to 75×75 pixels
✅ Augmented using:
Random rotation
Horizontal flip
Brightness adjustment
Contrast enhancement
Sharpness modification
This results in a clean, uniform, and diverse dataset ideal for FER tasks.
Testing (10%): 6,292 images
Training (80% of remainder): 45,305 images
Validation (20% of remainder): 11,326 images
✅ Advantages ⚖️ Balanced Classes: Equal images across all seven emotions
🧠 Model-Friendly: Grayscale, resized format reduces preprocessing overhead
🚀 Augmented: Improves model generalization and robustness
📦 Split Ready: Train/Val/Test folders structured per class
📊 Great for Benchmarking: Ideal for training CNNs, Transformers, and ensemble models for FER
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
Facebook
TwitterThe Balanced Caer-S Dataset is a uniformly processed, class-balanced, and augmented version of the original Caer-S Emotion Dataset. This dataset is tailored for deep learning and machine learning applications in Facial Emotion Recognition (FER). It addresses class imbalance and standardizes input dimensions to boost model performance and ensure fair evaluation across classes.
🎯 Purpose The goal of this dataset is to balance the representation of seven basic emotions, enabling the training of fairer and more robust FER models. Each emotion class contains an equal number of images, facilitating consistent model learning and evaluation across all classes.
🧾 Dataset Characteristics Source: Based on the Caer-s dataset
Image Size: 75 × 75 pixels
Emotion Classes:
Angry
Disgust
Fear
Happy
Sad
Surprise
Neutral
Total Images: 63,000
⚙️ Preprocessing Pipeline Each image in the dataset has been preprocessed using the following steps:
✅ Converted to grayscale
✅ Resized to 75×75 pixels
✅ Augmented using:
Random rotation
Horizontal flip
Brightness adjustment
Contrast enhancement
Sharpness modification
This results in a clean, uniform, and diverse dataset ideal for FER tasks.
Testing (10%): 900 images
Training (80% of remainder): 6,480 images
Validation (20% of remainder): 1,620 images
✅ Advantages ⚖️ Balanced Classes: Equal images across all seven emotions
🧠 Model-Friendly: Resized format reduces preprocessing overhead
🚀 Augmented: Improves model generalization and robustness
📦 Split Ready: Train/Val/Test folders structured per class
📊 Great for Benchmarking: Ideal for training CNNs, Transformers, and ensemble models for FER
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Sunflower Growth Stage Image Dataset for Phenological Classification was collected from agricultural fields in Bangladesh, focusing on the identification and classification of sunflower growth stages. Images were captured directly in the field using a Redmi Note 11 smartphone, under natural daylight and varying weather conditions to reflect real-world environments. This dataset is meant to aid research in deep learning, computer vision, and plant phenology by providing data for automated classification of growth stages.
A total of 1,255 original images were gathered, each with a high resolution of 12,288 × 16,320 pixels and approximately 25 MB in size. The images are divided into five classes: Stage1 (Young_Bud) with 238 images, Stage2 (Mature_Bud) with 272 images, Stage3 (Early_Bloom) with 218 images, Stage4 (Full_Bloom) with 213 images, and Stage5 (Wilted) with 314 images. To balance the dataset for training, each class was augmented to have 500 images, resulting in a final balanced collection of 2,500 images.
Validation of the dataset was carried out by a Sub-Assistant Agriculture Officer from the Department of Agricultural Extension (DAE), Bangladesh, ensuring its reliability. The data was collected at two main sites: Daffodil International University (Ashulia Campus) and Model Town Nursery, Ashulia, Bangladesh. The camera used for capturing the images was a Redmi Note 11, with 24-bit color depth, an aperture of f/1.8, and images saved in JPEG format.
Example metadata for an image shows it was taken on 2025-05-22 at 17:47 using the MediaTek Camera Application. The image’s dimensions are 12,288 × 16,320 pixels at 72 dpi with 24-bit sRGB color representation. The camera details include Xiaomi as the maker, model 23117RA86G, f-stop f/1.6, exposure time 1/100 sec, ISO 200, focal length 6 mm, and auto white balance. GPS coordinates recorded were Latitude 23.5247046, Longitude 90.1918097, Altitude 34.5 m. The image file example is named IMG_20250522_174724.jpg, is a JPEG of size 26.1 MB.
Attribution Notice This dataset also includes 24 images derived from the publicly available dataset: “Sunflower Plant Health and Growth Stage Image Dataset for Agricultural Machine Learning Applications” Sagor, Saifuddin; Hossan, Md. Faysal ; Ahmed, Faruk; Reyad , Md. Zamirul Islam (2025), “Sunflower Plant Health and Growth Stage Image Dataset for Agricultural Machine Learning Applications”, Mendeley Data, V1, doi: 10.17632/y3ygk98ngr.1
These images were incorporated because the number of collected field images was insufficient for the Stage4 (Full_Bloom) Class. After inclusion, a portion of these images was further augmented to increase the dataset size and maintain class balance. Any modifications or augmentations applied to the derived images are the responsibility of the present authors.
The original dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
📂 Dataset Overview This dataset contains 1,500 real-world vehicle accident videos labeled according to severity levels: minor, moderate, and major — with 500 videos per class. Each video is manually reviewed and assigned a severity class based on visual cues such as collision intensity, vehicle damage, and traffic disruption. To ensure balanced class distribution, additional samples were generated for underrepresented categories through data augmentation. While specific techniques are not detailed, this process helped mitigate class imbalance and improve model generalization without compromising label consistency.
Unlike frame-based datasets, this collection emphasizes video-level analysis, making it suitable for tasks like: - 🚗 Accident detection - 📊 Severity classification - 🧠 Scene understanding - ⏱️ Event-level prediction
⚡ The dataset is ideal for researchers and developers working on: - 🎬 Video classification - 🤖 Deep learning applications in traffic surveillance - 🕵️ Object detection in dynamic scenes - 🚦 Intelligent transportation systems - 🛣️ Road safety analytics
📁 Structure & Format The dataset is organized into three main splits: train, val, and test, following standard machine learning conventions:
Balanced Accident Video Dataset/
├── train/
│ ├── minor/
│ ├── moderate/
│ ├── major/
├── val/
│ ├── minor/
│ ├── moderate/
│ ├── major/
├── test/
├── minor/
├── moderate/
├── major/
📌 Additional Notes All videos are sourced from publicly available traffic footage and curated to ensure class balance and real-world diversity. The dataset supports multi-class classification and can be extended for temporal modeling, action recognition, or video summarization tasks.
📜 License This dataset is released under the CC BY-NC 4.0 license, allowing academic and personal use with attribution. Commercial use is not permitted.
🌍 Source All videos are collected from the official website of the Republic of Türkiye General Directorate of Security – Traffic Department: 🔗https://www.trafik.gov.tr/kgys-goruntuleri
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the receiver operating characteristics curve.
Facebook
Twitter
According to our latest research, the global Data Balancing for Model Training market size in 2024 is valued at USD 1.37 billion, with a robust CAGR of 19.8% expected during the forecast period. By 2033, the market is forecasted to reach USD 6.59 billion. The primary growth factor driving this market is the exponential increase in demand for high-quality, unbiased machine learning models across industries, fueled by the rapid digital transformation and adoption of artificial intelligence.
One of the most significant growth drivers for the Data Balancing for Model Training market is the surging need for accurate and reliable AI models in critical sectors such as healthcare, finance, and retail. As organizations increasingly leverage AI and machine learning for decision-making, the importance of balanced datasets becomes paramount to ensure model fairness, accuracy, and compliance. Data imbalance, if not addressed, can lead to biased predictions and suboptimal business outcomes, making data balancing solutions essential for organizations aiming to deploy trustworthy and high-performing models. Furthermore, regulatory pressures and ethical considerations are compelling enterprises to adopt advanced data balancing techniques, further accelerating market growth.
Another key factor propelling the market is the proliferation of big data and the complexity of modern datasets. With the explosion of data sources and the diversity of data types, organizations are facing unprecedented challenges in managing and processing imbalanced datasets. This complexity necessitates the adoption of sophisticated data balancing solutions such as oversampling, undersampling, hybrid methods, and synthetic data generation. These solutions not only enhance model performance but also streamline the data preparation process, enabling faster and more efficient model training cycles. The growing integration of automated machine learning (AutoML) platforms is also contributing to the adoption of data balancing tools, as these platforms increasingly embed balancing techniques to democratize AI development.
The ongoing digital transformation across industries, coupled with the rise of Industry 4.0, is further boosting the demand for data balancing solutions. Enterprises in manufacturing, IT & telecommunications, and retail are deploying AI-powered applications at scale, which rely heavily on balanced training data to deliver accurate insights and automation. The expanding use of Internet of Things (IoT) devices and connected systems is generating vast volumes of imbalanced data, necessitating robust data balancing frameworks. Additionally, advancements in synthetic data generation are opening new avenues for addressing data scarcity and imbalance, especially in sensitive domains like healthcare where data privacy is a concern.
From a regional perspective, North America leads the Data Balancing for Model Training market, driven by early adoption of AI technologies, strong presence of tech giants, and significant investments in AI research and development. Europe follows closely, supported by stringent regulatory frameworks and a growing focus on ethical AI. The Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding IT infrastructure, and increasing adoption of AI in emerging economies such as China and India. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with increasing awareness and investments in AI-driven solutions.
The Solution Type segment of the Data Balancing for Model Training market encompasses Oversampling, Undersampling, Hybrid Methods, Synthetic Data Generation, and Others. Oversampling remains one of the most widely adopted techniques, particularly in scenarios where minority class data is scarce but critical for accurate model predictions. Techniques such as SMOTE (Synthetic Minority Over-sampling Technique) and its variants are extensively used to generate synthetic samples, thereby improv
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the precision-recall curve.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.
Facebook
TwitterOne of the primary challenges inherent in utilizing deep learning models is the scarcity and accessibility hurdles associated with acquiring datasets of sufficient size to facilitate effective training of these networks. This is particularly significant in object detection, shape completion, and fracture assembly. Instead of scanning a large number of real-world fragments, it is possible to generate massive datasets with synthetic pieces. However, realistic fragmentation is computationally intensive in the preparation (e.g., pre-factured models) and generation. Otherwise, simpler algorithms such as Voronoi diagrams provide faster processing speeds at the expense of compromising realism. Hence, it is required to balance computational efficiency and realism for generating large datasets for marching learning.
We proposed a GPU-based fragmentation method to improve the baseline Discrete Voronoi Chain aimed at completing this dataset generation task. The dataset in this repository includes voxelized fragments from high-resolution 3D models, curated to be used as training sets for machine learning models. More specifically, these models come from an archaeological dataset, which led to more than 1M fragments from 1,052 Iberian vessels. In this dataset, fragments are not stored individually; instead, the fragmented voxelizations are provided in a compressed binary file (.rle.zip). Once uncompressed, each fragment is represented by a different number in the grid. The class to which each vessel belongs is also included in class.csv. The GPU-based pipeline that generated this dataset is explained at https://doi.org/10.1016/j.cag.2024.104104.
Please, note that this dataset originally provided voxel data, point clouds and triangle meshes. However, we opted for including only voxel data because 1) the original dataset is too large to be uploaded to Zenodo and 2) the original intent of our paper is to generate implicit data in the form of voxels. If interested in the whole dataset (450GB), please visit the web page of our research institute.
Facebook
TwitterThe Balanced Stock2FER Dataset is a uniformly processed, class-balanced, and augmented version of the original Emotion Dataset. This dataset is tailored for deep learning and machine learning applications in Facial Emotion Recognition (FER). It addresses class imbalance and standardizes input dimensions to boost model performance and ensure fair evaluation across classes.
🎯 Purpose The goal of this dataset is to balance the representation of seven basic emotions, enabling the training of fairer and more robust FER models. Each emotion class contains an equal number of images, facilitating consistent model learning and evaluation across all classes.
🧾 Dataset Characteristics Source: Based on the Stock2FER dataset
Image Format: RGB .png
Image Size: 75 × 75 pixels
Emotion Classes:
Angry
Disgust
Fear
Happy
Sad
Surprise
Neutral
Total Images: 210
⚙️ Preprocessing Pipeline Each image in the dataset has been preprocessed using the following steps:
✅ Converted to grayscale
✅ Resized to 75×75 pixels
✅ Augmented using:
Random rotation
Horizontal flip
Brightness adjustment
Contrast enhancement
Sharpness modification
This results in a clean, uniform, and diverse dataset ideal for FER tasks.
Testing (10%): 3 images
Training (80% of remainder): 22 images
Validation (20% of remainder): 6 images
✅ Advantages ⚖️ Balanced Classes: Equal images across all seven emotions
🧠 Model-Friendly: Grayscale, resized format reduces preprocessing overhead
🚀 Augmented: Improves model generalization and robustness
📦 Split Ready: Train/Val/Test folders structured per class
📊 Great for Benchmarking: Ideal for training CNNs, Transformers, and ensemble models for FER
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts, computer science provides a useful toolbox of machine-learning methods whose performance remains understudied in the social sciences. In this article, we compare the performance of the most widely used text classifiers by applying them to a typical research scenario in social science research: a relatively small labeled dataset with infrequent occurrence of categories of interest, which is a part of a large unlabeled dataset. As an example case, we look at Twitter communication regarding climate change, a topic of increasing scholarly interest in interdisciplinary social science research. Using a novel dataset including 5,750 tweets from various international organizations regarding the highly ambiguous concept of climate change, we evaluate the performance of methods in automatically classifying tweets based on whether they are about climate change or not. In this context, we highlight two main findings. First, supervised machine-learning methods perform better than state-of-the-art lexicons, in particular as class balance increases. Second, traditional machine-learning methods, such as logistic regression and random forest, perform similarly to sophisticated deep-learning methods, whilst requiring much less training time and computational resources. The results have important implications for the analysis of short texts in social science research.
Facebook
TwitterTimeSpec4LULC is a smart open-source global dataset of multi-spectral time series for 29 Land Use and Land Cover (LULC) classes ready to train machine learning models. It was built based on the seven spectral bands of the MODIS sensors at 500 m resolution from 2000 to 2021 (262 observations in each time series). Then, was annotated using spatial-temporal agreement across the 15 global LULC products available in Google Earth Engine (GEE). TimeSpec4LULC contains two datasets: the original dataset distributed over 6,076,531 pixels, and the balanced subset of the original dataset distributed over 29000 pixels. The original dataset contains 30 folders, namely "Metadata", and 29 folders corresponding to the 29 LULC classes. The folder "Metadata" holds 29 different CSV files describing the metadata of the 29 LULC classes. The remaining 29 folders contain the time series data for the 29 LULC classes. Each folder holds 262 CSV files corresponding to the 262 months. Inside each CSV file, we provide the seven values of the spectral bands as well as the coordinates for all the LULC class-related pixels. The balanced subset of the original dataset contains the metadata and the time series data for 1000 pixels per class representative of the globe. It holds 29 different JSON files following the names of the 29 LULC classes. The features of the dataset are: - ".geo": the geometry and coordinates (longitude and latitude) of the pixel center. - "ADM0_Code": the GAUL country code. - "ADM1_Code": the GAUL first-level administrative unit code. - GHM_Index": the average of the global human modification index. - "Products_Agreement_Percentage": the agreement percentage over the 15 global LULC products available in GEE. - "Temporal_Availability_Percentage": the percentage of non-missing values in each band. - "Pixel_TS": the time series values of the seven spectral bands.
Facebook
TwitterCODEBRIM: COncrete DEfect BRidge IMage Dataset for multi-target multi-class concrete defect classification in computer vision and machine learning.
Dataset as presented and detailed in our CVPR 2019 publication: http://openaccess.thecvf.com/content_CVPR_2019/html/Mundt_Meta-Learning_Convolutional_Neural_Architectures_for_Multi-Target_Concrete_Defect_Classification_With_CVPR_2019_paper.html or https://arxiv.org/abs/1904.08486 . If you make use of the dataset please cite it as follows:
"Martin Mundt, Sagnik Majumder, Sreenivas Murali, Panagiotis Panetsos, Visvanathan Ramesh. Meta-learning Convolutional Neural Architectures for Multi-target Concrete Defect Classification with the COncrete DEfect BRidge IMage Dataset. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019"
We offer a supplementary GitHub repository with code to reproduce the paper and data loaders: https://github.com/ccc-frankfurt/meta-learning-CODEBRIM
For ease of use we provide the dataset in multiple different versions.
Files contained:
* CODEBRIM_original_images: contains the original full-resolution images and bounding box annotations
* CODEBRIM_cropped_dataset: contains the extracted crops/patches with corresponding class labels from the bounding boxes
* CODEBRIM_classification_dataset: contains the cropped patches with corresponding class labels split into training, validation and test sets for machine learning
* CODEBRIM_classification_balanced_dataset: similar to "CODEBRIM_classification_dataset" but with the exact replication of training images to balance the dataset in order to reproduce results obtained in the paper.
Facebook
TwitterThe positive group of 608 signaling protein sequences was downloaded as FASTA format from Protein Databank (Berman et al., 2000) by using the “Molecular Function Browser” in the “Advanced Search Interface” (“Signaling (GO ID23052)”, protein identity cut-off = 30%). The negative group of 2077 non-signaling proteins was downloaded as the PISCES CulledPDB (http://dunbrack.fccc.edu/PISCES.php) (Wang & R. L. Dunbrack, 2003) (November 19th, 2012) using identity (degree of correspondence between two sequences) less than 20%, resolution of 1.6 Å and R-factor 0.25. The full dataset is containing 2685 FASTA sequences of protein chains from the PDB databank: 608 are signaling proteins and 2077 are non-signaling peptides. This kind of unbalanced data is not the most suitable to be used as an input for learning algorithms because the results would present a high sensitivity and low specificity; learning algorithms would tend to classify most of samples as part of the most common group. To avoid this situation, a pre-processing stage is needed in order to get a more balanced dataset, in this case by means of the synthetic minority oversampling technique (SMOTE). In short, SMOTE provides a more balanced dataset using an expansion of the lower class by creating new samples, interpolating other minority-class samples. After this pre-processing, the final dataset is composed of 1824 positive samples (signaling protein chains) and 2432 negative cases (non-signaling protein chains). Paper is available at: http://dx.doi.org/10.1016/j.jtbi.2015.07.038 Please cite: Carlos Fernandez-Lozano, Rubén F. Cuiñas, José A. Seoane, Enrique Fernández-Blanco, Julian Dorado, Cristian R. Munteanu, Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models, Journal of Theoretical Biology, Volume 384, 7 November 2015, Pages 50-58, ISSN 0022-5193, http://dx.doi.org/10.1016/j.jtbi.2015.07.038.(http://www.sciencedirect.com/science/article/pii/S0022519315003999)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a traffic dataset which contains balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection. The dataset is a secondary csv feature data which is composed of five public traffic datasets. Our dataset is composed based on three criteria: The first criterion is to combine widely considered public datasets which contain both encrypted malicious and legitimate traffic in existing works, such as the Malwares Capture Facility Project dataset and the CICIDS-2017 dataset. The second criterion is to ensure the data balance, i.e., balance of malicious and legitimate network traffic and similar size of network traffic contributed by each individual dataset. Thus, approximate proportions of malicious and legitimate traffic from each selected public dataset are extracted by using random sampling. We also ensured that there will be no traffic size from one selected public dataset that is much larger than other selected public datasets. The third criterion is that our dataset includes both conventional devices' and IoT devices' encrypted malicious and legitimate traffic, as these devices are increasingly being deployed and are working in the same environments such as offices, homes, and other smart city settings.
Based on the criteria, 5 public datasets are selected. After data pre-processing, details of each selected public dataset and the final composed dataset are shown in “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, proportions of selected traffic size from each selected public dataset with respect to the total traffic size of the composed dataset (% w.r.t the composed dataset), proportions of selected encrypted traffic size from each selected public dataset (% of selected public dataset), and total traffic size of the composed dataset. From the table, we are able to observe that each public dataset equally contributes to approximately 20% of the composed dataset, except for CICDS-2012 (due to its limited number of encrypted malicious traffic). This achieves a balance across individual datasets and reduces bias towards traffic belonging to any dataset during learning. We can also observe that the size of malicious and legitimate traffic are almost the same, thus achieving class balance. The datasets now made available were prepared aiming at encrypted malicious traffic detection. Since the dataset is used for machine learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4 and stratification is applied during data split. Such datasets can be used directly for machine or deep learning model training based on selected features.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BD Sports-10 Dataset is a comprehensive collection of 3,000 high-resolution videos (1920×1080 pixels at 30 frames per second) showcasing ten culturally and traditionally significant Bangladeshi sports. It is designed to support research in action recognition, cultural heritage preservation, sports video classification, and machine learning applications. The BD_Sports_10 folder contains two subfolders: Annotation and Dataset. The Dataset folder includes 10 subfolders, each corresponding to a sports class. Each sports category comprises 300 videos, ensuring a balanced distribution for supervised learning tasks.The dataset includes the following Bangladeshi sports:Hari VangaJoldangaKanamachiLathimMorog LoraiToilakto Kolagach Arohon (Kolagach)Nouka BaichKabaddiKho KhoLathi Khela
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.