100+ datasets found

f
Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...
frontiersin.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2021.715421.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
f
Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...
plos.figshare.com
txt
Updated Jun 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data [Dataset]. http://doi.org/10.1371/journal.pone.0180830
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830
Dataset updated
Jun 18, 2023
Dataset provided by
PLOS ONE
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method.
i
Imbalanced Data
ieee-dataport.org
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0
Explore at:
Dataset updated
Aug 23, 2023
Authors
Blessa Binolin M
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.
s
Data from: High impact bug report identification with imbalanced learning...
researchdata.smu.edu.sg
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN (2023). Data from: High impact bug report identification with imbalanced learning strategies [Dataset]. http://doi.org/10.25440/smu.12062763.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25440/smu.12062763.v1
Dataset updated
Jun 1, 2023
Dataset provided by
SMU Research Data Repository (RDR)
Authors
YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This record contains the underlying research data for the publication "High impact bug report identification with imbalanced learning strategies" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/3702In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the F1-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.Supplementary code and data available from GitHub:
f
Classification of typical algorithms for imbalanced sampling and...
plos.figshare.com
xls
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingyang Deng; Yingshi Guo; Chang Wang; Fuwei Wu (2023). Classification of typical algorithms for imbalanced sampling and representative literature. [Dataset]. http://doi.org/10.1371/journal.pone.0259227.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0259227.t001
Dataset updated
Jun 8, 2023
Dataset provided by
PLOS ONE
Authors
Mingyang Deng; Yingshi Guo; Chang Wang; Fuwei Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification of typical algorithms for imbalanced sampling and representative literature.
f
Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...
frontiersin.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner (2023). Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets.PDF [Dataset]. http://doi.org/10.3389/fchem.2018.00362.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fchem.2018.00362.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.
f
Number of datasets on which a combination of machine learning and sampling...
plos.figshare.com
xls
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Misuk Kim; Kyu-Baek Hwang (2023). Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the receiver operating characteristics curve. [Dataset]. http://doi.org/10.1371/journal.pone.0271260.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0271260.t002
Dataset updated
Jun 16, 2023
Dataset provided by
PLOS ONE
Authors
Misuk Kim; Kyu-Baek Hwang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the receiver operating characteristics curve.
f
Number of datasets on which a combination of machine learning and sampling...
plos.figshare.com
xls
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Misuk Kim; Kyu-Baek Hwang (2023). Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the precision-recall curve. [Dataset]. http://doi.org/10.1371/journal.pone.0271260.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0271260.t001
Dataset updated
Jun 14, 2023
Dataset provided by
PLOS ONE
Authors
Misuk Kim; Kyu-Baek Hwang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of datasets on which a combination of machine learning and sampling methods performed the best in terms of the area under the precision-recall curve.
n
Subsampling reveals that unbalanced sampling affects STRUCTURE results in a...
data.niaid.nih.gov
datadryad.org
zip
Updated Jul 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick G. Meirmans (2018). Subsampling reveals that unbalanced sampling affects STRUCTURE results in a multi-species dataset [Dataset]. http://doi.org/10.5061/dryad.nh4366s
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.nh4366s
Dataset updated
Jul 10, 2018
Dataset provided by
University of Amsterdam
Authors
Patrick G. Meirmans
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Alps, Carpathians, Europe
Description
Studying the genetic population structure of species can reveal important insights into several key evolutionary, historical, demographic, and anthropogenic processes. One of the most important statistical tools for inferring genetic clusters is the program STRUCTURE. Recently, several papers have pointed out that STRUCTURE may show a bias when the sampling design is unbalanced, resulting in spurious joining of underrepresented populations and spurious separation of overrepresented populations. Suggestions to overcome this bias include subsampling and changing the ancestry model, but the performance of these two methods has not yet been tested on actual data. Here, I use a dataset of twelve high-alpine plant species to test whether unbalanced sampling affects the STRUCTURE inference of population differentiation between the European Alps and the Carpathians. For four of the twelve species, subsampling of the Alpine populations –to match the sample size between the Alps and the Carpathians– resulted in a drastically different clustering than the full dataset. On the other hand, STRUCTURE results with the alternative ancestry model were indistinguishable from the results with the default model. Based on these results, the subsampling strategy seems a more viable approach to overcome the bias than the alternative ancestry model. However, subsampling is only possible when there is an a priori expectation of what constitute the main clusters. Though these results do not mean that the use of STRUCTURE should be discarded, it does indicate that users of the software should be cautious about the interpretation of the results when sampling is unbalanced.
S
Research on Financial Distress Prediction of Listed Companies Based on...
scidb.cn
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
邢凯; 盛利琴; 张盼; 李珊 (2024). Research on Financial Distress Prediction of Listed Companies Based on Unbalanced Data Processing and Multivariable Screening Methods [Dataset]. http://doi.org/10.57760/sciencedb.j00214.00026
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00214.00026
Dataset updated
Feb 28, 2024
Dataset provided by
Science Data Bank
Authors
邢凯; 盛利琴; 张盼; 李珊
License
https://api.github.com/licenses/agpl-3.0https://api.github.com/licenses/agpl-3.0
Description
In the context of domestic supply side structural reform, the market environment is complex and ever-changing, and corporate debt defaults occur frequently. It is necessary to establish a timely and effective financial distress warning model Most of the existing distress prediction models have not effectively solved problems such as imbalanced datasets, unstable selection of key prediction indicators, and randomness in sample matching, and are not suitable for the current complex and changing market conditions in China Therefore, this article uses the Bootstrap resampling method to construct 1000 research samples, and uses LASSO (Least absolute shrinkage and selection operator) variable selection technology to screen key predictive factors to construct a logit model for predicting ahead of 3 years. In the prediction stage, the samples are randomly cut and predicted 1000 times to reduce random errors The results indicate that the Logit dilemma prediction model constructed by combining Bootstrap sample construction method with LASSO has stronger predictive ability compared to the traditional application of "similar industry asset size" method In addition, the embedded Bootstrap Lasso logit model has better predictive performance than mainstream O-Score models and ZChina Score models, with an accuracy increase of 10%, and is more suitable for China's time-varying market. The model constructed in this article can help corporate stakeholders better identify financial difficulties and make timely adjustments to reduce corporate bond default rates or avoid corporate defaults
f
Data from: S1 Datasets -
plos.figshare.com
bin
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). S1 Datasets - [Dataset]. http://doi.org/10.1371/journal.pone.0317396.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.s001
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
f
S3 Dataset -
plos.figshare.com
xlsx
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JiaMing Gong; MingGang Dong (2024). S3 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0311133.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311133.s003
Dataset updated
Dec 13, 2024
Dataset provided by
PLOS ONE
Authors
JiaMing Gong; MingGang Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.
h
ml_data_test_detection_bank_transaction_frauds_unbalanced
huggingface.co
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roberto Armas (2023). ml_data_test_detection_bank_transaction_frauds_unbalanced [Dataset]. https://huggingface.co/datasets/roberto-armas/ml_data_test_detection_bank_transaction_frauds_unbalanced
Explore at:
Dataset updated
Jun 19, 2023
Authors
Roberto Armas
Description
ML Data Test Detection Bank Transaction Frauds Unbalanced

The project provides a quick and accessible dataset designed for learning and experimenting with machine learning algorithms, specifically in the context of detecting fraudulent bank transactions. It is intended for practicing and applying concepts such as Random Forest, Support Vector Machines (SVM), and Synthetic Minority Over-sampling Technique (SMOTE) to address unbalanced classification problems. Note: This dataset is… See the full description on the dataset page: https://huggingface.co/datasets/roberto-armas/ml_data_test_detection_bank_transaction_frauds_unbalanced.
n
Acoustic features as a tool to visualize and explore marine soundscapes:...
data.niaid.nih.gov
datadryad.org
zip
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson (2024). Acoustic features as a tool to visualize and explore marine soundscapes: Applications illustrated using marine mammal Passive Acoustic Monitoring datasets [Dataset]. http://doi.org/10.5061/dryad.3bk3j9kn8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.3bk3j9kn8
Dataset updated
Feb 15, 2024
Dataset provided by
Fisheries and Oceans Canada
University of Parma
Memorial University of Newfoundland
Authors
Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Passive Acoustic Monitoring (PAM) is emerging as a solution for monitoring species and environmental change over large spatial and temporal scales. However, drawing rigorous conclusions based on acoustic recordings is challenging, as there is no consensus over which approaches, and indices are best suited for characterizing marine and terrestrial acoustic environments. Here, we describe the application of multiple machine-learning techniques to the analysis of a large PAM dataset. We combine pre-trained acoustic classification models (VGGish, NOAA & Google Humpback Whale Detector), dimensionality reduction (UMAP), and balanced random forest algorithms to demonstrate how machine-learned acoustic features capture different aspects of the marine environment. The UMAP dimensions derived from VGGish acoustic features exhibited good performance in separating marine mammal vocalizations according to species and locations. RF models trained on the acoustic features performed well for labelled sounds in the 8 kHz range, however, low and high-frequency sounds could not be classified using this approach. The workflow presented here shows how acoustic feature extraction, visualization, and analysis allow for establishing a link between ecologically relevant information and PAM recordings at multiple scales. The datasets and scripts provided in this repository allow replicating the results presented in the publication. Methods Data acquisition and preparation We collected all records available in the Watkins Marine Mammal Database website listed under the “all cuts'' page. For each audio file in the WMD the associated metadata included a label for the sound sources present in the recording (biological, anthropogenic, and environmental), as well as information related to the location and date of recording. To minimize the presence of unwanted sounds in the samples, we only retained audio files with a single source listed in the metadata. We then labelled the selected audio clips according to taxonomic group (Odontocetae, Mysticetae), and species. We limited the analysis to 12 marine mammal species by discarding data when a species: had less than 60 s of audio available, had a vocal repertoire extending beyond the resolution of the acoustic classification model (VGGish), or was recorded in a single country. To determine if a species was suited for analysis using VGGish, we inspected the Mel-spectrograms of 3-s audio samples and only retained species with vocalizations that could be captured in the Mel-spectrogram (Appendix S1). The vocalizations of species that produce very low frequency, or very high frequency were not captured by the Mel-spectrogram, thus we removed them from the analysis. To ensure that records included the vocalizations of multiple individuals for each species, we only considered species with records from two or more different countries. Lastly, to avoid overrepresentation of sperm whale vocalizations, we excluded 30,000 sperm whale recordings collected in the Dominican Republic. The resulting dataset consisted in 19,682 audio clips with a duration of 960 milliseconds each (0.96 s) (Table 1). The Placentia Bay Database (PBD) includes recordings collected by Fisheries and Oceans Canada in Placentia Bay (Newfoundland, Canada), in 2019. The dataset consisted of two months of continuous recordings (1230 hours), starting on July 1st, 2019, and ending on August 31st 2029. The data was collected using an AMAR G4 hydrophone (sensitivity: -165.02 dB re 1V/µPa at 250 Hz) deployed at 64 m of depth. The hydrophone was set to operate following 15 min cycles, with the first 60 s sampled at 512 kHz, and the remaining 14 min sampled at 64 kHz. For the purpose of this study, we limited the analysis to the 64 kHz recordings. Acoustic feature extraction The audio files from the WMD and PBD databases were used as input for VGGish (Abu-El-Haija et al., 2016; Chung et al., 2018), a CNN developed and trained to perform general acoustic classification. VGGish was trained on the Youtube8M dataset, containing more than two million user-labelled audio-video files. Rather than focusing on the final output of the model (i.e., the assigned labels), here the model was used as a feature extractor (Sethi et al., 2020). VGGish converts audio input into a semantically meaningful vector consisting of 128 features. The model returns features at multiple resolution: ~1 s (960 ms); ~5 s (4800 ms); ~1 min (59’520 ms); ~5 min (299’520 ms). All of the visualizations and results pertaining to the WMD were prepared using the finest feature resolution of ~1 s. The visualizations and results pertaining to the PBD were prepared using the ~5 s features for the humpback whale detection example, and were then averaged to an interval of 30 min in order to match the temporal resolution of the environmental measures available for the area. UMAP ordination and visualization UMAP is a non-linear dimensionality reduction algorithm based on the concept of topological data analysis which, unlike other dimensionality reduction techniques (e.g., tSNE), preserves both the local and global structure of multivariate datasets (McInnes et al., 2018). To allow for data visualization and to reduce the 128 features to two dimensions for further analysis, we applied Uniform Manifold Approximation and Projection (UMAP) to both datasets and inspected the resulting plots. The UMAP algorithm generates a low-dimensional representation of a multivariate dataset while maintaining the relationships between points in the global dataset structure (i.e., the 128 features extracted from VGGish). Each point in a UMAP plot in this paper represents an audio sample with duration of ~ 1 second (WMD dataset), ~ 5 seconds (PBD dataset, humpback whale detections), or 30 minutes (PBD dataset, environmental variables). Each point in the two-dimensional UMAP space also represents a vector of 128 VGGish features. The nearer two points are in the plot space, the nearer the two points are in the 128-dimensional space, and thus the distance between two points in UMAP reflects the degree of similarity between two audio samples in our datasets. Areas with a high density of samples in UMAP space should, therefore, contain sounds with similar characteristics, and such similarity should decrease with increasing point distance. Previous studies illustrated how VGGish and UMAP can be applied to the analysis of terrestrial acoustic datasets (Heath et al., 2021; Sethi et al., 2020). The visualizations and classification trials presented here illustrate how the two techniques (VGGish and UMAP) can be used together for marine ecoacoustics analysis. UMAP visualizations were prepared the umap-learn package for Python programming language (version 3.10). All UMAP visualizations presented in this study were generated using the algorithm’s default parameters.
Labelling sound sources The labels for the WMD records (i.e., taxonomic group, species, location) were obtained from the database metadata. For the PBD recordings, we obtained measures of wind speed, surface temperature, and current speed from (Fig 1) an oceanographic buy located in proximity of the recorder. We choose these three variables for their different contributions to background noise in marine environments. Wind speed contributes to underwater background noise at multiple frequencies, ranging 500 Hz to 20 kHz (Hildebrand et al., 2021). Sea surface temperature contributes to background noise at frequencies between 63 Hz and 125 Hz (Ainslie et al., 2021), while ocean currents contribute to ambient noise at frequencies below 50 Hz (Han et al., 2021) Prior to analysis, we categorized the environmental variables and assigned the categories as labels to the acoustic features (Table 2). Humpback whale vocalizations in the PBD recordings were processed using the humpback whale acoustic detector created by NOAA and Google (Allen et al., 2021), providing a model score for every ~5 s sample. This model was trained on a large dataset (14 years and 13 locations) using humpback whale recordings annotated by experts (Allen et al., 2021). The model returns scores ranging from 0 to 1 indicating the confidence in the predicted humpback whale presence. We used the results of this detection model to label the PBD samples according to presence of humpback whale vocalizations. To verify the model results, we inspected all audio files that contained a 5 s sample with a model score higher than 0.9 for the month of July. If the presence of a humpback whale was confirmed, we labelled the segment as a model detection. We labelled any additional humpback whale vocalization present in the inspected audio files as a visual detection, while we labelled other sources and background noise samples as absences. In total, we labelled 4.6 hours of recordings. We reserved the recordings collected in August to test the precision of the final predictive model. Label prediction performance We used Balanced Random Forest models (BRF) provided in the imbalanced-learn python package (Lemaître et al., 2017) to predict humpback whale presence and environmental conditions from the acoustic features generated by VGGish. We choose BRF as the algorithm as it is suited for datasets characterized by class imbalance. The BRF algorithm performs under sampling of the majority class prior to prediction, allowing to overcome class imbalance (Lemaître et al., 2017). For each model run, the PBD dataset was split into training (80%) and testing (20%) sets. The training datasets were used to fine-tune the models though a nested k-fold cross validation approach with ten-folds in the outer loop, and five-folds in the inner loop. We selected nested cross validation as it allows optimizing model hyperparameters and performing model evaluation in a single step. We used the default parameters of the BRF algorithm, except for the ‘n_estimators’ hyperparameter, for which we tested
f
Data from: Addressing Imbalanced Classification Problems in Drug Discovery...
acs.figshare.com
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00023.s001
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
f
The Datasets for training and testing algorithms.
figshare.com
xls
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingyang Deng; Yingshi Guo; Chang Wang; Fuwei Wu (2023). The Datasets for training and testing algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0259227.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0259227.t002
Dataset updated
Jun 8, 2023
Dataset provided by
PLOS ONE
Authors
Mingyang Deng; Yingshi Guo; Chang Wang; Fuwei Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Datasets for training and testing algorithms.
f
Synthetic data generation using the MC technique where the mean and...
figshare.com
plos.figshare.com
xls
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sambandh Bhusan Dhal; Muthukumar Bagavathiannan; Ulisses Braga-Neto; Stavros Kalafatis (2023). Synthetic data generation using the MC technique where the mean and covariance matrices are not shared between the classes. [Dataset]. http://doi.org/10.1371/journal.pone.0269401.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0269401.t003
Dataset updated
Jun 14, 2023
Dataset provided by
PLOS ONE
Authors
Sambandh Bhusan Dhal; Muthukumar Bagavathiannan; Ulisses Braga-Neto; Stavros Kalafatis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic data generation using the MC technique where the mean and covariance matrices are not shared between the classes.
f
Performance of linear discriminant analysis (LDA) on the Fraud_Detection...
figshare.com
xls
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Misuk Kim; Kyu-Baek Hwang (2023). Performance of linear discriminant analysis (LDA) on the Fraud_Detection dataset with and without the four sampling methods. [Dataset]. http://doi.org/10.1371/journal.pone.0271260.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0271260.t004
Dataset updated
Jun 14, 2023
Dataset provided by
PLOS ONE
Authors
Misuk Kim; Kyu-Baek Hwang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of linear discriminant analysis (LDA) on the Fraud_Detection dataset with and without the four sampling methods.
f
Performance of linear discriminant analysis (LDA) on the Letter_a dataset...
plos.figshare.com
xls
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Misuk Kim; Kyu-Baek Hwang (2023). Performance of linear discriminant analysis (LDA) on the Letter_a dataset with and without the four sampling methods. [Dataset]. http://doi.org/10.1371/journal.pone.0271260.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0271260.t003
Dataset updated
Jun 16, 2023
Dataset provided by
PLOS ONE
Authors
Misuk Kim; Kyu-Baek Hwang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of linear discriminant analysis (LDA) on the Letter_a dataset with and without the four sampling methods.
f
Table 1_Impact of a multiple oversampling technique-based assessment...
frontiersin.figshare.com
docx
Updated Jan 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guozhu Rao; Yunzhang Rao; Yangjun Xie; Qiang Huang; Jiazheng Wan; Jiyong Zhang (2025). Table 1_Impact of a multiple oversampling technique-based assessment framework on shallow rockburst prediction models.docx [Dataset]. http://doi.org/10.3389/feart.2024.1514591.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feart.2024.1514591.s001
Dataset updated
Jan 20, 2025
Dataset provided by
Frontiers
Authors
Guozhu Rao; Yunzhang Rao; Yangjun Xie; Qiang Huang; Jiazheng Wan; Jiyong Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The occurrence of class-imbalanced datasets is a frequent observation in natural science research, emphasizing the paramount importance of effectively harnessing them to construct highly accurate models for rockburst prediction. Initially, genuine rockburst incidents within a burial depth of 500 m were sourced from literature, revealing a small dataset imbalance issue. Utilizing various mainstream oversampling techniques, the dataset was expanded to generate six new datasets, subsequently subjected to 12 classifiers across 84 classification processes. The model incorporating the highest-scoring model from the original dataset and the top two models from the expanded dataset, yielded a high-performance model. Findings indicate that the KMeansSMOTE oversampling technique exhibits the most substantial enhancement across the combined 12 classifiers, whereas individual classifiers favor ET+SVMSMOTE and RF+SMOTENC. Following multiple rounds of hyper parameter adjustment via random cross-validation, the ET+SVMSMOTE combination attained the highest accuracy rate of 93.75%, surpassing mainstream models for rockburst prediction. Moreover, the SVMSMOTE technique, augmenting samples with fewer categories, demonstrated notable benefits in mitigating overfitting, enhancing generalization, and improving Recall and F1 score within RF classifiers. Validated for its high generalization performance, accuracy, and reliability. This process also provides an efficient framework for model development.

Facebook

Twitter

Click to copy link

Link copied

Cite

Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX

Explore at:

docxAvailable download formats

Unique identifier

https://doi.org/10.3389/fninf.2021.715421.s002

Dataset updated

Jun 1, 2023

Dataset provided by

Frontiers

Authors

Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

Clear search

Close search

Google apps

Main menu

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...

Imbalanced Data

Data from: High impact bug report identification with imbalanced learning...

Classification of typical algorithms for imbalanced sampling and...

Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...

Number of datasets on which a combination of machine learning and sampling...

Number of datasets on which a combination of machine learning and sampling...

Subsampling reveals that unbalanced sampling affects STRUCTURE results in a...

Research on Financial Distress Prediction of Listed Companies Based on...

Data from: S1 Datasets -

S3 Dataset -

ml_data_test_detection_bank_transaction_frauds_unbalanced

Acoustic features as a tool to visualize and explore marine soundscapes:...

Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

The Datasets for training and testing algorithms.

Synthetic data generation using the MC technique where the mean and...

Performance of linear discriminant analysis (LDA) on the Fraud_Detection...

Performance of linear discriminant analysis (LDA) on the Letter_a dataset...

Table 1_Impact of a multiple oversampling technique-based assessment...

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX