74 datasets found

f
Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...
frontiersin.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2021.715421.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

Data from: Imbalanced dataset for benchmarking

data.niaid.nih.gov
zenodo.org

Updated Jan 24, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Lemaitre, Guillaume (2020). Imbalanced dataset for benchmarking [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_61452

Explore at:

Dataset updated

Jan 24, 2020

Dataset provided by

Oliveira, Dayvid V. R.
Aridas, Christos K.
Lemaitre, Guillaume
Nogueira, Fernando

License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

Imbalanced dataset for benchmarking

The different algorithms of the imbalanced-learn toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.

Characteristics

ID	Name	Repository & Target	Ratio	# samples	# features
1	Ecoli	UCI, target: imU	8.6:1	336	7
2	Optical Digits	UCI, target: 8	9.1:1	5,620	64
3	SatImage	UCI, target: 4	9.3:1	6,435	36
4	Pen Digits	UCI, target: 5	9.4:1	10,992	16
5	Abalone	UCI, target: 7	9.7:1	4,177	8
6	Sick Euthyroid	UCI, target: sick euthyroid	9.8:1	3,163	25
7	Spectrometer	UCI, target: >=44	11:1	531	93
8	Car_Eval_34	UCI, target: good, v good	12:1	1,728	6
9	ISOLET	UCI, target: A, B	12:1	7,797	617
10	US Crime	UCI, target: >0.65	12:1	1,994	122
11	Yeast_ML8	LIBSVM, target: 8	13:1	2,417	103
12	Scene	LIBSVM, target: >one label	13:1	2,407	294
13	Libras Move	UCI, target: 1	14:1	360	90
14	Thyroid Sick	UCI, target: sick	15:1	3,772	28
15	Coil_2000	KDD, CoIL, target: minority	16:1	9,822	85
16	Arrhythmia	UCI, target: 06	17:1	452	279
17	Solar Flare M0	UCI, target: M->0	19:1	1,389	10
18	OIL	UCI, target: minority	22:1	937	49
19	Car_Eval_4	UCI, target: vgood	26:1	1,728	6
20	Wine Quality	UCI, wine, target: <=4	26:1	4,898	11
21	Letter Img	UCI, target: Z	26:1	20,000	16
22	Yeast _ME2	UCI, target: ME2	28:1	1,484	8
23	Webpage	LIBSVM, w7a, target: minority	33:1	49,749	300
24	Ozone Level	UCI, ozone, data	34:1	2,536	72
25	Mammography	UCI, target: minority	42:1	11,183	6
26	Protein homo.	KDD CUP 2004, minority	111:1	145,751	74
27	Abalone_19	UCI, target: 19	130:1	4,177	8

References

[1] Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).

[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).

[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.

[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.

s
Data from: High impact bug report identification with imbalanced learning...
researchdata.smu.edu.sg
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN (2023). Data from: High impact bug report identification with imbalanced learning strategies [Dataset]. http://doi.org/10.25440/smu.12062763.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25440/smu.12062763.v1
Dataset updated
Jun 1, 2023
Dataset provided by
SMU Research Data Repository (RDR)
Authors
YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This record contains the underlying research data for the publication "High impact bug report identification with imbalanced learning strategies" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/3702In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the F1-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.Supplementary code and data available from GitHub:
i
imbalanced data
ieee-dataport.org
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ZHI WANG (2022). imbalanced data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data
Explore at:
Dataset updated
Dec 14, 2022
Authors
ZHI WANG
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset file is used for the study of imbalanced data and contains 6 imbalanced datasets
f
Data from: Addressing Imbalanced Classification Problems in Drug Discovery...
acs.figshare.com
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00023.s001
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
Industrial Benchmark Dataset for Customer Escalation Prediction
zenodo.org
opendatalab.com
+2more
bin
Updated Sep 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
An Nguyen; An Nguyen; Stefan Foerstel; Thomas Kittler; Andrey Kurzyukov; Leo Schwinn; Dario Zanca; Tobias Hipp; Sun Da Jun; Michael Schrapp; Eva Rothgang; Bjoern Eskofier; Stefan Foerstel; Thomas Kittler; Andrey Kurzyukov; Leo Schwinn; Dario Zanca; Tobias Hipp; Sun Da Jun; Michael Schrapp; Eva Rothgang; Bjoern Eskofier (2021). Industrial Benchmark Dataset for Customer Escalation Prediction [Dataset]. http://doi.org/10.5281/zenodo.4383145
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4383145
Dataset updated
Sep 6, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
An Nguyen; An Nguyen; Stefan Foerstel; Thomas Kittler; Andrey Kurzyukov; Leo Schwinn; Dario Zanca; Tobias Hipp; Sun Da Jun; Michael Schrapp; Eva Rothgang; Bjoern Eskofier; Stefan Foerstel; Thomas Kittler; Andrey Kurzyukov; Leo Schwinn; Dario Zanca; Tobias Hipp; Sun Da Jun; Michael Schrapp; Eva Rothgang; Bjoern Eskofier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a real-world industrial benchmark dataset from a major medical device manufacturer for the prediction of customer escalations. The dataset contains features derived from IoT (machine log) and enterprise data including labels for escalation from a fleet of thousands of customers of high-end medical devices.

The dataset accompanies the publication "System Design for a Data-driven and Explainable Customer Sentiment Monitor" (submitted). We provide an anonymized version of data collected over a period of two years.

The dataset should fuel the research and development of new machine learning algorithms to better cope with real-world data challenges including sparse and noisy labels, and concept drifts. Additional challenges is the optimal fusion of enterprise and log based features for the prediction task. Thereby, interpretability of designed prediction models should be ensured in order to have practical relevancy.

Supporting software

Kindly use the corresponding GitHub repository (https://github.com/annguy/customer-sentiment-monitor) to design and benchmark your algorithms.

Citation and Contact

If you use this dataset please cite the following publication:

@ARTICLE{9520354, author={Nguyen, An and Foerstel, Stefan and Kittler, Thomas and Kurzyukov, Andrey and Schwinn, Leo and Zanca, Dario and Hipp, Tobias and Jun, Sun Da and Schrapp, Michael and Rothgang, Eva and Eskofier, Bjoern}, journal={IEEE Access}, title={System Design for a Data-Driven and Explainable Customer Sentiment Monitor Using IoT and Enterprise Data}, year={2021}, volume={9}, number={}, pages={117140-117152}, doi={10.1109/ACCESS.2021.3106791}}

If you would like to get in touch, please contact an.nguyen@fau.de.
i
Imbalanced Data
ieee-dataport.org
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0
Explore at:
Dataset updated
Aug 23, 2023
Authors
Blessa Binolin M
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.
Lending Club Loan Data
kaggle.com
Updated Nov 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sweta Shetye (2020). Lending Club Loan Data [Dataset]. https://www.kaggle.com/swetashetye/lending-club-loan-data-imbalance-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sweta Shetye
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I wanted a highly imbalanced dataset to share with others. It has the perfect one for us.

Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).

For example, In this dataset, There are way more samples of fully paid borrowers versus not fully paid borrowers.

Full LendingClub data available from their site.

Content

For companies like Lending Club correctly predicting whether or not a loan will be default is very important. This dataset contains historical data from 2007 to 2015, you can to build a deep learning model to predict the chance of default for future loans. As you will see this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.
f
The definition of a confusion matrix.
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t002
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
Data from: Arabic news credibility on Twitter using sentiment analysis and...
zenodo.org
data.niaid.nih.gov
csv, txt
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Duha Samdani; Duha Samdani; Mounira Taileb; Nada Almani; Mounira Taileb; Nada Almani (2023). Arabic news credibility on Twitter using sentiment analysis and ensemble learning [Dataset]. http://doi.org/10.5281/zenodo.8000717
Explore at:
csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8000717
Dataset updated
Jun 3, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Duha Samdani; Duha Samdani; Mounira Taileb; Nada Almani; Mounira Taileb; Nada Almani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Arabic news credibility on Twitter using sentiment analysis and ensemble learning.

WHAT IS IT?

-----------

an Arabic news credibility model on Twitter using sentiment analysis and ensemble learning.

Here we include the Collected dataset and the source code of the proposed model written in Python language and using Keras library with Tensorflow backend.

Required Packages

------------------

Keras (https://keras.io/).

Scikit-learn (http://scikit-learn.org/)

Imnlearn (imbalanced-learn documentation — Version 0.10.1)

To Run the model

---------------

One data file is required to run the model which are:

The data that were used are the collected dataset in the file, set the path of the required data file in the code.

The dataset

---------------

There are the dataset file with all features, you can choose the features that you need and apply it on the model.

There are a description file that describe each feature in the news credibility dataset

The file Tweet_ID contains the list of tweets id in the dataset.

The annotated replies based on credibility is provided.

CONTACTS

--------

If you want to report bugs or have general queries email to
Predict students' dropout and academic success
zenodo.org
explore.openaire.eu
+1more
Updated Mar 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins (2023). Predict students' dropout and academic success [Dataset]. http://doi.org/10.5281/zenodo.5777340
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5777340
Dataset updated
Mar 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins
Description
A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.

The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters.

The data is used to build classification models to predict students' dropout and academic success. The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.

Funding
We acknowledge support of this work by the program "SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal"
o
Youtube Videos Dataset (~3400 videos)
opendatabay.com
.undefined
Updated Jun 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Youtube Videos Dataset (~3400 videos) [Dataset]. https://www.opendatabay.com/data/ai-ml/fef9b558-dda7-42c6-83e3-048d99e5135b
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 10, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
YouTube, Social Media and Networking
Description
Context 📃 I wanted to practice text classification using NLP techniques, so I thought why not practice it by generating the data myself! This way, I brushed up on my scraping techniques using Selenium, collected the data, cleaned it, and then started working on it. You can take a peek at my work Github Repository For This Dataset and Trained Models/ Results

Content 📰 The total number of videos scraped was 3600. I scraped the following things from each video:

link title description category Video ID Category for which the video was scraped Description of the video Category for which the video was scraped. I queried the videos for 4 categories:

Travel Vlogs 🧳 Food 🥑 Art and Music 🎨 🎻 History 📜

Acknowledgements 🙏 I could have used a ready made API, but just for the fun of it, I scraped the data from Youtube using Selenium.

Inspiration 🦋 The data is not clean (for your enjoyment of cleaning the data!), has some missing values, and is imbalanced. Practice text classification on this dataset, you will have to learn different techniques for eg:- How to handle imbalanced classes..? While working on this dataset, you will learn a lot of different things and also get an opportunity to apply on this dataset.

Original Data Source: Youtube Videos Dataset (~3400 videos)
f
Data from: S1 Dataset -
figshare.com
xlsx
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JiaMing Gong; MingGang Dong (2024). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0311133.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311133.s001
Dataset updated
Dec 13, 2024
Dataset provided by
PLOS ONE
Authors
JiaMing Gong; MingGang Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.
Focal Dual Contrastive Learning for an imbalanced Chinese anesthesia dataset...
zenodo.org
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). Focal Dual Contrastive Learning for an imbalanced Chinese anesthesia dataset [Dataset]. http://doi.org/10.5281/zenodo.13378270
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13378270
Dataset updated
Aug 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2024
Description
This is a Chinese anesthesia dataset specifically for anesthesia risk prediction and ASA grading, which contains more than 10,000 real data. If you need it, you can apply for it and give us the direction and reason for your use. We will provide it to you free of charge under reasonable circumstances.
Lending Club Loan Data Analysis - Deep Learning
kaggle.com
Updated Aug 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deependra Verma (2023). Lending Club Loan Data Analysis - Deep Learning [Dataset]. https://www.kaggle.com/datasets/deependraverma13/lending-club-loan-data-analysis-deep-learning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deependra Verma
Description
DESCRIPTION

Create a model that predicts whether or not a loan will be default using the historical data.

Problem Statement:

For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

Domain: Finance

Analysis to be done: Perform data preprocessing and build a deep learning prediction model.

Content:

Dataset columns and definition:

credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

installment: The monthly installments owed by the borrower if the loan is funded.

log.annual.inc: The natural log of the self-reported annual income of the borrower.

dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

fico: The FICO credit score of the borrower.

days.with.cr.line: The number of days the borrower has had a credit line.

revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.

delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

Steps to perform:

Perform exploratory data analysis and feature engineering and then apply feature engineering. Follow up with a deep learning model to predict whether or not the loan will be default using the historical data.

Tasks:

Feature Transformation

Transform categorical values into numerical values (discrete)

Exploratory data analysis of different factors of the dataset.

Additional Feature Engineering

You will check the correlation between features and will drop those features which have a strong correlation

This will help reduce the number of features and will leave you with the most relevant features

Modeling

After applying EDA and feature engineering, you are now ready to build the predictive models

In this part, you will create a deep learning model using Keras with Tensorflow backend
d
Replication Data: Leveraging Researcher Domain Expertise to Annotate...
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus, Dror (2023). Replication Data: Leveraging Researcher Domain Expertise to Annotate Concepts within Imbalanced Data [Dataset]. http://doi.org/10.7910/DVN/IEX083
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/IEX083
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Markus, Dror
Description
In this manuscript, we describe a method to utilize researcher domain expertise to annotate concepts efficiently and accurately within an imbalanced dataset. This folder contains two scripts that run two variations of the simulation referred to in our paper. Additionally, we included two separate datasets that were utilized in the simulations. For each, we shared the list of document embeddings used for classification, together with a corresponding CSV which holds the categorical labels for each embedding. We recommend first reading the "README" text file, before running the scripts.
n
Acoustic features as a tool to visualize and explore marine soundscapes:...
data.niaid.nih.gov
datadryad.org
zip
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson (2024). Acoustic features as a tool to visualize and explore marine soundscapes: Applications illustrated using marine mammal Passive Acoustic Monitoring datasets [Dataset]. http://doi.org/10.5061/dryad.3bk3j9kn8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.3bk3j9kn8
Dataset updated
Feb 15, 2024
Dataset provided by
Fisheries and Oceans Canada
Memorial University of Newfoundland
University of Parma
Authors
Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Passive Acoustic Monitoring (PAM) is emerging as a solution for monitoring species and environmental change over large spatial and temporal scales. However, drawing rigorous conclusions based on acoustic recordings is challenging, as there is no consensus over which approaches, and indices are best suited for characterizing marine and terrestrial acoustic environments. Here, we describe the application of multiple machine-learning techniques to the analysis of a large PAM dataset. We combine pre-trained acoustic classification models (VGGish, NOAA & Google Humpback Whale Detector), dimensionality reduction (UMAP), and balanced random forest algorithms to demonstrate how machine-learned acoustic features capture different aspects of the marine environment. The UMAP dimensions derived from VGGish acoustic features exhibited good performance in separating marine mammal vocalizations according to species and locations. RF models trained on the acoustic features performed well for labelled sounds in the 8 kHz range, however, low and high-frequency sounds could not be classified using this approach. The workflow presented here shows how acoustic feature extraction, visualization, and analysis allow for establishing a link between ecologically relevant information and PAM recordings at multiple scales. The datasets and scripts provided in this repository allow replicating the results presented in the publication. Methods Data acquisition and preparation We collected all records available in the Watkins Marine Mammal Database website listed under the “all cuts'' page. For each audio file in the WMD the associated metadata included a label for the sound sources present in the recording (biological, anthropogenic, and environmental), as well as information related to the location and date of recording. To minimize the presence of unwanted sounds in the samples, we only retained audio files with a single source listed in the metadata. We then labelled the selected audio clips according to taxonomic group (Odontocetae, Mysticetae), and species. We limited the analysis to 12 marine mammal species by discarding data when a species: had less than 60 s of audio available, had a vocal repertoire extending beyond the resolution of the acoustic classification model (VGGish), or was recorded in a single country. To determine if a species was suited for analysis using VGGish, we inspected the Mel-spectrograms of 3-s audio samples and only retained species with vocalizations that could be captured in the Mel-spectrogram (Appendix S1). The vocalizations of species that produce very low frequency, or very high frequency were not captured by the Mel-spectrogram, thus we removed them from the analysis. To ensure that records included the vocalizations of multiple individuals for each species, we only considered species with records from two or more different countries. Lastly, to avoid overrepresentation of sperm whale vocalizations, we excluded 30,000 sperm whale recordings collected in the Dominican Republic. The resulting dataset consisted in 19,682 audio clips with a duration of 960 milliseconds each (0.96 s) (Table 1). The Placentia Bay Database (PBD) includes recordings collected by Fisheries and Oceans Canada in Placentia Bay (Newfoundland, Canada), in 2019. The dataset consisted of two months of continuous recordings (1230 hours), starting on July 1st, 2019, and ending on August 31st 2029. The data was collected using an AMAR G4 hydrophone (sensitivity: -165.02 dB re 1V/µPa at 250 Hz) deployed at 64 m of depth. The hydrophone was set to operate following 15 min cycles, with the first 60 s sampled at 512 kHz, and the remaining 14 min sampled at 64 kHz. For the purpose of this study, we limited the analysis to the 64 kHz recordings. Acoustic feature extraction The audio files from the WMD and PBD databases were used as input for VGGish (Abu-El-Haija et al., 2016; Chung et al., 2018), a CNN developed and trained to perform general acoustic classification. VGGish was trained on the Youtube8M dataset, containing more than two million user-labelled audio-video files. Rather than focusing on the final output of the model (i.e., the assigned labels), here the model was used as a feature extractor (Sethi et al., 2020). VGGish converts audio input into a semantically meaningful vector consisting of 128 features. The model returns features at multiple resolution: ~1 s (960 ms); ~5 s (4800 ms); ~1 min (59’520 ms); ~5 min (299’520 ms). All of the visualizations and results pertaining to the WMD were prepared using the finest feature resolution of ~1 s. The visualizations and results pertaining to the PBD were prepared using the ~5 s features for the humpback whale detection example, and were then averaged to an interval of 30 min in order to match the temporal resolution of the environmental measures available for the area. UMAP ordination and visualization UMAP is a non-linear dimensionality reduction algorithm based on the concept of topological data analysis which, unlike other dimensionality reduction techniques (e.g., tSNE), preserves both the local and global structure of multivariate datasets (McInnes et al., 2018). To allow for data visualization and to reduce the 128 features to two dimensions for further analysis, we applied Uniform Manifold Approximation and Projection (UMAP) to both datasets and inspected the resulting plots. The UMAP algorithm generates a low-dimensional representation of a multivariate dataset while maintaining the relationships between points in the global dataset structure (i.e., the 128 features extracted from VGGish). Each point in a UMAP plot in this paper represents an audio sample with duration of ~ 1 second (WMD dataset), ~ 5 seconds (PBD dataset, humpback whale detections), or 30 minutes (PBD dataset, environmental variables). Each point in the two-dimensional UMAP space also represents a vector of 128 VGGish features. The nearer two points are in the plot space, the nearer the two points are in the 128-dimensional space, and thus the distance between two points in UMAP reflects the degree of similarity between two audio samples in our datasets. Areas with a high density of samples in UMAP space should, therefore, contain sounds with similar characteristics, and such similarity should decrease with increasing point distance. Previous studies illustrated how VGGish and UMAP can be applied to the analysis of terrestrial acoustic datasets (Heath et al., 2021; Sethi et al., 2020). The visualizations and classification trials presented here illustrate how the two techniques (VGGish and UMAP) can be used together for marine ecoacoustics analysis. UMAP visualizations were prepared the umap-learn package for Python programming language (version 3.10). All UMAP visualizations presented in this study were generated using the algorithm’s default parameters.
Labelling sound sources The labels for the WMD records (i.e., taxonomic group, species, location) were obtained from the database metadata. For the PBD recordings, we obtained measures of wind speed, surface temperature, and current speed from (Fig 1) an oceanographic buy located in proximity of the recorder. We choose these three variables for their different contributions to background noise in marine environments. Wind speed contributes to underwater background noise at multiple frequencies, ranging 500 Hz to 20 kHz (Hildebrand et al., 2021). Sea surface temperature contributes to background noise at frequencies between 63 Hz and 125 Hz (Ainslie et al., 2021), while ocean currents contribute to ambient noise at frequencies below 50 Hz (Han et al., 2021) Prior to analysis, we categorized the environmental variables and assigned the categories as labels to the acoustic features (Table 2). Humpback whale vocalizations in the PBD recordings were processed using the humpback whale acoustic detector created by NOAA and Google (Allen et al., 2021), providing a model score for every ~5 s sample. This model was trained on a large dataset (14 years and 13 locations) using humpback whale recordings annotated by experts (Allen et al., 2021). The model returns scores ranging from 0 to 1 indicating the confidence in the predicted humpback whale presence. We used the results of this detection model to label the PBD samples according to presence of humpback whale vocalizations. To verify the model results, we inspected all audio files that contained a 5 s sample with a model score higher than 0.9 for the month of July. If the presence of a humpback whale was confirmed, we labelled the segment as a model detection. We labelled any additional humpback whale vocalization present in the inspected audio files as a visual detection, while we labelled other sources and background noise samples as absences. In total, we labelled 4.6 hours of recordings. We reserved the recordings collected in August to test the precision of the final predictive model. Label prediction performance We used Balanced Random Forest models (BRF) provided in the imbalanced-learn python package (Lemaître et al., 2017) to predict humpback whale presence and environmental conditions from the acoustic features generated by VGGish. We choose BRF as the algorithm as it is suited for datasets characterized by class imbalance. The BRF algorithm performs under sampling of the majority class prior to prediction, allowing to overcome class imbalance (Lemaître et al., 2017). For each model run, the PBD dataset was split into training (80%) and testing (20%) sets. The training datasets were used to fine-tune the models though a nested k-fold cross validation approach with ten-folds in the outer loop, and five-folds in the inner loop. We selected nested cross validation as it allows optimizing model hyperparameters and performing model evaluation in a single step. We used the default parameters of the BRF algorithm, except for the ‘n_estimators’ hyperparameter, for which we tested
h
her2-binding-prediction
huggingface.co
Updated Aug 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alchemab Therapeutics (2024). her2-binding-prediction [Dataset]. https://huggingface.co/datasets/alchemab/her2-binding-prediction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 26, 2024
Dataset authored and provided by
Alchemab Therapeutics
Description
HER2 binding dataset

HER2 binding antibodies have been obtained from the Github repo for Mason et al. (2021). Labels for antibody sequences were generated using scripts in the above Github repo. The number of negatives and positives were balanced through random undersampling using imbalanced-learn, and sequences were deduplicated. The dataset has:

39108 antibodies in total 22779 antibodies after undersampling and deduplication 18223 in the training set, 2278 in the evaluation set… See the full description on the dataset page: https://huggingface.co/datasets/alchemab/her2-binding-prediction.
P
mini-ImageNet-LT Dataset
paperswithcode.com
Updated Nov 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahul Vigneswaran; Marc T. Law; Vineeth N. Balasubramanian; Makarand Tapaswi (2021). mini-ImageNet-LT Dataset [Dataset]. https://paperswithcode.com/dataset/mini-imagenet-lt
Explore at:
Dataset updated
Nov 9, 2021
Authors
Rahul Vigneswaran; Marc T. Law; Vineeth N. Balasubramanian; Makarand Tapaswi
Description
mini-ImageNet was proposed by Matching networks for one-shot learning for few-shot learning evaluation, in an attempt to have a dataset like ImageNet while requiring fewer resources. Similar to the statistics for CIFAR-100-LT with an imbalance factor of 100, we construct a long-tailed variant of mini-ImageNet that features all the 100 classes and an imbalanced training set with $N_1 = 500$ and $N_K = 5$ images. For evaluation, both the validation and test sets are balanced and contain 10K images, 100 samples for each of the 100 categories.
A
‘Oil Spill’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Oil Spill’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-oil-spill-7d4d/latest
Explore at:
Dataset updated
Feb 14, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Oil Spill’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ashrafkhan94/oil-spill on 14 February 2022.

--- Dataset description provided by original source is as follows ---

In this project, we will use a standard imbalanced machine learning dataset referred to as the oil spill dataset, oil slicks dataset or simply oil. The dataset was introduced in the 1998 paper by Miroslav Kubat, et al. titled Machine Learning for the Detection of Oil Spills in Satellite Radar Images. The dataset is often credited to Robert Holte, a co-author of the paper. The dataset was developed by starting with satellite images of the ocean, some of which contain an oil spill and some that do not. Images were split into sections and processed using computer vision algorithms to provide a vector of features to describe the contents of the image section or patch.

--- Original source retains full ownership of the source dataset ---

Facebook

Twitter

Click to copy link

Link copied

Cite

Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX

Explore at:

docxAvailable download formats

Unique identifier

https://doi.org/10.3389/fninf.2021.715421.s002

Dataset updated

Jun 1, 2023

Dataset provided by

Frontiers

Authors

Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

Clear search

Close search

Google apps

Main menu

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

Data from: Imbalanced dataset for benchmarking

Imbalanced dataset for benchmarking

Characteristics

References

Data from: High impact bug report identification with imbalanced learning...

imbalanced data

Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

Industrial Benchmark Dataset for Customer Escalation Prediction

Imbalanced Data

Lending Club Loan Data

Context

Content

The definition of a confusion matrix.

Data from: Arabic news credibility on Twitter using sentiment analysis and...

Predict students' dropout and academic success

Youtube Videos Dataset (~3400 videos)

Data from: S1 Dataset -

Focal Dual Contrastive Learning for an imbalanced Chinese anesthesia dataset...

Lending Club Loan Data Analysis - Deep Learning

Replication Data: Leveraging Researcher Domain Expertise to Annotate...

Acoustic features as a tool to visualize and explore marine soundscapes:...

her2-binding-prediction

mini-ImageNet-LT Dataset

‘Oil Spill’ analyzed by Analyst-2

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX