36 datasets found

i
Imbalanced Data
ieee-dataport.org
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0
Explore at:
Dataset updated
Aug 23, 2023
Authors
Blessa Binolin M
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.

Data from: Imbalanced dataset for benchmarking

data.niaid.nih.gov
zenodo.org

Updated Jan 24, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Nogueira, Fernando (2020). Imbalanced dataset for benchmarking [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_61452

Explore at:

Dataset updated

Jan 24, 2020

Dataset provided by

Nogueira, Fernando
Oliveira, Dayvid V. R.
Lemaitre, Guillaume
Aridas, Christos K.

License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

Imbalanced dataset for benchmarking

The different algorithms of the imbalanced-learn toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.

Characteristics

ID	Name	Repository & Target	Ratio	# samples	# features
1	Ecoli	UCI, target: imU	8.6:1	336	7
2	Optical Digits	UCI, target: 8	9.1:1	5,620	64
3	SatImage	UCI, target: 4	9.3:1	6,435	36
4	Pen Digits	UCI, target: 5	9.4:1	10,992	16
5	Abalone	UCI, target: 7	9.7:1	4,177	8
6	Sick Euthyroid	UCI, target: sick euthyroid	9.8:1	3,163	25
7	Spectrometer	UCI, target: >=44	11:1	531	93
8	Car_Eval_34	UCI, target: good, v good	12:1	1,728	6
9	ISOLET	UCI, target: A, B	12:1	7,797	617
10	US Crime	UCI, target: >0.65	12:1	1,994	122
11	Yeast_ML8	LIBSVM, target: 8	13:1	2,417	103
12	Scene	LIBSVM, target: >one label	13:1	2,407	294
13	Libras Move	UCI, target: 1	14:1	360	90
14	Thyroid Sick	UCI, target: sick	15:1	3,772	28
15	Coil_2000	KDD, CoIL, target: minority	16:1	9,822	85
16	Arrhythmia	UCI, target: 06	17:1	452	279
17	Solar Flare M0	UCI, target: M->0	19:1	1,389	10
18	OIL	UCI, target: minority	22:1	937	49
19	Car_Eval_4	UCI, target: vgood	26:1	1,728	6
20	Wine Quality	UCI, wine, target: <=4	26:1	4,898	11
21	Letter Img	UCI, target: Z	26:1	20,000	16
22	Yeast _ME2	UCI, target: ME2	28:1	1,484	8
23	Webpage	LIBSVM, w7a, target: minority	33:1	49,749	300
24	Ozone Level	UCI, ozone, data	34:1	2,536	72
25	Mammography	UCI, target: minority	42:1	11,183	6
26	Protein homo.	KDD CUP 2004, minority	111:1	145,751	74
27	Abalone_19	UCI, target: 19	130:1	4,177	8

References

[1] Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).

[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).

[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.

[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.

f
Data from: Isometric Stratified Ensembles: A Partial and Incremental...
acs.figshare.com
xlsx
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux (2023). Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation [Dataset]. http://doi.org/10.1021/acs.jcim.2c00293.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.2c00293.s004
Dataset updated
Jun 7, 2023
Dataset provided by
ACS Publications
Authors
Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.
d
Replication Data: Leveraging Researcher Domain Expertise to Annotate...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus, Dror (2023). Replication Data: Leveraging Researcher Domain Expertise to Annotate Concepts within Imbalanced Data [Dataset]. http://doi.org/10.7910/DVN/IEX083
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/IEX083
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Markus, Dror
Description
In this manuscript, we describe a method to utilize researcher domain expertise to annotate concepts efficiently and accurately within an imbalanced dataset. This folder contains two scripts that run two variations of the simulation referred to in our paper. Additionally, we included two separate datasets that were utilized in the simulations. For each, we shared the list of document embeddings used for classification, together with a corresponding CSV which holds the categorical labels for each embedding. We recommend first reading the "README" text file, before running the scripts.
f
M_DATA2
figshare.com
csv
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Wang (2025). M_DATA2 [Dataset]. http://doi.org/10.6084/m9.figshare.28956344.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28956344.v1
Dataset updated
May 8, 2025
Dataset provided by
figshare
Authors
Wei Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
this paper generated a challenging two-dimensional imbalanced dataset M_DATA2 based on the normal distribution. Its scatter plot is shown in Figure 1. The dataset has a sample size of 1000, 2 feature variables, and the ratio between the number of samples in the two classes is 1:4. Among them, the samples of the minority class are divided into four parts by the majorityclasssamples, and there is a certain mixed area, which makes it difficult for the GNB classifier to classify effectivel
s
Data from: High impact bug report identification with imbalanced learning...
researchdata.smu.edu.sg
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN (2023). Data from: High impact bug report identification with imbalanced learning strategies [Dataset]. http://doi.org/10.25440/smu.12062763.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25440/smu.12062763.v1
Dataset updated
Jun 1, 2023
Dataset provided by
SMU Research Data Repository (RDR)
Authors
YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This record contains the underlying research data for the publication "High impact bug report identification with imbalanced learning strategies" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/3702In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the F1-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.Supplementary code and data available from GitHub:
f
Data from: Addressing Imbalanced Classification Problems in Drug Discovery...
acs.figshare.com
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das (2025). Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML [Dataset]. http://doi.org/10.1021/acs.jcim.5c00023.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5c00023.s001
Dataset updated
Apr 15, 2025
Dataset provided by
ACS Publications
Authors
Ayush Garg; Narayanan Ramamurthi; Shyam Sundar Das
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
f
S5 Dataset -
plos.figshare.com
xlsx
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JiaMing Gong; MingGang Dong (2024). S5 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0311133.s005
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311133.s005
Dataset updated
Dec 13, 2024
Dataset provided by
PLOS ONE
Authors
JiaMing Gong; MingGang Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.
Cerebral Stroke Prediction-Imbalanced Dataset
kaggle.com
Updated Aug 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shashwat Tiwari (2021). Cerebral Stroke Prediction-Imbalanced Dataset [Dataset]. https://www.kaggle.com/shashwatwork/cerebral-stroke-predictionimbalaced-dataset/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 22, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shashwat Tiwari
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
Context

A stroke, also known as a cerebrovascular accident or CVA is when part of the brain loses its blood supply and the part of the body that the blood-deprived brain cells control stops working. This loss of blood supply can be ischemic because of lack of blood flow, or hemorrhagic because of bleeding into brain tissue. A stroke is a medical emergency because strokes can lead to death or permanent disability. There are opportunities to treat ischemic strokes but that treatment needs to be started in the first few hours after the signs of a stroke begin.

Content

The cerebral Stroke dataset consists of 12 features including the target column which is imbalanced.

Acknowledgements

Liu, Tianyu; Fan, Wenhui; Wu, Cheng (2019), “Data for A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical-datasets”, Mendeley Data, V1, doi: 10.17632/x8ygrw87jw.1 Dataset is sourced from here.
IT Customer Chrun (Goal : Imbalanced Dataset)
kaggle.com
Updated Jan 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soheil Tehranipour (2023). IT Customer Chrun (Goal : Imbalanced Dataset) [Dataset]. https://www.kaggle.com/datasets/soheiltehranipour/it-customer-churn/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 14, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Soheil Tehranipour
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Predicting customer churn is critical for companies to be able to effectively retain customers. It is more costly to acquire new customers than to retain existing ones. For this reason, large corporations are seeking to develop models to predict which customers are more likely to change and take actions accordingly.

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

Customers who left within the last month – the column is called Churn

Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

Demographic info about customers – gender, age range, and if they have partners and dependents
SGCC Electricity Theft Detection
kaggle.com
Updated Sep 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BENSALEM M.Abderrahmane (2023). SGCC Electricity Theft Detection [Dataset]. https://www.kaggle.com/datasets/bensalem14/sgcc-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BENSALEM M.Abderrahmane
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Overview

The State Grid Corporation of China (SGCC) dataset with 1000 records was used in the model. This is a key resource in the field of power distribution and management, with a large and varied set of data about electricity transport and grid operations. This set of data contains a lot of different kinds of information, such as history and real-time data on energy use, grid infrastructure, the integration of green energy, and grid performance. It is a key part of making power distribution networks more reliable and efficient by helping with things like predicting demand, watching the grid, and finding problems. Researchers, energy providers, and law- makers can use this information to learn important things about electricity usage trends, the health of the grid, and the merging of green energy sources. This will help the electric power industry come up with new strategies and ideas that are based on data.

Description

Electricity theft detection released by the State Grid Corporation of China (SGCC) dataset data set.csv contains 1037 columns and 42,372 rows for electric consumption from January first 2014 to 30 October 2016. SGCC data first column is consumer ID that is alphanumeric. Then from column 2 to columns 1036 daily electricity consumption is given. Last column named flag is the labels in 0 and 1 values. the small version of the dataset datasetsmall.csv only contains the electric consumption for January 2014.

Features

'MM/DD/YYYY': The electric consumption on a given day .

CONS_NO: Consumer Number stands for a customer ID of string type.

FLAG: 0 indicating no theft and 1 for theft.

Useful for

Binary Classification: The main intention of the dataset is for binary classification of electrical theft.

Imbalanced Datasets Processing: Useful for exploring class balancing methods.

Time Series Forecasting: Can be used for forecasting and predicting electrical consumption on a given day.

Notes

This Dataset Contains missing values .

This Dataset has dates of the form "MM/DD/YYYY".

This Dataset requires slight cleaning.
Z
Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moscato, Pablo (2020). Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search using Genetic Algorithm for Class Imbalanced Data Classification" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_33539
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Moscato, Pablo
Haque, Mohammad Nazmul
Noman, Nasimul
Berratta, Regina
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Faces Dataset: PubFig05

This is a subset of the ''PubFig83'' dataset [1] which provides 100 images each of 5 most difficult celebrities to recognise (referred as class in the classification problem). For each celebrity persons, we took 100 images and separated them into training and testing sets of 90 and 10 images, respectively:

Person: Jenifer Lopez; Katherine Heigl; Scarlett Johansson; Mariah Carey; Jessica Alba

Feature Extraction

To extract features from images, we have applied the HT-L3-model as described in [2] and obtained 25600 features.

Feature Selection

Details about feature selection followed in brief as follows:

Entropy Filtering: First we apply an implementation of Fayyad and Irani's [3] entropy base heuristic to discretise the dataset and discarded features using the minimum description length (MDL) principle and only 4878 passed this entropy based filtering method.

Class-Distribution Balancing: Next, we have converted the dataset to binary-class problem by separating into 5 binary-class datasets using one-vs-all setup. Hence, these datasets became imbalanced at a ratio of 1:4. Then we converted them into balanced binary-class datasets using random sub-sampled method. Further processing of the dataset has been described in the paper.

(alpha,beta)-k Feature selection: To get a good feature set for training the classifier, we select the features using the approach based on the (alpha,beta)-k feature selection [4] problem. It selects a minimum subset of features that maximise both within class similarity and dissimilarity in different classes. We applied the entropy filtering and (alpha,beta)-k feature subset selection methods in three ways and obtained different numbers of features (in the Table below) after consolidating them into binary class dataset.

UAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the union of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

IAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the intersection of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

UEAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets. Then, we applied the entropy filtering and (alpha,beta)-k feature set selection method on each of the balanced binary-class datasets. Finally, we took the union of selected features for each balanced binary-class datasets and get a set of features.

All of these datasets are inside the compressed folder. It also contains the document describing the process detail.

References

[1] Pinto, N., Stone, Z., Zickler, T., & Cox, D. (2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on (pp. 35–42).

[2] Cox, D., & Pinto, N. (2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on (pp. 8–15).

[3] Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In International Joint Conference on Artificial Intelligence (pp. 1022–1029).

[4] Berretta, R., Mendes, A., & Moscato, P. (2005). Integer programming models and algorithms for molecular classification of cancer from microarray data. In Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38 (pp. 361–370). 1082201: Australian Computer Society, Inc.
Uplift Modeling , Marketing Campaign Data
kaggle.com
zip
Updated Nov 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Möbius (2020). Uplift Modeling , Marketing Campaign Data [Dataset]. https://www.kaggle.com/arashnic/uplift-modeling
Explore at:
zip(340156703 bytes)Available download formats
Dataset updated
Nov 1, 2020
Authors
Möbius
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.

###
###

Content

The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.

Following is a detailed description of the features:

f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)

treatment: treatment group (1 = treated, 0 = control)

conversion: whether a conversion occured for this user (binary, label)

visit: whether a visit occured for this user (binary, label)

exposure: treatment effect, whether the user has been effectively exposed (binary)

###

Context

Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.

###
###

Content

The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.

Following is a detailed description of the features:

f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)

treatment: treatment group (1 = treated, 0 = control)

conversion: whether a conversion occured for this user (binary, label)

visit: whether a visit occured for this user (binary, label)

exposure: treatment effect, whether the user has been effectively exposed (binary)

###

Starter Kernels

HistGradientBoostingClassifier Base Model

Acknowledgement

The data provided for paper: "A Large Scale Benchmark for Uplift Modeling"

https://s3.us-east-2.amazonaws.com/criteo-uplift-dataset/large-scale-benchmark.pdf

Eustache Diemert CAIL e.diemert@criteo.com

Artem Betlei CAIL & Université Grenoble Alpes a.betlei@criteo.com

Christophe Renaudin CAIL c.renaudin@criteo.com

Massih-Reza Amini Université Grenoble Alpes massih-reza.amini@imag.fr

For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.

Inspiration

We can foresee related usages such as but not limited to:

Uplift modeling

Interactions between features and treatment

Heterogeneity of treatment

More Readings

Supercharging customer touchpoints with uplift modeling

CasualML

PyLift

MORE DATASETs ...
P
iNaturalist Dataset
paperswithcode.com
Updated Apr 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grant Van Horn; Oisin Mac Aodha; Yang song; Yin Cui; Chen Sun; Alex Shepard; Hartwig Adam; Pietro Perona; Serge Belongie (2021). iNaturalist Dataset [Dataset]. https://paperswithcode.com/dataset/inaturalist
Explore at:
Dataset updated
Apr 11, 2021
Authors
Grant Van Horn; Oisin Mac Aodha; Yang song; Yin Cui; Chen Sun; Alex Shepard; Hartwig Adam; Pietro Perona; Serge Belongie
Description
The iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories. Those categories belong to 13 super-categories including Plantae (Plant), Insecta (Insect), Aves (Bird), Mammalia (Mammal), and so on. The iNat dataset is highly imbalanced with dramatically different number of images per category. For example, the largest super-category “Plantae (Plant)” has 196,613 images from 2,101 categories; whereas the smallest super-category “Protozoa” only has 381 images from 4 categories.
f
Data from: S1 Dataset -
figshare.com
xlsx
Updated Dec 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JiaMing Gong; MingGang Dong (2024). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0311133.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311133.s001
Dataset updated
Dec 13, 2024
Dataset provided by
PLOS ONE
Authors
JiaMing Gong; MingGang Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.
f
Data from: Effective Feature Selection Method for Class-Imbalance Datasets...
acs.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aurelio Antelo-Collado; Ramón Carrasco-Velar; Nicolás García-Pedrajas; Gonzalo Cerruela-García (2023). Effective Feature Selection Method for Class-Imbalance Datasets Applied to Chemical Toxicity Prediction [Dataset]. http://doi.org/10.1021/acs.jcim.0c00908.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.0c00908.s002
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Aurelio Antelo-Collado; Ramón Carrasco-Velar; Nicolás García-Pedrajas; Gonzalo Cerruela-García
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
During the drug development process, it is common to carry out toxicity tests and adverse effect studies, which are essential to guarantee patient safety and the success of the research. The use of in silico quantitative structure–activity relationship (QSAR) approaches for this task involves processing a huge amount of data that, in many cases, have an imbalanced distribution of active and inactive samples. This is usually termed the class-imbalance problem and may have a significant negative effect on the performance of the learned models. The performance of feature selection (FS) for QSAR models is usually damaged by the class-imbalance nature of the involved datasets. This paper proposes the use of an FS method focused on dealing with the class-imbalance problems. The method is based on the use of FS ensembles constructed by boosting and using two well-known FS methods, fast clustering-based FS and the fast correlation-based filter. The experimental results demonstrate the efficiency of the proposal in terms of the classification performance compared to standard methods. The proposal can be extended to other FS methods and applied to other problems in cheminformatics.

Insurance Claims Dataset

kaggle.com

Updated May 9, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Sergey Litvinenko (2024). Insurance Claims Dataset [Dataset]. https://www.kaggle.com/datasets/litvinenko630/insurance-claims

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 9, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Sergey Litvinenko

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Description: Insurance Claims Prediction

Introduction: In the insurance industry, accurately predicting the likelihood of claims is essential for risk assessment and policy pricing. However, insurance claims datasets frequently suffer from class imbalance, where the number of non-claims instances far exceeds that of actual claims. This class imbalance poses challenges for predictive modeling, often leading to biased models favoring the majority class, resulting in subpar performance for the minority class, which is typically of greater interest.

Dataset Overview: The dataset utilized in this project comprises historical data on insurance claims, encompassing a variety of information about the policyholders, their demographics, past claim history, and other pertinent features. The dataset is structured to facilitate predictive modeling tasks aimed at accurately identifying the likelihood of future insurance claims.

Key Features: 1. Policyholder Information: This includes demographic details such as age, gender, occupation, marital status, and geographical location. 2. Claim History: Information regarding past insurance claims, including claim amounts, types of claims (e.g., medical, automobile), frequency of claims, and claim durations. 3. Policy Details: Details about the insurance policies held by the policyholders, such as coverage type, policy duration, premium amount, and deductibles. 4. Risk Factors: Variables indicating potential risk factors associated with policyholders, such as credit score, driving record (for automobile insurance), health status (for medical insurance), and property characteristics (for home insurance). 5. External Factors: Factors external to the policyholders that may influence claim likelihood, such as economic indicators, weather conditions, and regulatory changes.

Objective: The primary objective of utilizing this dataset is to develop robust predictive models capable of accurately assessing the likelihood of insurance claims. By leveraging advanced machine learning techniques, such as classification algorithms and ensemble methods, the aim is to mitigate the effects of class imbalance and produce models that demonstrate high predictive performance across both majority and minority classes.

Application Areas: 1. Risk Assessment: Assessing the risk associated with insuring a particular policyholder based on their characteristics and historical claim behavior. 2. Policy Pricing: Determining appropriate premium amounts for insurance policies by estimating the expected claim frequency and severity. 3. Fraud Detection: Identifying fraudulent insurance claims by detecting anomalous patterns in claim submissions and policyholder behavior. 4. Customer Segmentation: Segmenting policyholders into distinct groups based on their risk profiles and insurance needs to tailor marketing strategies and policy offerings.

Conclusion: The insurance claims dataset serves as a valuable resource for developing predictive models aimed at enhancing risk management, policy pricing, and overall operational efficiency within the insurance industry. By addressing the challenges posed by class imbalance and leveraging the rich array of features available, organizations can gain valuable insights into insurance claim likelihood and make informed decisions to mitigate risk and optimize business outcomes.

Feature	Description
policy_id	Unique identifier for the insurance policy.
subscription_length	The duration for which the insurance policy is active.
customer_age	Age of the insurance policyholder, which can influence the likelihood of claims.
vehicle_age	Age of the vehicle insured, which may affect the probability of claims due to factors like wear and tear.
model	The model of the vehicle, which could impact the claim frequency due to model-specific characteristics.
fuel_type	Type of fuel the vehicle uses (e.g., Petrol, Diesel, CNG), which might influence the risk profile and claim likelihood.
max_torque, max_power	Engine performance characteristics that could relate to the vehicle’s mechanical condition and claim risks.
engine_type	The type of engine, which might have implications for maintenance and claim rates.
displacement, cylinder	Specifications related to the engine size and construction, affec...

t
Credit Card Fraud Detection
test.researchdata.tuwien.at
zenodo.org
+1more
csv, json, pdf +2
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja (2025). Credit Card Fraud Detection [Dataset]. http://doi.org/10.82556/yvxj-9t22
Explore at:
csv, pdf, text/markdown, txt, jsonAvailable download formats
Unique identifier
https://doi.org/10.82556/yvxj-9t22
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:

1. Dataset Description

Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.

Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.

Data Sources
We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.

Method of Dataset Preparation

Schema validation: Renamed columns to snake_case (e.g. transaction_amount, is_declined) so they conform to DBRepo’s requirements.

Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).

Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr. Each subset was materialized in DBRepo and assigned its own PID for precise citation.

Cleaning: Converted the categorical flags (is_declined, isforeigntransaction, ishighriskcountry, isfradulent) from “Y”/“N” to 1/0 and dropped non‐feature identifiers (actionnr, merchant_id).

Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.

2. Technical Details

Dataset Structure

The raw data is a single CSV with columns:

actionnr (integer transaction ID)

merchant_id (string)

average_amount_transaction_day (float)

transaction_amount (float)

is_declined, isforeigntransaction, ishighriskcountry, isfradulent (binary flags)

total_number_of_declines_day, daily_chargeback_avg_amt, sixmonth_avg_chbk_amt, sixmonth_chbk_freq (numeric features)

Naming Conventions

All columns use lowercase snake_case.

Subsets are named creditcard_training, creditcard_validation, creditcard_test in DBRepo.

Files in the code repo follow a clear structure:

├── data/ # local copies only; raw data lives in DBRepo ├── notebooks/Task.ipynb ├── models/rf_model_v1.joblib ├── outputs/ # confusion_matrix.png, roc_curve.png, predictions.csv ├── README.md ├── requirements.txt └── codemeta.json

Required Software

Python 3.9+

pandas, numpy (data handling)

scikit-learn (modeling, metrics)

matplotlib (visualizations)

dbrepo‐client.py (DBRepo API)

requests (TU WRD API)

Additional Resources

Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud

Scikit-learn docs: https://scikit-learn.org/stable

DBRepo API guide: via the starter notebook’s dbrepo_client.py template

TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs

3. Further Details

Data Limitations

Highly imbalanced: only ~0.17% of transactions are fraudulent.

Anonymized PCA features (V1–V28) hidden; we extended with domain features but cannot reverse engineer raw variables.

Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.

Licensing and Attribution

Raw data: CC-0 (per Kaggle terms)

Code & notebooks: MIT License

Model artifacts & outputs: CC-BY 4.0

DUWRD records include ORCID identifiers for the author.

Recommended Uses

Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.

Educational purposes: demonstrating model‐training pipelines, FAIR data practices.

Extension: adding time‐series or deep‐learning models.

Known Issues

Possible temporal leakage if date/time features not handled correctly.

Model performance may degrade on live data due to concept drift.

Binary flags may oversimplify nuanced transaction outcomes.
f
Data_Sheet_1_Effect of De-noising by Wavelet Filtering and Data Augmentation...
frontiersin.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Min Jin; Chunguang Wang; Dan Børge Jensen (2023). Data_Sheet_1_Effect of De-noising by Wavelet Filtering and Data Augmentation by Borderline SMOTE on the Classification of Imbalanced Datasets of Pig Behavior.pdf [Dataset]. http://doi.org/10.3389/fanim.2021.666855.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fanim.2021.666855.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Min Jin; Chunguang Wang; Dan Børge Jensen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification of imbalanced datasets of animal behavior has been one of the top challenges in the field of animal science. An imbalanced dataset will lead many classification algorithms to being less effective and result in a higher misclassification rate for the minority classes. The aim of this study was to assess a method for addressing the problem of imbalanced datasets of pigs' behavior by using an over-sampling method, namely Borderline-SMOTE. The pigs' activity was measured using a triaxial accelerometer, which was mounted on the back of the pigs. Wavelet filtering and Borderline-SMOTE were both applied as methods to pre-process the dataset. A multilayer feed-forward neural network was trained and validated with 21 input features to classify four pig activities: lying, standing, walking, and exploring. The results showed that wavelet filtering and Borderline-SMOTE both lead to improved performance. Furthermore, Borderline-SMOTE yielded greater improvements in classification performance than an alternative method for balancing the training data, namely random under-sampling, which is commonly used in animal science research. However, the overall performance was not adequate to satisfy the research needs in this field and to address the common but urgent problem of imbalanced behavior dataset.
A
‘Oil Spill’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Oil Spill’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-oil-spill-7d4d/3ef21f5a/?iid=007-049&v=presentation
Explore at:
Dataset updated
Feb 14, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Oil Spill’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ashrafkhan94/oil-spill on 14 February 2022.

--- Dataset description provided by original source is as follows ---

In this project, we will use a standard imbalanced machine learning dataset referred to as the oil spill dataset, oil slicks dataset or simply oil. The dataset was introduced in the 1998 paper by Miroslav Kubat, et al. titled Machine Learning for the Detection of Oil Spills in Satellite Radar Images. The dataset is often credited to Robert Holte, a co-author of the paper. The dataset was developed by starting with satellite images of the ocean, some of which contain an oil spill and some that do not. Images were split into sections and processed using computer vision algorithms to provide a vector of features to describe the contents of the image section or patch.

--- Original source retains full ownership of the source dataset ---

Facebook

Twitter

Click to copy link

Link copied

Cite

Blessa Binolin M (2023). Imbalanced Data [Dataset]. https://ieee-dataport.org/documents/imbalanced-data-0

Imbalanced Data

Explore at:

Dataset updated

Aug 23, 2023

Authors

Blessa Binolin M

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.

Clear search

Close search

Google apps

Main menu

Imbalanced Data

Data from: Imbalanced dataset for benchmarking

Imbalanced dataset for benchmarking

Characteristics

References

Data from: Isometric Stratified Ensembles: A Partial and Incremental...

Replication Data: Leveraging Researcher Domain Expertise to Annotate...

M_DATA2

Data from: High impact bug report identification with imbalanced learning...

Data from: Addressing Imbalanced Classification Problems in Drug Discovery...

S5 Dataset -

Cerebral Stroke Prediction-Imbalanced Dataset

Context

Content

Acknowledgements

IT Customer Chrun (Goal : Imbalanced Dataset)

SGCC Electricity Theft Detection

Overview

Description

Features

Useful for

Notes

Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search...

Uplift Modeling , Marketing Campaign Data

Context

Content

Context

Content

Starter Kernels

Acknowledgement

Inspiration

More Readings

MORE DATASETs ...

iNaturalist Dataset

Data from: S1 Dataset -

Data from: Effective Feature Selection Method for Class-Imbalance Datasets...

Insurance Claims Dataset

Credit Card Fraud Detection

1. Dataset Description

2. Technical Details

3. Further Details

Data_Sheet_1_Effect of De-noising by Wavelet Filtering and Data Augmentation...

‘Oil Spill’ analyzed by Analyst-2

Imbalanced Data