Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The different algorithms of the imbalanced-learn
toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.
ID | Name | Repository & Target | Ratio | # samples | # features |
---|---|---|---|---|---|
1 | Ecoli | UCI, target: imU | 8.6:1 | 336 | 7 |
2 | Optical Digits | UCI, target: 8 | 9.1:1 | 5,620 | 64 |
3 | SatImage | UCI, target: 4 | 9.3:1 | 6,435 | 36 |
4 | Pen Digits | UCI, target: 5 | 9.4:1 | 10,992 | 16 |
5 | Abalone | UCI, target: 7 | 9.7:1 | 4,177 | 8 |
6 | Sick Euthyroid | UCI, target: sick euthyroid | 9.8:1 | 3,163 | 25 |
7 | Spectrometer | UCI, target: >=44 | 11:1 | 531 | 93 |
8 | Car_Eval_34 | UCI, target: good, v good | 12:1 | 1,728 | 6 |
9 | ISOLET | UCI, target: A, B | 12:1 | 7,797 | 617 |
10 | US Crime | UCI, target: >0.65 | 12:1 | 1,994 | 122 |
11 | Yeast_ML8 | LIBSVM, target: 8 | 13:1 | 2,417 | 103 |
12 | Scene | LIBSVM, target: >one label | 13:1 | 2,407 | 294 |
13 | Libras Move | UCI, target: 1 | 14:1 | 360 | 90 |
14 | Thyroid Sick | UCI, target: sick | 15:1 | 3,772 | 28 |
15 | Coil_2000 | KDD, CoIL, target: minority | 16:1 | 9,822 | 85 |
16 | Arrhythmia | UCI, target: 06 | 17:1 | 452 | 279 |
17 | Solar Flare M0 | UCI, target: M->0 | 19:1 | 1,389 | 10 |
18 | OIL | UCI, target: minority | 22:1 | 937 | 49 |
19 | Car_Eval_4 | UCI, target: vgood | 26:1 | 1,728 | 6 |
20 | Wine Quality | UCI, wine, target: <=4 | 26:1 | 4,898 | 11 |
21 | Letter Img | UCI, target: Z | 26:1 | 20,000 | 16 |
22 | Yeast _ME2 | UCI, target: ME2 | 28:1 | 1,484 | 8 |
23 | Webpage | LIBSVM, w7a, target: minority | 33:1 | 49,749 | 300 |
24 | Ozone Level | UCI, ozone, data | 34:1 | 2,536 | 72 |
25 | Mammography | UCI, target: minority | 42:1 | 11,183 | 6 |
26 | Protein homo. | KDD CUP 2004, minority | 111:1 | 145,751 | 74 |
27 | Abalone_19 | UCI, target: 19 | 130:1 | 4,177 | 8 |
[1] Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).
[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).
[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.
[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.
In this manuscript, we describe a method to utilize researcher domain expertise to annotate concepts efficiently and accurately within an imbalanced dataset. This folder contains two scripts that run two variations of the simulation referred to in our paper. Additionally, we included two separate datasets that were utilized in the simulations. For each, we shared the list of document embeddings used for classification, together with a corresponding CSV which holds the categorical labels for each embedding. We recommend first reading the "README" text file, before running the scripts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
this paper generated a challenging two-dimensional imbalanced dataset M_DATA2 based on the normal distribution. Its scatter plot is shown in Figure 1. The dataset has a sample size of 1000, 2 feature variables, and the ratio between the number of samples in the two classes is 1:4. Among them, the samples of the minority class are divided into four parts by the majorityclasssamples, and there is a certain mixed area, which makes it difficult for the GNB classifier to classify effectivel
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record contains the underlying research data for the publication "High impact bug report identification with imbalanced learning strategies" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/3702In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the F1-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.Supplementary code and data available from GitHub:
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
A stroke, also known as a cerebrovascular accident or CVA is when part of the brain loses its blood supply and the part of the body that the blood-deprived brain cells control stops working. This loss of blood supply can be ischemic because of lack of blood flow, or hemorrhagic because of bleeding into brain tissue. A stroke is a medical emergency because strokes can lead to death or permanent disability. There are opportunities to treat ischemic strokes but that treatment needs to be started in the first few hours after the signs of a stroke begin.
The cerebral Stroke dataset consists of 12 features including the target column which is imbalanced.
Liu, Tianyu; Fan, Wenhui; Wu, Cheng (2019), “Data for A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical-datasets”, Mendeley Data, V1, doi: 10.17632/x8ygrw87jw.1 Dataset is sourced from here.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Predicting customer churn is critical for companies to be able to effectively retain customers. It is more costly to acquire new customers than to retain existing ones. For this reason, large corporations are seeking to develop models to predict which customers are more likely to change and take actions accordingly.
Each row represents a customer, each column contains customer’s attributes described on the column Metadata.
The data set includes information about:
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The State Grid Corporation of China (SGCC) dataset with 1000 records was used in the model. This is a key resource in the field of power distribution and management, with a large and varied set of data about electricity transport and grid operations. This set of data contains a lot of different kinds of information, such as history and real-time data on energy use, grid infrastructure, the integration of green energy, and grid performance. It is a key part of making power distribution networks more reliable and efficient by helping with things like predicting demand, watching the grid, and finding problems. Researchers, energy providers, and law- makers can use this information to learn important things about electricity usage trends, the health of the grid, and the merging of green energy sources. This will help the electric power industry come up with new strategies and ideas that are based on data.
Electricity theft detection released by the State Grid Corporation of China (SGCC) dataset data set.csv contains 1037 columns and 42,372 rows for electric consumption from January first 2014 to 30 October 2016. SGCC data first column is consumer ID that is alphanumeric. Then from column 2 to columns 1036 daily electricity consumption is given. Last column named flag is the labels in 0 and 1 values. the small version of the dataset datasetsmall.csv only contains the electric consumption for January 2014.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Faces Dataset: PubFig05
This is a subset of the ''PubFig83'' dataset [1] which provides 100 images each of 5 most difficult celebrities to recognise (referred as class in the classification problem). For each celebrity persons, we took 100 images and separated them into training and testing sets of 90 and 10 images, respectively:
Person: Jenifer Lopez; Katherine Heigl; Scarlett Johansson; Mariah Carey; Jessica Alba
Feature Extraction
To extract features from images, we have applied the HT-L3-model as described in [2] and obtained 25600 features.
Feature Selection
Details about feature selection followed in brief as follows:
Entropy Filtering: First we apply an implementation of Fayyad and Irani's [3] entropy base heuristic to discretise the dataset and discarded features using the minimum description length (MDL) principle and only 4878 passed this entropy based filtering method.
Class-Distribution Balancing: Next, we have converted the dataset to binary-class problem by separating into 5 binary-class datasets using one-vs-all setup. Hence, these datasets became imbalanced at a ratio of 1:4. Then we converted them into balanced binary-class datasets using random sub-sampled method. Further processing of the dataset has been described in the paper.
(alpha,beta)-k Feature selection: To get a good feature set for training the classifier, we select the features using the approach based on the (alpha,beta)-k feature selection [4] problem. It selects a minimum subset of features that maximise both within class similarity and dissimilarity in different classes. We applied the entropy filtering and (alpha,beta)-k feature subset selection methods in three ways and obtained different numbers of features (in the Table below) after consolidating them into binary class dataset.
UAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the union of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.
IAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the intersection of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.
UEAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets. Then, we applied the entropy filtering and (alpha,beta)-k feature set selection method on each of the balanced binary-class datasets. Finally, we took the union of selected features for each balanced binary-class datasets and get a set of features.
All of these datasets are inside the compressed folder. It also contains the document describing the process detail.
References
[1] Pinto, N., Stone, Z., Zickler, T., & Cox, D. (2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on (pp. 35–42).
[2] Cox, D., & Pinto, N. (2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on (pp. 8–15).
[3] Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In International Joint Conference on Artificial Intelligence (pp. 1022–1029).
[4] Berretta, R., Mendes, A., & Moscato, P. (2005). Integer programming models and algorithms for molecular classification of cancer from microarray data. In Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38 (pp. 361–370). 1082201: Australian Computer Society, Inc.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.
###
###
The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.
Following is a detailed description of the features:
###
Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.
###
###
The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.
Following is a detailed description of the features:
###
The data provided for paper: "A Large Scale Benchmark for Uplift Modeling"
https://s3.us-east-2.amazonaws.com/criteo-uplift-dataset/large-scale-benchmark.pdf
For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.
We can foresee related usages such as but not limited to:
The iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories. Those categories belong to 13 super-categories including Plantae (Plant), Insecta (Insect), Aves (Bird), Mammalia (Mammal), and so on. The iNat dataset is highly imbalanced with dramatically different number of images per category. For example, the largest super-category “Plantae (Plant)” has 196,613 images from 2,101 categories; whereas the smallest super-category “Protozoa” only has 381 images from 4 categories.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
During the drug development process, it is common to carry out toxicity tests and adverse effect studies, which are essential to guarantee patient safety and the success of the research. The use of in silico quantitative structure–activity relationship (QSAR) approaches for this task involves processing a huge amount of data that, in many cases, have an imbalanced distribution of active and inactive samples. This is usually termed the class-imbalance problem and may have a significant negative effect on the performance of the learned models. The performance of feature selection (FS) for QSAR models is usually damaged by the class-imbalance nature of the involved datasets. This paper proposes the use of an FS method focused on dealing with the class-imbalance problems. The method is based on the use of FS ensembles constructed by boosting and using two well-known FS methods, fast clustering-based FS and the fast correlation-based filter. The experimental results demonstrate the efficiency of the proposal in terms of the classification performance compared to standard methods. The proposal can be extended to other FS methods and applied to other problems in cheminformatics.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Description: Insurance Claims Prediction
Introduction: In the insurance industry, accurately predicting the likelihood of claims is essential for risk assessment and policy pricing. However, insurance claims datasets frequently suffer from class imbalance, where the number of non-claims instances far exceeds that of actual claims. This class imbalance poses challenges for predictive modeling, often leading to biased models favoring the majority class, resulting in subpar performance for the minority class, which is typically of greater interest.
Dataset Overview: The dataset utilized in this project comprises historical data on insurance claims, encompassing a variety of information about the policyholders, their demographics, past claim history, and other pertinent features. The dataset is structured to facilitate predictive modeling tasks aimed at accurately identifying the likelihood of future insurance claims.
Key Features: 1. Policyholder Information: This includes demographic details such as age, gender, occupation, marital status, and geographical location. 2. Claim History: Information regarding past insurance claims, including claim amounts, types of claims (e.g., medical, automobile), frequency of claims, and claim durations. 3. Policy Details: Details about the insurance policies held by the policyholders, such as coverage type, policy duration, premium amount, and deductibles. 4. Risk Factors: Variables indicating potential risk factors associated with policyholders, such as credit score, driving record (for automobile insurance), health status (for medical insurance), and property characteristics (for home insurance). 5. External Factors: Factors external to the policyholders that may influence claim likelihood, such as economic indicators, weather conditions, and regulatory changes.
Objective: The primary objective of utilizing this dataset is to develop robust predictive models capable of accurately assessing the likelihood of insurance claims. By leveraging advanced machine learning techniques, such as classification algorithms and ensemble methods, the aim is to mitigate the effects of class imbalance and produce models that demonstrate high predictive performance across both majority and minority classes.
Application Areas: 1. Risk Assessment: Assessing the risk associated with insuring a particular policyholder based on their characteristics and historical claim behavior. 2. Policy Pricing: Determining appropriate premium amounts for insurance policies by estimating the expected claim frequency and severity. 3. Fraud Detection: Identifying fraudulent insurance claims by detecting anomalous patterns in claim submissions and policyholder behavior. 4. Customer Segmentation: Segmenting policyholders into distinct groups based on their risk profiles and insurance needs to tailor marketing strategies and policy offerings.
Conclusion: The insurance claims dataset serves as a valuable resource for developing predictive models aimed at enhancing risk management, policy pricing, and overall operational efficiency within the insurance industry. By addressing the challenges posed by class imbalance and leveraging the rich array of features available, organizations can gain valuable insights into insurance claim likelihood and make informed decisions to mitigate risk and optimize business outcomes.
Feature | Description |
---|---|
policy_id | Unique identifier for the insurance policy. |
subscription_length | The duration for which the insurance policy is active. |
customer_age | Age of the insurance policyholder, which can influence the likelihood of claims. |
vehicle_age | Age of the vehicle insured, which may affect the probability of claims due to factors like wear and tear. |
model | The model of the vehicle, which could impact the claim frequency due to model-specific characteristics. |
fuel_type | Type of fuel the vehicle uses (e.g., Petrol, Diesel, CNG), which might influence the risk profile and claim likelihood. |
max_torque, max_power | Engine performance characteristics that could relate to the vehicle’s mechanical condition and claim risks. |
engine_type | The type of engine, which might have implications for maintenance and claim rates. |
displacement, cylinder | Specifications related to the engine size and construction, affec... |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:
Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.
Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.
Data Sources
We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.
Method of Dataset Preparation
Schema validation: Renamed columns to snake_case (e.g. transaction_amount
, is_declined
) so they conform to DBRepo’s requirements.
Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).
Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr
. Each subset was materialized in DBRepo and assigned its own PID for precise citation.
Cleaning: Converted the categorical flags (is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
) from “Y”/“N” to 1/0 and dropped non‐feature identifiers (actionnr
, merchant_id
).
Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.
Dataset Structure
The raw data is a single CSV with columns:
actionnr
(integer transaction ID)
merchant_id
(string)
average_amount_transaction_day
(float)
transaction_amount
(float)
is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
(binary flags)
total_number_of_declines_day
, daily_chargeback_avg_amt
, sixmonth_avg_chbk_amt
, sixmonth_chbk_freq
(numeric features)
Naming Conventions
All columns use lowercase snake_case.
Subsets are named creditcard_training
, creditcard_validation
, creditcard_test
in DBRepo.
Files in the code repo follow a clear structure:
├── data/ # local copies only; raw data lives in DBRepo
├── notebooks/Task.ipynb
├── models/rf_model_v1.joblib
├── outputs/ # confusion_matrix.png, roc_curve.png, predictions.csv
├── README.md
├── requirements.txt
└── codemeta.json
Required Software
Python 3.9+
pandas, numpy (data handling)
scikit-learn (modeling, metrics)
matplotlib (visualizations)
dbrepo‐client.py (DBRepo API)
requests (TU WRD API)
Additional Resources
Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
Scikit-learn docs: https://scikit-learn.org/stable
DBRepo API guide: via the starter notebook’s dbrepo_client.py
template
TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs
Data Limitations
Highly imbalanced: only ~0.17% of transactions are fraudulent.
Anonymized PCA features (V1
–V28
) hidden; we extended with domain features but cannot reverse engineer raw variables.
Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.
Licensing and Attribution
Raw data: CC-0 (per Kaggle terms)
Code & notebooks: MIT License
Model artifacts & outputs: CC-BY 4.0
DUWRD records include ORCID identifiers for the author.
Recommended Uses
Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.
Educational purposes: demonstrating model‐training pipelines, FAIR data practices.
Extension: adding time‐series or deep‐learning models.
Known Issues
Possible temporal leakage if date/time features not handled correctly.
Model performance may degrade on live data due to concept drift.
Binary flags may oversimplify nuanced transaction outcomes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification of imbalanced datasets of animal behavior has been one of the top challenges in the field of animal science. An imbalanced dataset will lead many classification algorithms to being less effective and result in a higher misclassification rate for the minority classes. The aim of this study was to assess a method for addressing the problem of imbalanced datasets of pigs' behavior by using an over-sampling method, namely Borderline-SMOTE. The pigs' activity was measured using a triaxial accelerometer, which was mounted on the back of the pigs. Wavelet filtering and Borderline-SMOTE were both applied as methods to pre-process the dataset. A multilayer feed-forward neural network was trained and validated with 21 input features to classify four pig activities: lying, standing, walking, and exploring. The results showed that wavelet filtering and Borderline-SMOTE both lead to improved performance. Furthermore, Borderline-SMOTE yielded greater improvements in classification performance than an alternative method for balancing the training data, namely random under-sampling, which is commonly used in animal science research. However, the overall performance was not adequate to satisfy the research needs in this field and to address the common but urgent problem of imbalanced behavior dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Oil Spill’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ashrafkhan94/oil-spill on 14 February 2022.
--- Dataset description provided by original source is as follows ---
In this project, we will use a standard imbalanced machine learning dataset referred to as the oil spill dataset, oil slicks dataset or simply oil. The dataset was introduced in the 1998 paper by Miroslav Kubat, et al. titled Machine Learning for the Detection of Oil Spills in Satellite Radar Images. The dataset is often credited to Robert Holte, a co-author of the paper. The dataset was developed by starting with satellite images of the ocean, some of which contain an oil spill and some that do not. Images were split into sections and processed using computer vision algorithms to provide a vector of features to describe the contents of the image section or patch.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification learning on non-stationary data may face dynamic changes from time to time. The major problem in it is the class imbalance and high cost of labeling instances despite drifts. Imbalance is due to lower number of samples in the minority class than the majority class. Imbalanced data results in the misclassification of data points.