100+ datasets found

f
Data from: Isometric Stratified Ensembles: A Partial and Incremental...
figshare.com
acs.figshare.com
xlsx
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux (2023). Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation [Dataset]. http://doi.org/10.1021/acs.jcim.2c00293.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.2c00293.s003
Dataset updated
Jun 15, 2023
Dataset provided by
ACS Publications
Authors
Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.
f
Imbalanced class datasets.
plos.figshare.com
xls
Updated Apr 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud (2024). Imbalanced class datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0299585.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299585.t001
Dataset updated
Apr 11, 2024
Dataset provided by
PLOS ONE
Authors
Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.
Dataset for Class Imbalance Classification Problem
kaggle.com
Updated Jan 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akalya Subramanian (2021). Dataset for Class Imbalance Classification Problem [Dataset]. https://www.kaggle.com/akalyasubramanian/dataset-for-class-imbalance-classification-problem/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Akalya Subramanian
Description
Dataset

This dataset was created by Akalya Subramanian

Contents
Dataset: The effects of class balance on the training energy consumption of...
zenodo.org
data.niaid.nih.gov
csv
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga (2024). Dataset: The effects of class balance on the training energy consumption of logistic regression models [Dataset]. http://doi.org/10.5281/zenodo.10823624
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10823624
Dataset updated
Mar 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2024
Description
Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.
f
Data from: Less is More: An Empirical Study of Undersampling Techniques for...
figshare.com
zip
Updated May 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gichan Lee (2024). Less is More: An Empirical Study of Undersampling Techniques for Technical Debt Prediction [Dataset]. http://doi.org/10.6084/m9.figshare.22708036.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22708036.v1
Dataset updated
May 20, 2024
Dataset provided by
figshare
Authors
Gichan Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Technical Debt (TD) prediction is crucial to preventing software quality degradation and maintenance cost increase. Recent Machine Learning (ML) approaches have shown promising results in TD prediction, but the imbalanced TD datasets can have a negative impact on ML model performance. Although previous TD studies have investigated various oversampling techniques that generates minority class instances to mitigate the imbalance, potentials of undersampling techniques have not yet been thoroughly explored due to the concerns about information loss. To address this gap, we investigate the impact of undersampling on ML model performance for TD prediction by utilizing 17,797 classes from 25 Java open-source projects. We compare the performance of ML models with different undersampling techniques and evaluate the impact of combining them with widely used oversampling techniques in TD studies. Our findings reveal that (i) undersampling can significantly improve ML model performance compared to oversampling and no resampling; (ii) the combined application of undersampling and oversampling techniques leads to a synergy of further performance improvement compared to applying each technique exclusively. Based on these results, we recommend practitioners to explore various undersampling techniques and their combinations with oversampling techniques for more effective TD prediction.This package is for the replication of 'Less is More: an Empirical Study of Undersampling Techniques for Technical Debt Prediction'File list:X.csv, Y.csv: - These are the datasets for the study, used in the ipynb file below.under_over_sampling_scripts.ipynb: - These scripts can obtain all the experimental results from the study. - They can be run through Jupyter Notebook or Google Colab. - The required packages are listed at the top in the file, so installation via pip or conda is necessary before running.Results_for_all_tables.csv: This is a csv file that summarizes all the results obtained from the study.
DCASE-2023-TASK-5
kaggle.com
zip
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Aguado (2023). DCASE-2023-TASK-5 [Dataset]. https://www.kaggle.com/datasets/aguado/dcase-2023-task-5
Explore at:
zip(7712922302 bytes)Available download formats
Dataset updated
Jun 5, 2023
Authors
Víctor Aguado
Description
Introduction

This task focuses on sound event detection in a few-shot learning setting for animal (mammal and bird) vocalisations. Participants will be expected to create a method that can extract information from five exemplar vocalisations (shots) of mammals or birds and detect and classify sounds in field recordings.

For more info please reffer to the official website: https://dcase.community/challenge2023/task-few-shot-bioacoustic-event-detection

Description

Few-shot learning is a highly promising paradigm for sound event detection. It is also an extremely good fit to the needs of users in bioacoustics, in which increasingly large acoustic datasets commonly need to be labelled for events of an identified category (e.g. species or call-type), even though this category might not be known in other datasets or have any yet-known label. While satisfying user needs, this will also benchmark few-shot learning for the wider domain of sound event detection (SED).

Few-shot learning describes tasks in which an algorithm must make predictions given only a few instances of each class, contrary to standard supervised learning paradigm. The main objective is to find reliable algorithms that are capable of dealing with data sparsity, class imbalance and noisy/busy environments. Few-shot learning is usually studied using N-way-K-shot classification, where N denotes the number of classes and K the number of examples for each class.

Some reasons why few-shot learning has been of increasing interest:

Scarcity of supervised data can lead to unreliable generalisations of machine learning models. Explicitly labeling a huge dataset can be costly both in time and resources. Fixed ontologies or class labels used in SED and other DCASE tasks are often a poor fit to a given user’s goal. Development Set The development set is pre-split into training and validation sets. The training set consists of five sub-folders deriving from a different source each. Along with the audio files multi-class annotations are provided for each. The validation set consists of two sub-folders deriving from a different source each, with a single-class (class of interest) annotation file provided for each audio file.

Training Set

The training set contains four different sub-folders (BV, HV, JD, MT,WMW). Statistics are given overall and specific for each sub-folder.

Overall Statistics Values Number of audio recordings 174 Total duration 21 hours Total classes (excl. UNK) 47 Total events (excl. UNK) 14229

BV

The BirdVox-DCASE-10h (BV for short) contains five audio files from four different autonomous recording units, each lasting two hours. These autonomous recording units are all located in Tompkins County, New York, United States. Furthermore, they follow the same hardware specification: the Recording and Observing Bird Identification Node (ROBIN) developed by the Cornell Lab of Ornithology. Andrew Farnsworth, an expert ornithologist, has annotated these recordings for the presence of flight calls from migratory passerines, namely: American sparrows, cardinals, thrushes, and warblers. In total, the annotator found 2,662 from 11 different species. We estimate these flight calls to have a duration of 150 milliseconds and a fundamental frequency between 2 kHz and 10 kHz.

Statistics Values Number of audio recordings 5 Total duration 10 hours Total classes (excl. UNK) 11 Total events (excl. UNK) 9026 Ratio event/duration 0.04 Sampling rate 24,000 Hz

HT

Spotted hyenas are a highly social species that live in "fission-fusion" groups where group members range alone or in smaller subgroups that split and merge over time. Hyenas use a variety of types of vocalizations to coordinate with one another over both short and long distances. Spotted hyena vocalization data were recorded on custom-developed audio tags designed by Mark Johnson and integrated into combined GPS / acoustic collars (Followit Sweden AB) by Frants Jensen and Mark Johnson. Collars were deployed on female hyenas of the Talek West hyena clan at the MSU-Mara Hyena Project (directed by Kay Holekamp) in the Masai Mara, Kenya as part of a multi-species study on communication and collective behavior. Field work was carried out by Kay Holekamp, Andrew Gersick, Frants Jensen, Ariana Strandburg-Peshkin, and Benson Pion; labeling was done by Kenna Lehmann and colleagues.

Statistics Values Number of audio recordings 5 Total duration 5 hours Total classes (excl. UNK) 3 Total events (excl. UNK) 611 Ratio events/duration 0.05 Sampling rate 6000 Hz

JD

Jackdaws are corvid songbirds which usually breed, forage and sleep in large groups, but form a pair bond with the same partner for life. They produce thousands of vocalisations per day, but many aspects of their vocal behaviour remained unexplored due to the difficulty in recording and assigning vocalisations to specific individuals, especia...
UVP5 data sorted with EcoTaxa and MorphoCluster
seanoe.org
image/*
Updated 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rainer Kiko; Simon-Martin Schröder (2020). UVP5 data sorted with EcoTaxa and MorphoCluster [Dataset]. http://doi.org/10.17882/73002
Explore at:
image/*Available download formats
Unique identifier
https://doi.org/10.17882/73002
Dataset updated
2020
Dataset provided by
SEANOE
Authors
Rainer Kiko; Simon-Martin Schröder
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
Oct 23, 2012 - Aug 7, 2017
Area covered
Description
here, we provide plankton image data that was sorted with the web applications ecotaxa and morphocluster. the data set was used for image classification tasks as described in schröder et. al (in preparation) and does not include any geospatial or temporal meta-data.plankton was imaged using the underwater vision profiler 5 (picheral et al. 2010) in various regions of the world's oceans between 2012-10-24 and 2017-08-08.this data publication consists of an archive containing "training.csv" (list of 392k training images for classification, validated using ecotaxa), "validation.csv" (list of 196k validation images for classification, validated using ecotaxa), "unlabeld.csv" (list of 1m unlabeled images), "morphocluster.csv" (1.2m objects validated using morphocluster, a subset of "unlabeled.csv" and "validation.csv") and the image files themselves. the csv files each contain the columns "object_id" (a unique id), "image_fn" (the relative filename), and "label" (the assigned name).the training and validation sets were sorted into 65 classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). this data shows a severe class imbalance; the 10% most populated classes contain more than 80% of the objects and the class sizes span four orders of magnitude. the validation set and a set of additional 1m unlabeled images were sorted during the first trial of morphocluster (https://github.com/morphocluster).the images in this data set were sampled during rv meteor cruises m92, m93, m96, m97, m98, m105, m106, m107, m108, m116, m119, m121, m130, m131, m135, m136, m137 and m138, during rv maria s merian cruises msm22, msm23, msm40 and msm49, during the rv polarstern cruise ps88b and during the fluxes1 experiment with rv sarmiento de gamboa.the following people have contributed to the sorting of the image data on ecotaxa:rainer kiko, tristan biard, benjamin blanc, svenja christiansen, justine courboules, charlotte eich, jannik faustmann, christine gawinski, augustin lafond, aakash panchal, marc picheral, akanksha singh and helena haussin schröder et al. (in preparation), the training set serves as a source for knowledge transfer in the training of the feature extractor. the classification using morphocluster was conducted by rainer kiko. used labels are operational and not yet matched to respective ecotaxa classes.
m
Dataset for Transient Stability Assessment of IEEE 39-Bus System
data.mendeley.com
Updated Dec 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Živko Sokolović (2024). Dataset for Transient Stability Assessment of IEEE 39-Bus System [Dataset]. http://doi.org/10.17632/p992nhb8ss.1
Explore at:
Unique identifier
https://doi.org/10.17632/p992nhb8ss.1
Dataset updated
Dec 20, 2024
Authors
Živko Sokolović
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 50 features and was generated through 12,852 time-domain simulations performed on the IEEE New England 39 bus system test case using DIgSILENT PowerFactory and Python automation. The simulations span diverse operating conditions by varying the generation/load profile from 80% to 120% in 5% increments. For each condition, three-phase short-circuit faults were applied at seven distinct locations (0%, 10%, 20%, 50%, 80%, 90%, 100%) along all transmission lines, with fault clearing times ranging from 0.1s to 0.3s.

Key features captured for each of the 10 generators (G02 is the reference machine) include:

P in MW - Active Power ut in p.u. - Terminal Voltage ie in p.u. - Excitation Current xspeed in p.u. - Rotor Speed firel in deg - Rotor Angle (relative to G02)

Simulations lasted 10 seconds to ensure accurate transient stability assessment. Post-fault data was sampled every 0.01s from fault clearance up to 0.6s afterward, labeling the stability state as 1 (stable) or 0 (unstable). The dataset generation process took 5,840 seconds. The dataset exhibits a class imbalance, with 42% of cases belonging to the unstable class. All simulation data were exported to .csv files and subsequently unified into a single pickle file (tsa_data.pkl).

Helper scripts are provided:

dataset_loader.py: Includes the load_tsa_data function to load the dataset. usage.py: Demonstrates how to use the loader module.

This dataset serves as a comprehensive foundation for machine learning applications in transient stability assessment (TSA), offering insights into system behavior under dynamic conditions.
h
resampled_IDS_datasets
huggingface.co
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Le (2025). resampled_IDS_datasets [Dataset]. http://doi.org/10.57967/hf/4961
Explore at:
Unique identifier
https://doi.org/10.57967/hf/4961
Dataset updated
Mar 26, 2025
Authors
Le
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for resampled_IDS_datasets

Intrusion Detection Systems (IDS) play a crucial role in securing computer networks against malicious activities. However, their efficacy is consistently hindered by the persistent challenge of class imbalance in real-world datasets. While various methods, such as resampling techniques, ensemble methods, cost-sensitive learning, data augmentation, and so on, have individually addressed imbalance classification issues, there exists a notable… See the full description on the dataset page: https://huggingface.co/datasets/Thi-Thu-Huong/resampled_IDS_datasets.
Shows parameters of the datasets used in this study.
plos.figshare.com
xls
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Teh; Paul Armitage; Solomon Tesfaye; Dinesh Selvarajah; Iain D. Wilkinson (2023). Shows parameters of the datasets used in this study. [Dataset]. http://doi.org/10.1371/journal.pone.0243907.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0243907.t001
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Kevin Teh; Paul Armitage; Solomon Tesfaye; Dinesh Selvarajah; Iain D. Wilkinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Shows parameters of the datasets used in this study.
Wafer UCR Archive Dataset
zenodo.org
bin
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). Wafer UCR Archive Dataset [Dataset]. http://doi.org/10.5281/zenodo.11198387
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11198387
Dataset updated
Jul 31, 2024
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.

This dataset was formatted by R. Olszewski as part of his thesis Generalized feature extraction for structural pattern recognition in time-series data at Carnegie Mellon University, 2001. Wafer data relates to semi-conductor microelectronics fabrication. A collection of inline process control measurements recorded from various sensors during the processing of silicon wafers for semiconductor fabrication constitute the wafer database; each data set in the wafer database contains the measurements recorded by one sensor during the processing of one wafer by one tool. The two classes are normal and abnormal. There is a large class imbalance between normal and abnormal (10.7% of the train are abnormal, 12.1% of the test).

Donator: R. Olszewski
f
S5 Dataset -
plos.figshare.com
xlsx
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JiaMing Gong; MingGang Dong (2024). S5 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0311133.s005
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311133.s005
Dataset updated
Dec 13, 2024
Dataset provided by
PLOS ONE
Authors
JiaMing Gong; MingGang Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.
Z
Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
data.niaid.nih.gov
zenodo.org
Updated Jan 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nagappan, Meiyappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
Explore at:
Dataset updated
Jan 27, 2022
Dataset provided by
Nagappan, Meiyappan
Keshavarz, Hossein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

The datasets are available under directory dataset. There are 4 datasets in this directory.

apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.

apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).

apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.

apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

References:

GumTree

https://github.com/GumTreeDiff/gumtree

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

PyDriller

https://pydriller.readthedocs.io/en/latest/

Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
f
Additional file 3 of Impact of random oversampling and random undersampling...
springernature.figshare.com
figshare.com
xlsx
Updated Aug 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cynthia Yang; Egill A. Fridgeirsson; Jan A. Kors; Jenna M. Reps; Peter R. Rijnbeek (2024). Additional file 3 of Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data [Dataset]. http://doi.org/10.6084/m9.figshare.26660464.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26660464.v1
Dataset updated
Aug 18, 2024
Dataset provided by
figshare
Authors
Cynthia Yang; Egill A. Fridgeirsson; Jan A. Kors; Jenna M. Reps; Peter R. Rijnbeek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 3. Candidate predictors per database.
Heart Disease Health Indicators Dataset
kaggle.com
Updated Mar 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Teboul (2022). Heart Disease Health Indicators Dataset [Dataset]. https://www.kaggle.com/datasets/alexteboul/heart-disease-health-indicators-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 10, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alex Teboul
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Heart Disease is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. In the United States alone, heart disease claims roughly 647,000 lives each year — making it the leading cause of death. The buildup of plaques inside larger coronary arteries, molecular changes associated with aging, chronic inflammation, high blood pressure, and diabetes are all causes of and risk factors for heart disease.

While there are different types of coronary heart disease, the majority of individuals only learn they have the disease following symptoms such as chest pain, a heart attack, or sudden cardiac arrest. This fact highlights the importance of preventative measures and tests that can accurately predict heart disease in the population prior to negative outcomes like myocardial infarctions (heart attacks) taking place.

The Centers for Disease Control and Prevention has identified high blood pressure, high blood cholesterol, and smoking as three key risk factors for heart disease. Roughly half of Americans have at least one of these three risk factors. The National Heart, Lung, and Blood Institute highlights a wider array of factors such as Age, Environment and Occupation, Family History and Genetics, Lifestyle Habits, Other Medical Conditions, Race or Ethnicity, and Sex for clinicians to use in diagnosing coronary heart disease. Diagnosis tends to be driven by an initial survey of these common risk factors followed by bloodwork and other tests.

Content

The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is collected annually by the CDC. Each year, the survey collects responses from over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the use of preventative services. It has been conducted every year since 1984. For this project, I downloaded a csv of the dataset available on Kaggle for the year 2015. This original dataset contains responses from 441,455 individuals and has 330 features. These features are either questions directly asked of participants, or calculated variables based on individual participant responses.

This dataset contains 253,680 survey responses from cleaned BRFSS 2015 to be used primarily for the binary classification of heart disease. Not that there is strong class imbalance in this dataset. 229,787 respondents do not have/have not had heart disease while 23,893 have had heart disease. The question to be explored is:

1. To what extend can survey responses from the BRFSS be used for predicting heart disease risk?

and

2. Can a subset of questions from the BRFSS be used for preventative health screening for diseases like heart disease?

Acknowledgements

It it important to reiterate that I did not create this dataset, it is just a cleaned and consolidated dataset created from the BRFSS 2015 dataset already on Kaggle. That dataset can be found here and the notebook I used for the data cleaning can be found here.

Inspiration

Let's build some predictive models for for heart disease.
Municipal accounts; balance sheet by region and size class
cbs.nl
ckan.mobidatalab.eu
+2more
xml
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centraal Bureau voor de Statistiek (2024). Municipal accounts; balance sheet by region and size class [Dataset]. https://www.cbs.nl/en-gb/figures/detail/71231ENG
Explore at:
xmlAvailable download formats
Dataset updated
Dec 19, 2024
Dataset provided by
Statistics Netherlands
Authors
Centraal Bureau voor de Statistiek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2004 - 2023
Area covered
The Netherlands
Description
This table contains figures on balance sheet items from municipalities by size class and region at year-end. The figures are stated in millions of euros as well as in euros per inhabitant. The figures presented in the table are based on balance sheet positions that are in line with the definitions and classifications used by the municipalities themselves in their administration. This data is supplied to Statistics Netherlands via the survey 'Informatie voor derden' (Iv3) . The requirements for this survey are laid down in the 'Besluit Begroting en Verantwoording (BBV)'.

Data available from: 2004

Status of the figures: The figures in this table are provisional at the time of first publication. The figures become definitive when figures for the following year are added to the series.

Changes as of 19 December 2024: The provisional figures for 2023 have been added. The figures for 2022 have been adjusted from provisional to definite.

When will new figures be published? The new figures from the local intergovernmental organizations accounts are published no later than 12 months after the reporting period. The figures can be adjusted on the basis of the availability of new or updated source material. In general, the adjustments are small. The adjustments are made the moment a new annual figure is added to the series.
P
CrackVision12K Dataset
paperswithcode.com
Updated Sep 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CrackVision12K Dataset [Dataset]. https://paperswithcode.com/dataset/crackvision12k
Explore at:
Dataset updated
Sep 3, 2024
Authors
June Moh Goo; Xenios Milidonis; Alessandro Artusi; Jan Boehm; Carlo Ciliberto
Description
We present the CrackVision12k dataset, a collection of 12,000 crack images derived from 13 publicly available crack datasets. The individual datasets were too small to effectively train a deep learning model. Moreover, the masks in each dataset were annotated using different standards, so unifying the annotations was necessary. To achieve this, we applied various image processing techniques to each dataset to create masks that follow a consistent standard.

Crack datasets inherently suffer from class imbalance. To mitigate this issue, we selected images containing crack pixels of more than 5000 pixels and applied data augmentation techniques such as Gaussian noise and rotation. Finally, there is a corresponding refined ground truth for each crack image across the dataset to ensure uniformity and reliability.

The 13 datasets we combined are as follows: Aigle-RN, ESAR, LCMS, CRACK500, CrackLS315, CRKWH100, CrackTree260, DeepCrack, GAPS384, Masonry, Stone331, CFD, and SDNet2018.
o
Pl@ntNet-300K image dataset
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Apr 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Camille Garcin; Alexis Joly; Pierre Bonnet; Maximilien Servajean; Joseph Salmon (2021). Pl@ntNet-300K image dataset [Dataset]. http://doi.org/10.5281/zenodo.4726653
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4726653
Dataset updated
Apr 29, 2021
Authors
Camille Garcin; Alexis Joly; Pierre Bonnet; Maximilien Servajean; Joseph Salmon
Description
Pl@ntNet-300K is an image dataset aimed at evaluating set-valued classification. It was built from the database of Pl@ntnet citizen observatory and consists of 306146 images, covering 1081 species. We highlight two particular features of the dataset, inherent to the way the images are acquired and to the intrinsic diversity of plants morphology: i) The dataset exhibits a strong class imbalance, meaning that a few species represent most of the images. ii) Many species are visually similar, making identification difficult even for the expert eye. These two characteristics make the present dataset a good candidate for the evaluation of set-valued classification methods and algorithms. Therefore, we recommend two set-valued evaluation metrics associated with the dataset (top-K and average-K) and we provide the results of a baseline approach based on a resnet50 trained with a cross-entropy loss. The full description of the dataset can be found in (to be provided soon). The scientific publication (NEURIPS 2022) describing the dataset and providing baseline results can be found here: https://openreview.net/forum?id=eLYinD0TtIt Utilities to load the data and train models with pytorch can be found here: https://github.com/plantnet/PlantNet-300K/
R
Brackish Underwater Object Detection Dataset - 1920x1080
public.roboflow.com
zip
Updated Aug 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aalborg University (2022). Brackish Underwater Object Detection Dataset - 1920x1080 [Dataset]. https://public.roboflow.com/object-detection/brackish-underwater/1
Explore at:
zipAvailable download formats
Dataset updated
Aug 2, 2022
Dataset authored and provided by
Aalborg University
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Bounding Boxes of animals
Description
https://i.imgur.com/3dtuNhv.png" alt="Example image from the dataset">

Dataset Information

This dataset contains 14,674 images (12,444 of which contain objects of interest with bounding box annotations) of fish, crabs, and other marine animals. It was collected with a camera mounted 9 meters below the surface on the Limfjords bridge in northern Denmark by Aalborg University.

Composition

Roboflow has extracted and processed the frames from the source videos and converted the annotations for use with many popular computer vision models. We have maintained the same 80/10/10 train/valid/test split as the original dataset.

The class balance in the annotations is as follows: https://i.imgur.com/3MUk7D7.png" alt="Class Balance">

Most of the identified objects are congregated towards the bottom of the frames.

https://i.imgur.com/jAbb2i4.png" alt="Annotation Heatmap">

More Information

For more information, see the Detection of Marine Animals in a New Underwater Dataset with Varying Visibility paper.

If you find the dataset useful, the authors request that you please cite their paper:

@InProceedings{pedersen2019brackish, title={Detection of Marine Animals in a New Underwater Dataset with Varying Visibility}, author={Pedersen, Malte and Haurum, Joakim Bruslund and Gade, Rikke and Moeslund, Thomas B. and Madsen, Niels}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2019} }
Dataset for Insect Detection Remote Sensing
zenodo.org
zip
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trevor Vannoy; Bradley Whitaker; Nathaniel Sweeney; Caroline Xu; Walden Marshall; Ryan Ficken; Trevor Vannoy; Bradley Whitaker; Nathaniel Sweeney; Caroline Xu; Walden Marshall; Ryan Ficken (2024). Dataset for Insect Detection Remote Sensing [Dataset]. http://doi.org/10.5281/zenodo.10055763
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10055763
Dataset updated
Feb 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Trevor Vannoy; Bradley Whitaker; Nathaniel Sweeney; Caroline Xu; Walden Marshall; Ryan Ficken; Trevor Vannoy; Bradley Whitaker; Nathaniel Sweeney; Caroline Xu; Walden Marshall; Ryan Ficken
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# MSU Horticulture farm beehive dataset

Dataset associated with journal submission entitled "Comparison of Supervised Learning and Changepoint Detection for Insect Detection in Lidar Data" by authors T. C. Vannoy, N. B. Sweeney, J. A. Shaw, and B. M. Whitaker.

The associated software is archived at https://zenodo.org/doi/10.5281/zenodo.10055809.

The data were collected in June and July of 2022 at the horticulture farm at Montana State University - Bozeman. The data consists of 9977 images taken in front of the beehives.

For the data collection process, the lidar was mounted in the back of a U-haul van and pointed in front of the beehives. The lidar was then run at a variety of pan and tilt angles. This created a diverse set of images with varying levels of activity depending on how far the beam was from the beehives, along with some of the sets of images containing stationary targets where the beam was hitting a beehive or plant in the distance.

## Organization

At the top-level, data are split by the collection date. The next level down, the folders correspond to individual data collection runs; the timestamp at the end of the folder names indicates when the data collection started.

Each top-level date folder contains a README file that describes the individual data collection runs.

Each data collection folder contains the following files:

- `adjusted_data_junecal_volts.mat`: The main data file, which contains all the data and metadata.

- `

- `labels.csv`: The class labels for all data in the folder

- `labels.mat`: The class labels converted into label vectors, which are used for machine learning.

## Class labels

After collecting the data, we manually labelled the bounding box of each insect in each image, then converted the bounding boxes into binary labels that indicate whether a row contains an insect. Each potential insect was labeled with a confidence rating because some bees were more obvious than others. During the labeling process, we found 4671 probable bees. Since we were not able to collect ground-truth data in the field, it is possible that our labels are imperfect: some insects might have been missed, and some non-insects might have been labeled as insects.

There is a `labels.csv` file in the root directory, which contains all the labels. Each of the data-collection subdirectories has a `labels.csv` and `labels.mat` file that contain the labels for that run.

### Class imbalance

Of the 9977 images, 3498 (35.14\%) contain one or more bees. In total, the dataset has 1775906 rows, 11492 (0.647\%) of which contain an insect measurement. Due to sampling jitter in the ADC, most insects span multiple range bins, leading to an increase in the number of rows that were labeled as containing insects. The dataset has a large class imbalance, particularly when looking at how many rows contain insects.

Facebook

Twitter

Click to copy link

Link copied

Cite

Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux (2023). Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation [Dataset]. http://doi.org/10.1021/acs.jcim.2c00293.s003

Data from: Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.2c00293.s003

Dataset updated

Jun 15, 2023

Dataset provided by

ACS Publications

Authors

Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.

Clear search

Close search

Google apps

Main menu

Data from: Isometric Stratified Ensembles: A Partial and Incremental...

Imbalanced class datasets.

Dataset for Class Imbalance Classification Problem

Dataset

Contents

Dataset: The effects of class balance on the training energy consumption of...

Data from: Less is More: An Empirical Study of Undersampling Techniques for...

DCASE-2023-TASK-5

Introduction

Description

Training Set

BV

HT

JD

UVP5 data sorted with EcoTaxa and MorphoCluster

Dataset for Transient Stability Assessment of IEEE 39-Bus System

resampled_IDS_datasets

Shows parameters of the datasets used in this study.

Wafer UCR Archive Dataset

S5 Dataset -

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

Additional file 3 of Impact of random oversampling and random undersampling...

Heart Disease Health Indicators Dataset

Context

Content

1. To what extend can survey responses from the BRFSS be used for predicting heart disease risk?

2. Can a subset of questions from the BRFSS be used for preventative health screening for diseases like heart disease?

Acknowledgements

Inspiration

Municipal accounts; balance sheet by region and size class

CrackVision12K Dataset

Pl@ntNet-300K image dataset

Brackish Underwater Object Detection Dataset - 1920x1080

Dataset Information

Composition

More Information

Dataset for Insect Detection Remote Sensing

Data from: Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation