Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.
This dataset contains all the simulation results on the effect of ensemble models in dealing with data imbalance. The simulations are performed with sample size n=2000, number of variables p=200, and number of groups k=20 under six imbalanced scenarios. It shows the result of ensemble models with threshold from [0, 0.05, 0.1, ..., 0.95, 1.0], in terms of the overall AP/AR and discrete (continuous) specific AP/AR. This dataset serves as a reference for practitioners to find the appropriate ensemble threshold that fits their business needs the best.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Technical Debt (TD) prediction is crucial to preventing software quality degradation and maintenance cost increase. Recent Machine Learning (ML) approaches have shown promising results in TD prediction, but the imbalanced TD datasets can have a negative impact on ML model performance. Although previous TD studies have investigated various oversampling techniques that generates minority class instances to mitigate the imbalance, potentials of undersampling techniques have not yet been thoroughly explored due to the concerns about information loss. To address this gap, we investigate the impact of undersampling on ML model performance for TD prediction by utilizing 17,797 classes from 25 Java open-source projects. We compare the performance of ML models with different undersampling techniques and evaluate the impact of combining them with widely used oversampling techniques in TD studies. Our findings reveal that (i) undersampling can significantly improve ML model performance compared to oversampling and no resampling; (ii) the combined application of undersampling and oversampling techniques leads to a synergy of further performance improvement compared to applying each technique exclusively. Based on these results, we recommend practitioners to explore various undersampling techniques and their combinations with oversampling techniques for more effective TD prediction.This package is for the replication of 'Less is More: an Empirical Study of Undersampling Techniques for Technical Debt Prediction'File list:X.csv, Y.csv: - These are the datasets for the study, used in the ipynb file below.under_over_sampling_scripts.ipynb: - These scripts can obtain all the experimental results from the study. - They can be run through Jupyter Notebook or Google Colab. - The required packages are listed at the top in the file, so installation via pip or conda is necessary before running.Results_for_all_tables.csv: This is a csv file that summarizes all the results obtained from the study.
Instantaneous system imbalance and net regulation volume (and its components) in Elia’s control area. All published values are non-validated values and can only be used for information purposes.This dataset contains data until 21/05/2024 (before MARI local go-live).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.
Imbalance prices applied for balance responsible parties (BRPs) settlemnt. One minute imbalance prices are published as fast as possible and are never validated. The 1 min prices give an indication for the final imabalnce price of the ISP (imbalance settlement period which is 15 min). Contains the historical data and is refreshed daily.This dataset contains data from 22/05/2024 (MARI local go-live) on.
Imbalance prices used for balancing responsible parties (BRPs)settlment. When imbalance prices are published on a quarter-hourly basis, the published prices have not yet been validated and can therefore only be used as an indication of the imbalance price. Only after the published prices have been validated can they be used for invoicing purposes. The records for month M are validated after the 15th of month M+1. Contains the historical data and is refreshed daily.This dataset contains data from 22/05/2024 (MARI local go-live) on.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Detailed experimental results of the different Prototype Generation strategies for k-Nearest Neighbour classification in multilabel data attending to the particular issues of label-level imbalance and noise:
Corresponds to Section 5.1 in the manuscript.
Noisy scenarios
Study of the noise robustness capabilities of the proposed strategies.
Individual results provided for each corpus.
Statistical tests (Friedman and Bonferroni-Dunn with significance level of p < 0.01) to assess the improvement compared too the base multilabel PG strategies
Corresponds to Section 5.2 in the manuscript.
Results ignoring the Editing stage
Assessment of the relevance of the Editing stage in the general pipeline.
Individual results provided for each corpus.
Corresponds to Section 5.3 in the manuscript.
Imbalance prices used for balance responsible parties (BRPs) settlement for every quarter hour. This report contains data for the current day and is refreshed every quarter-hour. Notice that in this report we only provide non-validated data.This dataset contains data from 22/05/2024 (MARI local go-live) on.
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Data and models use in "Bayesian joint-regression analysis of unbalanced series of on-farm trials"
NYSE American Integrated is a proprietary data feed that provides full order book depth, including every quote and order at each price level, on the American market (formerly AMEX, the American Stock Exchange). It operates on NYSE's Pillar platform and disseminates all order book activity in an order-by-order view of events, including trade executions, order modifications, cancellations, and other book updates.
NYSE American specializes in listing growing companies and is the leading exchange for small-cap stocks, as well as offering mid-cap insights. As of January 2025, it represented approximately 0.23% of the average daily volume (ADV) across all exchange-listed securities.
With L3 granularity, NYSE American Integrated captures information beyond the L1, top-of-book data available through SIP feeds, enabling accurate modeling of the book imbalances, trade directionality, quote lifetimes, and more. This data includes explicit trade aggressor side, odd lots, and auction imbalances. Auction imbalances offer valuable insights into NYSE American’s opening and closing auctions by providing details like imbalance quantity, paired quantity, imbalance reference price, and book clearing price.
Historical data is available for usage-based rates or with any Databento US Equities subscription. Visit our pricing page for more details or to upgrade your plan.
Asset class: Equities
Origin: Directly captured at Equinix NY4 (Secaucus, NJ) with an FPGA-based network card and hardware timestamping. Synchronized to UTC with PTP.
Supported data encodings: DBN, CSV, JSON (Learn more)
Supported market data schemas: MBO, MBP-1, MBP-10, TBBO, Trades, OHLCV-1s, OHLCV-1m, OHLCV-1h, OHLCV-1d, Definition, Imbalance (Learn more)
Resolution: Immediate publication, nanosecond-resolution timestamps
Instantaneous system imbalance (and its components) and the area control error (ACE) in Elia’s control area. All published values are non-validated values and can only be used for information purposes.This dataset contains data from 22/05/2024 (MARI local go-live) on.
The 1 min imbalance prices are published as fast as possible and give an indication for the final imbalance price of the ISP (imbalance settlement period which is 15min). This report contains data for the current hour and is refreshed every minute. Notice that in this report we only provide non-validated data. This dataset contains data from 22/05/2024 (MARI local go-live) on.
Breadth of coverage: 14,160 products
Asset class(es): Equities
Origin: Directly captured at Equinix NY4 (Secaucus, NJ) with an FPGA-based network card and hardware timestamping. Synchronized to UTC with PTP.
Supported data encodings: DBN, CSV, JSON Learn more
Supported market data schemas: MBO, MBP-1, MBP-10, TBBO, Trades, OHLCV-1s, OHLCV-1m, OHLCV-1h, OHLCV-1d, Definition, Imbalance Learn more
Resolution: Immediate publication, nanosecond-resolution timestamps
This report contains a forecast of the average quarter-hourly system imbalance in the current quarter-hour as well as an estimated probability distribution of the average quarter-hourly system imbalance in the current quarter hour. The data reflects Elia's own forecasts of the system imbalance. It must be noted that these forecasts can have a significant error margin, are not binding for Elia and are therefore merely shared for informational purposes and that under no circumstances the publication or the use of this information imply a shift in responsibility or liability towards Elia.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Due to the inherent characteristics of accumulation sequence of unbalanced data, the mining results of this kind of data are often affected by a large number of categories, resulting in the decline of mining performance. To solve the above problems, the performance of data cumulative sequence mining is optimized. The algorithm for mining cumulative sequence of unbalanced data based on probability matrix decomposition is studied. The natural nearest neighbor of a few samples in the unbalanced data cumulative sequence is determined, and the few samples in the unbalanced data cumulative sequence are clustered according to the natural nearest neighbor relationship. In the same cluster, new samples are generated from the core points of dense regions and non core points of sparse regions, and then new samples are added to the original data accumulation sequence to balance the data accumulation sequence. The probability matrix decomposition method is used to generate two random number matrices with Gaussian distribution in the cumulative sequence of balanced data, and the linear combination of low dimensional eigenvectors is used to explain the preference of specific users for the data sequence; At the same time, from a global perspective, the AdaBoost idea is used to adaptively adjust the sample weight and optimize the probability matrix decomposition algorithm. Experimental results show that the algorithm can effectively generate new samples, improve the imbalance of data accumulation sequence, and obtain more accurate mining results. Optimizing global errors as well as more efficient single-sample errors. When the decomposition dimension is 5, the minimum RMSE is obtained. The proposed algorithm has good classification performance for the cumulative sequence of balanced data, and the average ranking of index F value, G mean and AUC is the best.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 3. Candidate predictors per database.
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, usin...
This data release contains the output from an ecological analysis modeling the exposure of 214 fish taxa across the conterminous US (CONUS) to an index of surface water supply and use imbalances (SUI), the proportion of monthly gross average water supply available after accounting for climate variation and consumptive use, during their spawning months, hereafter referred to as spawning exposure. SUI were calculated in Miller and others (2024) by combining the monthly water balance from water supply and human consumptive uses for CONUS from water years 2010-2020 at the HUC12 scale. Water supply inputs were generated from two physically-based hydrologic models, and consumptive water use was calculated from three separate national models for agricultural irrigation, thermoelectric power generation, and public supply. Water budgets were routed through the surface water flow network (to allow for upstream consumptive uses to affect downstream water availability) and used to determine potential water limitations for human populations and fish taxa. We overlaid water supply imbalances with the modeled ranges of 241 fish taxa, including Species of Greatest Conservation Need, recreationally important, and common native taxa. SUI were evaluated within each HUC12 and specifically mean weighted based on the probability of spawning in each month for each taxa. Our analyses indicated multiple taxa having notable proportions of their habitats exposed to high or severe water imbalances during spawning, especially the federally-listed Arkansas River shiner. This analysis can be used to identify fish taxa particularly exposed to water availability issues, specifically from surface water supply and use imbalances, during the physiologically important spawning period. However, this analysis did not consider specific taxa-level differences as to the sensitivity of different taxa to limited water supply. This data release contains five tabular datasets in comma-separated values (.csv), covering a tabular data dictionary, input data supporting analysis, raw analysis output, and summarized versions at two spatial scales for convenience. They are: 1) data_dictionary.csv - A data dictionary containing entity and attribute information about variable names, descriptions, types, ranges, and unique values for easy access. 2) SpawningExposure_TaxaSpawningWeights.csv - Dataset used to weigh spawning months for each taxon in calculation of the spawning exposure. Derived from Frimpong and Angermeier, 2011. 3) SpawningExposure_SUI_HUC12.csv - CONUS level dataset of spawning exposure to SUI from 2010-2020 for each fish taxa reported for each HUC12 where they are present. 4) SpawningExposure_SUI_CONUS_Summary.csv - A summary of spawning exposure to SUI by fish taxa, for the entire habitat range in CONUS, the range-averaged SUI exposure and percentage of habitat in each SUI category class. 5) SpawningExposure_SUI_Regional_Summary.csv - Summaries of spawning exposure to SUI by fish taxa, for each Van Metre (2020) hydrologic region, the region range-average exposure and percentage of the region's habitat range in each SUI category class.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Multitask prediction of bioactivities is often faced with challenges relating to the sparsity of data and imbalance between different labels. We propose class conditional (Mondrian) conformal predictors using underlying Macau models as a novel approach for large scale bioactivity prediction. This approach handles both high degrees of missing data and label imbalances while still producing high quality predictive models. When applied to ten assay end points from PubChem, the models generated valid models with an efficiency of 74.0–80.1% at the 80% confidence level with similar performance both for the minority and majority class. Also when deleting progressively larger portions of the available data (0–80%) the performance of the models remained robust with only minor deterioration (reduction in efficiency between 5 and 10%). Compared to using Macau without conformal prediction the method presented here significantly improves the performance on imbalanced data sets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.