83 datasets found

f
Imbalanced class datasets.
plos.figshare.com
xls
Updated Apr 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud (2024). Imbalanced class datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0299585.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299585.t001
Dataset updated
Apr 11, 2024
Dataset provided by
PLOS ONE
Authors
Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.
d
Simulation Results on the Effect of Ensemble on Data Imbalance
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang, Yu (2023). Simulation Results on the Effect of Ensemble on Data Imbalance [Dataset]. https://search.dataone.org/view/sha256%3Ae6de30d2f7aa0db00837a402e7377acc36de959159760d2900103285cc862392
Explore at:
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Yang, Yu
Description
This dataset contains all the simulation results on the effect of ensemble models in dealing with data imbalance. The simulations are performed with sample size n=2000, number of variables p=200, and number of groups k=20 under six imbalanced scenarios. It shows the result of ensemble models with threshold from [0, 0.05, 0.1, ..., 0.95, 1.0], in terms of the overall AP/AR and discrete (continuous) specific AP/AR. This dataset serves as a reference for practitioners to find the appropriate ensemble threshold that fits their business needs the best.
f
Data from: Less is More: An Empirical Study of Undersampling Techniques for...
figshare.com
zip
Updated May 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gichan Lee (2024). Less is More: An Empirical Study of Undersampling Techniques for Technical Debt Prediction [Dataset]. http://doi.org/10.6084/m9.figshare.22708036.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22708036.v1
Dataset updated
May 20, 2024
Dataset provided by
figshare
Authors
Gichan Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Technical Debt (TD) prediction is crucial to preventing software quality degradation and maintenance cost increase. Recent Machine Learning (ML) approaches have shown promising results in TD prediction, but the imbalanced TD datasets can have a negative impact on ML model performance. Although previous TD studies have investigated various oversampling techniques that generates minority class instances to mitigate the imbalance, potentials of undersampling techniques have not yet been thoroughly explored due to the concerns about information loss. To address this gap, we investigate the impact of undersampling on ML model performance for TD prediction by utilizing 17,797 classes from 25 Java open-source projects. We compare the performance of ML models with different undersampling techniques and evaluate the impact of combining them with widely used oversampling techniques in TD studies. Our findings reveal that (i) undersampling can significantly improve ML model performance compared to oversampling and no resampling; (ii) the combined application of undersampling and oversampling techniques leads to a synergy of further performance improvement compared to applying each technique exclusively. Based on these results, we recommend practitioners to explore various undersampling techniques and their combinations with oversampling techniques for more effective TD prediction.This package is for the replication of 'Less is More: an Empirical Study of Undersampling Techniques for Technical Debt Prediction'File list:X.csv, Y.csv: - These are the datasets for the study, used in the ipynb file below.under_over_sampling_scripts.ipynb: - These scripts can obtain all the experimental results from the study. - They can be run through Jupyter Notebook or Google Colab. - The required packages are listed at the top in the file, so installation via pip or conda is necessary before running.Results_for_all_tables.csv: This is a csv file that summarizes all the results obtained from the study.
e
Current system imbalance (Historical data - up to 22/05/2024)
opendata.elia.be
external-elia.opendatasoft.com
+1more
csv, excel, json
Updated Aug 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Current system imbalance (Historical data - up to 22/05/2024) [Dataset]. https://opendata.elia.be/explore/dataset/ods045/
Explore at:
excel, json, csvAvailable download formats
Dataset updated
Aug 23, 2024
Description
Instantaneous system imbalance and net regulation volume (and its components) in Elia’s control area. All published values are non-validated values and can only be used for information purposes.This dataset contains data until 21/05/2024 (before MARI local go-live).
f
S5 Dataset -
plos.figshare.com
xlsx
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JiaMing Gong; MingGang Dong (2024). S5 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0311133.s005
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311133.s005
Dataset updated
Dec 13, 2024
Dataset provided by
PLOS ONE
Authors
JiaMing Gong; MingGang Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.
o
Imbalance price per minute (Historical data as of 22/05/2024)
external-elia.opendatasoft.com
opendata.elia.be
+1more
csv, excel, json
Updated Jul 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Imbalance price per minute (Historical data as of 22/05/2024) [Dataset]. https://external-elia.opendatasoft.com/explore/dataset/ods133/api/
Explore at:
csv, json, excelAvailable download formats
Dataset updated
Jul 19, 2024
Description
Imbalance prices applied for balance responsible parties (BRPs) settlemnt. One minute imbalance prices are published as fast as possible and are never validated. The 1 min prices give an indication for the final imabalnce price of the ISP (imbalance settlement period which is 15 min). Contains the historical data and is refreshed daily.This dataset contains data from 22/05/2024 (MARI local go-live) on.
o
Imbalance prices per quarter-hour (Historical data as of 22/05/2024)
external-elia.opendatasoft.com
opendata.elia.be
+1more
csv, excel, json
Updated Nov 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Imbalance prices per quarter-hour (Historical data as of 22/05/2024) [Dataset]. https://external-elia.opendatasoft.com/explore/dataset/ods134/api/
Explore at:
json, excel, csvAvailable download formats
Dataset updated
Nov 27, 2024
Description
Imbalance prices used for balancing responsible parties (BRPs)settlment. When imbalance prices are published on a quarter-hourly basis, the published prices have not yet been validated and can therefore only be used as an indication of the imbalance price. Only after the published prices have been validated can they be used for invoicing purposes. The records for month M are validated after the 15th of month M+1. Contains the historical data and is refreshed daily.This dataset contains data from 22/05/2024 (MARI local go-live) on.
m
Data from: Detailed results of "Insights into imbalance-aware Multilabel...
data.mendeley.com
observatorio-cientifico.ua.es
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jose J. Valero-Mas (2024). Detailed results of "Insights into imbalance-aware Multilabel Prototype Generation mechanisms for k-Nearest Neighbor classification in noisy scenarios" [Dataset]. http://doi.org/10.17632/p6ytjt5rfy.1
Explore at:
Unique identifier
https://doi.org/10.17632/p6ytjt5rfy.1
Dataset updated
Apr 2, 2024
Authors
Jose J. Valero-Mas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Detailed experimental results of the different Prototype Generation strategies for k-Nearest Neighbour classification in multilabel data attending to the particular issues of label-level imbalance and noise:

Noise-free scenarios

Study of the considered strategies for addressing label-level imbalance in PG scenarios without induced noise.

Individual results provided for each corpus.

Statistical tests (Friedman and Bonferroni-Dunn with significance level of p < 0.01) to assess the improvement compared to the base multilabel PG strategies

Corresponds to Section 5.1 in the manuscript.

Noisy scenarios

Study of the noise robustness capabilities of the proposed strategies.

Individual results provided for each corpus.

Statistical tests (Friedman and Bonferroni-Dunn with significance level of p < 0.01) to assess the improvement compared too the base multilabel PG strategies

Corresponds to Section 5.2 in the manuscript.

Results ignoring the Editing stage

Assessment of the relevance of the Editing stage in the general pipeline.

Individual results provided for each corpus.

Corresponds to Section 5.3 in the manuscript.
e
Imbalance prices per quarter-hour (Near real-time)
opendata.elia.be
external-elia.opendatasoft.com
+1more
csv, excel, json
Updated Jul 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Imbalance prices per quarter-hour (Near real-time) [Dataset]. https://opendata.elia.be/explore/dataset/ods162/
Explore at:
excel, json, csvAvailable download formats
Dataset updated
Jul 19, 2024
Description
Imbalance prices used for balance responsible parties (BRPs) settlement for every quarter hour. This report contains data for the current day and is refreshed every quarter-hour. Notice that in this report we only provide non-validated data.This dataset contains data from 22/05/2024 (MARI local go-live) on.
R
Supplementary Data and models : Bayesian joint-regression analysis of...
entrepot.recherche.data.gouv.fr
tsv, txt
Updated Jan 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Supplementary Data and models : Bayesian joint-regression analysis of unbalanced series of on-farm trials [Dataset]. https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/SUTZ9U
Explore at:
tsv(102713), txt(12602)Available download formats
Unique identifier
https://doi.org/10.57745/SUTZ9U
Dataset updated
Jan 11, 2024
Dataset provided by
Recherche Data Gouv
Authors
Michel TURBET DELOF; Michel TURBET DELOF; Isabelle GOLDRINGER; Isabelle GOLDRINGER; Julie DAWSON; Pierre RIVIERE; Gaëlle VAN FRANCK; Olivier DAVID; Olivier DAVID; Julie DAWSON; Pierre RIVIERE; Gaëlle VAN FRANCK
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Dataset funded by
INRAE BAP
METABIO
MOBIDIV
Description
Data and models use in "Bayesian joint-regression analysis of unbalanced series of on-farm trials"
d
Get small and mid-cap market data with NYSE American Integrated
databento.com
csv, dbn, json
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Get small and mid-cap market data with NYSE American Integrated [Dataset]. https://databento.com/datasets/XASE.PILLAR
Explore at:
csv, dbn, jsonAvailable download formats
Dataset updated
Jan 15, 2025
Dataset authored and provided by
Databento
Time period covered
Mar 28, 2023 - Present
Area covered
United States
Description
NYSE American Integrated is a proprietary data feed that provides full order book depth, including every quote and order at each price level, on the American market (formerly AMEX, the American Stock Exchange). It operates on NYSE's Pillar platform and disseminates all order book activity in an order-by-order view of events, including trade executions, order modifications, cancellations, and other book updates.

NYSE American specializes in listing growing companies and is the leading exchange for small-cap stocks, as well as offering mid-cap insights. As of January 2025, it represented approximately 0.23% of the average daily volume (ADV) across all exchange-listed securities.

With L3 granularity, NYSE American Integrated captures information beyond the L1, top-of-book data available through SIP feeds, enabling accurate modeling of the book imbalances, trade directionality, quote lifetimes, and more. This data includes explicit trade aggressor side, odd lots, and auction imbalances. Auction imbalances offer valuable insights into NYSE American’s opening and closing auctions by providing details like imbalance quantity, paired quantity, imbalance reference price, and book clearing price.

Historical data is available for usage-based rates or with any Databento US Equities subscription. Visit our pricing page for more details or to upgrade your plan.

Asset class: Equities

Origin: Directly captured at Equinix NY4 (Secaucus, NJ) with an FPGA-based network card and hardware timestamping. Synchronized to UTC with PTP.

Supported data encodings: DBN, CSV, JSON (Learn more)

Supported market data schemas: MBO, MBP-1, MBP-10, TBBO, Trades, OHLCV-1s, OHLCV-1m, OHLCV-1h, OHLCV-1d, Definition, Imbalance (Learn more)

Resolution: Immediate publication, nanosecond-resolution timestamps
o
Current system imbalance (Historical data as of 22/05/2024)
external-elia.opendatasoft.com
opendata.elia.be
csv, excel, json
Updated Nov 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Current system imbalance (Historical data as of 22/05/2024) [Dataset]. https://external-elia.opendatasoft.com/explore/dataset/ods126/api/
Explore at:
csv, json, excelAvailable download formats
Dataset updated
Nov 4, 2024
Description
Instantaneous system imbalance (and its components) and the area control error (ACE) in Elia’s control area. All published values are non-validated values and can only be used for information purposes.This dataset contains data from 22/05/2024 (MARI local go-live) on.
o
Imbalance prices per minute (Near real-time)
external-elia.aws-ec2-eu-central-1.opendatasoft.com
opendata.elia.be
+1more
csv, excel, json
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Imbalance prices per minute (Near real-time) [Dataset]. https://external-elia.aws-ec2-eu-central-1.opendatasoft.com/explore/dataset/ods161/api/
Explore at:
excel, csv, jsonAvailable download formats
Dataset updated
Aug 23, 2024
Description
The 1 min imbalance prices are published as fast as possible and give an indication for the final imbalance price of the ISP (imbalance settlement period which is 15min). This report contains data for the current hour and is refreshed every minute. Notice that in this report we only provide non-validated data. This dataset contains data from 22/05/2024 (MARI local go-live) on.
D
Databento US Equities Market Data and APIs
databento.com
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Databento US Equities Market Data and APIs [Dataset]. https://databento.com/datasets/XASE.PILLAR
Explore at:
Dataset updated
Jan 15, 2025
Time period covered
May 1, 2018 - Present
Area covered
United States
Description
Breadth of coverage: 14,160 products

Asset class(es): Equities

Origin: Directly captured at Equinix NY4 (Secaucus, NJ) with an FPGA-based network card and hardware timestamping. Synchronized to UTC with PTP.

Supported data encodings: DBN, CSV, JSON Learn more

Supported market data schemas: MBO, MBP-1, MBP-10, TBBO, Trades, OHLCV-1s, OHLCV-1m, OHLCV-1h, OHLCV-1d, Definition, Imbalance Learn more

Resolution: Immediate publication, nanosecond-resolution timestamps
o
System imbalance forecast current quarter hour (near real-time)
external-elia.opendatasoft.com
opendata.elia.be
+1more
csv, excel, json
Updated Jan 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). System imbalance forecast current quarter hour (near real-time) [Dataset]. https://external-elia.opendatasoft.com/explore/dataset/ods136/?flg=nl
Explore at:
csv, excel, jsonAvailable download formats
Dataset updated
Jan 11, 2023
Description
This report contains a forecast of the average quarter-hourly system imbalance in the current quarter-hour as well as an estimated probability distribution of the average quarter-hourly system imbalance in the current quarter hour. The data reflects Elia's own forecasts of the system imbalance. It must be noted that these forecasts can have a significant error margin, are not binding for Elia and are therefore merely shared for informational purposes and that under no circumstances the publication or the use of this information imply a shift in responsibility or liability towards Elia.
f
Confusion matrix.
figshare.com
xls
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaoxia Mou; Heming Zhang (2023). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0288140.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288140.t002
Dataset updated
Jul 7, 2023
Dataset provided by
PLOS ONE
Authors
Shaoxia Mou; Heming Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Due to the inherent characteristics of accumulation sequence of unbalanced data, the mining results of this kind of data are often affected by a large number of categories, resulting in the decline of mining performance. To solve the above problems, the performance of data cumulative sequence mining is optimized. The algorithm for mining cumulative sequence of unbalanced data based on probability matrix decomposition is studied. The natural nearest neighbor of a few samples in the unbalanced data cumulative sequence is determined, and the few samples in the unbalanced data cumulative sequence are clustered according to the natural nearest neighbor relationship. In the same cluster, new samples are generated from the core points of dense regions and non core points of sparse regions, and then new samples are added to the original data accumulation sequence to balance the data accumulation sequence. The probability matrix decomposition method is used to generate two random number matrices with Gaussian distribution in the cumulative sequence of balanced data, and the linear combination of low dimensional eigenvectors is used to explain the preference of specific users for the data sequence; At the same time, from a global perspective, the AdaBoost idea is used to adaptively adjust the sample weight and optimize the probability matrix decomposition algorithm. Experimental results show that the algorithm can effectively generate new samples, improve the imbalance of data accumulation sequence, and obtain more accurate mining results. Optimizing global errors as well as more efficient single-sample errors. When the decomposition dimension is 5, the minimum RMSE is obtained. The proposed algorithm has good classification performance for the cumulative sequence of balanced data, and the average ranking of index F value, G mean and AUC is the best.
f
Additional file 3 of Impact of random oversampling and random undersampling...
springernature.figshare.com
figshare.com
xlsx
Updated Aug 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cynthia Yang; Egill A. Fridgeirsson; Jan A. Kors; Jenna M. Reps; Peter R. Rijnbeek (2024). Additional file 3 of Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data [Dataset]. http://doi.org/10.6084/m9.figshare.26660464.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26660464.v1
Dataset updated
Aug 18, 2024
Dataset provided by
figshare
Authors
Cynthia Yang; Egill A. Fridgeirsson; Jan A. Kors; Jenna M. Reps; Peter R. Rijnbeek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 3. Candidate predictors per database.
d
Data from: Exploring deep learning techniques for wild animal behaviour...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jan 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. https://datadryad.org/stash/dataset/doi:10.5061/dryad.2ngf1vhwk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2ngf1vhwk
Dataset updated
Jan 23, 2024
Dataset provided by
Dryad
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
Time period covered
2023
Description
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, usin...
c
Exposure to surface water supply and use imbalance during spawning months...
s.cnmilf.com
gimi9.com
Updated Feb 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Exposure to surface water supply and use imbalance during spawning months for 214 fish taxa across the conterminous United States [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/exposure-to-surface-water-supply-and-use-imbalance-during-spawning-months-for-214-fish-tax
Explore at:
Dataset updated
Feb 22, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Contiguous United States, United States
Description
This data release contains the output from an ecological analysis modeling the exposure of 214 fish taxa across the conterminous US (CONUS) to an index of surface water supply and use imbalances (SUI), the proportion of monthly gross average water supply available after accounting for climate variation and consumptive use, during their spawning months, hereafter referred to as spawning exposure. SUI were calculated in Miller and others (2024) by combining the monthly water balance from water supply and human consumptive uses for CONUS from water years 2010-2020 at the HUC12 scale. Water supply inputs were generated from two physically-based hydrologic models, and consumptive water use was calculated from three separate national models for agricultural irrigation, thermoelectric power generation, and public supply. Water budgets were routed through the surface water flow network (to allow for upstream consumptive uses to affect downstream water availability) and used to determine potential water limitations for human populations and fish taxa. We overlaid water supply imbalances with the modeled ranges of 241 fish taxa, including Species of Greatest Conservation Need, recreationally important, and common native taxa. SUI were evaluated within each HUC12 and specifically mean weighted based on the probability of spawning in each month for each taxa. Our analyses indicated multiple taxa having notable proportions of their habitats exposed to high or severe water imbalances during spawning, especially the federally-listed Arkansas River shiner. This analysis can be used to identify fish taxa particularly exposed to water availability issues, specifically from surface water supply and use imbalances, during the physiologically important spawning period. However, this analysis did not consider specific taxa-level differences as to the sensitivity of different taxa to limited water supply. This data release contains five tabular datasets in comma-separated values (.csv), covering a tabular data dictionary, input data supporting analysis, raw analysis output, and summarized versions at two spatial scales for convenience. They are: 1) data_dictionary.csv - A data dictionary containing entity and attribute information about variable names, descriptions, types, ranges, and unique values for easy access. 2) SpawningExposure_TaxaSpawningWeights.csv - Dataset used to weigh spawning months for each taxon in calculation of the spawning exposure. Derived from Frimpong and Angermeier, 2011. 3) SpawningExposure_SUI_HUC12.csv - CONUS level dataset of spawning exposure to SUI from 2010-2020 for each fish taxa reported for each HUC12 where they are present. 4) SpawningExposure_SUI_CONUS_Summary.csv - A summary of spawning exposure to SUI by fish taxa, for the entire habitat range in CONUS, the range-averaged SUI exposure and percentage of habitat in each SUI category class. 5) SpawningExposure_SUI_Regional_Summary.csv - Summaries of spawning exposure to SUI by fish taxa, for each Van Metre (2020) hydrologic region, the region range-average exposure and percentage of the region's habitat range in each SUI category class.
Data from: Multitask Modeling with Confidence Using Matrix Factorization and...
acs.figshare.com
xlsx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ulf Norinder; Fredrik Svensson (2023). Multitask Modeling with Confidence Using Matrix Factorization and Conformal Prediction [Dataset]. http://doi.org/10.1021/acs.jcim.9b00027.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.9b00027.s001
Dataset updated
Jun 3, 2023
Dataset provided by
ACS Publications
Authors
Ulf Norinder; Fredrik Svensson
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Multitask prediction of bioactivities is often faced with challenges relating to the sparsity of data and imbalance between different labels. We propose class conditional (Mondrian) conformal predictors using underlying Macau models as a novel approach for large scale bioactivity prediction. This approach handles both high degrees of missing data and label imbalances while still producing high quality predictive models. When applied to ten assay end points from PubChem, the models generated valid models with an efficiency of 74.0–80.1% at the 80% confidence level with similar performance both for the minority and majority class. Also when deleting progressively larger portions of the available data (0–80%) the performance of the models remained robust with only minor deterioration (reduction in efficiency between 5 and 10%). Compared to using Macau without conformal prediction the method presented here significantly improves the performance on imbalanced data sets.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud (2024). Imbalanced class datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0299585.t001

Imbalanced class datasets.

Explore at:

171 scholarly articles cite this dataset (View in Google Scholar)

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0299585.t001

Dataset updated

Apr 11, 2024

Dataset provided by

PLOS ONE

Authors

Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.

Clear search

Close search

Google apps

Main menu

Imbalanced class datasets.

Simulation Results on the Effect of Ensemble on Data Imbalance

Data from: Less is More: An Empirical Study of Undersampling Techniques for...

Current system imbalance (Historical data - up to 22/05/2024)

S5 Dataset -

Imbalance price per minute (Historical data as of 22/05/2024)

Imbalance prices per quarter-hour (Historical data as of 22/05/2024)

Data from: Detailed results of "Insights into imbalance-aware Multilabel...

Imbalance prices per quarter-hour (Near real-time)

Supplementary Data and models : Bayesian joint-regression analysis of...

Get small and mid-cap market data with NYSE American Integrated

Current system imbalance (Historical data as of 22/05/2024)

Imbalance prices per minute (Near real-time)

Databento US Equities Market Data and APIs

System imbalance forecast current quarter hour (near real-time)

Confusion matrix.

Additional file 3 of Impact of random oversampling and random undersampling...

Data from: Exploring deep learning techniques for wild animal behaviour...

Exposure to surface water supply and use imbalance during spawning months...

Data from: Multitask Modeling with Confidence Using Matrix Factorization and...

Imbalanced class datasets.