83 datasets found
  1. f

    Imbalanced class datasets.

    • plos.figshare.com
    xls
    Updated Apr 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud (2024). Imbalanced class datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0299585.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 11, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.

  2. d

    Simulation Results on the Effect of Ensemble on Data Imbalance

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang, Yu (2023). Simulation Results on the Effect of Ensemble on Data Imbalance [Dataset]. https://search.dataone.org/view/sha256%3Ae6de30d2f7aa0db00837a402e7377acc36de959159760d2900103285cc862392
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Yang, Yu
    Description

    This dataset contains all the simulation results on the effect of ensemble models in dealing with data imbalance. The simulations are performed with sample size n=2000, number of variables p=200, and number of groups k=20 under six imbalanced scenarios. It shows the result of ensemble models with threshold from [0, 0.05, 0.1, ..., 0.95, 1.0], in terms of the overall AP/AR and discrete (continuous) specific AP/AR. This dataset serves as a reference for practitioners to find the appropriate ensemble threshold that fits their business needs the best.

  3. f

    Data from: Less is More: An Empirical Study of Undersampling Techniques for...

    • figshare.com
    zip
    Updated May 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gichan Lee (2024). Less is More: An Empirical Study of Undersampling Techniques for Technical Debt Prediction [Dataset]. http://doi.org/10.6084/m9.figshare.22708036.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2024
    Dataset provided by
    figshare
    Authors
    Gichan Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Technical Debt (TD) prediction is crucial to preventing software quality degradation and maintenance cost increase. Recent Machine Learning (ML) approaches have shown promising results in TD prediction, but the imbalanced TD datasets can have a negative impact on ML model performance. Although previous TD studies have investigated various oversampling techniques that generates minority class instances to mitigate the imbalance, potentials of undersampling techniques have not yet been thoroughly explored due to the concerns about information loss. To address this gap, we investigate the impact of undersampling on ML model performance for TD prediction by utilizing 17,797 classes from 25 Java open-source projects. We compare the performance of ML models with different undersampling techniques and evaluate the impact of combining them with widely used oversampling techniques in TD studies. Our findings reveal that (i) undersampling can significantly improve ML model performance compared to oversampling and no resampling; (ii) the combined application of undersampling and oversampling techniques leads to a synergy of further performance improvement compared to applying each technique exclusively. Based on these results, we recommend practitioners to explore various undersampling techniques and their combinations with oversampling techniques for more effective TD prediction.This package is for the replication of 'Less is More: an Empirical Study of Undersampling Techniques for Technical Debt Prediction'File list:X.csv, Y.csv: - These are the datasets for the study, used in the ipynb file below.under_over_sampling_scripts.ipynb: - These scripts can obtain all the experimental results from the study. - They can be run through Jupyter Notebook or Google Colab. - The required packages are listed at the top in the file, so installation via pip or conda is necessary before running.Results_for_all_tables.csv: This is a csv file that summarizes all the results obtained from the study.

  4. e

    Current system imbalance (Historical data - up to 22/05/2024)

    • opendata.elia.be
    • external-elia.opendatasoft.com
    • +1more
    csv, excel, json
    Updated Aug 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Current system imbalance (Historical data - up to 22/05/2024) [Dataset]. https://opendata.elia.be/explore/dataset/ods045/
    Explore at:
    excel, json, csvAvailable download formats
    Dataset updated
    Aug 23, 2024
    Description

    Instantaneous system imbalance and net regulation volume (and its components) in Elia’s control area. All published values are non-validated values and can only be used for information purposes.This dataset contains data until 21/05/2024 (before MARI local go-live).

  5. f

    S5 Dataset -

    • plos.figshare.com
    xlsx
    Updated Dec 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JiaMing Gong; MingGang Dong (2024). S5 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0311133.s005
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    PLOS ONE
    Authors
    JiaMing Gong; MingGang Dong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.

  6. o

    Imbalance price per minute (Historical data as of 22/05/2024)

    • external-elia.opendatasoft.com
    • opendata.elia.be
    • +1more
    csv, excel, json
    Updated Jul 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Imbalance price per minute (Historical data as of 22/05/2024) [Dataset]. https://external-elia.opendatasoft.com/explore/dataset/ods133/api/
    Explore at:
    csv, json, excelAvailable download formats
    Dataset updated
    Jul 19, 2024
    Description

    Imbalance prices applied for balance responsible parties (BRPs) settlemnt. One minute imbalance prices are published as fast as possible and are never validated. The 1 min prices give an indication for the final imabalnce price of the ISP (imbalance settlement period which is 15 min). Contains the historical data and is refreshed daily.This dataset contains data from 22/05/2024 (MARI local go-live) on.

  7. o

    Imbalance prices per quarter-hour (Historical data as of 22/05/2024)

    • external-elia.opendatasoft.com
    • opendata.elia.be
    • +1more
    csv, excel, json
    Updated Nov 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Imbalance prices per quarter-hour (Historical data as of 22/05/2024) [Dataset]. https://external-elia.opendatasoft.com/explore/dataset/ods134/api/
    Explore at:
    json, excel, csvAvailable download formats
    Dataset updated
    Nov 27, 2024
    Description

    Imbalance prices used for balancing responsible parties (BRPs)settlment. When imbalance prices are published on a quarter-hourly basis, the published prices have not yet been validated and can therefore only be used as an indication of the imbalance price. Only after the published prices have been validated can they be used for invoicing purposes. The records for month M are validated after the 15th of month M+1. Contains the historical data and is refreshed daily.This dataset contains data from 22/05/2024 (MARI local go-live) on.

  8. m

    Data from: Detailed results of "Insights into imbalance-aware Multilabel...

    • data.mendeley.com
    • observatorio-cientifico.ua.es
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose J. Valero-Mas (2024). Detailed results of "Insights into imbalance-aware Multilabel Prototype Generation mechanisms for k-Nearest Neighbor classification in noisy scenarios" [Dataset]. http://doi.org/10.17632/p6ytjt5rfy.1
    Explore at:
    Dataset updated
    Apr 2, 2024
    Authors
    Jose J. Valero-Mas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Detailed experimental results of the different Prototype Generation strategies for k-Nearest Neighbour classification in multilabel data attending to the particular issues of label-level imbalance and noise:

    1. Noise-free scenarios
    2. Study of the considered strategies for addressing label-level imbalance in PG scenarios without induced noise.
    3. Individual results provided for each corpus.
    4. Statistical tests (Friedman and Bonferroni-Dunn with significance level of p < 0.01) to assess the improvement compared to the base multilabel PG strategies
    5. Corresponds to Section 5.1 in the manuscript.

    6. Noisy scenarios

    7. Study of the noise robustness capabilities of the proposed strategies.

    8. Individual results provided for each corpus.

    9. Statistical tests (Friedman and Bonferroni-Dunn with significance level of p < 0.01) to assess the improvement compared too the base multilabel PG strategies

    10. Corresponds to Section 5.2 in the manuscript.

    11. Results ignoring the Editing stage

    12. Assessment of the relevance of the Editing stage in the general pipeline.

    13. Individual results provided for each corpus.

    14. Corresponds to Section 5.3 in the manuscript.

  9. e

    Imbalance prices per quarter-hour (Near real-time)

    • opendata.elia.be
    • external-elia.opendatasoft.com
    • +1more
    csv, excel, json
    Updated Jul 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Imbalance prices per quarter-hour (Near real-time) [Dataset]. https://opendata.elia.be/explore/dataset/ods162/
    Explore at:
    excel, json, csvAvailable download formats
    Dataset updated
    Jul 19, 2024
    Description

    Imbalance prices used for balance responsible parties (BRPs) settlement for every quarter hour. This report contains data for the current day and is refreshed every quarter-hour. Notice that in this report we only provide non-validated data.This dataset contains data from 22/05/2024 (MARI local go-live) on.

  10. R

    Supplementary Data and models : Bayesian joint-regression analysis of...

    • entrepot.recherche.data.gouv.fr
    tsv, txt
    Updated Jan 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Supplementary Data and models : Bayesian joint-regression analysis of unbalanced series of on-farm trials [Dataset]. https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/SUTZ9U
    Explore at:
    tsv(102713), txt(12602)Available download formats
    Dataset updated
    Jan 11, 2024
    Dataset provided by
    Recherche Data Gouv
    Authors
    Michel TURBET DELOF; Michel TURBET DELOF; Isabelle GOLDRINGER; Isabelle GOLDRINGER; Julie DAWSON; Pierre RIVIERE; Gaëlle VAN FRANCK; Olivier DAVID; Olivier DAVID; Julie DAWSON; Pierre RIVIERE; Gaëlle VAN FRANCK
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Dataset funded by
    INRAE BAP
    METABIO
    MOBIDIV
    Description

    Data and models use in "Bayesian joint-regression analysis of unbalanced series of on-farm trials"

  11. d

    Get small and mid-cap market data with NYSE American Integrated

    • databento.com
    csv, dbn, json
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Get small and mid-cap market data with NYSE American Integrated [Dataset]. https://databento.com/datasets/XASE.PILLAR
    Explore at:
    csv, dbn, jsonAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset authored and provided by
    Databento
    Time period covered
    Mar 28, 2023 - Present
    Area covered
    United States
    Description

    NYSE American Integrated is a proprietary data feed that provides full order book depth, including every quote and order at each price level, on the American market (formerly AMEX, the American Stock Exchange). It operates on NYSE's Pillar platform and disseminates all order book activity in an order-by-order view of events, including trade executions, order modifications, cancellations, and other book updates.

    NYSE American specializes in listing growing companies and is the leading exchange for small-cap stocks, as well as offering mid-cap insights. As of January 2025, it represented approximately 0.23% of the average daily volume (ADV) across all exchange-listed securities.

    With L3 granularity, NYSE American Integrated captures information beyond the L1, top-of-book data available through SIP feeds, enabling accurate modeling of the book imbalances, trade directionality, quote lifetimes, and more. This data includes explicit trade aggressor side, odd lots, and auction imbalances. Auction imbalances offer valuable insights into NYSE American’s opening and closing auctions by providing details like imbalance quantity, paired quantity, imbalance reference price, and book clearing price.

    Historical data is available for usage-based rates or with any Databento US Equities subscription. Visit our pricing page for more details or to upgrade your plan.

    Asset class: Equities

    Origin: Directly captured at Equinix NY4 (Secaucus, NJ) with an FPGA-based network card and hardware timestamping. Synchronized to UTC with PTP.

    Supported data encodings: DBN, CSV, JSON (Learn more)

    Supported market data schemas: MBO, MBP-1, MBP-10, TBBO, Trades, OHLCV-1s, OHLCV-1m, OHLCV-1h, OHLCV-1d, Definition, Imbalance (Learn more)

    Resolution: Immediate publication, nanosecond-resolution timestamps

  12. o

    Current system imbalance (Historical data as of 22/05/2024)

    • external-elia.opendatasoft.com
    • opendata.elia.be
    csv, excel, json
    Updated Nov 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Current system imbalance (Historical data as of 22/05/2024) [Dataset]. https://external-elia.opendatasoft.com/explore/dataset/ods126/api/
    Explore at:
    csv, json, excelAvailable download formats
    Dataset updated
    Nov 4, 2024
    Description

    Instantaneous system imbalance (and its components) and the area control error (ACE) in Elia’s control area. All published values are non-validated values and can only be used for information purposes.This dataset contains data from 22/05/2024 (MARI local go-live) on.

  13. o

    Imbalance prices per minute (Near real-time)

    • external-elia.aws-ec2-eu-central-1.opendatasoft.com
    • opendata.elia.be
    • +1more
    csv, excel, json
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Imbalance prices per minute (Near real-time) [Dataset]. https://external-elia.aws-ec2-eu-central-1.opendatasoft.com/explore/dataset/ods161/api/
    Explore at:
    excel, csv, jsonAvailable download formats
    Dataset updated
    Aug 23, 2024
    Description

    The 1 min imbalance prices are published as fast as possible and give an indication for the final imbalance price of the ISP (imbalance settlement period which is 15min). This report contains data for the current hour and is refreshed every minute. Notice that in this report we only provide non-validated data. This dataset contains data from 22/05/2024 (MARI local go-live) on.

  14. D

    Databento US Equities Market Data and APIs

    • databento.com
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Databento US Equities Market Data and APIs [Dataset]. https://databento.com/datasets/XASE.PILLAR
    Explore at:
    Dataset updated
    Jan 15, 2025
    Time period covered
    May 1, 2018 - Present
    Area covered
    United States
    Description

    Breadth of coverage: 14,160 products

    Asset class(es): Equities

    Origin: Directly captured at Equinix NY4 (Secaucus, NJ) with an FPGA-based network card and hardware timestamping. Synchronized to UTC with PTP.

    Supported data encodings: DBN, CSV, JSON Learn more

    Supported market data schemas: MBO, MBP-1, MBP-10, TBBO, Trades, OHLCV-1s, OHLCV-1m, OHLCV-1h, OHLCV-1d, Definition, Imbalance Learn more

    Resolution: Immediate publication, nanosecond-resolution timestamps

  15. o

    System imbalance forecast current quarter hour (near real-time)

    • external-elia.opendatasoft.com
    • opendata.elia.be
    • +1more
    csv, excel, json
    Updated Jan 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). System imbalance forecast current quarter hour (near real-time) [Dataset]. https://external-elia.opendatasoft.com/explore/dataset/ods136/?flg=nl
    Explore at:
    csv, excel, jsonAvailable download formats
    Dataset updated
    Jan 11, 2023
    Description

    This report contains a forecast of the average quarter-hourly system imbalance in the current quarter-hour as well as an estimated probability distribution of the average quarter-hourly system imbalance in the current quarter hour. The data reflects Elia's own forecasts of the system imbalance. It must be noted that these forecasts can have a significant error margin, are not binding for Elia and are therefore merely shared for informational purposes and that under no circumstances the publication or the use of this information imply a shift in responsibility or liability towards Elia.

  16. f

    Confusion matrix.

    • figshare.com
    xls
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaoxia Mou; Heming Zhang (2023). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0288140.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 7, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Shaoxia Mou; Heming Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Due to the inherent characteristics of accumulation sequence of unbalanced data, the mining results of this kind of data are often affected by a large number of categories, resulting in the decline of mining performance. To solve the above problems, the performance of data cumulative sequence mining is optimized. The algorithm for mining cumulative sequence of unbalanced data based on probability matrix decomposition is studied. The natural nearest neighbor of a few samples in the unbalanced data cumulative sequence is determined, and the few samples in the unbalanced data cumulative sequence are clustered according to the natural nearest neighbor relationship. In the same cluster, new samples are generated from the core points of dense regions and non core points of sparse regions, and then new samples are added to the original data accumulation sequence to balance the data accumulation sequence. The probability matrix decomposition method is used to generate two random number matrices with Gaussian distribution in the cumulative sequence of balanced data, and the linear combination of low dimensional eigenvectors is used to explain the preference of specific users for the data sequence; At the same time, from a global perspective, the AdaBoost idea is used to adaptively adjust the sample weight and optimize the probability matrix decomposition algorithm. Experimental results show that the algorithm can effectively generate new samples, improve the imbalance of data accumulation sequence, and obtain more accurate mining results. Optimizing global errors as well as more efficient single-sample errors. When the decomposition dimension is 5, the minimum RMSE is obtained. The proposed algorithm has good classification performance for the cumulative sequence of balanced data, and the average ranking of index F value, G mean and AUC is the best.

  17. f

    Additional file 3 of Impact of random oversampling and random undersampling...

    • springernature.figshare.com
    • figshare.com
    xlsx
    Updated Aug 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cynthia Yang; Egill A. Fridgeirsson; Jan A. Kors; Jenna M. Reps; Peter R. Rijnbeek (2024). Additional file 3 of Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data [Dataset]. http://doi.org/10.6084/m9.figshare.26660464.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Aug 18, 2024
    Dataset provided by
    figshare
    Authors
    Cynthia Yang; Egill A. Fridgeirsson; Jan A. Kors; Jenna M. Reps; Peter R. Rijnbeek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 3. Candidate predictors per database.

  18. d

    Data from: Exploring deep learning techniques for wild animal behaviour...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jan 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. https://datadryad.org/stash/dataset/doi:10.5061/dryad.2ngf1vhwk
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 23, 2024
    Dataset provided by
    Dryad
    Authors
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
    Time period covered
    2023
    Description

    Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, usin...

  19. c

    Exposure to surface water supply and use imbalance during spawning months...

    • s.cnmilf.com
    • gimi9.com
    Updated Feb 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Exposure to surface water supply and use imbalance during spawning months for 214 fish taxa across the conterminous United States [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/exposure-to-surface-water-supply-and-use-imbalance-during-spawning-months-for-214-fish-tax
    Explore at:
    Dataset updated
    Feb 22, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Contiguous United States, United States
    Description

    This data release contains the output from an ecological analysis modeling the exposure of 214 fish taxa across the conterminous US (CONUS) to an index of surface water supply and use imbalances (SUI), the proportion of monthly gross average water supply available after accounting for climate variation and consumptive use, during their spawning months, hereafter referred to as spawning exposure. SUI were calculated in Miller and others (2024) by combining the monthly water balance from water supply and human consumptive uses for CONUS from water years 2010-2020 at the HUC12 scale. Water supply inputs were generated from two physically-based hydrologic models, and consumptive water use was calculated from three separate national models for agricultural irrigation, thermoelectric power generation, and public supply. Water budgets were routed through the surface water flow network (to allow for upstream consumptive uses to affect downstream water availability) and used to determine potential water limitations for human populations and fish taxa. We overlaid water supply imbalances with the modeled ranges of 241 fish taxa, including Species of Greatest Conservation Need, recreationally important, and common native taxa. SUI were evaluated within each HUC12 and specifically mean weighted based on the probability of spawning in each month for each taxa. Our analyses indicated multiple taxa having notable proportions of their habitats exposed to high or severe water imbalances during spawning, especially the federally-listed Arkansas River shiner. This analysis can be used to identify fish taxa particularly exposed to water availability issues, specifically from surface water supply and use imbalances, during the physiologically important spawning period. However, this analysis did not consider specific taxa-level differences as to the sensitivity of different taxa to limited water supply. This data release contains five tabular datasets in comma-separated values (.csv), covering a tabular data dictionary, input data supporting analysis, raw analysis output, and summarized versions at two spatial scales for convenience. They are: 1) data_dictionary.csv - A data dictionary containing entity and attribute information about variable names, descriptions, types, ranges, and unique values for easy access. 2) SpawningExposure_TaxaSpawningWeights.csv - Dataset used to weigh spawning months for each taxon in calculation of the spawning exposure. Derived from Frimpong and Angermeier, 2011. 3) SpawningExposure_SUI_HUC12.csv - CONUS level dataset of spawning exposure to SUI from 2010-2020 for each fish taxa reported for each HUC12 where they are present. 4) SpawningExposure_SUI_CONUS_Summary.csv - A summary of spawning exposure to SUI by fish taxa, for the entire habitat range in CONUS, the range-averaged SUI exposure and percentage of habitat in each SUI category class. 5) SpawningExposure_SUI_Regional_Summary.csv - Summaries of spawning exposure to SUI by fish taxa, for each Van Metre (2020) hydrologic region, the region range-average exposure and percentage of the region's habitat range in each SUI category class.

  20. Data from: Multitask Modeling with Confidence Using Matrix Factorization and...

    • acs.figshare.com
    xlsx
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ulf Norinder; Fredrik Svensson (2023). Multitask Modeling with Confidence Using Matrix Factorization and Conformal Prediction [Dataset]. http://doi.org/10.1021/acs.jcim.9b00027.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    ACS Publications
    Authors
    Ulf Norinder; Fredrik Svensson
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Multitask prediction of bioactivities is often faced with challenges relating to the sparsity of data and imbalance between different labels. We propose class conditional (Mondrian) conformal predictors using underlying Macau models as a novel approach for large scale bioactivity prediction. This approach handles both high degrees of missing data and label imbalances while still producing high quality predictive models. When applied to ten assay end points from PubChem, the models generated valid models with an efficiency of 74.0–80.1% at the 80% confidence level with similar performance both for the minority and majority class. Also when deleting progressively larger portions of the available data (0–80%) the performance of the models remained robust with only minor deterioration (reduction in efficiency between 5 and 10%). Compared to using Macau without conformal prediction the method presented here significantly improves the performance on imbalanced data sets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud (2024). Imbalanced class datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0299585.t001

Imbalanced class datasets.

Related Article
Explore at:
171 scholarly articles cite this dataset (View in Google Scholar)
xlsAvailable download formats
Dataset updated
Apr 11, 2024
Dataset provided by
PLOS ONE
Authors
Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.

Search
Clear search
Close search
Google apps
Main menu