72 datasets found
  1. f

    Imbalanced class datasets.

    • plos.figshare.com
    xls
    Updated Apr 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud (2024). Imbalanced class datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0299585.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 11, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.

  2. Dataset: The effects of class balance on the training energy consumption of...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga (2024). Dataset: The effects of class balance on the training energy consumption of logistic regression models [Dataset]. http://doi.org/10.5281/zenodo.10823624
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2024
    Description

    Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.

  3. f

    Confusion matrix.

    • figshare.com
    xls
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaoxia Mou; Heming Zhang (2023). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0288140.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 7, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Shaoxia Mou; Heming Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Due to the inherent characteristics of accumulation sequence of unbalanced data, the mining results of this kind of data are often affected by a large number of categories, resulting in the decline of mining performance. To solve the above problems, the performance of data cumulative sequence mining is optimized. The algorithm for mining cumulative sequence of unbalanced data based on probability matrix decomposition is studied. The natural nearest neighbor of a few samples in the unbalanced data cumulative sequence is determined, and the few samples in the unbalanced data cumulative sequence are clustered according to the natural nearest neighbor relationship. In the same cluster, new samples are generated from the core points of dense regions and non core points of sparse regions, and then new samples are added to the original data accumulation sequence to balance the data accumulation sequence. The probability matrix decomposition method is used to generate two random number matrices with Gaussian distribution in the cumulative sequence of balanced data, and the linear combination of low dimensional eigenvectors is used to explain the preference of specific users for the data sequence; At the same time, from a global perspective, the AdaBoost idea is used to adaptively adjust the sample weight and optimize the probability matrix decomposition algorithm. Experimental results show that the algorithm can effectively generate new samples, improve the imbalance of data accumulation sequence, and obtain more accurate mining results. Optimizing global errors as well as more efficient single-sample errors. When the decomposition dimension is 5, the minimum RMSE is obtained. The proposed algorithm has good classification performance for the cumulative sequence of balanced data, and the average ranking of index F value, G mean and AUC is the best.

  4. C

    Data from: Imbalanced dataset for benchmarking

    • dataverse.csuc.cat
    application/gzip, txt
    Updated Jul 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Lemaitre; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Christos K. Aridas; Dayvid V. R. Oliveira; Fernando Nogueira; Dayvid V. R. Oliveira (2023). Imbalanced dataset for benchmarking [Dataset]. http://doi.org/10.34810/data656
    Explore at:
    txt(1592), application/gzip(42530536)Available download formats
    Dataset updated
    Jul 27, 2023
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Guillaume Lemaitre; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Christos K. Aridas; Dayvid V. R. Oliveira; Fernando Nogueira; Dayvid V. R. Oliveira
    License

    https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.34810/data656https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.34810/data656

    Description

    The different algorithms of the "imbalanced-learn" toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011)

  5. A dataset for comparing filtering methods used to separate balanced and...

    • zenodo.org
    • data.niaid.nih.gov
    tar
    Updated Jan 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    C Spencer Jones; C Spencer Jones; Qiyu Xiao; Ryan P Abernathey; K Shafer Smith; Qiyu Xiao; Ryan P Abernathey; K Shafer Smith (2023). A dataset for comparing filtering methods used to separate balanced and unbalanced flow at the surface of the Agulhas region [Dataset]. http://doi.org/10.5281/zenodo.6561068
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 3, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    C Spencer Jones; C Spencer Jones; Qiyu Xiao; Ryan P Abernathey; K Shafer Smith; Qiyu Xiao; Ryan P Abernathey; K Shafer Smith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises sea surface height (SSH) and velocity data at the ocean surface in two small regions near the Agulhas retroflection. The unfiltered SSH and a horizontal velocity field are provided, along with the same fields after various kinds of filtering, as described in the accompanying manuscript, Separating balanced and unbalanced flow at the surface of the Agulhas region using Lagrangian filtering. The code repository for this work is https://github.com/cspencerjones/separating-balanced .

    Two time-resolutions are provided: two weeks of hourly data and 70 days of daily data. See the manuscript for more information.

    This work was supported by NASA award 80NSSC20K1142.

  6. i

    Multisense

    • ieee-dataport.org
    Updated Oct 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bouabid Marwen (2024). Multisense [Dataset]. http://doi.org/10.21227/cxy4-1136
    Explore at:
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    IEEE Dataport
    Authors
    Bouabid Marwen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset DescriptionThis dataset, named MultiSense, is designed to enhance disaster response by providing comprehensive data from multiple sources. It comes in two versions: balanced and unbalanced. The dataset consists of five distinct classes, each representing different types of events or conditions:Syria Earthquake: This class includes imagery and video footage related to earthquake damage. The data captures the aftermath of seismic events, showcasing various degrees of destruction.Gaza War: This class contains data depicting war-related damage. It includes imagery and videos from conflict zones, highlighting the impact of warfare on infrastructure and urban areas.Hurricane Harvey: This class encompasses data related to hurricane damage. It includes imagery and footage showing the effects of strong winds, flooding, and storm surges associated with hurricanes.Libya Flood: This class features imagery and videos of flood damage. It documents areas affected by flooding, capturing the extent of water damage to buildings, roads, and landscapes.No Damage: This class provides imagery and footage of areas with no significant damage. It serves as a control group, representing normal conditions without the impact of natural disasters or conflicts.The balanced version of the dataset contains an equal number of samples from each class, ensuring that the model trained on this data does not favor any particular class due to data imbalance. On the other hand, the unbalanced version reflects the real-world distribution of such events, where some types of damage may be more prevalent than others.Both versions of the dataset include high-resolution satellite imagery and drone footage, offering a rich and diverse set of data for training and testing machine learning models aimed at disaster detection and response. The balanced dataset is ideal for training models that require equal representation of each class, while the unbalanced dataset provides a more realistic scenario for model evaluation.

  7. E

    Data on the composition of four balanced and four unbalanced series of E12.5...

    • dtechtive.com
    • find.data.gov.scot
    docx, pdf, txt, xlsx
    Updated Jun 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Edinburgh. Edinburgh Medical School (2017). Data on the composition of four balanced and four unbalanced series of E12.5 fetal mouse chimaeras [Dataset]. http://doi.org/10.7488/ds/2056
    Explore at:
    txt(0.0166 MB), docx(0.1459 MB), xlsx(0.1168 MB), pdf(0.1047 MB)Available download formats
    Dataset updated
    Jun 6, 2017
    Dataset provided by
    University of Edinburgh. Edinburgh Medical School
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is numerical data used to compare the composition of different series of fetal mouse chimaeras. Eight series of chimaeras were created as matched pairs in four studies and the composition of each chimaeric conceptus was evaluated by electrophoresis of glucose phosphate isomerase (GPI) markers. These data show that BALB/c embryos tend to contribute poorly to mouse chimaeras [references 1, 3, 4] and this appears to be mediated, in part, by a maternal effect [reference 2]. 1. West, J.D., Flockhart, J.H., 1994. Genotypically unbalanced diploid -diploid foetal mouse chimaeras: possible relevance to human confined mosaicism. Genet Res 63, 87-99. DOI: https://doi.org/10.1017/S0016672300032195 2. West, J.D., Flockhart, J.H., Kissenpfennig, A., 1995. A maternal genetic effect on the composition of mouse aggregation chimaeras. Genet Res 65, 29-40. DOI: https://doi.org/10.1017/S0016672300032985 3. Tang, P.-C. & West, J.D., 2001. Size regulation does not cause the composition of mouse chimaeras to become unbalanced. Int. J. Dev. Biol. 45, 583-590. 4. MacKay, G.E., Keighren, M.A., Wilson, L., Pratt, T., Flockhart, J.H., Mason, J.O., Price, D.J., West, J.D., 2005. Evaluation of the mouse TgTP6.3 tauGFP transgene as a lineage marker in chimeras. J. Anat. 206, 79-92. DOI: 10.1111/j.0021-8782.2005.00370.x

  8. f

    Unbalanced 2 x 2 Factorial Designs and the Interaction Effect: A Troublesome...

    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes A. Landsheer; Godfried van den Wittenboer (2023). Unbalanced 2 x 2 Factorial Designs and the Interaction Effect: A Troublesome Combination [Dataset]. http://doi.org/10.1371/journal.pone.0121412
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Johannes A. Landsheer; Godfried van den Wittenboer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this power study, ANOVAs of unbalanced and balanced 2 x 2 datasets are compared (N = 120). Datasets are created under the assumption that H1 of the effects is true. The effects are constructed in two ways, assuming: 1. contributions to the effects solely in the treatment groups; 2. contrasting contributions in treatment and control groups. The main question is whether the two ANOVA correction methods for imbalance (applying Sums of Squares Type II or III; SS II or SS III) offer satisfactory power in the presence of an interaction. Overall, SS II showed higher power, but results varied strongly. When compared to a balanced dataset, for some unbalanced datasets the rejection rate of H0 of main effects was undesirably higher. SS III showed consistently somewhat lower power. When the effects were constructed with equal contributions from control and treatment groups, the interaction could be re-estimated satisfactorily. When an interaction was present, SS III led consistently to somewhat lower rejection rates of H0 of main effects, compared to the rejection rates found in equivalent balanced datasets, while SS II produced strongly varying results. In data constructed with only effects in the treatment groups and no effects in the control groups, the H0 of moderate and strong interaction effects was often not rejected and SS II seemed applicable. Even then, SS III provided slightly better results when a true interaction was present. ANOVA allowed not always for a satisfactory re-estimation of the unique interaction effect. Yet, SS II worked better only when an interaction effect could be excluded, whereas SS III results were just marginally worse in that case. Overall, SS III provided consistently 1 to 5% lower rejection rates of H0 in comparison with analyses of balanced datasets, while results of SS II varied too widely for general application.

  9. o

    christine

    • openml.org
    Updated Aug 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    http://automl.chalearn.org (2018). christine [Dataset]. https://openml.org/d/41142
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2018
    Authors
    http://automl.chalearn.org
    Description

    SOURCE: ChaLearn Automatic Machine Learning Challenge (AutoML), ChaLearn

    This is a "supervised learning" challenge in machine learning. We are making available 30 datasets, all pre-formatted in given feature representations (this means that each example consists of a fixed number of numerical coefficients). The challenge is to solve classification and regression problems, without any further human intervention.

    The difficulty is that there is a broad diversity of data types and distributions (including balanced or unbalanced classes, sparse or dense feature representations, with or without missing values or categorical variables, various metrics of evaluation, various proportions of number of features and number of examples). The problems are drawn from a wide variety of domains and include medical diagnosis from laboratory analyses, speech recognition, credit rating, prediction or drug toxicity or efficacy, classification of text, prediction of customer satisfaction, object recognition, protein structure prediction, action recognition in video data, etc. While there exist machine learning toolkits including methods that can solve all these problems, it is still considerable human effort to find, for a given combination of dataset, task, metric of evaluation, and available computational time, the combination of methods and hyper-parameter setting that is best suited. Your challenge is to create the "perfect black box" eliminating the human in the loop.

    This is a challenge with code submission: your code will be executed automatically on our servers to train and test your learning machines with unknown datasets. However, there is NO OBLIGATION TO SUBMIT CODE. Half of the prizes can be won by just submitting prediction results. There are six rounds (Prep, Novice, Intermediate, Advanced, Expert, and Master) in which datasets of progressive difficulty are introduced (5 per round). There is NO PREREQUISITE TO PARTICIPATE IN PREVIOUS ROUNDS to enter a new round. The rounds alternate AutoML phases in which submitted code is "blind tested" in limited time on our platform, using datasets you have never seen before, and Tweakathon phases giving you time to improve your methods by tweaking them on those datasets and running them on your own systems (without computational resource limitation).

    NOTE: This dataset corresponds to one of the datasets of the challenge.

  10. d

    Data from: QST FST comparisons with unbalanced half-sib designs

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 15, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kimberly J. Gilbert; Michael C. Whitlock (2014). QST FST comparisons with unbalanced half-sib designs [Dataset]. http://doi.org/10.5061/dryad.rm574
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 15, 2014
    Dataset provided by
    Dryad
    Authors
    Kimberly J. Gilbert; Michael C. Whitlock
    Time period covered
    2014
    Description

    QST, a measure of quantitative genetic differentiation among populations, is an index that can suggest local adaptation if QST for a trait is sufficiently larger than the mean FST of neutral genetic markers. A previous method by Whitlock and Guillaume derived a simulation resampling approach to statistically test for a difference between QST and FST, but that method is limited to balanced data sets with offspring related as half-sibs through shared fathers. We extend this approach to (1) allow for a model more suitable for some plant populations or breeding designs in which offspring are related through mothers (assuming independent fathers for each offspring; half-sibs by dam), and (2) by explicitly allowing for unbalanced data sets. The resulting approach is made available through the R package QstFstComp.

  11. Data from: The influence of balanced and imbalanced resource supply on...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Mar 11, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The influence of balanced and imbalanced resource supply on biodiversity-functioning relationship across ecosystems [Dataset]. https://data.niaid.nih.gov/resources?id=dryad_h50d9
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 11, 2017
    Dataset provided by
    Netherlands Institute of Ecology
    German Centre for Integrative Biodiversity Research (iDiv)https://www.idiv.de/
    Tohoku University
    Plymouth Marine Laboratory
    University of Maryland, College Park
    Carl von Ossietzky Universität Oldenburg
    Institute of Natural Sciences
    KU Leuven
    University of Minnesota
    Vrije Universiteit Brussel
    Ghent University
    University of Nebraska–Lincoln
    Monash University
    University of Hildesheim
    University of KwaZulu-Natal
    GEOMAR Helmholtz Centre for Ocean Research Kiel
    Michigan State University
    University of Gothenburg
    Authors
    Aleksandra M. Lewandowska; Antje Biermann; Elizabeth T. Borer; Miguel A. Cebrian-Piqueras; Steven A. J. Declerck; Luc De Meester; Ellen van Donk; Lars Gamfeldt; Daniel S. Gruner; Nicole Hagenah; W. Stanley Harpole; Kevin P. Kirkman; Christopher A. Klausmeier; Michael Kleyer; Johannes M. H. Knops; Pieter Lemmens; Eric M. Lind; Elena Litchman; Jasmin Mantilla-Contreras; Koen Martens; Sandra Meier; Vanessa Minden; Joslin L. Moore; Harry olde Venterink; Eric W. Seabloom; Ulrich Sommer; Maren Striebel; Anastasia Trenkamp; Juliane Trinogga; Jotaro Urabe; Wim Vyverman; Dedmer B. Van de Waal; Claire E. Widdicombe; Helmut Hillebrand
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Numerous studies show that increasing species richness leads to higher ecosystem productivity. This effect is often attributed to more efficient portioning of multiple resources in communities with higher numbers of competing species, indicating the role of resource supply and stoichiometry for biodiversity–ecosystem functioning relationships. Here, we merged theory on ecological stoichiometry with a framework of biodiversity–ecosystem functioning to understand how resource use transfers into primary production. We applied a structural equation model to define patterns of diversity–productivity relationships with respect to available resources. Meta-analysis was used to summarize the findings across ecosystem types ranging from aquatic ecosystems to grasslands and forests. As hypothesized, resource supply increased realized productivity and richness, but we found significant differences between ecosystems and study types. Increased richness was associated with increased productivity, although this effect was not seen in experiments. More even communities had lower productivity, indicating that biomass production is often maintained by a few dominant species, and reduced dominance generally reduced ecosystem productivity. This synthesis, which integrates observational and experimental studies in a variety of ecosystems and geographical regions, exposes common patterns and differences in biodiversity–functioning relationships, and increases the mechanistic understanding of changes in ecosystems productivity.

  12. B

    Data from: QST FST comparisons with unbalanced half-sib designs

    • borealisdata.ca
    • open.library.ubc.ca
    Updated May 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kimberly J. Gilbert; Michael C. Whitlock (2021). Data from: QST FST comparisons with unbalanced half-sib designs [Dataset]. http://doi.org/10.5683/SP2/9PBQES
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 20, 2021
    Dataset provided by
    Borealis
    Authors
    Kimberly J. Gilbert; Michael C. Whitlock
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    AbstractQST, a measure of quantitative genetic differentiation among populations, is an index that can suggest local adaptation if QST for a trait is sufficiently larger than the mean FST of neutral genetic markers. A previous method by Whitlock and Guillaume derived a simulation resampling approach to statistically test for a difference between QST and FST, but that method is limited to balanced data sets with offspring related as half-sibs through shared fathers. We extend this approach to (1) allow for a model more suitable for some plant populations or breeding designs in which offspring are related through mothers (assuming independent fathers for each offspring; half-sibs by dam), and (2) by explicitly allowing for unbalanced data sets. The resulting approach is made available through the R package QstFstComp. Usage notesSourceCode_DamModelSource code used when doing type I error testing of balanced or unbalanced half-sib dam modelDamModel_WorkingCopy.RSireModel_WorkingCopySource code used when doing type I error testing of unbalanced half-sib sire modelTypeI_ErrorTest_DamBalancedR code to run the error testing of the balanced half-sib dam model over 1000 replicate datasets.TypeI_ErrorTest_DamUnbalancedR code to run the error testing of the unbalanced half-sib dam model over 1000 replicate datasets.TypeI_ErrorTest_SireUnbalancedR code to run the error testing of the unbalanced half-sib sire model over 1000 replicate datasets.NemoReplicatesZipped file containing the 1000 simulated replicate datasets from Nemo used for type I error testing.

  13. f

    S5 Dataset -

    • plos.figshare.com
    xlsx
    Updated Dec 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JiaMing Gong; MingGang Dong (2024). S5 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0311133.s005
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    PLOS ONE
    Authors
    JiaMing Gong; MingGang Dong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.

  14. 4

    Empirical data used in the application of the paper "Genuinely Unbalanced...

    • data.4tu.nl
    • 4tu.edu.hpc.n-helix.com
    zip
    Updated Sep 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaoyu Meng (2024). Empirical data used in the application of the paper "Genuinely Unbalanced Spatial Panel Data Models with Fixed Effects: M-Estimation and Inference with an Application to FDI" [Dataset]. http://doi.org/10.4121/2cdc714c-6c94-454c-8719-ee8f53e0ab27.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 9, 2024
    Dataset provided by
    4TU.ResearchData
    Authors
    Xiaoyu Meng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the data used in the empirical analysis of spatial spillover effects on Foreign Direct Investment (FDI) inflows across Chinese administrative divisions. The analysis employs two different model specifications: a balanced panel model and a generalized unbalanced (GU) model. Additionally, a spatial weight matrix file is provided, which is essential for modeling spatial dependencies.

  15. Raw Data for: "Inorganic synthesis-structure maps in zeolites with machine...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Schwalbe-Koda; Daniel Schwalbe-Koda (2023). Raw Data for: "Inorganic synthesis-structure maps in zeolites with machine learning and crystallographic distances" [Dataset]. http://doi.org/10.5281/zenodo.8422373
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Oct 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Schwalbe-Koda; Daniel Schwalbe-Koda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains all the raw data to reproduce the manuscript:

    D. Schwalbe-Koda et al. "Inorganic synthesis-structure maps in zeolites with machine learning and crystallographic distances". arXiv:2307.10935 (2023)

    The raw data should be used in combination with the code hosted on GitHub: https://github.com/dskoda/Zeolites-AMD.

    Description of the data

    The data in this link contains all necessary information to reproduce the manuscript. In combination with the code hosted on GitHub, it can be visualized and analyzed accordingly. The full description on the columns and results is available on the GitHub code.
    The data files in this repository are:

    - `hparams_rnd_*.json`: results of the hyperparameter optimization of all classifiers studied in this work. The data was produced by randomly sampling the train-validation-test sets. In some cases, the data was normalized (`_norm_`), and the train set was kept `balanced` or `unbalanced`.
    - `hyp_dm`: distance matrix of all hypothetical zeolites towards the known zeolites
    - `hyp_predictions`: predictions of the synthesis conditions for all hypothetical zeolites
    - `xgb_ensembles*`: pickle files containing the serialized ensemble models used in the evaluation of the data in this work. The models can be loaded with the `xgboost` Python package.

    License

    The data and all the content from this repository is distributed under the Creative Commons Attribution 4.0 (CC-BY 4.0)

    This work was produced under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

    Dataset released as: LLNL-MI-854709.

  16. Data from: The Unit Re-Balancing Problem

    • zenodo.org
    bin, txt, zip
    Updated Oct 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin Dee; Armin Fügenschuh; Armin Fügenschuh; George Kaimakamis; Robin Dee; George Kaimakamis (2021). The Unit Re-Balancing Problem [Dataset]. http://doi.org/10.5281/zenodo.5579319
    Explore at:
    txt, zip, binAvailable download formats
    Dataset updated
    Oct 20, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Robin Dee; Armin Fügenschuh; Armin Fügenschuh; George Kaimakamis; Robin Dee; George Kaimakamis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The unit re-balancing problem is about a number of defensive military units distributed over a geographic area. Each unit consists of a number of components (e.g., people, armor, or equipment). A value between 0 and 1 describes the current rating of each component. By a nonlinear function this value is converted into a nominal status assessment. This allows a comparison of different components of all units. The lowest of the statuses determines the efficiency of a unit, and the highest status its cost. An unbalanced unit has a gap between these two. When too many units are unbalanced, the entire system is costly and inefficient. To re-balance the units, people and material can be transferred. The goal is to have all units equally well equipped at the lowest possible cost. On a secondary level, the cost for the re-balancing should also be minimal. We present a mixed-integer nonlinear programming formulation for this problem, which describes the potential movement of components as a multi-commodity flow. Nonlinear constraints are needed to obtain the lowest and the highest status. Since we assume that these functions are piecewise linear, we reformulate them using inequalities and binary variables. This results in a mixed-integer linear program, and numerical standard solvers are able to compute proven optimal solutions for instances with up to 100 units. The dataset consists of the models and test instances that were presented at the (virtual) 6th IMA Conference on Mathematics in Defence and Security, March 30-31, 2021.

  17. Z

    Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nagappan, Meiyappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
    Explore at:
    Dataset updated
    Jan 27, 2022
    Dataset provided by
    Nagappan, Meiyappan
    Keshavarz, Hossein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

    The datasets are available under directory dataset. There are 4 datasets in this directory.

    1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.
    2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
    3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
    4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

    In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

    The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

    More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

    References:

    1. GumTree

    Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

    1. PyDriller
    • https://pydriller.readthedocs.io/en/latest/

    • Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

  18. f

    Table_1_Association Mapping for 24 Traits Related to Protein Content, Gluten...

    • figshare.com
    • frontiersin.figshare.com
    xlsx
    Updated Jun 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marina Johnson; Ajay Kumar; Atena Oladzad-Abbasabadi; Evan Salsman; Meriem Aoun; Frank A. Manthey; Elias M. Elias (2023). Table_1_Association Mapping for 24 Traits Related to Protein Content, Gluten Strength, Color, Cooking, and Milling Quality Using Balanced and Unbalanced Data in Durum Wheat [Triticum turgidum L. var. durum (Desf).].xlsx [Dataset]. http://doi.org/10.3389/fgene.2019.00717.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    Frontiers
    Authors
    Marina Johnson; Ajay Kumar; Atena Oladzad-Abbasabadi; Evan Salsman; Meriem Aoun; Frank A. Manthey; Elias M. Elias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Durum wheat [Triticum durum (Desf).] is mostly used to produce pasta, couscous, and bulgur. The quality of the grain and end-use products determine its market value. However, quality tests are highly resource intensive and almost impossible to conduct in the early generations in the breeding program. Modern genomics-based tools provide an excellent opportunity to genetically dissect complex quality traits to expedite cultivar development using molecular breeding approaches. This study used a panel of 243 cultivars and advanced breeding lines developed during the last 20 years to identify SNPs associated with 24 traits related to nutritional value and quality. Genome-wide association study (GWAS) identified a total of 179 marker–trait associations (MTAs), located in 95 genomic regions belonging to all 14 durum wheat chromosomes. Major and stable QTLs were identified for gluten strength on chromosomes 1A and 1B, and for PPO activity on chromosomes 1A, 2B, 3A, and 3B. As a large amount of unbalance phenotypic data are generated every year on advanced lines in all the breeding programs, the applicability of such a dataset for identification of MTAs remains unclear. We observed that ∼84% of the MTAs identified using a historic unbalanced dataset (belonging to a total of 80 environments collected over a period of 16 years) were also identified in a balanced dataset. This suggests the suitability of historic unbalanced phenotypic data to identify beneficial MTAs to facilitate local-knowledge-based breeding. In addition to providing extensive knowledge about the genetics of quality traits, association mapping identified several candidate markers to assist durum wheat quality improvement through molecular breeding. The molecular markers associated with important traits could be extremely useful in the development of improved quality durum wheat cultivars using marker-assisted selection (MAS).

  19. h

    app_reviews

    • huggingface.co
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Ghazaryan (2024). app_reviews [Dataset]. https://huggingface.co/datasets/PavelGh/app_reviews
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2024
    Authors
    Pavel Ghazaryan
    Description

    Dataset Card for App Reviews

      Dataset Details
    

    App reviews labeled into 4 categories: 'Bug Report', 'Feature Request', 'Rating', 'User Experience'. Note that the ones that say GPT in their file name are labeled by ChatGPT through prompt fine-tuning out of which approximately 3% was verified through random manual checking. The files that do not contain gpt in their names are manually labeled. I have separate datasets for Balanced and Unbalanced. Addtionally the gpt… See the full description on the dataset page: https://huggingface.co/datasets/PavelGh/app_reviews.

  20. Experimental measurements and uncertainty analysis for validation of the...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Apr 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Cale (2023). Experimental measurements and uncertainty analysis for validation of the Building Electrical Efficiency Analysis Model (BEEAM) [Dataset]. http://doi.org/10.5061/dryad.m63xsj471
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 5, 2023
    Dataset provided by
    Colorado State University
    Authors
    James Cale
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    This dataset includes experimental measurements taken on a laboratory testbed at Colorado State University that was used for model validation of a software toolkit, the Building Electrical Efficiency Analysis Model (BEEAM). This toolkit was developed for comparing electrical efficiency of AC versus DC distribution systems in buildings. The testbed emulated loads found in a small office building and included laptop computer chargers, LED lighting systems, and miscellaneous DC and AC loads. Measurements were taken under AC and DC configurations in electrically balanced and unbalanced loading conditions. Also included in the dataset is an uncertainty analysis. A complete description of the testbed, hardware, measurements and uncertainty analysis is contained in the paper cited below.

    Avpreet Othee, James Cale, Arthur Santos, Stephen Frank, Daniel Zimmerle, Omkar Ghatpande, Gerald Duggan and Daniel Gerber, “A Modeling Toolkit for Comparing AC and DC Electrical Distribution Efficiency in Buildings,” Energies, 2023 (accepted, publication in progress).

    Methods Data was collected using a Keysight multifunction switch measuring unit (MU) model 34980A with Keysight 34921T multiplexer and a Keysight PA2203A power analyzer.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud (2024). Imbalanced class datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0299585.t001

Imbalanced class datasets.

Related Article
Explore at:
171 scholarly articles cite this dataset (View in Google Scholar)
xlsAvailable download formats
Dataset updated
Apr 11, 2024
Dataset provided by
PLOS ONE
Authors
Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.

Search
Clear search
Close search
Google apps
Main menu