72 datasets found

f
Imbalanced class datasets.
plos.figshare.com
xls
Updated Apr 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud (2024). Imbalanced class datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0299585.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299585.t001
Dataset updated
Apr 11, 2024
Dataset provided by
PLOS ONE
Authors
Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.
Dataset: The effects of class balance on the training energy consumption of...
zenodo.org
data.niaid.nih.gov
csv
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga (2024). Dataset: The effects of class balance on the training energy consumption of logistic regression models [Dataset]. http://doi.org/10.5281/zenodo.10823624
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10823624
Dataset updated
Mar 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Gutierrez; Maria Gutierrez; Coral Calero; Coral Calero; Félix García; Félix García; Mª Ángeles Moraga; Mª Ángeles Moraga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2024
Description
Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.
f
Confusion matrix.
figshare.com
xls
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaoxia Mou; Heming Zhang (2023). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0288140.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288140.t002
Dataset updated
Jul 7, 2023
Dataset provided by
PLOS ONE
Authors
Shaoxia Mou; Heming Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Due to the inherent characteristics of accumulation sequence of unbalanced data, the mining results of this kind of data are often affected by a large number of categories, resulting in the decline of mining performance. To solve the above problems, the performance of data cumulative sequence mining is optimized. The algorithm for mining cumulative sequence of unbalanced data based on probability matrix decomposition is studied. The natural nearest neighbor of a few samples in the unbalanced data cumulative sequence is determined, and the few samples in the unbalanced data cumulative sequence are clustered according to the natural nearest neighbor relationship. In the same cluster, new samples are generated from the core points of dense regions and non core points of sparse regions, and then new samples are added to the original data accumulation sequence to balance the data accumulation sequence. The probability matrix decomposition method is used to generate two random number matrices with Gaussian distribution in the cumulative sequence of balanced data, and the linear combination of low dimensional eigenvectors is used to explain the preference of specific users for the data sequence; At the same time, from a global perspective, the AdaBoost idea is used to adaptively adjust the sample weight and optimize the probability matrix decomposition algorithm. Experimental results show that the algorithm can effectively generate new samples, improve the imbalance of data accumulation sequence, and obtain more accurate mining results. Optimizing global errors as well as more efficient single-sample errors. When the decomposition dimension is 5, the minimum RMSE is obtained. The proposed algorithm has good classification performance for the cumulative sequence of balanced data, and the average ranking of index F value, G mean and AUC is the best.
C
Data from: Imbalanced dataset for benchmarking
dataverse.csuc.cat
application/gzip, txt
Updated Jul 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillaume Lemaitre; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Christos K. Aridas; Dayvid V. R. Oliveira; Fernando Nogueira; Dayvid V. R. Oliveira (2023). Imbalanced dataset for benchmarking [Dataset]. http://doi.org/10.34810/data656
Explore at:
txt(1592), application/gzip(42530536)Available download formats
Unique identifier
https://doi.org/10.34810/data656
Dataset updated
Jul 27, 2023
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Guillaume Lemaitre; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Christos K. Aridas; Dayvid V. R. Oliveira; Fernando Nogueira; Dayvid V. R. Oliveira
License
https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.34810/data656https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.34810/data656
Description
The different algorithms of the "imbalanced-learn" toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011)
A dataset for comparing filtering methods used to separate balanced and...
zenodo.org
data.niaid.nih.gov
tar
Updated Jan 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
C Spencer Jones; C Spencer Jones; Qiyu Xiao; Ryan P Abernathey; K Shafer Smith; Qiyu Xiao; Ryan P Abernathey; K Shafer Smith (2023). A dataset for comparing filtering methods used to separate balanced and unbalanced flow at the surface of the Agulhas region [Dataset]. http://doi.org/10.5281/zenodo.6561068
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6561068
Dataset updated
Jan 3, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
C Spencer Jones; C Spencer Jones; Qiyu Xiao; Ryan P Abernathey; K Shafer Smith; Qiyu Xiao; Ryan P Abernathey; K Shafer Smith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises sea surface height (SSH) and velocity data at the ocean surface in two small regions near the Agulhas retroflection. The unfiltered SSH and a horizontal velocity field are provided, along with the same fields after various kinds of filtering, as described in the accompanying manuscript, Separating balanced and unbalanced flow at the surface of the Agulhas region using Lagrangian filtering. The code repository for this work is https://github.com/cspencerjones/separating-balanced .

Two time-resolutions are provided: two weeks of hourly data and 70 days of daily data. See the manuscript for more information.

This work was supported by NASA award 80NSSC20K1142.
i
Multisense
ieee-dataport.org
Updated Oct 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bouabid Marwen (2024). Multisense [Dataset]. http://doi.org/10.21227/cxy4-1136
Explore at:
Unique identifier
https://doi.org/10.21227/cxy4-1136
Dataset updated
Oct 9, 2024
Dataset provided by
IEEE Dataport
Authors
Bouabid Marwen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset DescriptionThis dataset, named MultiSense, is designed to enhance disaster response by providing comprehensive data from multiple sources. It comes in two versions: balanced and unbalanced. The dataset consists of five distinct classes, each representing different types of events or conditions:Syria Earthquake: This class includes imagery and video footage related to earthquake damage. The data captures the aftermath of seismic events, showcasing various degrees of destruction.Gaza War: This class contains data depicting war-related damage. It includes imagery and videos from conflict zones, highlighting the impact of warfare on infrastructure and urban areas.Hurricane Harvey: This class encompasses data related to hurricane damage. It includes imagery and footage showing the effects of strong winds, flooding, and storm surges associated with hurricanes.Libya Flood: This class features imagery and videos of flood damage. It documents areas affected by flooding, capturing the extent of water damage to buildings, roads, and landscapes.No Damage: This class provides imagery and footage of areas with no significant damage. It serves as a control group, representing normal conditions without the impact of natural disasters or conflicts.The balanced version of the dataset contains an equal number of samples from each class, ensuring that the model trained on this data does not favor any particular class due to data imbalance. On the other hand, the unbalanced version reflects the real-world distribution of such events, where some types of damage may be more prevalent than others.Both versions of the dataset include high-resolution satellite imagery and drone footage, offering a rich and diverse set of data for training and testing machine learning models aimed at disaster detection and response. The balanced dataset is ideal for training models that require equal representation of each class, while the unbalanced dataset provides a more realistic scenario for model evaluation.
E
Data on the composition of four balanced and four unbalanced series of E12.5...
dtechtive.com
find.data.gov.scot
docx, pdf, txt, xlsx
Updated Jun 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Edinburgh. Edinburgh Medical School (2017). Data on the composition of four balanced and four unbalanced series of E12.5 fetal mouse chimaeras [Dataset]. http://doi.org/10.7488/ds/2056
Explore at:
txt(0.0166 MB), docx(0.1459 MB), xlsx(0.1168 MB), pdf(0.1047 MB)Available download formats
Unique identifier
https://doi.org/10.7488/ds/2056
Dataset updated
Jun 6, 2017
Dataset provided by
University of Edinburgh. Edinburgh Medical School
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is numerical data used to compare the composition of different series of fetal mouse chimaeras. Eight series of chimaeras were created as matched pairs in four studies and the composition of each chimaeric conceptus was evaluated by electrophoresis of glucose phosphate isomerase (GPI) markers. These data show that BALB/c embryos tend to contribute poorly to mouse chimaeras [references 1, 3, 4] and this appears to be mediated, in part, by a maternal effect [reference 2]. 1. West, J.D., Flockhart, J.H., 1994. Genotypically unbalanced diploid -diploid foetal mouse chimaeras: possible relevance to human confined mosaicism. Genet Res 63, 87-99. DOI: https://doi.org/10.1017/S0016672300032195 2. West, J.D., Flockhart, J.H., Kissenpfennig, A., 1995. A maternal genetic effect on the composition of mouse aggregation chimaeras. Genet Res 65, 29-40. DOI: https://doi.org/10.1017/S0016672300032985 3. Tang, P.-C. & West, J.D., 2001. Size regulation does not cause the composition of mouse chimaeras to become unbalanced. Int. J. Dev. Biol. 45, 583-590. 4. MacKay, G.E., Keighren, M.A., Wilson, L., Pratt, T., Flockhart, J.H., Mason, J.O., Price, D.J., West, J.D., 2005. Evaluation of the mouse TgTP6.3 tauGFP transgene as a lineage marker in chimeras. J. Anat. 206, 79-92. DOI: 10.1111/j.0021-8782.2005.00370.x
f
Unbalanced 2 x 2 Factorial Designs and the Interaction Effect: A Troublesome...
figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes A. Landsheer; Godfried van den Wittenboer (2023). Unbalanced 2 x 2 Factorial Designs and the Interaction Effect: A Troublesome Combination [Dataset]. http://doi.org/10.1371/journal.pone.0121412
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0121412
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Johannes A. Landsheer; Godfried van den Wittenboer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this power study, ANOVAs of unbalanced and balanced 2 x 2 datasets are compared (N = 120). Datasets are created under the assumption that H1 of the effects is true. The effects are constructed in two ways, assuming: 1. contributions to the effects solely in the treatment groups; 2. contrasting contributions in treatment and control groups. The main question is whether the two ANOVA correction methods for imbalance (applying Sums of Squares Type II or III; SS II or SS III) offer satisfactory power in the presence of an interaction. Overall, SS II showed higher power, but results varied strongly. When compared to a balanced dataset, for some unbalanced datasets the rejection rate of H0 of main effects was undesirably higher. SS III showed consistently somewhat lower power. When the effects were constructed with equal contributions from control and treatment groups, the interaction could be re-estimated satisfactorily. When an interaction was present, SS III led consistently to somewhat lower rejection rates of H0 of main effects, compared to the rejection rates found in equivalent balanced datasets, while SS II produced strongly varying results. In data constructed with only effects in the treatment groups and no effects in the control groups, the H0 of moderate and strong interaction effects was often not rejected and SS II seemed applicable. Even then, SS III provided slightly better results when a true interaction was present. ANOVA allowed not always for a satisfactory re-estimation of the unique interaction effect. Yet, SS II worked better only when an interaction effect could be excluded, whereas SS III results were just marginally worse in that case. Overall, SS III provided consistently 1 to 5% lower rejection rates of H0 in comparison with analyses of balanced datasets, while results of SS II varied too widely for general application.
o
christine
openml.org
Updated Aug 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
http://automl.chalearn.org (2018). christine [Dataset]. https://openml.org/d/41142
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 15, 2018
Authors
http://automl.chalearn.org
Description
SOURCE: ChaLearn Automatic Machine Learning Challenge (AutoML), ChaLearn

This is a "supervised learning" challenge in machine learning. We are making available 30 datasets, all pre-formatted in given feature representations (this means that each example consists of a fixed number of numerical coefficients). The challenge is to solve classification and regression problems, without any further human intervention.

The difficulty is that there is a broad diversity of data types and distributions (including balanced or unbalanced classes, sparse or dense feature representations, with or without missing values or categorical variables, various metrics of evaluation, various proportions of number of features and number of examples). The problems are drawn from a wide variety of domains and include medical diagnosis from laboratory analyses, speech recognition, credit rating, prediction or drug toxicity or efficacy, classification of text, prediction of customer satisfaction, object recognition, protein structure prediction, action recognition in video data, etc. While there exist machine learning toolkits including methods that can solve all these problems, it is still considerable human effort to find, for a given combination of dataset, task, metric of evaluation, and available computational time, the combination of methods and hyper-parameter setting that is best suited. Your challenge is to create the "perfect black box" eliminating the human in the loop.

This is a challenge with code submission: your code will be executed automatically on our servers to train and test your learning machines with unknown datasets. However, there is NO OBLIGATION TO SUBMIT CODE. Half of the prizes can be won by just submitting prediction results. There are six rounds (Prep, Novice, Intermediate, Advanced, Expert, and Master) in which datasets of progressive difficulty are introduced (5 per round). There is NO PREREQUISITE TO PARTICIPATE IN PREVIOUS ROUNDS to enter a new round. The rounds alternate AutoML phases in which submitted code is "blind tested" in limited time on our platform, using datasets you have never seen before, and Tweakathon phases giving you time to improve your methods by tweaking them on those datasets and running them on your own systems (without computational resource limitation).

NOTE: This dataset corresponds to one of the datasets of the challenge.
d
Data from: QST FST comparisons with unbalanced half-sib designs
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jul 15, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kimberly J. Gilbert; Michael C. Whitlock (2014). QST FST comparisons with unbalanced half-sib designs [Dataset]. http://doi.org/10.5061/dryad.rm574
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rm574
Dataset updated
Jul 15, 2014
Dataset provided by
Dryad
Authors
Kimberly J. Gilbert; Michael C. Whitlock
Time period covered
2014
Description
QST, a measure of quantitative genetic differentiation among populations, is an index that can suggest local adaptation if QST for a trait is sufficiently larger than the mean FST of neutral genetic markers. A previous method by Whitlock and Guillaume derived a simulation resampling approach to statistically test for a difference between QST and FST, but that method is limited to balanced data sets with offspring related as half-sibs through shared fathers. We extend this approach to (1) allow for a model more suitable for some plant populations or breeding designs in which offspring are related through mothers (assuming independent fathers for each offspring; half-sibs by dam), and (2) by explicitly allowing for unbalanced data sets. The resulting approach is made available through the R package QstFstComp.
Data from: The influence of balanced and imbalanced resource supply on...
data.niaid.nih.gov
datadryad.org
zip
Updated Mar 11, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The influence of balanced and imbalanced resource supply on biodiversity-functioning relationship across ecosystems [Dataset]. https://data.niaid.nih.gov/resources?id=dryad_h50d9
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.h50d9
Dataset updated
Mar 11, 2017
Dataset provided by
Netherlands Institute of Ecology
German Centre for Integrative Biodiversity Research (iDiv)https://www.idiv.de/
Tohoku University
Plymouth Marine Laboratory
University of Maryland, College Park
Carl von Ossietzky Universität Oldenburg
Institute of Natural Sciences
KU Leuven
University of Minnesota
Vrije Universiteit Brussel
Ghent University
University of Nebraska–Lincoln
Monash University
University of Hildesheim
University of KwaZulu-Natal
GEOMAR Helmholtz Centre for Ocean Research Kiel
Michigan State University
University of Gothenburg
Authors
Aleksandra M. Lewandowska; Antje Biermann; Elizabeth T. Borer; Miguel A. Cebrian-Piqueras; Steven A. J. Declerck; Luc De Meester; Ellen van Donk; Lars Gamfeldt; Daniel S. Gruner; Nicole Hagenah; W. Stanley Harpole; Kevin P. Kirkman; Christopher A. Klausmeier; Michael Kleyer; Johannes M. H. Knops; Pieter Lemmens; Eric M. Lind; Elena Litchman; Jasmin Mantilla-Contreras; Koen Martens; Sandra Meier; Vanessa Minden; Joslin L. Moore; Harry olde Venterink; Eric W. Seabloom; Ulrich Sommer; Maren Striebel; Anastasia Trenkamp; Juliane Trinogga; Jotaro Urabe; Wim Vyverman; Dedmer B. Van de Waal; Claire E. Widdicombe; Helmut Hillebrand
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Numerous studies show that increasing species richness leads to higher ecosystem productivity. This effect is often attributed to more efficient portioning of multiple resources in communities with higher numbers of competing species, indicating the role of resource supply and stoichiometry for biodiversity–ecosystem functioning relationships. Here, we merged theory on ecological stoichiometry with a framework of biodiversity–ecosystem functioning to understand how resource use transfers into primary production. We applied a structural equation model to define patterns of diversity–productivity relationships with respect to available resources. Meta-analysis was used to summarize the findings across ecosystem types ranging from aquatic ecosystems to grasslands and forests. As hypothesized, resource supply increased realized productivity and richness, but we found significant differences between ecosystems and study types. Increased richness was associated with increased productivity, although this effect was not seen in experiments. More even communities had lower productivity, indicating that biomass production is often maintained by a few dominant species, and reduced dominance generally reduced ecosystem productivity. This synthesis, which integrates observational and experimental studies in a variety of ecosystems and geographical regions, exposes common patterns and differences in biodiversity–functioning relationships, and increases the mechanistic understanding of changes in ecosystems productivity.
B
Data from: QST FST comparisons with unbalanced half-sib designs
borealisdata.ca
open.library.ubc.ca
Updated May 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kimberly J. Gilbert; Michael C. Whitlock (2021). Data from: QST FST comparisons with unbalanced half-sib designs [Dataset]. http://doi.org/10.5683/SP2/9PBQES
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/9PBQES
Dataset updated
May 20, 2021
Dataset provided by
Borealis
Authors
Kimberly J. Gilbert; Michael C. Whitlock
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
AbstractQST, a measure of quantitative genetic differentiation among populations, is an index that can suggest local adaptation if QST for a trait is sufficiently larger than the mean FST of neutral genetic markers. A previous method by Whitlock and Guillaume derived a simulation resampling approach to statistically test for a difference between QST and FST, but that method is limited to balanced data sets with offspring related as half-sibs through shared fathers. We extend this approach to (1) allow for a model more suitable for some plant populations or breeding designs in which offspring are related through mothers (assuming independent fathers for each offspring; half-sibs by dam), and (2) by explicitly allowing for unbalanced data sets. The resulting approach is made available through the R package QstFstComp. Usage notesSourceCode_DamModelSource code used when doing type I error testing of balanced or unbalanced half-sib dam modelDamModel_WorkingCopy.RSireModel_WorkingCopySource code used when doing type I error testing of unbalanced half-sib sire modelTypeI_ErrorTest_DamBalancedR code to run the error testing of the balanced half-sib dam model over 1000 replicate datasets.TypeI_ErrorTest_DamUnbalancedR code to run the error testing of the unbalanced half-sib dam model over 1000 replicate datasets.TypeI_ErrorTest_SireUnbalancedR code to run the error testing of the unbalanced half-sib sire model over 1000 replicate datasets.NemoReplicatesZipped file containing the 1000 simulated replicate datasets from Nemo used for type I error testing.
f
S5 Dataset -
plos.figshare.com
xlsx
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JiaMing Gong; MingGang Dong (2024). S5 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0311133.s005
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311133.s005
Dataset updated
Dec 13, 2024
Dataset provided by
PLOS ONE
Authors
JiaMing Gong; MingGang Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.
4
Empirical data used in the application of the paper "Genuinely Unbalanced...
data.4tu.nl
4tu.edu.hpc.n-helix.com
zip
Updated Sep 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaoyu Meng (2024). Empirical data used in the application of the paper "Genuinely Unbalanced Spatial Panel Data Models with Fixed Effects: M-Estimation and Inference with an Application to FDI" [Dataset]. http://doi.org/10.4121/2cdc714c-6c94-454c-8719-ee8f53e0ab27.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/2cdc714c-6c94-454c-8719-ee8f53e0ab27.v1
Dataset updated
Sep 9, 2024
Dataset provided by
4TU.ResearchData
Authors
Xiaoyu Meng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the data used in the empirical analysis of spatial spillover effects on Foreign Direct Investment (FDI) inflows across Chinese administrative divisions. The analysis employs two different model specifications: a balanced panel model and a generalized unbalanced (GU) model. Additionally, a spatial weight matrix file is provided, which is essential for modeling spatial dependencies.
Raw Data for: "Inorganic synthesis-structure maps in zeolites with machine...
zenodo.org
data.niaid.nih.gov
application/gzip, bin
Updated Oct 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Schwalbe-Koda; Daniel Schwalbe-Koda (2023). Raw Data for: "Inorganic synthesis-structure maps in zeolites with machine learning and crystallographic distances" [Dataset]. http://doi.org/10.5281/zenodo.8422373
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8422373
Dataset updated
Oct 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Schwalbe-Koda; Daniel Schwalbe-Koda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains all the raw data to reproduce the manuscript:

D. Schwalbe-Koda et al. "Inorganic synthesis-structure maps in zeolites with machine learning and crystallographic distances". arXiv:2307.10935 (2023)

The raw data should be used in combination with the code hosted on GitHub: https://github.com/dskoda/Zeolites-AMD.

Description of the data

The data in this link contains all necessary information to reproduce the manuscript. In combination with the code hosted on GitHub, it can be visualized and analyzed accordingly. The full description on the columns and results is available on the GitHub code.
The data files in this repository are:

- `hparams_rnd_*.json`: results of the hyperparameter optimization of all classifiers studied in this work. The data was produced by randomly sampling the train-validation-test sets. In some cases, the data was normalized (`_norm_`), and the train set was kept `balanced` or `unbalanced`.
- `hyp_dm`: distance matrix of all hypothetical zeolites towards the known zeolites
- `hyp_predictions`: predictions of the synthesis conditions for all hypothetical zeolites
- `xgb_ensembles*`: pickle files containing the serialized ensemble models used in the evaluation of the data in this work. The models can be loaded with the `xgboost` Python package.

License

The data and all the content from this repository is distributed under the Creative Commons Attribution 4.0 (CC-BY 4.0)

This work was produced under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Dataset released as: LLNL-MI-854709.
Data from: The Unit Re-Balancing Problem
zenodo.org
bin, txt, zip
Updated Oct 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin Dee; Armin Fügenschuh; Armin Fügenschuh; George Kaimakamis; Robin Dee; George Kaimakamis (2021). The Unit Re-Balancing Problem [Dataset]. http://doi.org/10.5281/zenodo.5579319
Explore at:
txt, zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5579319
Dataset updated
Oct 20, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Robin Dee; Armin Fügenschuh; Armin Fügenschuh; George Kaimakamis; Robin Dee; George Kaimakamis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The unit re-balancing problem is about a number of defensive military units distributed over a geographic area. Each unit consists of a number of components (e.g., people, armor, or equipment). A value between 0 and 1 describes the current rating of each component. By a nonlinear function this value is converted into a nominal status assessment. This allows a comparison of different components of all units. The lowest of the statuses determines the efficiency of a unit, and the highest status its cost. An unbalanced unit has a gap between these two. When too many units are unbalanced, the entire system is costly and inefficient. To re-balance the units, people and material can be transferred. The goal is to have all units equally well equipped at the lowest possible cost. On a secondary level, the cost for the re-balancing should also be minimal. We present a mixed-integer nonlinear programming formulation for this problem, which describes the potential movement of components as a multi-commodity flow. Nonlinear constraints are needed to obtain the lowest and the highest status. Since we assume that these functions are piecewise linear, we reformulate them using inequalities and binary variables. This results in a mixed-integer linear program, and numerical standard solvers are able to compute proven optimal solutions for instances with up to 100 units. The dataset consists of the models and test instances that were presented at the (virtual) 6th IMA Conference on Mathematics in Defence and Security, March 30-31, 2021.
Z
Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
data.niaid.nih.gov
zenodo.org
Updated Jan 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nagappan, Meiyappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
Explore at:
Dataset updated
Jan 27, 2022
Dataset provided by
Nagappan, Meiyappan
Keshavarz, Hossein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

The datasets are available under directory dataset. There are 4 datasets in this directory.

apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.

apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).

apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.

apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

References:

GumTree

https://github.com/GumTreeDiff/gumtree

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

PyDriller

https://pydriller.readthedocs.io/en/latest/

Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
f
Table_1_Association Mapping for 24 Traits Related to Protein Content, Gluten...
figshare.com
frontiersin.figshare.com
xlsx
Updated Jun 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marina Johnson; Ajay Kumar; Atena Oladzad-Abbasabadi; Evan Salsman; Meriem Aoun; Frank A. Manthey; Elias M. Elias (2023). Table_1_Association Mapping for 24 Traits Related to Protein Content, Gluten Strength, Color, Cooking, and Milling Quality Using Balanced and Unbalanced Data in Durum Wheat [Triticum turgidum L. var. durum (Desf).].xlsx [Dataset]. http://doi.org/10.3389/fgene.2019.00717.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2019.00717.s001
Dataset updated
Jun 5, 2023
Dataset provided by
Frontiers
Authors
Marina Johnson; Ajay Kumar; Atena Oladzad-Abbasabadi; Evan Salsman; Meriem Aoun; Frank A. Manthey; Elias M. Elias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Durum wheat [Triticum durum (Desf).] is mostly used to produce pasta, couscous, and bulgur. The quality of the grain and end-use products determine its market value. However, quality tests are highly resource intensive and almost impossible to conduct in the early generations in the breeding program. Modern genomics-based tools provide an excellent opportunity to genetically dissect complex quality traits to expedite cultivar development using molecular breeding approaches. This study used a panel of 243 cultivars and advanced breeding lines developed during the last 20 years to identify SNPs associated with 24 traits related to nutritional value and quality. Genome-wide association study (GWAS) identified a total of 179 marker–trait associations (MTAs), located in 95 genomic regions belonging to all 14 durum wheat chromosomes. Major and stable QTLs were identified for gluten strength on chromosomes 1A and 1B, and for PPO activity on chromosomes 1A, 2B, 3A, and 3B. As a large amount of unbalance phenotypic data are generated every year on advanced lines in all the breeding programs, the applicability of such a dataset for identification of MTAs remains unclear. We observed that ∼84% of the MTAs identified using a historic unbalanced dataset (belonging to a total of 80 environments collected over a period of 16 years) were also identified in a balanced dataset. This suggests the suitability of historic unbalanced phenotypic data to identify beneficial MTAs to facilitate local-knowledge-based breeding. In addition to providing extensive knowledge about the genetics of quality traits, association mapping identified several candidate markers to assist durum wheat quality improvement through molecular breeding. The molecular markers associated with important traits could be extremely useful in the development of improved quality durum wheat cultivars using marker-assisted selection (MAS).
h
app_reviews
huggingface.co
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Ghazaryan (2024). app_reviews [Dataset]. https://huggingface.co/datasets/PavelGh/app_reviews
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2024
Authors
Pavel Ghazaryan
Description
Dataset Card for App Reviews

Dataset Details

App reviews labeled into 4 categories: 'Bug Report', 'Feature Request', 'Rating', 'User Experience'. Note that the ones that say GPT in their file name are labeled by ChatGPT through prompt fine-tuning out of which approximately 3% was verified through random manual checking. The files that do not contain gpt in their names are manually labeled. I have separate datasets for Balanced and Unbalanced. Addtionally the gpt… See the full description on the dataset page: https://huggingface.co/datasets/PavelGh/app_reviews.
Experimental measurements and uncertainty analysis for validation of the...
data.niaid.nih.gov
datadryad.org
zip
Updated Apr 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Cale (2023). Experimental measurements and uncertainty analysis for validation of the Building Electrical Efficiency Analysis Model (BEEAM) [Dataset]. http://doi.org/10.5061/dryad.m63xsj471
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m63xsj471
Dataset updated
Apr 5, 2023
Dataset provided by
Colorado State University
Authors
James Cale
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
This dataset includes experimental measurements taken on a laboratory testbed at Colorado State University that was used for model validation of a software toolkit, the Building Electrical Efficiency Analysis Model (BEEAM). This toolkit was developed for comparing electrical efficiency of AC versus DC distribution systems in buildings. The testbed emulated loads found in a small office building and included laptop computer chargers, LED lighting systems, and miscellaneous DC and AC loads. Measurements were taken under AC and DC configurations in electrically balanced and unbalanced loading conditions. Also included in the dataset is an uncertainty analysis. A complete description of the testbed, hardware, measurements and uncertainty analysis is contained in the paper cited below.

Avpreet Othee, James Cale, Arthur Santos, Stephen Frank, Daniel Zimmerle, Omkar Ghatpande, Gerald Duggan and Daniel Gerber, “A Modeling Toolkit for Comparing AC and DC Electrical Distribution Efficiency in Buildings,” Energies, 2023 (accepted, publication in progress).

Methods Data was collected using a Keysight multifunction switch measuring unit (MU) model 34980A with Keysight 34921T multiplexer and a Keysight PA2203A power analyzer.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud (2024). Imbalanced class datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0299585.t001

Imbalanced class datasets.

Explore at:

171 scholarly articles cite this dataset (View in Google Scholar)

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0299585.t001

Dataset updated

Apr 11, 2024

Dataset provided by

PLOS ONE

Authors

Ahmad Muhaimin Ismail; Siti Hafizah Ab Hamid; Asmiza Abdul Sani; Nur Nasuha Mohd Daud

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.

Clear search

Close search

Google apps

Main menu

Imbalanced class datasets.

Dataset: The effects of class balance on the training energy consumption of...

Confusion matrix.

Data from: Imbalanced dataset for benchmarking

A dataset for comparing filtering methods used to separate balanced and...

Multisense

Data on the composition of four balanced and four unbalanced series of E12.5...

Unbalanced 2 x 2 Factorial Designs and the Interaction Effect: A Troublesome...

christine

Data from: QST FST comparisons with unbalanced half-sib designs

Data from: The influence of balanced and imbalanced resource supply on...

Data from: QST FST comparisons with unbalanced half-sib designs

S5 Dataset -

Empirical data used in the application of the paper "Genuinely Unbalanced...

Raw Data for: "Inorganic synthesis-structure maps in zeolites with machine...

Data from: The Unit Re-Balancing Problem

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

Table_1_Association Mapping for 24 Traits Related to Protein Content, Gluten...

app_reviews

Experimental measurements and uncertainty analysis for validation of the...

Imbalanced class datasets.