54 datasets found

d
Simulation Results on the Effect of Ensemble on Data Imbalance
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang, Yu (2023). Simulation Results on the Effect of Ensemble on Data Imbalance [Dataset]. https://search.dataone.org/view/sha256%3Ae6de30d2f7aa0db00837a402e7377acc36de959159760d2900103285cc862392
Explore at:
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Yang, Yu
Description
This dataset contains all the simulation results on the effect of ensemble models in dealing with data imbalance. The simulations are performed with sample size n=2000, number of variables p=200, and number of groups k=20 under six imbalanced scenarios. It shows the result of ensemble models with threshold from [0, 0.05, 0.1, ..., 0.95, 1.0], in terms of the overall AP/AR and discrete (continuous) specific AP/AR. This dataset serves as a reference for practitioners to find the appropriate ensemble threshold that fits their business needs the best.
f
Data from: Less is More: An Empirical Study of Undersampling Techniques for...
figshare.com
zip
Updated May 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gichan Lee (2024). Less is More: An Empirical Study of Undersampling Techniques for Technical Debt Prediction [Dataset]. http://doi.org/10.6084/m9.figshare.22708036.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22708036.v1
Dataset updated
May 20, 2024
Dataset provided by
figshare
Authors
Gichan Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Technical Debt (TD) prediction is crucial to preventing software quality degradation and maintenance cost increase. Recent Machine Learning (ML) approaches have shown promising results in TD prediction, but the imbalanced TD datasets can have a negative impact on ML model performance. Although previous TD studies have investigated various oversampling techniques that generates minority class instances to mitigate the imbalance, potentials of undersampling techniques have not yet been thoroughly explored due to the concerns about information loss. To address this gap, we investigate the impact of undersampling on ML model performance for TD prediction by utilizing 17,797 classes from 25 Java open-source projects. We compare the performance of ML models with different undersampling techniques and evaluate the impact of combining them with widely used oversampling techniques in TD studies. Our findings reveal that (i) undersampling can significantly improve ML model performance compared to oversampling and no resampling; (ii) the combined application of undersampling and oversampling techniques leads to a synergy of further performance improvement compared to applying each technique exclusively. Based on these results, we recommend practitioners to explore various undersampling techniques and their combinations with oversampling techniques for more effective TD prediction.This package is for the replication of 'Less is More: an Empirical Study of Undersampling Techniques for Technical Debt Prediction'File list:X.csv, Y.csv: - These are the datasets for the study, used in the ipynb file below.under_over_sampling_scripts.ipynb: - These scripts can obtain all the experimental results from the study. - They can be run through Jupyter Notebook or Google Colab. - The required packages are listed at the top in the file, so installation via pip or conda is necessary before running.Results_for_all_tables.csv: This is a csv file that summarizes all the results obtained from the study.
Is this a good customer?
kaggle.com
Updated Apr 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
podsyp (2020). Is this a good customer? [Dataset]. https://www.kaggle.com/podsyp/is-this-a-good-customer/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 16, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
podsyp
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Imbalanced classes put “accuracy” out of business. This is a surprisingly common problem in machine learning (specifically in classification), occurring in datasets with a disproportionate ratio of observations in each class.

Content

Standard accuracy no longer reliably measures performance, which makes model training much trickier. Imbalanced classes appear in many domains, including: - Antifraud - Antispam - ...

Inspiration

5 tactics for handling imbalanced classes in machine learning: - Up-sample the minority class - Down-sample the majority class - Change your performance metric - Penalize algorithms (cost-sensitive training) - Use tree-based algorithms
Predict students' dropout and academic success
zenodo.org
data.niaid.nih.gov
Updated Mar 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins (2023). Predict students' dropout and academic success [Dataset]. http://doi.org/10.5281/zenodo.5777340
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5777340
Dataset updated
Mar 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins
Description
A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.

The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters.

The data is used to build classification models to predict students' dropout and academic success. The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.

Funding
We acknowledge support of this work by the program "SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal"
Software Defect Prediction Using AWEIG+ADACOST Bayesian Algorithm for...
osf.io
Updated Jan 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joko Suntoro (2019). Software Defect Prediction Using AWEIG+ADACOST Bayesian Algorithm for Handling High Dimensional Data and Class Imbalanced Problem [Dataset]. http://doi.org/10.17605/OSF.IO/9JNE2
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/9JNE2
Dataset updated
Jan 11, 2019
Dataset provided by
Center for Open Sciencehttps://cos.io/
Authors
Joko Suntoro
Description
No description was included in this Dataset collected from the OSF
f
Over-sampled dataset.
figshare.com
xls
Updated Dec 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Over-sampled dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t004
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
f
Comparative analysis over various datasets.
plos.figshare.com
xls
Updated Jan 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tao Yu; Wei Huang; Xin Tang; Duosi Zheng (2025). Comparative analysis over various datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0316557.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316557.t003
Dataset updated
Jan 10, 2025
Dataset provided by
PLOS ONE
Authors
Tao Yu; Wei Huang; Xin Tang; Duosi Zheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In credit risk assessment, unsupervised classification techniques can be introduced to reduce human resource expenses and expedite decision-making. Despite the efficacy of unsupervised learning methods in handling unlabeled datasets, their performance remains limited owing to challenges such as imbalanced data, local optima, and parameter adjustment complexities. Thus, this paper introduces a novel hybrid unsupervised classification method, named the two-stage hybrid system with spectral clustering and semi-supervised support vector machine (TSC-SVM), which effectively addresses the unsupervised imbalance problem in credit risk assessment by targeting global optimal solutions. Furthermore, a multi-view combined unsupervised method is designed to thoroughly mine data and enhance the robustness of label predictions. This method mitigates discrepancies in prediction outcomes from three distinct perspectives. The effectiveness, efficiency, and robustness of the proposed TSC-SVM model are demonstrated through various real-world applications. The proposed algorithm is anticipated to expand the customer base for financial institutions while reducing economic losses.
f
Hyperparameter settings of classification model.
plos.figshare.com
xls
Updated Oct 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Hyperparameter settings of classification model. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t006
Dataset updated
Oct 18, 2024
Dataset provided by
PLOS ONE
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.
Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search...
zenodo.org
application/gzip, bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Nazmul Haque; Nasimul Noman; Regina Berratta; Pablo Moscato; Mohammad Nazmul Haque; Nasimul Noman; Regina Berratta; Pablo Moscato (2020). Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search using Genetic Algorithm for Class Imbalanced Data Classification" [Dataset]. http://doi.org/10.5281/zenodo.33539
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.33539
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mohammad Nazmul Haque; Nasimul Noman; Regina Berratta; Pablo Moscato; Mohammad Nazmul Haque; Nasimul Noman; Regina Berratta; Pablo Moscato
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Faces Dataset: PubFig05

This is a subset of the ''PubFig83'' dataset [1] which provides 100 images each of 5 most difficult celebrities to recognise (referred as class in the classification problem). For each celebrity persons, we took 100 images and separated them into training and testing sets of 90 and 10 images, respectively:

Person: Jenifer Lopez; Katherine Heigl; Scarlett Johansson; Mariah Carey; Jessica Alba

Feature Extraction

To extract features from images, we have applied the HT-L3-model as described in [2] and obtained 25600 features.

Feature Selection

Details about feature selection followed in brief as follows:

Entropy Filtering: First we apply an implementation of Fayyad and Irani's [3] entropy base heuristic to discretise the dataset and discarded features using the minimum description length (MDL) principle and only 4878 passed this entropy based filtering method.

Class-Distribution Balancing: Next, we have converted the dataset to binary-class problem by separating into 5 binary-class datasets using one-vs-all setup. Hence, these datasets became imbalanced at a ratio of 1:4. Then we converted them into balanced binary-class datasets using random sub-sampled method. Further processing of the dataset has been described in the paper.

(alpha,beta)-k Feature selection: To get a good feature set for training the classifier, we select the features using the approach based on the (alpha,beta)-k feature selection [4] problem. It selects a minimum subset of features that maximise both within class similarity and dissimilarity in different classes. We applied the entropy filtering and (alpha,beta)-k feature subset selection methods in three ways and obtained different numbers of features (in the Table below) after consolidating them into binary class dataset.

UAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the union of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

IAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the intersection of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

UEAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets. Then, we applied the entropy filtering and (alpha,beta)-k feature set selection method on each of the balanced binary-class datasets. Finally, we took the union of selected features for each balanced binary-class datasets and get a set of features.

All of these datasets are inside the compressed folder. It also contains the document describing the process detail.

References

[1] Pinto, N., Stone, Z., Zickler, T., & Cox, D. (2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on (pp. 35–42).

[2] Cox, D., & Pinto, N. (2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on (pp. 8–15).

[3] Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In International Joint Conference on Artificial Intelligence (pp. 1022–1029).

[4] Berretta, R., Mendes, A., & Moscato, P. (2005). Integer programming models and algorithms for molecular classification of cancer from microarray data. In Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38 (pp. 361–370). 1082201: Australian Computer Society, Inc.
Z
Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
data.niaid.nih.gov
zenodo.org
Updated Jan 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nagappan, Meiyappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
Explore at:
Dataset updated
Jan 27, 2022
Dataset provided by
Nagappan, Meiyappan
Keshavarz, Hossein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

The datasets are available under directory dataset. There are 4 datasets in this directory.

apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.

apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).

apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.

apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

References:

GumTree

https://github.com/GumTreeDiff/gumtree

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

PyDriller

https://pydriller.readthedocs.io/en/latest/

Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
Data from: The influence of balanced and imbalanced resource supply on...
data.niaid.nih.gov
datadryad.org
zip
Updated Mar 11, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The influence of balanced and imbalanced resource supply on biodiversity-functioning relationship across ecosystems [Dataset]. https://data.niaid.nih.gov/resources?id=dryad_h50d9
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.h50d9
Dataset updated
Mar 11, 2017
Dataset provided by
German Centre for Integrative Biodiversity Research (iDiv)https://www.idiv.de/
Netherlands Institute of Ecology
Michigan State University
Plymouth Marine Laboratory
GEOMAR Helmholtz Centre for Ocean Research Kiel
University of Maryland, College Park
KU Leuven
Vrije Universiteit Brussel
University of Hildesheim
University of KwaZulu-Natal
Tohoku University
University of Gothenburg
Institute of Natural Sciences
Carl von Ossietzky Universität Oldenburg
Ghent University
University of Nebraska–Lincoln
Monash University
University of Minnesota
Authors
Aleksandra M. Lewandowska; Antje Biermann; Elizabeth T. Borer; Miguel A. Cebrian-Piqueras; Steven A. J. Declerck; Luc De Meester; Ellen van Donk; Lars Gamfeldt; Daniel S. Gruner; Nicole Hagenah; W. Stanley Harpole; Kevin P. Kirkman; Christopher A. Klausmeier; Michael Kleyer; Johannes M. H. Knops; Pieter Lemmens; Eric M. Lind; Elena Litchman; Jasmin Mantilla-Contreras; Koen Martens; Sandra Meier; Vanessa Minden; Joslin L. Moore; Harry olde Venterink; Eric W. Seabloom; Ulrich Sommer; Maren Striebel; Anastasia Trenkamp; Juliane Trinogga; Jotaro Urabe; Wim Vyverman; Dedmer B. Van de Waal; Claire E. Widdicombe; Helmut Hillebrand
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Numerous studies show that increasing species richness leads to higher ecosystem productivity. This effect is often attributed to more efficient portioning of multiple resources in communities with higher numbers of competing species, indicating the role of resource supply and stoichiometry for biodiversity–ecosystem functioning relationships. Here, we merged theory on ecological stoichiometry with a framework of biodiversity–ecosystem functioning to understand how resource use transfers into primary production. We applied a structural equation model to define patterns of diversity–productivity relationships with respect to available resources. Meta-analysis was used to summarize the findings across ecosystem types ranging from aquatic ecosystems to grasslands and forests. As hypothesized, resource supply increased realized productivity and richness, but we found significant differences between ecosystems and study types. Increased richness was associated with increased productivity, although this effect was not seen in experiments. More even communities had lower productivity, indicating that biomass production is often maintained by a few dominant species, and reduced dominance generally reduced ecosystem productivity. This synthesis, which integrates observational and experimental studies in a variety of ecosystems and geographical regions, exposes common patterns and differences in biodiversity–functioning relationships, and increases the mechanistic understanding of changes in ecosystems productivity.
u
Data from: Interactive Heavily Unbalanced Power System Analysis
portalinvestigacion.uniovi.es
ieee-dataport.org
Updated 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arboleya, Pablo; Arboleya, Pablo (2021). Interactive Heavily Unbalanced Power System Analysis [Dataset]. https://portalinvestigacion.uniovi.es/documentos/668fc416b9e7c03b01bd41e1
Explore at:
Dataset updated
2021
Authors
Arboleya, Pablo; Arboleya, Pablo
Description
In order to see and interact with the case of study open the case_of_study.html file with an internet browser, it should work in all browser but we tested in Google Chrome, Firefox and Safari.Once the file is open, we will see something like the attached figure. In the next steps, the different variables available and the way of interacting with then will be described.A heavily unbalanced three-phase system of 11 nodes is represented in the left part, where each phase of a line is represented with a color (red,green,blue) for (a,b,c) respectively, and this can be extended also for the nodes. The white number in the center of the nodes represent the node numbers. The movement of the yellow ball on the lines represents the convention of positive power flow. The power flow in a line will be positive if it flows in the same direction of the movement of the ball. Voltages are expressed in V, currents in A and power in kW, kVAr or kVA.The power flow in this system has been solved and all the results, included the power dividers are available. Nodal variables, like voltage, current injections, active, reactive and apparent power can be represented in the nodes by selecting them using the drop-down menu labeled as “Node Variable”. Next to this drown menu we can see a tick labeled as “Angle”, if it is marked, complex nodal variables are shown with angle, if not, only module. The size of the colored circles inside the node will be proportional to the module of the variable represented. Positive active power means consumption, negative active power injection. When clicking a node of the scheme a phasor diagram with the injected currents and the voltages in the nodes appears in the right side of the screen. Also, the power triangles are represented.The “Line variable” drop-down menu allow us to represent variables in the lines, basically, currents, powers and losses. I12t is the total current (considering also the shunt current, if it exists), flowing from the starting point to the ending point while I21t the total current flowing from the ending point to the starting point. It must be remarked that the yellow ball is always moving from the starting point to the ending point. Shunt currents are represented by I1s and I2s and the currents flowing through the lines without considering shunt currents are expressed as I12 and I21. Same criteria apply to active, reactive and apparent power through the lines. The line losses can be represented selecting “Line_losses” in this very same drop-down menu. The width of the colored lines representing the phases is proportional to the module of the variable represented. When clicking in a specific line, all the currents through that line are represented in the right part of the screen in the so called “Line details” diagram.The power dividers of the expression (21) of the paper P(m,n) and P(m,n) as well as the P(m,n)loss can be analyzed using the “drop-down” menu labeled as “Pdivs”, there we can select Pdiv12t and Pdiv21t representing respectively P(m,n) and P(m,n). Pdiv_losses represent P(m,n)loss.We recommend the next set up for analysis:Node variable drop-down menu: PLine variable drop-down menu: Line_lossesPdivs drop-down menu: Pdiv_losses
Z
The Turku UAS DeepSeaSalama - GAN dataset 1 (TDSS-G1)
data.niaid.nih.gov
zenodo.org
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asadi, Mehdi (2024). The Turku UAS DeepSeaSalama - GAN dataset 1 (TDSS-G1) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10714822
Explore at:
Dataset updated
Jul 7, 2024
Dataset provided by
Turku University of Applied Sciences
Asadi, Mehdi
Majd, Amin
Auranen, Jani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Turku
Description
The Turku UAS DeepSeaSalama-GAN dataset 1 (TDSS-G1) is a comprehensive image dataset obtained from a maritime environment. This dataset was assembled in the southwest Finnish archipelago area at Taalintehdas, using two stationary RGB fisheye cameras in the month of August 2022. The technical setup is described in the section “Sensor Platform design” in report “Development of Applied Research Platforms for Autonomous and Remotely Operated Systems” (https://www.theseus.fi/handle/10024/815628).

The data collection and annotation process was carried out in the Autonomous and Intelligent Systems laboratory at Turku University of Applied Sciences. The dataset is a blend of original images captured by our cameras and synthetic data generated by a Generative Adversarial Network (GAN), simulating 18 distinct weather conditions.

The TDSS-G1 dataset comprises 199 original images and a substantial addition of 3582 synthetic images, culminating in a total of 3781 annotated images. These images provide a diverse representation of various maritime objects, including motorboats, sailing boats, and seamarks.

The creation of TDSS-G1 involved extracting images from videos recorded in MPEG format, with a resolution of 720p at 30 frames per second (FPS). An image was extracted every 100 milliseconds.

The distribution of labels within TDSS-G1 is as follows: motorboats (62.1%), sailing boats (16.8%), and seamarks (21.1%).

This distribution highlights a class imbalance, with motorboats being the most represented class and sailing boats being the least. This imbalance is an important factor to consider during the model training process, as it could influence the model’s ability to accurately recognize underrepresented classes. In the future synthetic datasets, vision Transformers will be used to tackle this problem.

The TDSS-G1 dataset is organized into three distinct subsets for the purpose of training and evaluating machine learning models. These subsets are as follows:

Training Set: Located in dataset/train/images, this set is used to train the model. It learns to recognize the different classes of maritime objects from this data.

Validation Set: Stored in dataset/valid/images, this set is used to tune the model parameters and to prevent overfitting during the training process.

Test Set: Found in dataset/test/images, this set is used to evaluate the final performance of the model. It provides an unbiased assessment of how the model will perform on unseen data.

The dataset comprises three classes (nc: 3), each representing a different type of maritime object. The classes are as follows:

Motor Boat (motor_boat)

Sailing Boat (sailing_boat)

Seamark (seamark)

These labels correspond to the annotated objects in the images. The model trained on this dataset will be capable of identifying these three types of maritime objects. As mentioned earlier, the distribution of these classes is imbalanced, which is an important factor to consider during the training process.
Data from: The limits of the constant-rate birth-death prior for...
zenodo.org
bin, sh
Updated Jul 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Khurana; Neil Scheidwasser-Clow; Matthew J. Penn; Samir Bhatt; David Duchêne; Mark Khurana; Neil Scheidwasser-Clow; Matthew J. Penn; Samir Bhatt; David Duchêne (2023). The limits of the constant-rate birth-death prior for phylogenetic tree topology inference [Dataset]. http://doi.org/10.5281/zenodo.8187005
Explore at:
bin, shAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8187005
Dataset updated
Jul 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mark Khurana; Neil Scheidwasser-Clow; Matthew J. Penn; Samir Bhatt; David Duchêne; Mark Khurana; Neil Scheidwasser-Clow; Matthew J. Penn; Samir Bhatt; David Duchêne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Birth-death models are stochastic processes describing speciation and extinction through time and across taxa, and are widely used in biology for inference of evolutionary timescales. Previous research has highlighted how the expected trees under constant-rate birth-death (crBD) tend to differ from empirical trees, for example with respect to the amount of phylogenetic imbalance. However, our understanding of how trees differ between crBD and the signal in empirical data remains incomplete. In this Point of View, we aim to expose the degree to which crBD differs from empirically inferred phylogenies and test the limits of the model in practice. Using a wide range of topology indices to compare crBD expectations against a comprehensive dataset of 1189 empirically estimated trees, we confirm that crBD trees frequently differ topologically compared with empirical trees. To place this in the context of standard practice in the field, we conducted a meta-analysis for a subset of the empirical studies. When comparing studies that used crBD priors with those that used other non-BD Bayesian and non-Bayesian methods, we do not find any significant differences in tree topology inferences. To scrutinize this finding for the case of highly imbalanced trees, we selected the 100 trees with the greatest imbalance from our dataset, simulated sequence data for these tree topologies under various evolutionary rates, and re-inferred the trees under maximum likelihood and using crBD in a Bayesian setting. We find that when the substitution rate is low, the crBD prior results in overly balanced trees, but the tendency is negligible when substitution rates are sufficiently high. Overall, our findings demonstrate the general robustness of crBD priors across a broad range of phylogenetic inference scenarios, but also highlights that empirically observed phylogenetic imbalance is highly improbable under crBD, leading to systematic bias in data sets with limited information content.
Data from: Species Selection Regime and Phylogenetic Tree Shape
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Nov 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Verboom; Florian Boucher; David Ackerly; Lara Wootton; William Freyman (2019). Species Selection Regime and Phylogenetic Tree Shape [Dataset]. http://doi.org/10.5061/dryad.1sf007b
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.1sf007b
Dataset updated
Nov 26, 2019
Dataset provided by
Université Grenoble Alpes
University of Cape Town
University of California, Berkeley
University of Minnesota
Authors
George Verboom; Florian Boucher; David Ackerly; Lara Wootton; William Freyman
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Species selection, the effect of heritable traits in generating between-lineage diversification rate differences, provides a valuable conceptual framework for understanding the relationship between traits, diversification and phylogenetic tree shape. An important challenge, however, is that the nature of real diversification landscapes – curves or surfaces which describe the propensity of species-level lineages to diversify as a function of one or more traits – remains poorly understood. Here we present a novel, time-stratified extension of the QuaSSE model in which speciation/extinction rate is specified as a static or temporally-shifting Gaussian or skewed-Gaussian function of the diversification trait. We then use simulations to show that the generally imbalanced nature of real phylogenetic trees, as well as their generally greater-than-expected frequency of deep branching events, are typical outcomes when diversification is treated as a dynamic, trait-dependent process. Focusing on four basic models (Gaussian-speciation with and without background extinction; skewed-speciation; Gaussian-extinction), we also show that particular features of the species selection regime produce distinct tree shape signatures and that, consequently, a combination of tree shape metrics has the potential to reveal the species selection regime under which a particular lineage diversified. We evaluate this idea empirically by comparing the phylogenetic trees of plant lineages diversifying within climatically- and geologically-stable environments of the Greater Cape Floristic Region, with those of lineages diversifying in environments that have experienced major change through the Late Miocene-Pliocene. Consistent with our expectations, the trees of lineages diversifying in a dynamic context are less balanced, show a greater concentration of branching events close to the present, and display stronger diversification rate-trait correlations. We suggest that species selection plays an important role in shaping phylogenetic trees but recognize the need for an explicit probabilistic framework within which to assess the likelihoods of alternative diversification scenarios as explanations of a particular tree shape.
Z
Data from: Unbalanced species losses and gains lead to non-linear...
data.niaid.nih.gov
zenodo.org
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James M. Bullock (2020). Unbalanced species losses and gains lead to non-linear trajectories as grasslands become forests [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3406338
Explore at:
Dataset updated
Jan 21, 2020
Dataset provided by
James M. Bullock
Adam Kimberley
Sara A.O. Cousins
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets for the article "Unbalanced species losses and gains lead to non-linear trajectories as grasslands become forests" in Journal of Vegetation Science. Plot data contains information on grassland sites in the archipelago, including how long they have been abandoned for and the surrounding landscape composition. Species occurrence matrix contains data on the plant communities found in vegetation sampling plots at respective sites.
f
MOESM2 of Quality control of imbalanced mass spectra from isotopic labeling...
springernature.figshare.com
xls
Updated Feb 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianjun Li; Long Chen; Min Gan (2024). MOESM2 of Quality control of imbalanced mass spectra from isotopic labeling experiments [Dataset]. http://doi.org/10.6084/m9.figshare.10264199.v1
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.10264199.v1
Dataset updated
Feb 16, 2024
Dataset provided by
figshare
Authors
Tianjun Li; Long Chen; Min Gan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 2 This is a four-sheet xls file (7180KB) containing the TPP analysis results and extracted features, each sheet refers to one dataset with special ratio in its sheet name, e.g., 1_1 is the ratio 1:1 sample data.
Data from: Age-dependent and lineage-dependent speciation and extinction in...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jan 24, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric W. Holman (2017). Age-dependent and lineage-dependent speciation and extinction in the imbalance of phylogenetic trees [Dataset]. http://doi.org/10.5061/dryad.2q9r7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2q9r7
Dataset updated
Jan 24, 2017
Authors
Eric W. Holman
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
It is known that phylogenetic trees are more imbalanced than expected from a birth–death model with constant rates of speciation and extinction, and also that imbalance can be better fit by allowing the rate of speciation to decrease as the age of the parent species increases. If imbalance is measured in more detail, at nodes within trees as a function of the number of species descended from the nodes, age-dependent models predict levels of imbalance comparable to real trees for small numbers of descendent species, but predicted imbalance approaches an asymptote not found in real trees as the number of descendent species becomes large. Age-dependence must therefore be complemented by another process such as inheritance of different rates along different lineages, which is known to predict insufficient imbalance at nodes with few descendent species, but can predict increasing imbalance with increasing numbers of descendent species.
f
Increase in AUC, MCC, and F1 between oversampling and undersampling.
plos.figshare.com
xls
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seongil Han; Haemin Jung (2024). Increase in AUC, MCC, and F1 between oversampling and undersampling. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316454.t009
Dataset updated
Dec 31, 2024
Dataset provided by
PLOS ONE
Authors
Seongil Han; Haemin Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Increase in AUC, MCC, and F1 between oversampling and undersampling.
d
Dataset on dynamic analysis of an unbalanced hollow cylinder rolling over a...
b2find.dkrz.de
Updated Jan 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Dataset on dynamic analysis of an unbalanced hollow cylinder rolling over a horizontal plane - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/7fce84b1-f5ec-5810-a776-f12eae179b1f
Explore at:
Dataset updated
Jan 4, 2024
License
Licence Ouverte / Open Licence 2.0https://www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf
License information was derived automatically
Description
Rigid solid body dynamics is a key element of the undergraduate mechanical engineering curriculum. In a context of reverse engineering and/or sustainable development, being able to analyze the mechanical and material properties of a system without damaging it is a required skill. In this dataset, an unbalanced hollow cylinder rolling over horizontal path without sliding is studied. Four generations of last year bachelor students in mechanical engineering, representing a hundred people a year, followed a total of 12 hours of practical sessions working on such systems. This work aims at showing how computer tools can help and improve a rigid solid body dynamics course.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yang, Yu (2023). Simulation Results on the Effect of Ensemble on Data Imbalance [Dataset]. https://search.dataone.org/view/sha256%3Ae6de30d2f7aa0db00837a402e7377acc36de959159760d2900103285cc862392

Simulation Results on the Effect of Ensemble on Data Imbalance

Explore at:

Dataset updated

Nov 8, 2023

Dataset provided by

Harvard Dataverse

Authors

Yang, Yu

Description

This dataset contains all the simulation results on the effect of ensemble models in dealing with data imbalance. The simulations are performed with sample size n=2000, number of variables p=200, and number of groups k=20 under six imbalanced scenarios. It shows the result of ensemble models with threshold from [0, 0.05, 0.1, ..., 0.95, 1.0], in terms of the overall AP/AR and discrete (continuous) specific AP/AR. This dataset serves as a reference for practitioners to find the appropriate ensemble threshold that fits their business needs the best.

Clear search

Close search

Google apps

Main menu

Simulation Results on the Effect of Ensemble on Data Imbalance

Data from: Less is More: An Empirical Study of Undersampling Techniques for...

Is this a good customer?

Context

Content

Inspiration

Predict students' dropout and academic success

Software Defect Prediction Using AWEIG+ADACOST Bayesian Algorithm for...

Over-sampled dataset.

Comparative analysis over various datasets.

Hyperparameter settings of classification model.

Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search...

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

Data from: The influence of balanced and imbalanced resource supply on...

Data from: Interactive Heavily Unbalanced Power System Analysis

The Turku UAS DeepSeaSalama - GAN dataset 1 (TDSS-G1)

Data from: The limits of the constant-rate birth-death prior for...

Data from: Species Selection Regime and Phylogenetic Tree Shape

Data from: Unbalanced species losses and gains lead to non-linear...

MOESM2 of Quality control of imbalanced mass spectra from isotopic labeling...

Data from: Age-dependent and lineage-dependent speciation and extinction in...

Increase in AUC, MCC, and F1 between oversampling and undersampling.

Dataset on dynamic analysis of an unbalanced hollow cylinder rolling over a...

Simulation Results on the Effect of Ensemble on Data Imbalance