54 datasets found
  1. d

    Simulation Results on the Effect of Ensemble on Data Imbalance

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang, Yu (2023). Simulation Results on the Effect of Ensemble on Data Imbalance [Dataset]. https://search.dataone.org/view/sha256%3Ae6de30d2f7aa0db00837a402e7377acc36de959159760d2900103285cc862392
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Yang, Yu
    Description

    This dataset contains all the simulation results on the effect of ensemble models in dealing with data imbalance. The simulations are performed with sample size n=2000, number of variables p=200, and number of groups k=20 under six imbalanced scenarios. It shows the result of ensemble models with threshold from [0, 0.05, 0.1, ..., 0.95, 1.0], in terms of the overall AP/AR and discrete (continuous) specific AP/AR. This dataset serves as a reference for practitioners to find the appropriate ensemble threshold that fits their business needs the best.

  2. f

    Data from: Less is More: An Empirical Study of Undersampling Techniques for...

    • figshare.com
    zip
    Updated May 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gichan Lee (2024). Less is More: An Empirical Study of Undersampling Techniques for Technical Debt Prediction [Dataset]. http://doi.org/10.6084/m9.figshare.22708036.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2024
    Dataset provided by
    figshare
    Authors
    Gichan Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Technical Debt (TD) prediction is crucial to preventing software quality degradation and maintenance cost increase. Recent Machine Learning (ML) approaches have shown promising results in TD prediction, but the imbalanced TD datasets can have a negative impact on ML model performance. Although previous TD studies have investigated various oversampling techniques that generates minority class instances to mitigate the imbalance, potentials of undersampling techniques have not yet been thoroughly explored due to the concerns about information loss. To address this gap, we investigate the impact of undersampling on ML model performance for TD prediction by utilizing 17,797 classes from 25 Java open-source projects. We compare the performance of ML models with different undersampling techniques and evaluate the impact of combining them with widely used oversampling techniques in TD studies. Our findings reveal that (i) undersampling can significantly improve ML model performance compared to oversampling and no resampling; (ii) the combined application of undersampling and oversampling techniques leads to a synergy of further performance improvement compared to applying each technique exclusively. Based on these results, we recommend practitioners to explore various undersampling techniques and their combinations with oversampling techniques for more effective TD prediction.This package is for the replication of 'Less is More: an Empirical Study of Undersampling Techniques for Technical Debt Prediction'File list:X.csv, Y.csv: - These are the datasets for the study, used in the ipynb file below.under_over_sampling_scripts.ipynb: - These scripts can obtain all the experimental results from the study. - They can be run through Jupyter Notebook or Google Colab. - The required packages are listed at the top in the file, so installation via pip or conda is necessary before running.Results_for_all_tables.csv: This is a csv file that summarizes all the results obtained from the study.

  3. Is this a good customer?

    • kaggle.com
    Updated Apr 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    podsyp (2020). Is this a good customer? [Dataset]. https://www.kaggle.com/podsyp/is-this-a-good-customer/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 16, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    podsyp
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Imbalanced classes put “accuracy” out of business. This is a surprisingly common problem in machine learning (specifically in classification), occurring in datasets with a disproportionate ratio of observations in each class.

    Content

    Standard accuracy no longer reliably measures performance, which makes model training much trickier. Imbalanced classes appear in many domains, including: - Antifraud - Antispam - ...

    Inspiration

    5 tactics for handling imbalanced classes in machine learning: - Up-sample the minority class - Down-sample the majority class - Change your performance metric - Penalize algorithms (cost-sensitive training) - Use tree-based algorithms

  4. Predict students' dropout and academic success

    • zenodo.org
    • data.niaid.nih.gov
    Updated Mar 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins (2023). Predict students' dropout and academic success [Dataset]. http://doi.org/10.5281/zenodo.5777340
    Explore at:
    Dataset updated
    Mar 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Valentim Realinho; Valentim Realinho; Jorge Machado; Jorge Machado; Luís Baptista; Luís Baptista; Mónica V. Martins; Mónica V. Martins
    Description

    A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.

    The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters.

    The data is used to build classification models to predict students' dropout and academic success. The problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course.

    Funding
    We acknowledge support of this work by the program "SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal"

  5. Software Defect Prediction Using AWEIG+ADACOST Bayesian Algorithm for...

    • osf.io
    Updated Jan 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joko Suntoro (2019). Software Defect Prediction Using AWEIG+ADACOST Bayesian Algorithm for Handling High Dimensional Data and Class Imbalanced Problem [Dataset]. http://doi.org/10.17605/OSF.IO/9JNE2
    Explore at:
    Dataset updated
    Jan 11, 2019
    Dataset provided by
    Center for Open Sciencehttps://cos.io/
    Authors
    Joko Suntoro
    Description

    No description was included in this Dataset collected from the OSF

  6. f

    Over-sampled dataset.

    • figshare.com
    xls
    Updated Dec 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongil Han; Haemin Jung (2024). Over-sampled dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Seongil Han; Haemin Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model’s decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE’s capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.

  7. f

    Comparative analysis over various datasets.

    • plos.figshare.com
    xls
    Updated Jan 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tao Yu; Wei Huang; Xin Tang; Duosi Zheng (2025). Comparative analysis over various datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0316557.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Tao Yu; Wei Huang; Xin Tang; Duosi Zheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In credit risk assessment, unsupervised classification techniques can be introduced to reduce human resource expenses and expedite decision-making. Despite the efficacy of unsupervised learning methods in handling unlabeled datasets, their performance remains limited owing to challenges such as imbalanced data, local optima, and parameter adjustment complexities. Thus, this paper introduces a novel hybrid unsupervised classification method, named the two-stage hybrid system with spectral clustering and semi-supervised support vector machine (TSC-SVM), which effectively addresses the unsupervised imbalance problem in credit risk assessment by targeting global optimal solutions. Furthermore, a multi-view combined unsupervised method is designed to thoroughly mine data and enhance the robustness of label predictions. This method mitigates discrepancies in prediction outcomes from three distinct perspectives. The effectiveness, efficiency, and robustness of the proposed TSC-SVM model are demonstrated through various real-world applications. The proposed algorithm is anticipated to expand the customer base for financial institutions while reducing economic losses.

  8. f

    Hyperparameter settings of classification model.

    • plos.figshare.com
    xls
    Updated Oct 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Hyperparameter settings of classification model. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.

  9. Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search...

    • zenodo.org
    application/gzip, bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Nazmul Haque; Nasimul Noman; Regina Berratta; Pablo Moscato; Mohammad Nazmul Haque; Nasimul Noman; Regina Berratta; Pablo Moscato (2020). Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search using Genetic Algorithm for Class Imbalanced Data Classification" [Dataset]. http://doi.org/10.5281/zenodo.33539
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mohammad Nazmul Haque; Nasimul Noman; Regina Berratta; Pablo Moscato; Mohammad Nazmul Haque; Nasimul Noman; Regina Berratta; Pablo Moscato
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Faces Dataset: PubFig05

    This is a subset of the ''PubFig83'' dataset [1] which provides 100 images each of 5 most difficult celebrities to recognise (referred as class in the classification problem). For each celebrity persons, we took 100 images and separated them into training and testing sets of 90 and 10 images, respectively:

    Person: Jenifer Lopez; Katherine Heigl; Scarlett Johansson; Mariah Carey; Jessica Alba

    Feature Extraction

    To extract features from images, we have applied the HT-L3-model as described in [2] and obtained 25600 features.

    Feature Selection

    Details about feature selection followed in brief as follows:

    1. Entropy Filtering: First we apply an implementation of Fayyad and Irani's [3] entropy base heuristic to discretise the dataset and discarded features using the minimum description length (MDL) principle and only 4878 passed this entropy based filtering method.

    2. Class-Distribution Balancing: Next, we have converted the dataset to binary-class problem by separating into 5 binary-class datasets using one-vs-all setup. Hence, these datasets became imbalanced at a ratio of 1:4. Then we converted them into balanced binary-class datasets using random sub-sampled method. Further processing of the dataset has been described in the paper.

    3. (alpha,beta)-k Feature selection: To get a good feature set for training the classifier, we select the features using the approach based on the (alpha,beta)-k feature selection [4] problem. It selects a minimum subset of features that maximise both within class similarity and dissimilarity in different classes. We applied the entropy filtering and (alpha,beta)-k feature subset selection methods in three ways and obtained different numbers of features (in the Table below) after consolidating them into binary class dataset.

    • UAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the union of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

    • IAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the intersection of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

    • UEAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets. Then, we applied the entropy filtering and (alpha,beta)-k feature set selection method on each of the balanced binary-class datasets. Finally, we took the union of selected features for each balanced binary-class datasets and get a set of features.

    All of these datasets are inside the compressed folder. It also contains the document describing the process detail.

    References

    [1] Pinto, N., Stone, Z., Zickler, T., & Cox, D. (2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on (pp. 35–42).

    [2] Cox, D., & Pinto, N. (2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on (pp. 8–15).

    [3] Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In International Joint Conference on Artificial Intelligence (pp. 1022–1029).

    [4] Berretta, R., Mendes, A., & Moscato, P. (2005). Integer programming models and algorithms for molecular classification of cancer from microarray data. In Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38 (pp. 361–370). 1082201: Australian Computer Society, Inc.

  10. Z

    Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nagappan, Meiyappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
    Explore at:
    Dataset updated
    Jan 27, 2022
    Dataset provided by
    Nagappan, Meiyappan
    Keshavarz, Hossein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

    The datasets are available under directory dataset. There are 4 datasets in this directory.

    1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.
    2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
    3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
    4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

    In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

    The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

    More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

    References:

    1. GumTree

    Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

    1. PyDriller
    • https://pydriller.readthedocs.io/en/latest/

    • Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

  11. Data from: The influence of balanced and imbalanced resource supply on...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Mar 11, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The influence of balanced and imbalanced resource supply on biodiversity-functioning relationship across ecosystems [Dataset]. https://data.niaid.nih.gov/resources?id=dryad_h50d9
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 11, 2017
    Dataset provided by
    German Centre for Integrative Biodiversity Research (iDiv)https://www.idiv.de/
    Netherlands Institute of Ecology
    Michigan State University
    Plymouth Marine Laboratory
    GEOMAR Helmholtz Centre for Ocean Research Kiel
    University of Maryland, College Park
    KU Leuven
    Vrije Universiteit Brussel
    University of Hildesheim
    University of KwaZulu-Natal
    Tohoku University
    University of Gothenburg
    Institute of Natural Sciences
    Carl von Ossietzky Universität Oldenburg
    Ghent University
    University of Nebraska–Lincoln
    Monash University
    University of Minnesota
    Authors
    Aleksandra M. Lewandowska; Antje Biermann; Elizabeth T. Borer; Miguel A. Cebrian-Piqueras; Steven A. J. Declerck; Luc De Meester; Ellen van Donk; Lars Gamfeldt; Daniel S. Gruner; Nicole Hagenah; W. Stanley Harpole; Kevin P. Kirkman; Christopher A. Klausmeier; Michael Kleyer; Johannes M. H. Knops; Pieter Lemmens; Eric M. Lind; Elena Litchman; Jasmin Mantilla-Contreras; Koen Martens; Sandra Meier; Vanessa Minden; Joslin L. Moore; Harry olde Venterink; Eric W. Seabloom; Ulrich Sommer; Maren Striebel; Anastasia Trenkamp; Juliane Trinogga; Jotaro Urabe; Wim Vyverman; Dedmer B. Van de Waal; Claire E. Widdicombe; Helmut Hillebrand
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Numerous studies show that increasing species richness leads to higher ecosystem productivity. This effect is often attributed to more efficient portioning of multiple resources in communities with higher numbers of competing species, indicating the role of resource supply and stoichiometry for biodiversity–ecosystem functioning relationships. Here, we merged theory on ecological stoichiometry with a framework of biodiversity–ecosystem functioning to understand how resource use transfers into primary production. We applied a structural equation model to define patterns of diversity–productivity relationships with respect to available resources. Meta-analysis was used to summarize the findings across ecosystem types ranging from aquatic ecosystems to grasslands and forests. As hypothesized, resource supply increased realized productivity and richness, but we found significant differences between ecosystems and study types. Increased richness was associated with increased productivity, although this effect was not seen in experiments. More even communities had lower productivity, indicating that biomass production is often maintained by a few dominant species, and reduced dominance generally reduced ecosystem productivity. This synthesis, which integrates observational and experimental studies in a variety of ecosystems and geographical regions, exposes common patterns and differences in biodiversity–functioning relationships, and increases the mechanistic understanding of changes in ecosystems productivity.

  12. u

    Data from: Interactive Heavily Unbalanced Power System Analysis

    • portalinvestigacion.uniovi.es
    • ieee-dataport.org
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arboleya, Pablo; Arboleya, Pablo (2021). Interactive Heavily Unbalanced Power System Analysis [Dataset]. https://portalinvestigacion.uniovi.es/documentos/668fc416b9e7c03b01bd41e1
    Explore at:
    Dataset updated
    2021
    Authors
    Arboleya, Pablo; Arboleya, Pablo
    Description

    In order to see and interact with the case of study open the case_of_study.html file with an internet browser, it should work in all browser but we tested in Google Chrome, Firefox and Safari.Once the file is open, we will see something like the attached figure. In the next steps, the different variables available and the way of interacting with then will be described.A heavily unbalanced three-phase system of 11 nodes is represented in the left part, where each phase of a line is represented with a color (red,green,blue) for (a,b,c) respectively, and this can be extended also for the nodes. The white number in the center of the nodes represent the node numbers. The movement of the yellow ball on the lines represents the convention of positive power flow. The power flow in a line will be positive if it flows in the same direction of the movement of the ball. Voltages are expressed in V, currents in A and power in kW, kVAr or kVA.The power flow in this system has been solved and all the results, included the power dividers are available. Nodal variables, like voltage, current injections, active, reactive and apparent power can be represented in the nodes by selecting them using the drop-down menu labeled as “Node Variable”. Next to this drown menu we can see a tick labeled as “Angle”, if it is marked, complex nodal variables are shown with angle, if not, only module. The size of the colored circles inside the node will be proportional to the module of the variable represented. Positive active power means consumption, negative active power injection. When clicking a node of the scheme a phasor diagram with the injected currents and the voltages in the nodes appears in the right side of the screen. Also, the power triangles are represented.The “Line variable” drop-down menu allow us to represent variables in the lines, basically, currents, powers and losses. I12t is the total current (considering also the shunt current, if it exists), flowing from the starting point to the ending point while I21t the total current flowing from the ending point to the starting point. It must be remarked that the yellow ball is always moving from the starting point to the ending point. Shunt currents are represented by I1s and I2s and the currents flowing through the lines without considering shunt currents are expressed as I12 and I21. Same criteria apply to active, reactive and apparent power through the lines. The line losses can be represented selecting “Line_losses” in this very same drop-down menu. The width of the colored lines representing the phases is proportional to the module of the variable represented. When clicking in a specific line, all the currents through that line are represented in the right part of the screen in the so called “Line details” diagram.The power dividers of the expression (21) of the paper P(m,n) and P(m,n) as well as the P(m,n)loss can be analyzed using the “drop-down” menu labeled as “Pdivs”, there we can select Pdiv12t and Pdiv21t representing respectively P(m,n) and P(m,n). Pdiv_losses represent P(m,n)loss.We recommend the next set up for analysis:Node variable drop-down menu: PLine variable drop-down menu: Line_lossesPdivs drop-down menu: Pdiv_losses

  13. Z

    The Turku UAS DeepSeaSalama - GAN dataset 1 (TDSS-G1)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asadi, Mehdi (2024). The Turku UAS DeepSeaSalama - GAN dataset 1 (TDSS-G1) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10714822
    Explore at:
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    Turku University of Applied Sciences
    Asadi, Mehdi
    Majd, Amin
    Auranen, Jani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Turku
    Description

    The Turku UAS DeepSeaSalama-GAN dataset 1 (TDSS-G1) is a comprehensive image dataset obtained from a maritime environment. This dataset was assembled in the southwest Finnish archipelago area at Taalintehdas, using two stationary RGB fisheye cameras in the month of August 2022. The technical setup is described in the section “Sensor Platform design” in report “Development of Applied Research Platforms for Autonomous and Remotely Operated Systems” (https://www.theseus.fi/handle/10024/815628).

    The data collection and annotation process was carried out in the Autonomous and Intelligent Systems laboratory at Turku University of Applied Sciences. The dataset is a blend of original images captured by our cameras and synthetic data generated by a Generative Adversarial Network (GAN), simulating 18 distinct weather conditions.

    The TDSS-G1 dataset comprises 199 original images and a substantial addition of 3582 synthetic images, culminating in a total of 3781 annotated images. These images provide a diverse representation of various maritime objects, including motorboats, sailing boats, and seamarks.

    The creation of TDSS-G1 involved extracting images from videos recorded in MPEG format, with a resolution of 720p at 30 frames per second (FPS). An image was extracted every 100 milliseconds.

    The distribution of labels within TDSS-G1 is as follows: motorboats (62.1%), sailing boats (16.8%), and seamarks (21.1%).

    This distribution highlights a class imbalance, with motorboats being the most represented class and sailing boats being the least. This imbalance is an important factor to consider during the model training process, as it could influence the model’s ability to accurately recognize underrepresented classes. In the future synthetic datasets, vision Transformers will be used to tackle this problem.

    The TDSS-G1 dataset is organized into three distinct subsets for the purpose of training and evaluating machine learning models. These subsets are as follows:

    Training Set: Located in dataset/train/images, this set is used to train the model. It learns to recognize the different classes of maritime objects from this data.

    Validation Set: Stored in dataset/valid/images, this set is used to tune the model parameters and to prevent overfitting during the training process.

    Test Set: Found in dataset/test/images, this set is used to evaluate the final performance of the model. It provides an unbiased assessment of how the model will perform on unseen data.

    The dataset comprises three classes (nc: 3), each representing a different type of maritime object. The classes are as follows:

    Motor Boat (motor_boat)

    Sailing Boat (sailing_boat)

    Seamark (seamark)

    These labels correspond to the annotated objects in the images. The model trained on this dataset will be capable of identifying these three types of maritime objects. As mentioned earlier, the distribution of these classes is imbalanced, which is an important factor to consider during the training process.

  14. Data from: The limits of the constant-rate birth-death prior for...

    • zenodo.org
    bin, sh
    Updated Jul 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Khurana; Neil Scheidwasser-Clow; Matthew J. Penn; Samir Bhatt; David Duchêne; Mark Khurana; Neil Scheidwasser-Clow; Matthew J. Penn; Samir Bhatt; David Duchêne (2023). The limits of the constant-rate birth-death prior for phylogenetic tree topology inference [Dataset]. http://doi.org/10.5281/zenodo.8187005
    Explore at:
    bin, shAvailable download formats
    Dataset updated
    Jul 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mark Khurana; Neil Scheidwasser-Clow; Matthew J. Penn; Samir Bhatt; David Duchêne; Mark Khurana; Neil Scheidwasser-Clow; Matthew J. Penn; Samir Bhatt; David Duchêne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Birth-death models are stochastic processes describing speciation and extinction through time and across taxa, and are widely used in biology for inference of evolutionary timescales. Previous research has highlighted how the expected trees under constant-rate birth-death (crBD) tend to differ from empirical trees, for example with respect to the amount of phylogenetic imbalance. However, our understanding of how trees differ between crBD and the signal in empirical data remains incomplete. In this Point of View, we aim to expose the degree to which crBD differs from empirically inferred phylogenies and test the limits of the model in practice. Using a wide range of topology indices to compare crBD expectations against a comprehensive dataset of 1189 empirically estimated trees, we confirm that crBD trees frequently differ topologically compared with empirical trees. To place this in the context of standard practice in the field, we conducted a meta-analysis for a subset of the empirical studies. When comparing studies that used crBD priors with those that used other non-BD Bayesian and non-Bayesian methods, we do not find any significant differences in tree topology inferences. To scrutinize this finding for the case of highly imbalanced trees, we selected the 100 trees with the greatest imbalance from our dataset, simulated sequence data for these tree topologies under various evolutionary rates, and re-inferred the trees under maximum likelihood and using crBD in a Bayesian setting. We find that when the substitution rate is low, the crBD prior results in overly balanced trees, but the tendency is negligible when substitution rates are sufficiently high. Overall, our findings demonstrate the general robustness of crBD priors across a broad range of phylogenetic inference scenarios, but also highlights that empirically observed phylogenetic imbalance is highly improbable under crBD, leading to systematic bias in data sets with limited information content.

  15. Data from: Species Selection Regime and Phylogenetic Tree Shape

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Nov 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George Verboom; Florian Boucher; David Ackerly; Lara Wootton; William Freyman (2019). Species Selection Regime and Phylogenetic Tree Shape [Dataset]. http://doi.org/10.5061/dryad.1sf007b
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 26, 2019
    Dataset provided by
    Université Grenoble Alpes
    University of Cape Town
    University of California, Berkeley
    University of Minnesota
    Authors
    George Verboom; Florian Boucher; David Ackerly; Lara Wootton; William Freyman
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Species selection, the effect of heritable traits in generating between-lineage diversification rate differences, provides a valuable conceptual framework for understanding the relationship between traits, diversification and phylogenetic tree shape. An important challenge, however, is that the nature of real diversification landscapes – curves or surfaces which describe the propensity of species-level lineages to diversify as a function of one or more traits – remains poorly understood. Here we present a novel, time-stratified extension of the QuaSSE model in which speciation/extinction rate is specified as a static or temporally-shifting Gaussian or skewed-Gaussian function of the diversification trait. We then use simulations to show that the generally imbalanced nature of real phylogenetic trees, as well as their generally greater-than-expected frequency of deep branching events, are typical outcomes when diversification is treated as a dynamic, trait-dependent process. Focusing on four basic models (Gaussian-speciation with and without background extinction; skewed-speciation; Gaussian-extinction), we also show that particular features of the species selection regime produce distinct tree shape signatures and that, consequently, a combination of tree shape metrics has the potential to reveal the species selection regime under which a particular lineage diversified. We evaluate this idea empirically by comparing the phylogenetic trees of plant lineages diversifying within climatically- and geologically-stable environments of the Greater Cape Floristic Region, with those of lineages diversifying in environments that have experienced major change through the Late Miocene-Pliocene. Consistent with our expectations, the trees of lineages diversifying in a dynamic context are less balanced, show a greater concentration of branching events close to the present, and display stronger diversification rate-trait correlations. We suggest that species selection plays an important role in shaping phylogenetic trees but recognize the need for an explicit probabilistic framework within which to assess the likelihoods of alternative diversification scenarios as explanations of a particular tree shape.

  16. Z

    Data from: Unbalanced species losses and gains lead to non-linear...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James M. Bullock (2020). Unbalanced species losses and gains lead to non-linear trajectories as grasslands become forests [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3406338
    Explore at:
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    James M. Bullock
    Adam Kimberley
    Sara A.O. Cousins
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets for the article "Unbalanced species losses and gains lead to non-linear trajectories as grasslands become forests" in Journal of Vegetation Science. Plot data contains information on grassland sites in the archipelago, including how long they have been abandoned for and the surrounding landscape composition. Species occurrence matrix contains data on the plant communities found in vegetation sampling plots at respective sites.

  17. f

    MOESM2 of Quality control of imbalanced mass spectra from isotopic labeling...

    • springernature.figshare.com
    xls
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianjun Li; Long Chen; Min Gan (2024). MOESM2 of Quality control of imbalanced mass spectra from isotopic labeling experiments [Dataset]. http://doi.org/10.6084/m9.figshare.10264199.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 16, 2024
    Dataset provided by
    figshare
    Authors
    Tianjun Li; Long Chen; Min Gan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 2 This is a four-sheet xls file (7180KB) containing the TPP analysis results and extracted features, each sheet refers to one dataset with special ratio in its sheet name, e.g., 1_1 is the ratio 1:1 sample data.

  18. Data from: Age-dependent and lineage-dependent speciation and extinction in...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jan 24, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric W. Holman (2017). Age-dependent and lineage-dependent speciation and extinction in the imbalance of phylogenetic trees [Dataset]. http://doi.org/10.5061/dryad.2q9r7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2017
    Authors
    Eric W. Holman
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    It is known that phylogenetic trees are more imbalanced than expected from a birth–death model with constant rates of speciation and extinction, and also that imbalance can be better fit by allowing the rate of speciation to decrease as the age of the parent species increases. If imbalance is measured in more detail, at nodes within trees as a function of the number of species descended from the nodes, age-dependent models predict levels of imbalance comparable to real trees for small numbers of descendent species, but predicted imbalance approaches an asymptote not found in real trees as the number of descendent species becomes large. Age-dependence must therefore be complemented by another process such as inheritance of different rates along different lineages, which is known to predict insufficient imbalance at nodes with few descendent species, but can predict increasing imbalance with increasing numbers of descendent species.

  19. f

    Increase in AUC, MCC, and F1 between oversampling and undersampling.

    • plos.figshare.com
    xls
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seongil Han; Haemin Jung (2024). Increase in AUC, MCC, and F1 between oversampling and undersampling. [Dataset]. http://doi.org/10.1371/journal.pone.0316454.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Seongil Han; Haemin Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Increase in AUC, MCC, and F1 between oversampling and undersampling.

  20. d

    Dataset on dynamic analysis of an unbalanced hollow cylinder rolling over a...

    • b2find.dkrz.de
    Updated Jan 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Dataset on dynamic analysis of an unbalanced hollow cylinder rolling over a horizontal plane - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/7fce84b1-f5ec-5810-a776-f12eae179b1f
    Explore at:
    Dataset updated
    Jan 4, 2024
    License

    Licence Ouverte / Open Licence 2.0https://www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf
    License information was derived automatically

    Description

    Rigid solid body dynamics is a key element of the undergraduate mechanical engineering curriculum. In a context of reverse engineering and/or sustainable development, being able to analyze the mechanical and material properties of a system without damaging it is a required skill. In this dataset, an unbalanced hollow cylinder rolling over horizontal path without sliding is studied. Four generations of last year bachelor students in mechanical engineering, representing a hundred people a year, followed a total of 12 hours of practical sessions working on such systems. This work aims at showing how computer tools can help and improve a rigid solid body dynamics course.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yang, Yu (2023). Simulation Results on the Effect of Ensemble on Data Imbalance [Dataset]. https://search.dataone.org/view/sha256%3Ae6de30d2f7aa0db00837a402e7377acc36de959159760d2900103285cc862392

Simulation Results on the Effect of Ensemble on Data Imbalance

Explore at:
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Yang, Yu
Description

This dataset contains all the simulation results on the effect of ensemble models in dealing with data imbalance. The simulations are performed with sample size n=2000, number of variables p=200, and number of groups k=20 under six imbalanced scenarios. It shows the result of ensemble models with threshold from [0, 0.05, 0.1, ..., 0.95, 1.0], in terms of the overall AP/AR and discrete (continuous) specific AP/AR. This dataset serves as a reference for practitioners to find the appropriate ensemble threshold that fits their business needs the best.

Search
Clear search
Close search
Google apps
Main menu