100+ datasets found
  1. n

    Data from: Exploring deep learning techniques for wild animal behaviour...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Feb 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 22, 2024
    Dataset provided by
    Osaka University
    Nagoya University
    Authors
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

    This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.

  2. H

    Data from: Data augmentation for disruption prediction via robust surrogate...

    • dataverse.harvard.edu
    • osti.gov
    Updated Aug 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.

  3. f

    Data from: Explainable Graph Neural Networks with Data Augmentation for...

    • acs.figshare.com
    zip
    Updated Sep 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongle An; Xuyang Liu; Wensheng Cai; Xueguang Shao (2023). Explainable Graph Neural Networks with Data Augmentation for Predicting pKa of C–H Acids [Dataset]. http://doi.org/10.1021/acs.jcim.3c00958.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 14, 2023
    Dataset provided by
    ACS Publications
    Authors
    Hongle An; Xuyang Liu; Wensheng Cai; Xueguang Shao
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The pKa of C–H acids is an important parameter in the fields of organic synthesis, drug discovery, and materials science. However, the prediction of pKa is still a great challenge due to the limit of experimental data and the lack of chemical insight. Here, a new model for predicting the pKa values of C–H acids is proposed on the basis of graph neural networks (GNNs) and data augmentation. A message passing unit (MPU) was used to extract the topological and target-related information from the molecular graph data, and a readout layer was utilized to retrieve the information on the ionization site C atom. The retrieved information then was adopted to predict pKa by a fully connected network. Furthermore, to increase the diversity of the training data, a knowledge-infused data augmentation technique was established by replacing the H atoms in a molecule with substituents exhibiting different electronic effects. The MPU was pretrained with the augmented data. The efficacy of data augmentation was confirmed by visualizing the distribution of compounds with different substituents and by classifying compounds. The explainability of the model was studied by examining the change of pKa values when a specific atom was masked. This explainability was used to identify the key substituents for pKa. The model was evaluated on two data sets from the iBonD database. Dataset1 includes the experimental pKa values of C–H acids measured in DMSO, while dataset2 comprises the pKa values measured in water. The results show that the knowledge-infused data augmentation technique greatly improves the predictive accuracy of the model, especially when the number of samples is small.

  4. Variable Message Signal annotated images for object detection

    • zenodo.org
    zip
    Updated Oct 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas (2022). Variable Message Signal annotated images for object detection [Dataset]. http://doi.org/10.5281/zenodo.5904211
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 2, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    If you use this dataset, please cite this paper: Puertas, E.; De-Las-Heras, G.; Sánchez-Soriano, J.; Fernández-Andrés, J. Dataset: Variable Message Signal Annotated Images for Object Detection. Data 2022, 7, 41. https://doi.org/10.3390/data7040041

    This dataset consists of Spanish road images taken from inside a vehicle, as well as annotations in XML files in PASCAL VOC format that indicate the location of Variable Message Signals within them. Also, a CSV file is attached with information regarding the geographic position, the folder where the image is located, and the text in Spanish. This can be used to train supervised learning computer vision algorithms, such as convolutional neural networks. Throughout this work, the process followed to obtain the dataset, image acquisition, and labeling, and its specifications are detailed. The dataset is constituted of 1216 instances, 888 positives, and 328 negatives, in 1152 jpg images with a resolution of 1280x720 pixels. These are divided into 576 real images and 576 images created from the data-augmentation technique. The purpose of this dataset is to help in road computer vision research since there is not one specifically for VMSs.

    The folder structure of the dataset is as follows:

    • vms_dataset/
      • data.csv
      • real_images/
        • imgs/
        • annotations/
      • data-augmentation/
        • imgs/
        • annotations/

    In which:

    • data.csv: Each row contains the following information separated by commas (,): image_name, x_min, y_min, x_max, y_max, class_name, lat, long, folder, text.
    • real_images: Images extracted directly from the videos.
    • data-augmentation: Images created using data-augmentation
    • imgs: Image files in .jpg format.
    • annotations: Annotation files in .xml format.
  5. f

    EDA augmentation parameters.

    • plos.figshare.com
    xls
    Updated Sep 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). EDA augmentation parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.

  6. m

    Optimizing Object Detection in Challenging Environments with Deep...

    • data.mendeley.com
    Updated Oct 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asad Ali (2024). Optimizing Object Detection in Challenging Environments with Deep Convolutional Neural Networks [Dataset]. http://doi.org/10.17632/gfpg6hxrvz.1
    Explore at:
    Dataset updated
    Oct 24, 2024
    Authors
    Asad Ali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Object detection in challenging environments, such as low-light, cluttered, or dynamic conditions, remains a critical issue in computer vision. Deep Convolutional Neural Networks (DCNNs) have emerged as powerful tools for addressing these challenges due to their ability to learn hierarchical feature representations. This paper explores the optimization of object detection in such environments by leveraging advanced DCNN architectures, data augmentation techniques, and domain-specific pre-training. We propose an enhanced detection framework that integrates multi-scale feature extraction, transfer learning, and regularization methods to improve robustness against noise, occlusion, and lighting variations. Experimental results demonstrate significant improvements in detection accuracy across various challenging datasets, outperforming traditional methods. This study highlights the potential of DCNNs in real-world applications, such as autonomous driving, surveillance, and robotics, where object detection in difficult conditions is crucial.

  7. Z

    Data from: Phenotype Driven Data Augmentation Methods for Transcriptomic...

    • data.niaid.nih.gov
    Updated Mar 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    María Rodríguez Martínez (2025). Phenotype Driven Data Augmentation Methods for Transcriptomic Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8383202
    Explore at:
    Dataset updated
    Mar 6, 2025
    Dataset provided by
    Nikita Janakarajan
    María Rodríguez Martínez
    Mara Graziani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the data and associated results of all experiments conducted in our work "Phenotype Driven Data Augmentation Methods for Transcriptomic Data". In this work, we introduce two classes of phenotype driven data augmentation approaches – signature-dependent and signature-independent. The signature-dependent methods assume the existence of distinct gene signatures describing some phenotype and are simple, non-parametric, and novel data augmentation methods. The signature-independent methods are a modification of the established Gamma-Poisson and Poisson sampling methods for gene expression data. We benchmark our proposed methods against random oversampling, SMOTE, unmodified versions of Gamma-Poisson and Poisson sampling, and unaugmented data.

    This repository contains data used for all our experiments. This includes the original data based off which augmentation was performed, the cross validation split indices as a json file, the training and validation data augmented by the various augmentation methods mentioned in our study, a test set (containing only real samples) and an external test set standardised accordingly with respect to each augmentation method and training data per CV split.

    The compressed files 5x5stratified_{x}percent.zip contains data that were augmented on x% of the available real data. brca_public.zip contains data used for the breast cancer experiments. distribution_size_effect.zip contains data used for hyperparameter tuning the reference set size for the modified Poisson and Gamma-Poisson methods.

    The compressed file results.zip contains all the results from all the experiments. This includes the parameter files used to train the various models, the metrics (balanced accuracy and auc-roc) computed including p-values, as well as the latent space of train, validation and test (for the (N)VAE) for all 25 (5x5) CV splits.

    PLEASE NOTE: If any part of this repository is used in any form for your work, please attribute the following, in addition to attributing the original data source - TCGA, CPTAC, GSE20713 and METABRIC, accordingly:

    @article{janakarajan2023signature, title={Phenotype Driven Data Augmentation Methods for Transcriptomic Data}, author={Janakarajan, Nikita and Graziani, Mara and Martinez, Maria Rodriguez}, journal={bioRxiv}, pages={2023--10}, year={2023}, publisher={Cold Spring Harbor Laboratory} }

  8. f

    Datasets GO ID/attribute p-value q-value.

    • figshare.com
    xls
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sifan Feng; Zhenyou Wang; Yinghua Jin; Shengbin Xu (2024). Datasets GO ID/attribute p-value q-value. [Dataset]. http://doi.org/10.1371/journal.pone.0305857.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Sifan Feng; Zhenyou Wang; Yinghua Jin; Shengbin Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Traditional differential expression genes (DEGs) identification models have limitations in small sample size datasets because they require meeting distribution assumptions, otherwise resulting high false positive/negative rates due to sample variation. In contrast, tabular data model based on deep learning (DL) frameworks do not need to consider the data distribution types and sample variation. However, applying DL to RNA-Seq data is still a challenge due to the lack of proper labeling and the small sample size compared to the number of genes. Data augmentation (DA) extracts data features using different methods and procedures, which can significantly increase complementary pseudo-values from limited data without significant additional cost. Based on this, we combine DA and DL framework-based tabular data model, propose a model TabDEG, to predict DEGs and their up-regulation/down-regulation directions from gene expression data obtained from the Cancer Genome Atlas database. Compared to five counterpart methods, TabDEG has high sensitivity and low misclassification rates. Experiment shows that TabDEG is robust and effective in enhancing data features to facilitate classification of high-dimensional small sample size datasets and validates that TabDEG-predicted DEGs are mapped to important gene ontology terms and pathways associated with cancer.

  9. n

    Data from: New Deep Learning Methods for Medical Image Analysis and...

    • curate.nd.edu
    pdf
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pengfei Gu (2024). New Deep Learning Methods for Medical Image Analysis and Scientific Data Generation and Compression [Dataset]. http://doi.org/10.7274/26156719.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 11, 2024
    Dataset provided by
    University of Notre Dame
    Authors
    Pengfei Gu
    License

    https://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106

    Description

    Medical image analysis is critical to biological studies, health research, computer- aided diagnoses, and clinical applications. Recently, deep learning (DL) techniques have achieved remarkable successes in medical image analysis applications. However, these techniques typically require large amounts of annotations to achieve satisfactory performance. Therefore, in this dissertation, we seek to address this critical problem: How can we develop efficient and effective DL algorithms for medical image analysis while reducing annotation efforts? To address this problem, we have outlined two specific aims: (A1) Utilize existing annotations effectively from advanced models; (A2) extract generic knowledge directly from unannotated images.

    To achieve the aim (A1): First, we introduce a new data representation called TopoImages, which encodes the local topology of all the image pixels. TopoImages can be complemented with the original images to improve medical image analysis tasks. Second, we propose a new augmentation method, SAMAug-C, that lever- ages the Segment Anything Model (SAM) to augment raw image input and enhance medical image classification. Third, we propose two advanced DL architectures, kCBAC-Net and ConvFormer, to enhance the performance of 2D and 3D medical image segmentation. We also present a gate-regularized network training (GrNT) approach to improve multi-scale fusion in medical image segmentation. To achieve the aim (A2), we propose a novel extension of known Masked Autoencoders (MAEs) for self pre-training, i.e., models pre-trained on the same target dataset, specifically for 3D medical image segmentation.

    Scientific visualization is a powerful approach for understanding and analyzing various physical or natural phenomena, such as climate change or chemical reactions. However, the cost of scientific simulations is high when factors like time, ensemble, and multivariate analyses are involved. Additionally, scientists can only afford to sparsely store the simulation outputs (e.g., scalar field data) or visual representations (e.g., streamlines) or visualization images due to limited I/O bandwidths and storage space. Therefore, in this dissertation, we seek to address this critical problem: How can we develop efficient and effective DL algorithms for scientific data generation and compression while reducing simulation and storage costs?

    To tackle this problem: First, we propose a DL framework that generates un- steady vector fields data from a set of streamlines. Based on this method, domain scientists only need to store representative streamlines at simulation time and recon- struct vector fields during post-processing. Second, we design a novel DL method that translates scalar fields to vector fields. Using this approach, domain scientists only need to store scalar field data at simulation time and generate vector fields from their scalar field counterparts afterward. Third, we present a new DL approach that compresses a large collection of visualization images generated from time-varying data for communicating volume visualization results.

  10. Data from: Phenotype Driven Data Augmentation Methods for Transcriptomic...

    • zenodo.org
    zip
    Updated Jun 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikita Janakarajan; Nikita Janakarajan; Mara Graziani; Mara Graziani; María Rodríguez Martínez; María Rodríguez Martínez (2025). Phenotype Driven Data Augmentation Methods for Transcriptomic Data [Dataset]. http://doi.org/10.5281/zenodo.14983178
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nikita Janakarajan; Nikita Janakarajan; Mara Graziani; Mara Graziani; María Rodríguez Martínez; María Rodríguez Martínez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the data and associated results of all experiments conducted in our work "Phenotype Driven Data Augmentation Methods for Transcriptomic Data". In this work, we introduce two classes of phenotype driven data augmentation approaches – signature-dependent and signature-independent. The signature-dependent methods assume the existence of distinct gene signatures describing some phenotype and are simple, non-parametric, and novel data augmentation methods. The signature-independent methods are a modification of the established Gamma-Poisson and Poisson sampling methods for gene expression data. We benchmark our proposed methods against random oversampling, SMOTE, unmodified versions of Gamma-Poisson and Poisson sampling, and unaugmented data.

    This repository contains data used for all our experiments. This includes the original data based off which augmentation was performed, the cross validation split indices as a json file, the training and validation data augmented by the various augmentation methods mentioned in our study, a test set (containing only real samples) and an external test set standardised accordingly with respect to each augmentation method and training data per CV split.

    The compressed files 5x5stratified_{x}percent.zip contains data that were augmented on x% of the available real data. brca_public.zip contains data used for the breast cancer experiments. distribution_size_effect.zip contains data used for hyperparameter tuning the reference set size for the modified Poisson and Gamma-Poisson methods.

    The compressed file results.zip contains all the results from all the experiments. This includes the parameter files used to train the various models, the metrics (balanced accuracy and auc-roc) computed including p-values, as well as the latent space of train, validation and test (for the (N)VAE) for all 25 (5x5) CV splits.

    PLEASE NOTE: If any part of this repository is used in any form for your work, please attribute the following, in addition to attributing the original data source - TCGA, CPTAC, GSE20713 and METABRIC, accordingly:

    @article{janakarajan2025phenotype,
    title={Phenotype driven data augmentation methods for transcriptomic data},
    author={Janakarajan, Nikita and Graziani, Mara and Rodr{\'\i}guez Mart{\'\i}nez, Mar{\'\i}a},
    journal={Bioinformatics Advances},
    volume={5},
    number={1},
    pages={vbaf124},
    year={2025},
    publisher={Oxford University Press}
    }

  11. Replication Package of Deep Learning and Data Augmentation for Detecting...

    • zenodo.org
    zip
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2024). Replication Package of Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt [Dataset]. http://doi.org/10.5281/zenodo.10521909
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 17, 2024
    Description

    Self-Admitted Technical Debt (SATD) refers to circumstances where developers use code comments, issues, pull requests, or other textual artifacts to explain why the existing implementation is not optimal. Past research in detecting SATD has focused on either identifying SATD (classifying SATD instances as SATD or not) or categorizing SATD (labeling instances as SATD that pertain to requirements, design, code, test, etc.). However, the performance of such approaches remains suboptimal, particularly when dealing with specific types of SATD, such as test and requirement debt. This is mostly because the used datasets are extremely imbalanced.

    In this study, we utilize a data augmentation strategy to address the problem of imbalanced data. We also employ a two-step approach to identify and categorize SATD on various datasets derived from different artifacts. Based on earlier research, a deep learning architecture called BiLSTM is utilized for the binary identification of SATD. The BERT architecture is then utilized to categorize different types of SATD. We provide the dataset of balanced classes as a contribution for future SATD researchers, and we also show that the performance of SATD identification and categorization using deep learning and our two-step approach is significantly better than baseline approaches.

    Therefore, to showcase the effectiveness of our approach, we compared it against several existing approaches:

    1. Natural Language Processing (NLP) and Matches task Annotation Tags (MAT) [Github]
    2. eXtreme Gradient Boosting+Synthetic Minority Oversampling Technique (XGBoost+SMOTE) [Figshare]
    3. eXtreme Gradient Boosting+Easy Data Augmentation (XGBoost+EDA) [Github]
    4. MT-Text-CNN [Github]

    Structure of the Replication Package:

    In accordance with the original dataset, the dataset comprises four distinct CSV files delineated by the artifacts under consideration in this study. Each CSV file encompasses a text column and a class, which indicate classifications denoting specific types of SATD, namely code/design debt (C/D), documentation debt (DOC), test debt (TES), and requirement debt (REQ) or Not-SATD.

    ├── SATD Keywords
    │ ├── Keywords based on Source of Artifacts
    │ │ ├── Code comment.txt
    │ │ ├── Commit message.txt
    │ │ ├── Issue section.txt
    │ │ └── Pull section.txt
    │ ├── Keywords based on Types of SATD
    │ │ ├── code-design debt.txt
    │ │ ├── documentation debt.txt
    │ │ ├── requirement debt.txt
    │ │ └── test debt.txt
    ├── src
    │ ├── bert.py
    │ ├── bilstm.py
    │ └── preprocessing.py
    ├── data-augmentation-code_comments.csv
    ├── data-augmentation-commit_messages.csv
    ├── data-augmentation-issues.csv
    ├── data-augmentation-pull_requests.csv
    └── Supplementary Material.docx

    Requirements:

    nltk
    transformers
    torch
    tensorflow
    keras
    langdetect
    inflect
    inflection
    Project sources for each artifact are as follows:
    Source code commentIssue sectionPull sectionCommit message
    ant
    argouml
    columba
    emf
    hibernate
    jedit
    jfreechart
    jmeter
    jruby
    squirrel
    camel
    chromium
    gerrit
    hadoop
    hbase
    impala
    thrift
    accumulo
    activemq
    activemq-artemis
    airflow
    ambari
    apisix
    apisix-dashboard
    arrow
    attic-apex-core
    attic-apex-malhar
    attic-stratos
    avro
    beam
    bigtop
    bookkeeper
    brooklyn-server
    calcite
    camel
    camel-k
    camel-quarkus
    camel-website
    carbondata
    cassandra
    cloudstack
    commons-lang
    couchdb
    cxf
    daffodil
    drill
    druid
    dubbo
    echarts
    fineract
    flink
    fluo
    geode
    geode-native
    gobblin
    griffin
    groovy
    guacamole-client
    hadoop
    hawq
    hbase
    helix
    hive
    hudi
    iceberg
    ignite
    incubator-brooklyn
    incubator-dolphinscheduler
    incubator-doris
    incubator-heron
    incubator-hop
    incubator-mxnet
    incubator-pagespeed-ngx
    incubator-pinot
    incubator-weex
    infrastructure-puppet
    jena
    jmeter
    kafka
    karaf
    kylin
    lucene-solr
    madlib
    myfaces-tobago
    netbeans
    netbeans-website
    nifi
    nifi-minifi-cpp
    nutch
    openwhisk
    openwhisk-wskdeploy
    orc
    ozone
    parquet-mr
    phoenix
    pulsar
    qpid-dispatch
    reef
    rocketmq
    samza
    servicecomb-java-chassis
    shardingsphere
    shardingsphere-elasticjob
    skywalking
    spark
    storm
    streams
    superset
    systemds
    tajo
    thrift
    tinkerpop
    tomee
    trafficcontrol
    trafficserver
    trafodion
    tvm
    usergrid
    zeppelin
    zookeeper
    accumulo
    activemq
    activemq-artemis
    airflow
    ambari
    apisix
    apisix-dashboard
    arrow
    attic-apex-core
    attic-apex-malhar
    attic-stratos
    avro
    beam
    bigtop
    bookkeeper
    brooklyn-server
    calcite
    camel
    camel-k
    camel-quarkus
    camel-website
    carbondata
    cassandra
    cloudstack
    commons-lang
    couchdb
    cxf
    daffodil
    drill
    druid
    dubbo
    echarts
    fineract
    flink
    fluo
    geode
    geode-native
    gobblin
    griffin
    groovy
    guacamole-client
    hadoop
    hawq
    hbase
    helix
    hive
    hudi
    iceberg
    ignite
    incubator-brooklyn
    incubator-dolphinscheduler
    incubator-doris
    incubator-heron
    incubator-hop
    incubator-mxnet
    incubator-pagespeed-ngx
    incubator-pinot
    incubator-weex
    infrastructure-puppet
    jena
    jmeter
    kafka
    karaf
    kylin
    lucene-solr
    madlib
    myfaces-tobago
    netbeans
    netbeans-website
    nifi
    nifi-minifi-cpp
    nutch
    openwhisk
    openwhisk-wskdeploy
    orc
    ozone
    parquet-mr
    phoenix
    pulsar
    qpid-dispatch
    reef
    rocketmq
    samza
    servicecomb-java-chassis
    shardingsphere
    shardingsphere-elasticjob
    skywalking
    spark
    storm
    streams
    superset
    systemds
    tajo
    thrift
    tinkerpop
    tomee
    trafficcontrol
    trafficserver
    trafodion
    tvm
    usergrid
    zeppelin
    zookeeper

    This dataset has undergone a data augmentation process using the AugGPT technique. Meanwhile, the original dataset can be downloaded via the following link: https://github.com/yikun-li/satd-different-sources-data

  12. S

    Synthetic Data Platform Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Synthetic Data Platform Report [Dataset]. https://www.datainsightsmarket.com/reports/synthetic-data-platform-1939818
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Synthetic Data Platform market is experiencing robust growth, driven by the increasing need for data privacy, escalating data security concerns, and the rising demand for high-quality training data for AI and machine learning models. The market's expansion is fueled by several key factors: the growing adoption of AI across various industries, the limitations of real-world data availability due to privacy regulations like GDPR and CCPA, and the cost-effectiveness and efficiency of synthetic data generation. We project a market size of approximately $2 billion in 2025, with a Compound Annual Growth Rate (CAGR) of 25% over the forecast period (2025-2033). This rapid expansion is expected to continue, reaching an estimated market value of over $10 billion by 2033. The market is segmented based on deployment models (cloud, on-premise), data types (image, text, tabular), and industry verticals (healthcare, finance, automotive). Major players are actively investing in research and development, fostering innovation in synthetic data generation techniques and expanding their product offerings to cater to diverse industry needs. Competition is intense, with companies like AI.Reverie, Deep Vision Data, and Synthesis AI leading the charge with innovative solutions. However, several challenges remain, including ensuring the quality and fidelity of synthetic data, addressing the ethical concerns surrounding its use, and the need for standardization across platforms. Despite these challenges, the market is poised for significant growth, driven by the ever-increasing need for large, high-quality datasets to fuel advancements in artificial intelligence and machine learning. The strategic partnerships and acquisitions in the market further accelerate the innovation and adoption of synthetic data platforms. The ability to generate synthetic data tailored to specific business problems, combined with the increasing awareness of data privacy issues, is firmly establishing synthetic data as a key component of the future of data management and AI development.

  13. Data from: Class-specific data augmentation for plant stress classification

    • zenodo.org
    zip
    Updated Sep 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nasla Saleem; Nasla Saleem; Baskar Ganapathysubramanian; Zaki Jubery; Aditya Balu; Soumik Sarkar; Arti Singh; Asheesh Singh; Baskar Ganapathysubramanian; Zaki Jubery; Aditya Balu; Soumik Sarkar; Arti Singh; Asheesh Singh (2024). Class-specific data augmentation for plant stress classification [Dataset]. http://doi.org/10.5281/zenodo.13823148
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 21, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nasla Saleem; Nasla Saleem; Baskar Ganapathysubramanian; Zaki Jubery; Aditya Balu; Soumik Sarkar; Arti Singh; Asheesh Singh; Baskar Ganapathysubramanian; Zaki Jubery; Aditya Balu; Soumik Sarkar; Arti Singh; Asheesh Singh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a companion dataset for the paper titled "Class-specific data augmentation for plant stress classification" by Nasla Saleem, Aditya Balu, Talukder Zaki Jubery, Arti Singh, Asheesh K. Singh, Soumik Sarkar, and Baskar Ganapathysubramanian published in The Plant Phenome Journal, https://doi.org/10.1002/ppj2.20112


    Abstract:

    Data augmentation is a powerful tool for improving deep learning-based image classifiers for plant stress identification and classification. However, selecting an effective set of augmentations from a large pool of candidates remains a key challenge, particularly in imbalanced and confounding datasets. We propose an approach for automated class-specific data augmentation using a genetic algorithm. We demonstrate the utility of our approach on soybean [Glycine max (L.) Merr] stress classification where symptoms are observed on leaves; a particularly challenging problem due to confounding classes in the dataset. Our approach yields substantial performance, achieving a mean-per-class accuracy of 97.61% and an overall accuracy of 98% on the soybean leaf stress dataset. Our method significantly improves the accuracy of the most challenging classes, with notable enhancements from 83.01% to 88.89% and from 85.71% to 94.05%, respectively. A key observation we make in this study is that high-performing augmentation strategies can be identified in a computationally efficient manner. We fine-tune only the linear layer of the baseline model with different augmentations, thereby reducing the computational burden associated with training classifiers from scratch for each augmentation policy while achieving exceptional performance. This research represents an advancement in automated data augmentation strategies for plant stress classification, particularly in the context of confounding datasets. Our findings contribute to the growing body of research in tailored augmentation techniques and their potential impact on disease management strategies, crop yields, and global food security. The proposed approach holds the potential to enhance the accuracy and efficiency of deep learning-based tools for managing plant stresses in agriculture.

  14. n

    Data from: Fast and accurate estimation of species-specific diversification...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Nov 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Odile Maliet; Hélène Morlon (2020). Fast and accurate estimation of species-specific diversification rates using data augmentation [Dataset]. http://doi.org/10.5061/dryad.tb2rbnzzh
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 3, 2020
    Dataset provided by
    École Normale Supérieure - PSL
    Authors
    Odile Maliet; Hélène Morlon
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Diversification rates vary across species as a response to various factors, including environmental conditions and species-specific features. Phylogenetic models that allow accounting for and quantifying this heterogeneity in diversification rates have proven particularly useful for understanding clades diversification. Recently, we introduced the cladogenetic diversification rate shift model (ClaDS), which allows inferring subtle rate variations across lineages. Here we present a new inference technique for this model that considerably reduces computation time through the use of data augmentation and provide an implementation of this method in Julia. In addition to drastically reducing computation time, this new inference approach provides a posterior distribution of the augmented data, that is the tree with extinct and unsampled lineages as well as associated diversification rates. In particular, this allows extracting the distribution through time of both the mean rate and the number of lineages. We assess the statistical performances of our approach using simulations and illustrate its application on the entire bird radiation. Methods These additionnal data contains supplementary figures supporting the paper, as well as a tutorial for the use of the Julia package.

    The .jld2 file is the result of the run of ClaDS on the complete bird phylogeny computed with molecular data from Jetz (2012) with the Hackett backbone, containing 6670 species. We use TreeAnnotator from the software Beast with the Common Ancestor option for node height (Bouckaert 2019) to obtain a Maximum Clade Credibility (MCC) tree computed from a sample of 1000 trees from the posterior distribution. We fix the sampling fractions for each of the subtrees of the tree from Jetz (2012) as the ratio between the number of species in the molecular phylogeny over that in the phylogeny including all bird species. We attach the results of this analysis as a supplementary material to this paper.

    Bouckaert, R., T. G. Vaughan, J. Barido-Sottani, S. Duchêne, M. Fourment, A. Gavryushkina, J. Heled, G. Jones, D. Kühnert, N. De Maio, et al. 2019. Beast 2.5: An advanced software platform for bayesian evolutionary analysis. PLoS computational biology 15:e1006650.

    Jetz, W., G. Thomas, J. Joy, K. Hartmann, and A. Mooers. 2012. The global diversity of birds in space and time. Nature 491:444.

  15. t

    Leveraging QA Datasets to Improve Generative Data Augmentation - Dataset -...

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Leveraging QA Datasets to Improve Generative Data Augmentation - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/leveraging-qa-datasets-to-improve-generative-data-augmentation
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The paper proposes a method to leverage QA datasets for training generative language models to be context generators for a given question and answer.

  16. f

    Comparative results for pattern mixing-based data augmentation methods.

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Kenji Iwana; Seiichi Uchida (2023). Comparative results for pattern mixing-based data augmentation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0254841.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Brian Kenji Iwana; Seiichi Uchida
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparative results for pattern mixing-based data augmentation methods.

  17. Data from: MedMNIST-C: Comprehensive benchmark and improved classifier...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesco Di Salvo; Francesco Di Salvo; Sebastian Doerrich; Sebastian Doerrich; Christian Ledig; Christian Ledig (2024). MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions [Dataset]. http://doi.org/10.5281/zenodo.11471504
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francesco Di Salvo; Francesco Di Salvo; Sebastian Doerrich; Sebastian Doerrich; Christian Ledig; Christian Ledig
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection, covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at github.com/francescodisalvo05/medmnistc-api.

    This work has been accepted at the Workshop on Advancing Data Solutions in Medical Imaging AI @ MICCAI 2024 [preprint].

    Note: Due to space constraints, we have uploaded all datasets except TissueMNIST-C. However, it can be reproduced via our APIs.

    Usage: We recommend using the demo code and tutorials available on our GitHub repository.

    Citation: If you find this work useful, please consider citing us:

    @article{disalvo2024medmnist,
     title={MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions},
     author={Di Salvo, Francesco and Doerrich, Sebastian and Ledig, Christian},
     journal={arXiv preprint arXiv:2406.17536},
     year={2024}
    }

    Disclaimer: This repository is inspired by MedMNIST APIs and the ImageNet-C repository. Thus, please also consider citing MedMNIST, the respective source datasets (described here), and ImageNet-C.

  18. m

    MultiPatient Elderly Respiration dataset in Digital Twin Technology

    • data.mendeley.com
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SAGHEER KHAN (2023). MultiPatient Elderly Respiration dataset in Digital Twin Technology [Dataset]. http://doi.org/10.17632/vm8j5dvrxy.1
    Explore at:
    Dataset updated
    Dec 7, 2023
    Authors
    SAGHEER KHAN
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The research focus for this study is to generate a larger respiration dataset for the creation of elderly respiration Digital Twin (DT) model. Initial experimental data is collected with an unobtrusive Wi-Fi sensor with Channel State Information (CSI) characteristics to collect the subject's respiration rate.

    The generation of a DT model requires extensive and diverse data. Due to limited resources and the need for extensive experimentation, the data is generated by implementing a novel statistical time series data augmentation method on single-subject respiration data. The larger synthetic respiration datasets will allow for testing the signal processing methodologies for noise removal,Breaths Per Minute (BPM) estimation, extensive Artificial Intelligence (AI) implementation.

    The sensor data is for BPM from 12BPM to 25BPM for a single subject. Normal respiration rate ranges from 12BPM to 16BPM and beyond this is considered abnormal BPM. A total of 14 files are present in the dataset. Each file is labeled according to the BPM. All 30 patient data are present for each BPM. Patient are numbered as "P1, P2, P3, .... untill P30"

    This data can be utilized by researchers and scientists toward the development of novel signal processing methodologies in the respiration DT model. These larger respiration datasets can be utilized for Machine Learning (ML) and Deep Learning (DL) in providing predictive analysis and classification of multi-patient respiration in the DT model for an elderly respiration rate.

  19. Data augmentation for Multi-Classification of Non-Functional Requirements -...

    • zenodo.org
    • investigacion.usc.es
    csv
    Updated Mar 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    María-Isabel Limaylla-Lunarejo; María-Isabel Limaylla-Lunarejo; Nelly Condori-Fernandez; Nelly Condori-Fernandez; Miguel R. Luaces; Miguel R. Luaces (2024). Data augmentation for Multi-Classification of Non-Functional Requirements - Dataset [Dataset]. http://doi.org/10.5281/zenodo.10805331
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    María-Isabel Limaylla-Lunarejo; María-Isabel Limaylla-Lunarejo; Nelly Condori-Fernandez; Nelly Condori-Fernandez; Miguel R. Luaces; Miguel R. Luaces
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    There are four datasets:

    1.Dataset_structure indicates the structure of the datasets, such as column name, type, and value.

    2. Spanish_promise_exp_nfr_train and Spanish_promise_exp_nfr_test are the non-functional requirements of the Promise_exp[1] dataset translated into the Spanish language.

    3. Blanced_promise_exp_nfr_train is the new balanced dataset of Spanish_promise_exp_nfr_train, in which the Data Augmentation technique with chatGPT was applied to increase the requirements with little data and random undersampling was used to eliminate requirements.

  20. Data from: Signature Informed Sampling for Transcriptomic Data

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikita Janakarajan; Nikita Janakarajan; Mara Graziani; Mara Graziani; María Rodríguez Martínez; María Rodríguez Martínez (2023). Signature Informed Sampling for Transcriptomic Data [Dataset]. http://doi.org/10.5281/zenodo.8383203
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nikita Janakarajan; Nikita Janakarajan; Mara Graziani; Mara Graziani; María Rodríguez Martínez; María Rodríguez Martínez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the data and associated results of all experiments conducted in our work "Signature Informed Sampling for Transcriptomic Data". In this work we propose a simple, novel, non-parametric method for augmenting data inspired by the concept of chromosomal crossover. We benchmark our proposed methods against random oversampling, SMOTE, modified versions of gamma-Poisson and Poisson sapling, and the unbalanced data.

    The compressed file data_5x5stratified.zip contains all the data used for our experiments. This includes the original count data based off of which augmentation was performed, the cross validation split indices as a json file, the training and validation data (TCGA) augmented by the various augmentation methods mentioned in our study, a test set (containing only real samples from TCGA) and an external test set (CPTAC) standardised accordingly with respect to each augmentation method and training data per cv split.

    The compressed file 5x5_Results.zip contains all the results from all the experiments. This includes the parameter files used to train the various models, the metrics computed, the latent space of train, validation and test (if the model is a VAE), and the trained model itself for all 25 (5x5) splits.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk

Data from: Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Feb 22, 2024
Dataset provided by
Osaka University
Nagoya University
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.

Search
Clear search
Close search
Google apps
Main menu