https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.
This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 4. A Microsoft® Excel® workbook that details the raw data for the 8 experiments in which either the test set was augmented alone (after its allocation) or augmentation of the whole dataset was done before test-set allocation. All of the image-classification output probabilities are included.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
If you use this dataset, please cite this paper: Puertas, E.; De-Las-Heras, G.; Sánchez-Soriano, J.; Fernández-Andrés, J. Dataset: Variable Message Signal Annotated Images for Object Detection. Data 2022, 7, 41. https://doi.org/10.3390/data7040041
This dataset consists of Spanish road images taken from inside a vehicle, as well as annotations in XML files in PASCAL VOC format that indicate the location of Variable Message Signals within them. Also, a CSV file is attached with information regarding the geographic position, the folder where the image is located, and the text in Spanish. This can be used to train supervised learning computer vision algorithms, such as convolutional neural networks. Throughout this work, the process followed to obtain the dataset, image acquisition, and labeling, and its specifications are detailed. The dataset is constituted of 1216 instances, 888 positives, and 328 negatives, in 1152 jpg images with a resolution of 1280x720 pixels. These are divided into 576 real images and 576 images created from the data-augmentation technique. The purpose of this dataset is to help in road computer vision research since there is not one specifically for VMSs.
The folder structure of the dataset is as follows:
In which:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 3. A Microsoft® Excel® workbook that details the raw data for the 20 experiments in which no test-set augmentation was done, including all of the image-classification output probabilities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This image dataset contains synthetic structure images used for training the deep-learning based nanowire segmentation model presented in our work "A deep learned nanowire segmentation model using synthetic data augmentation" to be published in npj Computational materials. Detailed information can be found in the corresponding article.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The popularity of Deep Learning (DL) methods used in business process management research and practice is constantly increasing. One important factor that hinders the adoption of DL in certain areas is the availability of sufficiently large training datasets, particularly affecting domains where process models are mainly defined manually with a high knowledge-acquisition effort. In this paper, we examine process model augmentation in combination with semi-supervised transfer learning to enlarge existing datasets and train DL models effectively. The use case of similarity learning between manufacturing process models is discussed. Based on a literature study of existing augmentation techniques, a concept is presented with different categories of augmentation from knowledge-light approaches to knowledge-intensive ones, e. g. based on automated planning. Specifically, the impacts of augmentation approaches on the syntactic and semantic correctness of the augmented process models are considered. The concept also proposes a semi-supervised transfer learning approach to integrate augmented and non-augmented process model datasets in a two-phased training procedure. The experimental evaluation investigates augmented process model datasets regarding their quality for model training in the context of similarity learning between manufacturing process models. The results indicate a large potential with a reduction of the prediction error of up to 53%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Self-Admitted Technical Debt (SATD) refers to circumstances where developers use code comments, issues, pull requests, or other textual artifacts to explain why the existing implementation is not optimal. Past research in detecting SATD has focused on either identifying SATD (classifying SATD instances as SATD or not) or categorizing SATD (labeling instances as SATD that pertain to requirements, design, code, test, etc.). However, the performance of such approaches remains suboptimal, particularly when dealing with specific types of SATD, such as test and requirement debt. This is mostly because the used datasets are extremely imbalanced.
In this study, we utilize a data augmentation strategy to address the problem of imbalanced data. We also employ a two-step approach to identify and categorize SATD on various datasets derived from different artifacts. Based on earlier research, a deep learning architecture called BiLSTM is utilized for the binary identification of SATD. The BERT architecture is then utilized to categorize different types of SATD. We provide the dataset of balanced classes as a contribution for future SATD researchers, and we also show that the performance of SATD identification and categorization using deep learning and our two-step approach is significantly better than baseline approaches.
Therefore, to showcase the effectiveness of our approach, we compared it against several existing approaches:
In accordance with the original dataset, the dataset comprises four distinct CSV files delineated by the artifacts under consideration in this study. Each CSV file encompasses a text column and a class, which indicate classifications denoting specific types of SATD, namely code/design debt (C/D), documentation debt (DOC), test debt (TES), and requirement debt (REQ) or Not-SATD.
├── SATD Keywords
│ ├── Keywords based on Source of Artifacts
│ │ ├── Code comment.txt
│ │ ├── Commit message.txt
│ │ ├── Issue section.txt
│ │ └── Pull section.txt
│ ├── Keywords based on Types of SATD
│ │ ├── code-design debt.txt
│ │ ├── documentation debt.txt
│ │ ├── requirement debt.txt
│ │ └── test debt.txt
├── src
│ ├── bert.py
│ ├── bilstm.py
│ └── preprocessing.py
├── data-augmentation-code_comments.csv
├── data-augmentation-commit_messages.csv
├── data-augmentation-issues.csv
├── data-augmentation-pull_requests.csv
└── Supplementary Material.docx
Requirements:
Source code comment | Issue section | Pull section | Commit message |
ant argouml columba emf hibernate jedit jfreechart jmeter jruby squirrel |
camel
chromium
gerrit
hadoop
hbase
impala
thrift
|
accumulo
activemq
activemq-artemis
airflow
ambari
apisix
apisix-dashboard
arrow
attic-apex-core
attic-apex-malhar
attic-stratos
avro
beam
bigtop
bookkeeper
brooklyn-server
calcite
camel
camel-k
camel-quarkus
camel-website
carbondata
cassandra
cloudstack
commons-lang
couchdb
cxf
daffodil
drill
druid
dubbo
echarts
fineract
flink
fluo
geode
geode-native
gobblin
griffin
groovy
guacamole-client
hadoop
hawq
hbase
helix
hive
hudi
iceberg
ignite
incubator-brooklyn
incubator-dolphinscheduler
incubator-doris
incubator-heron
incubator-hop
incubator-mxnet
incubator-pagespeed-ngx
incubator-pinot
incubator-weex
infrastructure-puppet
jena
jmeter
kafka
karaf
kylin
lucene-solr
madlib
myfaces-tobago
netbeans
netbeans-website
nifi
nifi-minifi-cpp
nutch
openwhisk
openwhisk-wskdeploy
orc
ozone
parquet-mr
phoenix
pulsar
qpid-dispatch
reef
rocketmq
samza
servicecomb-java-chassis
shardingsphere
shardingsphere-elasticjob
skywalking
spark
storm
streams
superset
systemds
tajo
thrift
tinkerpop
tomee
trafficcontrol
trafficserver
trafodion
tvm
usergrid
zeppelin
zookeeper
|
accumulo
activemq
activemq-artemis
airflow
ambari
apisix
apisix-dashboard
arrow
attic-apex-core
attic-apex-malhar
attic-stratos
avro
beam
bigtop
bookkeeper
brooklyn-server
calcite
camel
camel-k
camel-quarkus
camel-website
carbondata
cassandra
cloudstack
commons-lang
couchdb
cxf
daffodil
drill
druid
dubbo
echarts
fineract
flink
fluo
geode
geode-native
gobblin
griffin
groovy
guacamole-client
hadoop
hawq
hbase
helix
hive
hudi
iceberg
ignite
incubator-brooklyn
incubator-dolphinscheduler
incubator-doris
incubator-heron
incubator-hop
incubator-mxnet
incubator-pagespeed-ngx
incubator-pinot
incubator-weex
infrastructure-puppet
jena
jmeter
kafka
karaf
kylin
lucene-solr
madlib
myfaces-tobago
netbeans
netbeans-website
nifi
nifi-minifi-cpp
nutch
openwhisk
openwhisk-wskdeploy
orc
ozone
parquet-mr
phoenix
pulsar
qpid-dispatch
reef
rocketmq
samza
servicecomb-java-chassis
shardingsphere
shardingsphere-elasticjob
skywalking
spark
storm
streams
superset
systemds
tajo
thrift
tinkerpop
tomee
trafficcontrol
trafficserver
trafodion
tvm
usergrid
zeppelin
zookeeper
|
This dataset has undergone a data augmentation process using the AugGPT technique. Meanwhile, the original dataset can be downloaded via the following link: https://github.com/yikun-li/satd-different-sources-data
Dataset used for data augmentation in the training phase of the Variable Misuse tool. It contains some source code files extracted from third-party repositories.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
many methods exist to augment a 3D object
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"modified" is light field image with exactly same background and an object on it.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Augmentation Of VL is a dataset for classification tasks - it contains Skin Cancer VL annotations for 243 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database contains the reference data used for direct force training of Artificial Neural Network (ANN) interatomic potentials using the atomic energy network (ænet) and ænet-PyTorch packages (https://github.com/atomisticnet/aenet-PyTorch). It also includes the GPR-augmented data used for indirect force training via Gaussian Process Regression (GPR) surrogate models using the ænet-GPR package (https://github.com/atomisticnet/aenet-gpr). Each data file contains atomic structures, energies, and atomic forces in XCrySDen Structure Format (XSF). The dataset includes all reference training/test data and corresponding GPR-augmented data used in the four benchmark examples presented in the reference paper, "Scalable Training of Neural Network Potentials for Complex Interfaces Through Data Augmentation". A hierarchy of the dataset is described in the README.txt file, and an overview of the dataset is also summarized in supplementary Table S1 of the reference paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and code for the paper "Assessing the Reliability of Point Mutation as Data Augmentation for Deep Learning with Genomic Data".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purified dataset for data augmentation for DAISM-DNNXMBD can be downloaded from this repository.
The pbmc8k dataset downloaded from 10X Genomics were processed and uesd for data augmentation to create training datasets for training DAISM-DNN models. pbmc8k.h5ad contains 5 cell types (B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells), and pbmc8k_fine.h5ad cantains 7 cell types (naive.B.cells, memory.B.cells, naive.CD4.T.cells, memory.CD4.T.cells,naive.CD8.T.cells, memory.CD8.T.cells, regulatory.T.cells, monocytes, macrophages, myeloid.dendritic.cells, NK.cells).
For RNA-seq dataset, it contains 5 cell types (B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells). Raw FASTQ reads were downloaded from the NCBI website, and transcription and gene-level expression quantification were performed using Salmon (version 0.11.3) with Gencode v29 after quality control of FASTQ reads using fastp. All tools were used with default parameters.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Augmentation Of AK is a dataset for classification tasks - it contains Skin Cancer AK annotations for 790 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
All Cell Stages For Excel Gen Augmentation is a dataset for instance segmentation tasks - it contains Cell PRid 6ycW annotations for 394 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Object detection in challenging environments, such as low-light, cluttered, or dynamic conditions, remains a critical issue in computer vision. Deep Convolutional Neural Networks (DCNNs) have emerged as powerful tools for addressing these challenges due to their ability to learn hierarchical feature representations. This paper explores the optimization of object detection in such environments by leveraging advanced DCNN architectures, data augmentation techniques, and domain-specific pre-training. We propose an enhanced detection framework that integrates multi-scale feature extraction, transfer learning, and regularization methods to improve robustness against noise, occlusion, and lighting variations. Experimental results demonstrate significant improvements in detection accuracy across various challenging datasets, outperforming traditional methods. This study highlights the potential of DCNNs in real-world applications, such as autonomous driving, surveillance, and robotics, where object detection in difficult conditions is crucial.
it often causes leakage
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.
This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.