100+ datasets found

f
Data from: Explainable Graph Neural Networks with Data Augmentation for...
acs.figshare.com
zip
Updated Sep 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongle An; Xuyang Liu; Wensheng Cai; Xueguang Shao (2023). Explainable Graph Neural Networks with Data Augmentation for Predicting pKa of C–H Acids [Dataset]. http://doi.org/10.1021/acs.jcim.3c00958.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.3c00958.s002
Dataset updated
Sep 14, 2023
Dataset provided by
ACS Publications
Authors
Hongle An; Xuyang Liu; Wensheng Cai; Xueguang Shao
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The pKa of C–H acids is an important parameter in the fields of organic synthesis, drug discovery, and materials science. However, the prediction of pKa is still a great challenge due to the limit of experimental data and the lack of chemical insight. Here, a new model for predicting the pKa values of C–H acids is proposed on the basis of graph neural networks (GNNs) and data augmentation. A message passing unit (MPU) was used to extract the topological and target-related information from the molecular graph data, and a readout layer was utilized to retrieve the information on the ionization site C atom. The retrieved information then was adopted to predict pKa by a fully connected network. Furthermore, to increase the diversity of the training data, a knowledge-infused data augmentation technique was established by replacing the H atoms in a molecule with substituents exhibiting different electronic effects. The MPU was pretrained with the augmented data. The efficacy of data augmentation was confirmed by visualizing the distribution of compounds with different substituents and by classifying compounds. The explainability of the model was studied by examining the change of pKa values when a specific atom was masked. This explainability was used to identify the key substituents for pKa. The model was evaluated on two data sets from the iBonD database. Dataset1 includes the experimental pKa values of C–H acids measured in DMSO, while dataset2 comprises the pKa values measured in water. The results show that the knowledge-infused data augmentation technique greatly improves the predictive accuracy of the model, especially when the number of samples is small.
n
Data from: Exploring deep learning techniques for wild animal behaviour...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2ngf1vhwk
Dataset updated
Feb 22, 2024
Dataset provided by
Osaka University
Nagoya University
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
Variable Message Signal annotated images for object detection
zenodo.org
zip
Updated Oct 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas (2022). Variable Message Signal annotated images for object detection [Dataset]. http://doi.org/10.5281/zenodo.5904211
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5904211
Dataset updated
Oct 2, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
If you use this dataset, please cite this paper: Puertas, E.; De-Las-Heras, G.; Sánchez-Soriano, J.; Fernández-Andrés, J. Dataset: Variable Message Signal Annotated Images for Object Detection. Data 2022, 7, 41. https://doi.org/10.3390/data7040041

This dataset consists of Spanish road images taken from inside a vehicle, as well as annotations in XML files in PASCAL VOC format that indicate the location of Variable Message Signals within them. Also, a CSV file is attached with information regarding the geographic position, the folder where the image is located, and the text in Spanish. This can be used to train supervised learning computer vision algorithms, such as convolutional neural networks. Throughout this work, the process followed to obtain the dataset, image acquisition, and labeling, and its specifications are detailed. The dataset is constituted of 1216 instances, 888 positives, and 328 negatives, in 1152 jpg images with a resolution of 1280x720 pixels. These are divided into 576 real images and 576 images created from the data-augmentation technique. The purpose of this dataset is to help in road computer vision research since there is not one specifically for VMSs.

The folder structure of the dataset is as follows:

vms_dataset/

data.csv

real_images/

imgs/

annotations/

data-augmentation/

imgs/

annotations/

In which:

data.csv: Each row contains the following information separated by commas (,): image_name, x_min, y_min, x_max, y_max, class_name, lat, long, folder, text.

real_images: Images extracted directly from the videos.

data-augmentation: Images created using data-augmentation

imgs: Image files in .jpg format.

annotations: Annotation files in .xml format.
f
Detailed characterization of the dataset.
figshare.com
xls
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Detailed characterization of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t006
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
R
Combo1 4 Augmentation Methods Dataset
universe.roboflow.com
zip
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
thesis (2025). Combo1 4 Augmentation Methods Dataset [Dataset]. https://universe.roboflow.com/thesis-uqqt7/combo1-4-augmentation-methods
Explore at:
zipAvailable download formats
Dataset updated
Jul 7, 2025
Dataset authored and provided by
thesis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Lyme Bounding Boxes
Description
Combo1 4 Augmentation Methods

## Overview Combo1 4 Augmentation Methods is a dataset for object detection tasks - it contains Lyme annotations for 2,783 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
f
Data augmentation recommendations for data type and model type.
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Kenji Iwana; Seiichi Uchida (2023). Data augmentation recommendations for data type and model type. [Dataset]. http://doi.org/10.1371/journal.pone.0254841.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0254841.t005
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Brian Kenji Iwana; Seiichi Uchida
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data augmentation recommendations for data type and model type.
H
Data from: Data augmentation for disruption prediction via robust surrogate...
dataverse.harvard.edu
osti.gov
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/FMJCAD
Dataset updated
Aug 31, 2024
Dataset provided by
Harvard Dataverse
Authors
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.
m
Optimizing Object Detection in Challenging Environments with Deep...
data.mendeley.com
Updated Oct 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asad Ali (2024). Optimizing Object Detection in Challenging Environments with Deep Convolutional Neural Networks [Dataset]. http://doi.org/10.17632/gfpg6hxrvz.1
Explore at:
Unique identifier
https://doi.org/10.17632/gfpg6hxrvz.1
Dataset updated
Oct 24, 2024
Authors
Asad Ali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Object detection in challenging environments, such as low-light, cluttered, or dynamic conditions, remains a critical issue in computer vision. Deep Convolutional Neural Networks (DCNNs) have emerged as powerful tools for addressing these challenges due to their ability to learn hierarchical feature representations. This paper explores the optimization of object detection in such environments by leveraging advanced DCNN architectures, data augmentation techniques, and domain-specific pre-training. We propose an enhanced detection framework that integrates multi-scale feature extraction, transfer learning, and regularization methods to improve robustness against noise, occlusion, and lighting variations. Experimental results demonstrate significant improvements in detection accuracy across various challenging datasets, outperforming traditional methods. This study highlights the potential of DCNNs in real-world applications, such as autonomous driving, surveillance, and robotics, where object detection in difficult conditions is crucial.
Z
Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi...
data.niaid.nih.gov
zenodo.org
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kampel, Martin (2025). Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8188998
Explore at:
Dataset updated
Apr 4, 2025
Dataset provided by
Strohmayer, Julian
Kampel, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the Wallhack1.8k dataset for WiFi-based long-range activity recognition in Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS)/Through-Wall scenarios, as proposed in [1,2], as well as the CAD models (of 3D-printable parts) of the WiFi systems proposed in [2].

PyTroch Dataloader

A minimal PyTorch dataloader for the Wallhack1.8k dataset is provided at: https://github.com/StrohmayerJ/wallhack1.8k

Dataset Description

The Wallhack1.8k dataset comprises 1,806 CSI amplitude spectrograms (and raw WiFi packet time series) corresponding to three activity classes: "no presence," "walking," and "walking + arm-waving." WiFi packets were transmitted at a frequency of 100 Hz, and each spectrogram captures a temporal context of approximately 4 seconds (400 WiFi packets).

To assess cross-scenario and cross-system generalization, WiFi packet sequences were collected in LoS and through-wall (NLoS) scenarios, utilizing two different WiFi systems (BQ: biquad antenna and PIFA: printed inverted-F antenna). The dataset is structured accordingly:

LOS/BQ/ <- WiFi packets collected in the LoS scenario using the BQ system

LOS/PIFA/ <- WiFi packets collected in the LoS scenario using the PIFA system

NLOS/BQ/ <- WiFi packets collected in the NLoS scenario using the BQ system

NLOS/PIFA/ <- WiFi packets collected in the NLoS scenario using the PIFA system

These directories contain the raw WiFi packet time series (see Table 1). Each row represents a single WiFi packet with the complex CSI vector H being stored in the "data" field and the class label being stored in the "class" field. H is of the form [I, R, I, R, ..., I, R], where two consecutive entries represent imaginary and real parts of complex numbers (the Channel Frequency Responses of subcarriers). Taking the absolute value of H (e.g., via numpy.abs(H)) yields the subcarrier amplitudes A.

To extract the 52 L-LTF subcarriers used in [1], the following indices of A are to be selected:

52 L-LTF subcarriers

csi_valid_subcarrier_index = [] csi_valid_subcarrier_index += [i for i in range(6, 32)] csi_valid_subcarrier_index += [i for i in range(33, 59)]

Additional 56 HT-LTF subcarriers can be selected via:

56 HT-LTF subcarriers

csi_valid_subcarrier_index += [i for i in range(66, 94)]
csi_valid_subcarrier_index += [i for i in range(95, 123)]

For more details on subcarrier selection, see ESP-IDF (Section Wi-Fi Channel State Information) and esp-csi.

Extracted amplitude spectrograms with the corresponding label files of the train/validation/test split: "trainLabels.csv," "validationLabels.csv," and "testLabels.csv," can be found in the spectrograms/ directory.

The columns in the label files correspond to the following: [Spectrogram index, Class label, Room label]

Spectrogram index: [0, ..., n]

Class label: [0,1,2], where 0 = "no presence", 1 = "walking", and 2 = "walking + arm-waving."

Room label: [0,1,2,3,4,5], where labels 1-5 correspond to the room number in the NLoS scenario (see Fig. 3 in [1]). The label 0 corresponds to no room and is used for the "no presence" class.

Dataset Overview:

Table 1: Raw WiFi packet sequences.

Scenario System "no presence" / label 0 "walking" / label 1 "walking + arm-waving" / label 2 Total

LoS BQ b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

LoS PIFA b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

NLoS BQ b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

NLoS PIFA b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

4 20 20 44

Table 2: Sample/Spectrogram distribution across activity classes in Wallhack1.8k.

Scenario System

"no presence" / label 0

"walking" / label 1

"walking + arm-waving" / label 2 Total

LoS BQ 149 154 155

LoS PIFA 149 160 152

NLoS BQ 148 150 152

NLoS PIFA 143 147 147

589 611 606 1,806

Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to one of our papers [1,2].

[1] Strohmayer, Julian, and Martin Kampel. (2024). “Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition”, In IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 42-56). Cham: Springer Nature Switzerland, doi: https://doi.org/10.1007/978-3-031-63211-2_4.

[2] Strohmayer, Julian, and Martin Kampel., “Directional Antenna Systems for Long-Range Through-Wall Human Activity Recognition,” 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024, pp. 3594-3599, doi: https://doi.org/10.1109/ICIP51287.2024.10647666.

BibTeX citations:

@inproceedings{strohmayer2024data, title={Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition}, author={Strohmayer, Julian and Kampel, Martin}, booktitle={IFIP International Conference on Artificial Intelligence Applications and Innovations}, pages={42--56}, year={2024}, organization={Springer}}@INPROCEEDINGS{10647666, author={Strohmayer, Julian and Kampel, Martin}, booktitle={2024 IEEE International Conference on Image Processing (ICIP)}, title={Directional Antenna Systems for Long-Range Through-Wall Human Activity Recognition}, year={2024}, volume={}, number={}, pages={3594-3599}, keywords={Visualization;Accuracy;System performance;Directional antennas;Directive antennas;Reflector antennas;Sensors;Human Activity Recognition;WiFi;Channel State Information;Through-Wall Sensing;ESP32}, doi={10.1109/ICIP51287.2024.10647666}}
f
Comparative results for magnitude domain transformation-based data...
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Kenji Iwana; Seiichi Uchida (2023). Comparative results for magnitude domain transformation-based data augmentation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0254841.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0254841.t001
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Brian Kenji Iwana; Seiichi Uchida
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparative results for magnitude domain transformation-based data augmentation methods.
Data from: Class-specific data augmentation for plant stress classification
zenodo.org
zip
Updated Sep 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nasla Saleem; Nasla Saleem; Baskar Ganapathysubramanian; Zaki Jubery; Aditya Balu; Soumik Sarkar; Arti Singh; Asheesh Singh; Baskar Ganapathysubramanian; Zaki Jubery; Aditya Balu; Soumik Sarkar; Arti Singh; Asheesh Singh (2024). Class-specific data augmentation for plant stress classification [Dataset]. http://doi.org/10.5281/zenodo.13823148
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13823148
Dataset updated
Sep 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nasla Saleem; Nasla Saleem; Baskar Ganapathysubramanian; Zaki Jubery; Aditya Balu; Soumik Sarkar; Arti Singh; Asheesh Singh; Baskar Ganapathysubramanian; Zaki Jubery; Aditya Balu; Soumik Sarkar; Arti Singh; Asheesh Singh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a companion dataset for the paper titled "Class-specific data augmentation for plant stress classification" by Nasla Saleem, Aditya Balu, Talukder Zaki Jubery, Arti Singh, Asheesh K. Singh, Soumik Sarkar, and Baskar Ganapathysubramanian published in The Plant Phenome Journal, https://doi.org/10.1002/ppj2.20112

Abstract:

Data augmentation is a powerful tool for improving deep learning-based image classifiers for plant stress identification and classification. However, selecting an effective set of augmentations from a large pool of candidates remains a key challenge, particularly in imbalanced and confounding datasets. We propose an approach for automated class-specific data augmentation using a genetic algorithm. We demonstrate the utility of our approach on soybean [Glycine max (L.) Merr] stress classification where symptoms are observed on leaves; a particularly challenging problem due to confounding classes in the dataset. Our approach yields substantial performance, achieving a mean-per-class accuracy of 97.61% and an overall accuracy of 98% on the soybean leaf stress dataset. Our method significantly improves the accuracy of the most challenging classes, with notable enhancements from 83.01% to 88.89% and from 85.71% to 94.05%, respectively. A key observation we make in this study is that high-performing augmentation strategies can be identified in a computationally efficient manner. We fine-tune only the linear layer of the baseline model with different augmentations, thereby reducing the computational burden associated with training classifiers from scratch for each augmentation policy while achieving exceptional performance. This research represents an advancement in automated data augmentation strategies for plant stress classification, particularly in the context of confounding datasets. Our findings contribute to the growing body of research in tailored augmentation techniques and their potential impact on disease management strategies, crop yields, and global food security. The proposed approach holds the potential to enhance the accuracy and efficiency of deep learning-based tools for managing plant stresses in agriculture.
f
Comparative results for time domain transformation-based data augmentation...
plos.figshare.com
xls
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Kenji Iwana; Seiichi Uchida (2023). Comparative results for time domain transformation-based data augmentation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0254841.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0254841.t002
Dataset updated
Jun 7, 2023
Dataset provided by
PLOS ONE
Authors
Brian Kenji Iwana; Seiichi Uchida
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparative results for time domain transformation-based data augmentation methods.
Data from: Phenotype Driven Data Augmentation Methods for Transcriptomic...
zenodo.org
zip
Updated Jun 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikita Janakarajan; Nikita Janakarajan; Mara Graziani; Mara Graziani; María Rodríguez Martínez; María Rodríguez Martínez (2025). Phenotype Driven Data Augmentation Methods for Transcriptomic Data [Dataset]. http://doi.org/10.5281/zenodo.14983178
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14983178
Dataset updated
Jun 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nikita Janakarajan; Nikita Janakarajan; Mara Graziani; Mara Graziani; María Rodríguez Martínez; María Rodríguez Martínez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the data and associated results of all experiments conducted in our work "Phenotype Driven Data Augmentation Methods for Transcriptomic Data". In this work, we introduce two classes of phenotype driven data augmentation approaches – signature-dependent and signature-independent. The signature-dependent methods assume the existence of distinct gene signatures describing some phenotype and are simple, non-parametric, and novel data augmentation methods. The signature-independent methods are a modification of the established Gamma-Poisson and Poisson sampling methods for gene expression data. We benchmark our proposed methods against random oversampling, SMOTE, unmodified versions of Gamma-Poisson and Poisson sampling, and unaugmented data.

This repository contains data used for all our experiments. This includes the original data based off which augmentation was performed, the cross validation split indices as a json file, the training and validation data augmented by the various augmentation methods mentioned in our study, a test set (containing only real samples) and an external test set standardised accordingly with respect to each augmentation method and training data per CV split.

The compressed files 5x5stratified_{x}percent.zip contains data that were augmented on x% of the available real data. brca_public.zip contains data used for the breast cancer experiments. distribution_size_effect.zip contains data used for hyperparameter tuning the reference set size for the modified Poisson and Gamma-Poisson methods.

The compressed file results.zip contains all the results from all the experiments. This includes the parameter files used to train the various models, the metrics (balanced accuracy and auc-roc) computed including p-values, as well as the latent space of train, validation and test (for the (N)VAE) for all 25 (5x5) CV splits.

PLEASE NOTE: If any part of this repository is used in any form for your work, please attribute the following, in addition to attributing the original data source - TCGA, CPTAC, GSE20713 and METABRIC, accordingly:

@article{janakarajan2025phenotype,
title={Phenotype driven data augmentation methods for transcriptomic data},
author={Janakarajan, Nikita and Graziani, Mara and Rodr{\'\i}guez Mart{\'\i}nez, Mar{\'\i}a},
journal={Bioinformatics Advances},
volume={5},
number={1},
pages={vbaf124},
year={2025},
publisher={Oxford University Press}
}
Augmented Hand-Drawn Data for Parkinson’s Disease
kaggle.com
Updated Sep 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdulkhalek Mugahed (2024). Augmented Hand-Drawn Data for Parkinson’s Disease [Dataset]. https://www.kaggle.com/datasets/abdulkhalekmugahed/augmented-hand-drawn-data-for-parkinsons-disease/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 29, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abdulkhalek Mugahed
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
K. Scott Mader created the original dataset of 204 hand-drawn images for Parkinson’s disease diagnosis, consisting of two classes: Healthy and Parkinson. The dataset includes spiral and wave drawings. For my thesis, the original 204 images were expanded to 3,264 across the same two classes. This increase was achieved through data augmentation techniques, including rotations of 90°, 180°, and 270°, vertical flipping at 180°, and conversion to color images. The augmented data gives the model more opportunities to generalize, enhancing training and testing processes.
g
Augmented Olivetti Faces Dataset
gts.ai
json
Updated May 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2024). Augmented Olivetti Faces Dataset [Dataset]. https://gts.ai/dataset-download/page/35/
Explore at:
jsonAvailable download formats
Dataset updated
May 4, 2024
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Explore the Augmented Olivetti Faces Dataset with 2000 facial images enhanced by data augmentation techniques.
Lao Character Image Dataset for Classification
kaggle.com
Updated May 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
silamany (2025). Lao Character Image Dataset for Classification [Dataset]. https://www.kaggle.com/datasets/silamany/lao-characters/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 23, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
silamany
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Description:

This dataset contains a collection of images featuring individual Lao characters, specifically designed for image classification tasks. The dataset is organized into folders, where each folder is named directly with the Lao character it represents (e.g., a folder named "ກ", a folder named "ຂ", and so on) and contains 100 images of that character.

Content:

The dataset comprises images of 44 distinct Lao characters, including consonants, vowels, and tone marks.

Image Characteristics: - Resolution: 128x128 pixels - Format: JPEG (.jpg) - Appearance: Each image features a white drawn line representing the Lao character against a black background.

Structure:

- The dataset is divided into 44 folders. - Each folder is named with the actual Lao character it contains. - Each folder contains 100 images of the corresponding Lao character. - This results in a total of 4400 images in the dataset.

Potential Use Cases:

- Training and evaluating image classification models for Lao character recognition. - Developing Optical Character Recognition (OCR) systems for the Lao language. - Research in computer vision and pattern recognition for Southeast Asian scripts.

Usage Notes / Data Augmentation:

The nature of these images (white characters on a black background) lends itself well to various data augmentation techniques to improve model robustness and performance. Consider applying augmentations such as:

- Geometric Transformations: - Zoom (in/out) - Height and width shifts - Rotation - Perspective transforms - Blurring Effects: - Standard blur - Motion blur - Noise Injection: - Gaussian noise

Applying these augmentations can help create a more diverse training set and potentially lead to better generalization on unseen data.
Data augmentation for Multi-Classification of Non-Functional Requirements -...
zenodo.org
investigacion.usc.gal
+2more
csv
Updated Mar 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
María-Isabel Limaylla-Lunarejo; María-Isabel Limaylla-Lunarejo; Nelly Condori-Fernandez; Nelly Condori-Fernandez; Miguel R. Luaces; Miguel R. Luaces (2024). Data augmentation for Multi-Classification of Non-Functional Requirements - Dataset [Dataset]. http://doi.org/10.5281/zenodo.10805331
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10805331
Dataset updated
Mar 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
María-Isabel Limaylla-Lunarejo; María-Isabel Limaylla-Lunarejo; Nelly Condori-Fernandez; Nelly Condori-Fernandez; Miguel R. Luaces; Miguel R. Luaces
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
There are four datasets:

1.Dataset_structure indicates the structure of the datasets, such as column name, type, and value.

2. Spanish_promise_exp_nfr_train and Spanish_promise_exp_nfr_test are the non-functional requirements of the Promise_exp[1] dataset translated into the Spanish language.

3. Blanced_promise_exp_nfr_train is the new balanced dataset of Spanish_promise_exp_nfr_train, in which the Data Augmentation technique with chatGPT was applied to increase the requirements with little data and random undersampling was used to eliminate requirements.
f
Algorithm comparison with average augmentation time per dataset and tunable...
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Kenji Iwana; Seiichi Uchida (2023). Algorithm comparison with average augmentation time per dataset and tunable parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0254841.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0254841.t004
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Brian Kenji Iwana; Seiichi Uchida
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Algorithm comparison with average augmentation time per dataset and tunable parameters.
m
SDFVD2.0: Extension of Small Scale Deep Fake Video Dataset
data.mendeley.com
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shilpa Kaman (2025). SDFVD2.0: Extension of Small Scale Deep Fake Video Dataset [Dataset]. http://doi.org/10.17632/zzb7jyy8w8.1
Explore at:
Unique identifier
https://doi.org/10.17632/zzb7jyy8w8.1
Dataset updated
Jan 27, 2025
Authors
Shilpa Kaman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SDFVD 2.0 is an augmented extension of the original SDFVD dataset, which originally contained 53 real and 53 fake videos. This new version has been created to enhance the diversity and robustness of the dataset by applying various augmentation techniques like horizontal flip, rotation, shear, brightness and contrast adjustment, additive gaussian noise, downscaling and upscaling to the original videos. These augmentations help simulate a wider range of conditions and variations, making the dataset more suitable for training and evaluating deep learning models for deepfake detection. This process has significantly expanded the dataset resulting in 461 real and 461 forged videos, providing a richer and more varied collection of video data for deepfake detection research and development. Dataset Structure The dataset is organized into two main directories: real and fake, each containing the original and augmented videos. Each augmented video file is named following the pattern: ‘
Tree detected
zenodo.org
bin
Updated Aug 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). Tree detected [Dataset]. http://doi.org/10.5281/zenodo.13149720
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13149720
Dataset updated
Aug 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 25, 2024
Description
The repository contains a dataset of labeled trees and the related trained YOLO models 'best.pt' and 'last.pt. The training images are generated from the original images using geometric transformations from the Albumentations library. The folder "train_original" contains the original images without any augmentations.

In the file names, the presence of the suffix "aug" indicates that the image has been modified from its original version. For example, "DJI_20240525115520_0002_aug_0.jpg" is an augmented image, meaning that the original image "DJI_20240525115520_0002.jpg" has been altered using various augmentation techniques to create this new version.

Facebook

Twitter

Click to copy link

Link copied

Cite

Hongle An; Xuyang Liu; Wensheng Cai; Xueguang Shao (2023). Explainable Graph Neural Networks with Data Augmentation for Predicting pKa of C–H Acids [Dataset]. http://doi.org/10.1021/acs.jcim.3c00958.s002

Data from: Explainable Graph Neural Networks with Data Augmentation for Predicting pKa of C–H Acids

Explore at:

24 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.3c00958.s002

Dataset updated

Sep 14, 2023

Dataset provided by

ACS Publications

Authors

Hongle An; Xuyang Liu; Wensheng Cai; Xueguang Shao

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The pKa of C–H acids is an important parameter in the fields of organic synthesis, drug discovery, and materials science. However, the prediction of pKa is still a great challenge due to the limit of experimental data and the lack of chemical insight. Here, a new model for predicting the pKa values of C–H acids is proposed on the basis of graph neural networks (GNNs) and data augmentation. A message passing unit (MPU) was used to extract the topological and target-related information from the molecular graph data, and a readout layer was utilized to retrieve the information on the ionization site C atom. The retrieved information then was adopted to predict pKa by a fully connected network. Furthermore, to increase the diversity of the training data, a knowledge-infused data augmentation technique was established by replacing the H atoms in a molecule with substituents exhibiting different electronic effects. The MPU was pretrained with the augmented data. The efficacy of data augmentation was confirmed by visualizing the distribution of compounds with different substituents and by classifying compounds. The explainability of the model was studied by examining the change of pKa values when a specific atom was masked. This explainability was used to identify the key substituents for pKa. The model was evaluated on two data sets from the iBonD database. Dataset1 includes the experimental pKa values of C–H acids measured in DMSO, while dataset2 comprises the pKa values measured in water. The results show that the knowledge-infused data augmentation technique greatly improves the predictive accuracy of the model, especially when the number of samples is small.

Clear search

Close search

Google apps

Main menu

Data from: Explainable Graph Neural Networks with Data Augmentation for...

Data from: Exploring deep learning techniques for wild animal behaviour...

Variable Message Signal annotated images for object detection

Detailed characterization of the dataset.

Combo1 4 Augmentation Methods Dataset

Combo1 4 Augmentation Methods

Data augmentation recommendations for data type and model type.

Data from: Data augmentation for disruption prediction via robust surrogate...

Optimizing Object Detection in Challenging Environments with Deep...

Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi...

52 L-LTF subcarriers

56 HT-LTF subcarriers

Comparative results for magnitude domain transformation-based data...

Data from: Class-specific data augmentation for plant stress classification

Comparative results for time domain transformation-based data augmentation...

Data from: Phenotype Driven Data Augmentation Methods for Transcriptomic...

Augmented Hand-Drawn Data for Parkinson’s Disease

Augmented Olivetti Faces Dataset

Lao Character Image Dataset for Classification

Dataset Description:

Content:

Structure:

Potential Use Cases:

Usage Notes / Data Augmentation:

Data augmentation for Multi-Classification of Non-Functional Requirements -...

Algorithm comparison with average augmentation time per dataset and tunable...

SDFVD2.0: Extension of Small Scale Deep Fake Video Dataset

Tree detected

Data from: Explainable Graph Neural Networks with Data Augmentation for Predicting pKa of C–H Acids