100+ datasets found

n
Data from: Exploring deep learning techniques for wild animal behaviour...
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2ngf1vhwk
Dataset updated
Feb 22, 2024
Dataset provided by
Osaka University
Nagoya University
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
Data from: Variable Message Signal annotated images for object detection
zenodo.org
portalcientifico.universidadeuropea.com
zip
Updated Oct 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas (2022). Variable Message Signal annotated images for object detection [Dataset]. http://doi.org/10.5281/zenodo.5904211
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5904211
Dataset updated
Oct 2, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
If you use this dataset, please cite this paper: Puertas, E.; De-Las-Heras, G.; Sánchez-Soriano, J.; Fernández-Andrés, J. Dataset: Variable Message Signal Annotated Images for Object Detection. Data 2022, 7, 41. https://doi.org/10.3390/data7040041

This dataset consists of Spanish road images taken from inside a vehicle, as well as annotations in XML files in PASCAL VOC format that indicate the location of Variable Message Signals within them. Also, a CSV file is attached with information regarding the geographic position, the folder where the image is located, and the text in Spanish. This can be used to train supervised learning computer vision algorithms, such as convolutional neural networks. Throughout this work, the process followed to obtain the dataset, image acquisition, and labeling, and its specifications are detailed. The dataset is constituted of 1216 instances, 888 positives, and 328 negatives, in 1152 jpg images with a resolution of 1280x720 pixels. These are divided into 576 real images and 576 images created from the data-augmentation technique. The purpose of this dataset is to help in road computer vision research since there is not one specifically for VMSs.

The folder structure of the dataset is as follows:

vms_dataset/

data.csv

real_images/

imgs/

annotations/

data-augmentation/

imgs/

annotations/

In which:

data.csv: Each row contains the following information separated by commas (,): image_name, x_min, y_min, x_max, y_max, class_name, lat, long, folder, text.

real_images: Images extracted directly from the videos.

data-augmentation: Images created using data-augmentation

imgs: Image files in .jpg format.

annotations: Annotation files in .xml format.
Data augmentation recommendations for data type and model type.
figshare.com
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Kenji Iwana; Seiichi Uchida (2023). Data augmentation recommendations for data type and model type. [Dataset]. http://doi.org/10.1371/journal.pone.0254841.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0254841.t005
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Brian Kenji Iwana; Seiichi Uchida
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data augmentation recommendations for data type and model type.
H
Data from: Data augmentation for disruption prediction via robust surrogate...
dataverse.harvard.edu
osti.gov
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/FMJCAD
Dataset updated
Aug 31, 2024
Dataset provided by
Harvard Dataverse
Authors
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.
D
Data Augmentation Tools Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Augmentation Tools Market Research Report 2033 [Dataset]. https://dataintelo.com/report/data-augmentation-tools-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Oct 1, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Augmentation Tools Market Outlook

According to our latest research, the global Data Augmentation Tools market size reached USD 1.62 billion in 2024, with a robust year-on-year growth trajectory. The market is poised for accelerated expansion, projected to achieve a CAGR of 26.4% from 2025 to 2033. By the end of 2033, the market is forecasted to reach approximately USD 12.34 billion. This dynamic growth is primarily driven by the rising demand for artificial intelligence (AI) and machine learning (ML) applications across diverse industry verticals, which necessitate vast quantities of high-quality training data. The proliferation of data-centric AI models and the increasing complexity of real-world datasets are compelling enterprises to invest in advanced data augmentation tools to enhance data diversity and model robustness, as per the latest research insights.

One of the principal growth factors fueling the Data Augmentation Tools market is the intensifying adoption of AI-driven solutions across industries such as healthcare, automotive, retail, and finance. Organizations are increasingly leveraging data augmentation to overcome the challenges posed by limited or imbalanced datasets, which are often a bottleneck in developing accurate and reliable AI models. By synthetically expanding training datasets through augmentation techniques, enterprises can significantly improve the generalization capabilities of their models, leading to enhanced performance and reduced risk of overfitting. Furthermore, the surge in computer vision, natural language processing, and speech recognition applications is creating a fertile environment for the adoption of specialized augmentation tools tailored to image, text, and audio data.

Another significant factor contributing to market growth is the rapid evolution of augmentation technologies themselves. Innovations such as Generative Adversarial Networks (GANs), automated data labeling, and domain-specific augmentation pipelines are making it easier for organizations to deploy and scale data augmentation strategies. These advancements are not only reducing the manual effort and expertise required but also enabling the generation of highly realistic synthetic data that closely mimics real-world scenarios. As a result, businesses across sectors are able to accelerate their AI/ML development cycles, reduce costs associated with data collection and labeling, and maintain compliance with stringent data privacy regulations by minimizing the need to use sensitive real-world data.

The growing integration of data augmentation tools within cloud-based AI development platforms is also acting as a major catalyst for market expansion. Cloud deployment offers unparalleled scalability, accessibility, and collaboration capabilities, allowing organizations of all sizes to harness the power of data augmentation without significant upfront infrastructure investments. This democratization of advanced data engineering tools is especially beneficial for small and medium enterprises (SMEs) and academic research institutes, which often face resource constraints. The proliferation of cloud-native augmentation solutions is further supported by strategic partnerships between technology vendors and cloud service providers, driving broader market penetration and innovation.

From a regional perspective, North America continues to dominate the Data Augmentation Tools market, driven by the presence of leading AI technology companies, a mature digital infrastructure, and substantial investments in research and development. However, the Asia Pacific region is emerging as the fastest-growing market, fueled by rapid digital transformation initiatives, a burgeoning startup ecosystem, and increasing government support for AI innovation. Europe also holds a significant share, underpinned by strong regulatory frameworks and a focus on ethical AI development. Meanwhile, Latin America and the Middle East & Africa are witnessing steady adoption, particularly in sectors such as BFSI and healthcare, where data-driven insights are becoming increasingly critical.

Component Analysis

The Data Augmentation Tools market by component is bifurcated into Software and Services. The software segment currently accounts for the largest share of the market, owing to the widespread deployment of standalone and integrated augmentation solutions across enterprises and research institutions. These software plat
ECG Augmented Dataset
kaggle.com
zip
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sidali Khelil cherfi (2025). ECG Augmented Dataset [Dataset]. https://www.kaggle.com/datasets/sidalikhelilcherfi/ecg-augmented
Explore at:
zip(5174909523 bytes)Available download formats
Dataset updated
Oct 7, 2025
Authors
sidali Khelil cherfi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🩺 Dataset Description

This dataset is an augmented version of an ECG image dataset created to balance and enrich the original classes for deep learning–based cardiovascular disease classification.

The original dataset consisted of unbalanced image counts per class in the training set: - ABH: 233 images - MI: 239 images - HMI: 172 images - NORM: 284 images

To improve class balance and model generalization, each class in the training set was expanded to 500 images using a combination of morphological, noise-based, and geometric data augmentation techniques. Additionally, the test set includes 112 images per class.

⚖️ Final Dataset Composition

Training set: 4 classes × 500 images each → 2,000 images total

Test set: 4 classes × 112 images each → 448 images total

🔬 Data Augmentation Techniques

1. Morphological Alterations - Erosion - Dilation - None (original preserved)

2. Noise Introduction - augment_noise_black_rain — simulates black streaks - augment_noise_pixel_dropout_black — random black pixel dropout - augment_noise_white_rain — simulates white streaks - augment_noise_pixel_dropout_white — random white pixel dropout

3. Geometric Transformations - Shift — small translations in all directions - Scale — random zoom-in/zoom-out between 0.9× and 1.1× - Rotate — small random rotation between -5° and +5°

These transformations were applied with balanced proportions to ensure diversity and realism while preserving diagnostic features of ECG signals.

💡 Intended Use

This dataset is designed for: - Training and evaluating deep learning models (CNNs, ViTs) for ECG image classification - Research in medical image augmentation, imbalanced data learning, and cardiovascular disease prediction

📘 License

This dataset is released under the CC0 1.0 License, allowing free use and distribution for research and educational purposes.
t
Paired-Embedding (PE) method for data augmentation and Act2Act network -...
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Paired-Embedding (PE) method for data augmentation and Act2Act network - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/paired-embedding--pe--method-for-data-augmentation-and-act2act-network
Explore at:
Dataset updated
Dec 16, 2024
Description
Paired-Embedding (PE) method for effective and reliable data augmentation, Act2Act network learns from augmented data
Z
Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi...
data-staging.niaid.nih.gov
nde-dev.biothings.io
+2more
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Strohmayer, Julian; Kampel, Martin (2025). Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8188998
Explore at:
Dataset updated
Apr 4, 2025
Dataset provided by
Computer Vision Lab, TU Wien
Authors
Strohmayer, Julian; Kampel, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the Wallhack1.8k dataset for WiFi-based long-range activity recognition in Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS)/Through-Wall scenarios, as proposed in [1,2], as well as the CAD models (of 3D-printable parts) of the WiFi systems proposed in [2].

PyTroch Dataloader

A minimal PyTorch dataloader for the Wallhack1.8k dataset is provided at: https://github.com/StrohmayerJ/wallhack1.8k

Dataset Description

The Wallhack1.8k dataset comprises 1,806 CSI amplitude spectrograms (and raw WiFi packet time series) corresponding to three activity classes: "no presence," "walking," and "walking + arm-waving." WiFi packets were transmitted at a frequency of 100 Hz, and each spectrogram captures a temporal context of approximately 4 seconds (400 WiFi packets).

To assess cross-scenario and cross-system generalization, WiFi packet sequences were collected in LoS and through-wall (NLoS) scenarios, utilizing two different WiFi systems (BQ: biquad antenna and PIFA: printed inverted-F antenna). The dataset is structured accordingly:

LOS/BQ/ <- WiFi packets collected in the LoS scenario using the BQ system

LOS/PIFA/ <- WiFi packets collected in the LoS scenario using the PIFA system

NLOS/BQ/ <- WiFi packets collected in the NLoS scenario using the BQ system

NLOS/PIFA/ <- WiFi packets collected in the NLoS scenario using the PIFA system

These directories contain the raw WiFi packet time series (see Table 1). Each row represents a single WiFi packet with the complex CSI vector H being stored in the "data" field and the class label being stored in the "class" field. H is of the form [I, R, I, R, ..., I, R], where two consecutive entries represent imaginary and real parts of complex numbers (the Channel Frequency Responses of subcarriers). Taking the absolute value of H (e.g., via numpy.abs(H)) yields the subcarrier amplitudes A.

To extract the 52 L-LTF subcarriers used in [1], the following indices of A are to be selected:

52 L-LTF subcarriers

csi_valid_subcarrier_index = [] csi_valid_subcarrier_index += [i for i in range(6, 32)] csi_valid_subcarrier_index += [i for i in range(33, 59)]

Additional 56 HT-LTF subcarriers can be selected via:

56 HT-LTF subcarriers

csi_valid_subcarrier_index += [i for i in range(66, 94)]
csi_valid_subcarrier_index += [i for i in range(95, 123)]

For more details on subcarrier selection, see ESP-IDF (Section Wi-Fi Channel State Information) and esp-csi.

Extracted amplitude spectrograms with the corresponding label files of the train/validation/test split: "trainLabels.csv," "validationLabels.csv," and "testLabels.csv," can be found in the spectrograms/ directory.

The columns in the label files correspond to the following: [Spectrogram index, Class label, Room label]

Spectrogram index: [0, ..., n]

Class label: [0,1,2], where 0 = "no presence", 1 = "walking", and 2 = "walking + arm-waving."

Room label: [0,1,2,3,4,5], where labels 1-5 correspond to the room number in the NLoS scenario (see Fig. 3 in [1]). The label 0 corresponds to no room and is used for the "no presence" class.

Dataset Overview:

Table 1: Raw WiFi packet sequences.

Scenario System "no presence" / label 0 "walking" / label 1 "walking + arm-waving" / label 2 Total

LoS BQ b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

LoS PIFA b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

NLoS BQ b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

NLoS PIFA b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

4 20 20 44

Table 2: Sample/Spectrogram distribution across activity classes in Wallhack1.8k.

Scenario System

"no presence" / label 0

"walking" / label 1

"walking + arm-waving" / label 2 Total

LoS BQ 149 154 155

LoS PIFA 149 160 152

NLoS BQ 148 150 152

NLoS PIFA 143 147 147

589 611 606 1,806

Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to one of our papers [1,2].

[1] Strohmayer, Julian, and Martin Kampel. (2024). “Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition”, In IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 42-56). Cham: Springer Nature Switzerland, doi: https://doi.org/10.1007/978-3-031-63211-2_4.

[2] Strohmayer, Julian, and Martin Kampel., “Directional Antenna Systems for Long-Range Through-Wall Human Activity Recognition,” 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024, pp. 3594-3599, doi: https://doi.org/10.1109/ICIP51287.2024.10647666.

BibTeX citations:

@inproceedings{strohmayer2024data, title={Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition}, author={Strohmayer, Julian and Kampel, Martin}, booktitle={IFIP International Conference on Artificial Intelligence Applications and Innovations}, pages={42--56}, year={2024}, organization={Springer}}@INPROCEEDINGS{10647666, author={Strohmayer, Julian and Kampel, Martin}, booktitle={2024 IEEE International Conference on Image Processing (ICIP)}, title={Directional Antenna Systems for Long-Range Through-Wall Human Activity Recognition}, year={2024}, volume={}, number={}, pages={3594-3599}, keywords={Visualization;Accuracy;System performance;Directional antennas;Directive antennas;Reflector antennas;Sensors;Human Activity Recognition;WiFi;Channel State Information;Through-Wall Sensing;ESP32}, doi={10.1109/ICIP51287.2024.10647666}}
t
Tied-Augment: Controlling Representation Similarity Improves Data...
service.tib.eu
resodate.org
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Tied-Augment: Controlling Representation Similarity Improves Data Augmentation - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/tied-augment--controlling-representation-similarity-improves-data-augmentation
Explore at:
Dataset updated
Dec 3, 2024
Description
Data augmentation methods have played an important role in the recent advance of deep learning models, and have become an indispensable component of state-of-the-art models in semi-supervised, self-supervised, and supervised training for vision.
n
Data from: New Deep Learning Methods for Medical Image Analysis and...
curate.nd.edu
pdf
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pengfei Gu (2024). New Deep Learning Methods for Medical Image Analysis and Scientific Data Generation and Compression [Dataset]. http://doi.org/10.7274/26156719.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.7274/26156719.v1
Dataset updated
Nov 11, 2024
Dataset provided by
University of Notre Dame
Authors
Pengfei Gu
License
https://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106
Description
Medical image analysis is critical to biological studies, health research, computer- aided diagnoses, and clinical applications. Recently, deep learning (DL) techniques have achieved remarkable successes in medical image analysis applications. However, these techniques typically require large amounts of annotations to achieve satisfactory performance. Therefore, in this dissertation, we seek to address this critical problem: How can we develop efficient and effective DL algorithms for medical image analysis while reducing annotation efforts? To address this problem, we have outlined two specific aims: (A1) Utilize existing annotations effectively from advanced models; (A2) extract generic knowledge directly from unannotated images.

To achieve the aim (A1): First, we introduce a new data representation called TopoImages, which encodes the local topology of all the image pixels. TopoImages can be complemented with the original images to improve medical image analysis tasks. Second, we propose a new augmentation method, SAMAug-C, that lever- ages the Segment Anything Model (SAM) to augment raw image input and enhance medical image classification. Third, we propose two advanced DL architectures, kCBAC-Net and ConvFormer, to enhance the performance of 2D and 3D medical image segmentation. We also present a gate-regularized network training (GrNT) approach to improve multi-scale fusion in medical image segmentation. To achieve the aim (A2), we propose a novel extension of known Masked Autoencoders (MAEs) for self pre-training, i.e., models pre-trained on the same target dataset, specifically for 3D medical image segmentation.

Scientific visualization is a powerful approach for understanding and analyzing various physical or natural phenomena, such as climate change or chemical reactions. However, the cost of scientific simulations is high when factors like time, ensemble, and multivariate analyses are involved. Additionally, scientists can only afford to sparsely store the simulation outputs (e.g., scalar field data) or visual representations (e.g., streamlines) or visualization images due to limited I/O bandwidths and storage space. Therefore, in this dissertation, we seek to address this critical problem: How can we develop efficient and effective DL algorithms for scientific data generation and compression while reducing simulation and storage costs?

To tackle this problem: First, we propose a DL framework that generates un- steady vector fields data from a set of streamlines. Based on this method, domain scientists only need to store representative streamlines at simulation time and recon- struct vector fields during post-processing. Second, we design a novel DL method that translates scalar fields to vector fields. Using this approach, domain scientists only need to store scalar field data at simulation time and generate vector fields from their scalar field counterparts afterward. Third, we present a new DL approach that compresses a large collection of visualization images generated from time-varying data for communicating volume visualization results.
Comparative results for time domain transformation-based data augmentation...
plos.figshare.com
xls
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Kenji Iwana; Seiichi Uchida (2023). Comparative results for time domain transformation-based data augmentation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0254841.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0254841.t002
Dataset updated
Jun 7, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Brian Kenji Iwana; Seiichi Uchida
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparative results for time domain transformation-based data augmentation methods.
Data from: Class-specific data augmentation for plant stress classification
zenodo.org
zip
Updated Sep 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nasla Saleem; Nasla Saleem; Baskar Ganapathysubramanian; Zaki Jubery; Aditya Balu; Soumik Sarkar; Arti Singh; Asheesh Singh; Baskar Ganapathysubramanian; Zaki Jubery; Aditya Balu; Soumik Sarkar; Arti Singh; Asheesh Singh (2024). Class-specific data augmentation for plant stress classification [Dataset]. http://doi.org/10.5281/zenodo.13823148
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13823148
Dataset updated
Sep 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nasla Saleem; Nasla Saleem; Baskar Ganapathysubramanian; Zaki Jubery; Aditya Balu; Soumik Sarkar; Arti Singh; Asheesh Singh; Baskar Ganapathysubramanian; Zaki Jubery; Aditya Balu; Soumik Sarkar; Arti Singh; Asheesh Singh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a companion dataset for the paper titled "Class-specific data augmentation for plant stress classification" by Nasla Saleem, Aditya Balu, Talukder Zaki Jubery, Arti Singh, Asheesh K. Singh, Soumik Sarkar, and Baskar Ganapathysubramanian published in The Plant Phenome Journal, https://doi.org/10.1002/ppj2.20112

Abstract:

Data augmentation is a powerful tool for improving deep learning-based image classifiers for plant stress identification and classification. However, selecting an effective set of augmentations from a large pool of candidates remains a key challenge, particularly in imbalanced and confounding datasets. We propose an approach for automated class-specific data augmentation using a genetic algorithm. We demonstrate the utility of our approach on soybean [Glycine max (L.) Merr] stress classification where symptoms are observed on leaves; a particularly challenging problem due to confounding classes in the dataset. Our approach yields substantial performance, achieving a mean-per-class accuracy of 97.61% and an overall accuracy of 98% on the soybean leaf stress dataset. Our method significantly improves the accuracy of the most challenging classes, with notable enhancements from 83.01% to 88.89% and from 85.71% to 94.05%, respectively. A key observation we make in this study is that high-performing augmentation strategies can be identified in a computationally efficient manner. We fine-tune only the linear layer of the baseline model with different augmentations, thereby reducing the computational burden associated with training classifiers from scratch for each augmentation policy while achieving exceptional performance. This research represents an advancement in automated data augmentation strategies for plant stress classification, particularly in the context of confounding datasets. Our findings contribute to the growing body of research in tailored augmentation techniques and their potential impact on disease management strategies, crop yields, and global food security. The proposed approach holds the potential to enhance the accuracy and efficiency of deep learning-based tools for managing plant stresses in agriculture.
SVD-Generated Video Dataset
kaggle.com
zip
Updated May 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Afnan Algharbi (2025). SVD-Generated Video Dataset [Dataset]. https://www.kaggle.com/datasets/afnanalgarby/svd-generated-video-dataset
Explore at:
zip(102546508 bytes)Available download formats
Dataset updated
May 11, 2025
Authors
Afnan Algharbi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains synthetic video samples generated from a 10-class subset of Tiny ImageNet using Stable Video Diffusion (SVD). It is designed to evaluate the impact of generative temporal augmentation on image classification performance.

Each training and validation video corresponds to a single image augmented into a sequence of frames.

Videos are stored in .mp4 format and labeled via train.csv and val.csv.

Sources:

Tiny ImageNet: Stanford CS231n

SVD model: Stable Video Diffusion

License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Augmented Hand-Drawn Data for Parkinson’s Disease
kaggle.com
zip
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdulkhalek Mugahed (2025). Augmented Hand-Drawn Data for Parkinson’s Disease [Dataset]. https://www.kaggle.com/datasets/abdulkhalekmugahed/augmented-hand-drawn-data-for-parkinsons-disease
Explore at:
zip(331533755 bytes)Available download formats
Dataset updated
Jun 11, 2025
Authors
Abdulkhalek Mugahed
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
K. Scott Mader created the original dataset of 204 hand-drawn images for Parkinson’s disease diagnosis, consisting of two classes: Healthy and Parkinson. The dataset includes spiral and wave drawings. For my thesis, the original 204 images were expanded to 3,264 across the same two classes. This increase was achieved through data augmentation techniques, including rotations of 90°, 180°, and 270°, vertical flipping at 180°, and conversion to color images. The augmented data gives the model more opportunities to generalize, enhancing training and testing processes.
COVID-19 Chest CT image Augmentation GAN Dataset
kaggle.com
zip
Updated Jan 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Loey (2021). COVID-19 Chest CT image Augmentation GAN Dataset [Dataset]. https://www.kaggle.com/mloey1/covid19-chest-ct-image-augmentation-gan-dataset
Explore at:
zip(1914822990 bytes)Available download formats
Dataset updated
Jan 31, 2021
Authors
Mohamed Loey
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Note: please do not claim diagnostic performance of a model without a clinical study! This is not a kaggle competition dataset. Please read our paper: Loey, M., Manogaran, G. & Khalifa, N.E.M. A deep transfer learning model with classical data augmentation and CGAN to detect COVID-19 from chest CT radiography digital images. Neural Comput & Applic (2020). https://doi.org/10.1007/s00521-020-05437-x

Khalifa, N.E.M., Smarandache, F., Manogaran, G. et al. A Study of the Neutrosophic Set Significance on Deep Transfer Learning Models: an Experimental Case on a Limited COVID-19 Chest X-ray Dataset. Cogn Comput (2021). https://doi.org/10.1007/s12559-020-09802-9

Abstract

The Coronavirus disease 2019 (COVID-19) is the fastest transmittable virus caused by severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2). The detection of COVID-19 using artificial intelligence techniques and especially deep learning will help to detect this virus in early stages which will reflect in increasing the opportunities of fast recovery of patients worldwide. This will lead to release the pressure off the healthcare system around the world. In this research, classical data augmentation techniques along with Conditional Generative Adversarial Nets (CGAN) based on a deep transfer learning model for COVID-19 detection in chest CT scan images will be presented. The limited benchmark datasets for COVID-19 especially in chest CT images are the main motivation of this research. The main idea is to collect all the possible images for COVID-19 that exists until the very writing of this research and use the classical data augmentations along with CGAN to generate more images to help in the detection of the COVID-19. In this study, five different deep convolutional neural network-based models (AlexNet, VGGNet16, VGGNet19, GoogleNet, and ResNet50) have been selected for the investigation to detect the Coronavirus-infected patient using chest CT radiographs digital images. The classical data augmentations along with CGAN improve the performance of classification in all selected deep transfer models. The outcomes show that ResNet50 is the most appropriate deep learning model to detect the COVID-19 from limited chest CT dataset using the classical data augmentation with testing accuracy of 82.91%, sensitivity 77.66%, and specificity of 87.62%.

Context

In this Dataet, we introduce DTL models to classify limited COVID-19 chest CT scan digital images. To input adopting CT images of the chest to the DCNN, we enriched the medical chest CT images using classical data augmentation and CGAN to generate more CT images. After that, a classifier is used to ensemble the class (COVID/NonCOVID) outputs of the classification outcomes. The proposed DTL models were evaluated on the COVID-19 CT scan images dataset. The novelty of this research is conducted as follows: (1) The introduced DTL models have end-to-end structure without classical feature extraction and selection methods. (2) We show that data augmentation and conditional generative adversarial network (CGAN) is an effective technique to generate CT images. (3) Chest CT images are one of the best tools for the classification of COVID-19. (4) The DTL models have been shown to yield very high accuracy in the limited COVID-19 dataset.

Content

There are 742 CT images and 2 categories (COVID/NonCOVID). Dataset |Train | Validation | Test COVID NonCOVID COVID NonCOVID COVID NonCOVID COVID-19 191 234 60 58 94 105 COVID-19 + Aug 2292 2808 720 696 94 105 COVID-19 + CGAN 2191 2234 210 208 94 105 COVID-19 + Aug + CGAN 4292 4808 870 846 94 105

Acknowledgements

Cite our papers:

Loey, M., Manogaran, G. & Khalifa, N.E.M. A deep transfer learning model with classical data augmentation and CGAN to detect COVID-19 from chest CT radiography digital images. Neural Comput & Applic (2020). https://doi.org/10.1007/s00521-020-05437-x

Loey, Mohamed; Smarandache, Florentin; M. Khalifa, Nour E. 2020. "Within the Lack of Chest COVID-19 X-ray Dataset: A Novel Detection Model Based on GAN and Deep Transfer Learning" Symmetry 12, no. 4: 651. https://doi.org/10.3390/sym12040651

Khalifa, N.E.M., Smarandache, F., Manogaran, G. et al. A Study of the Neutrosophic Set Significance on Deep Transfer Learning Models: an Experimental Case on a Limited COVID-19 Chest X-ray Dataset. Cogn Comput (2021). https://doi.org/10.1007/s12559-020-09802-9

Inspiration

Original Dataset: https://github.com/UCSD-AI4H/COVID-CT

Creating the proposed database present...
t
TokenMixup: Efficient Attention-guided Token-level Data Augmentation for...
service.tib.eu
resodate.org
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/tokenmixup--efficient-attention-guided-token-level-data-augmentation-for-transformers
Explore at:
Dataset updated
Dec 2, 2024
Description
Mixup is a commonly adopted data augmentation technique for image classification. Recent advances in mixup methods primarily focus on mixing based on saliency.
t
3D Data Augmentation for Driving Scenes on Camera - Dataset - LDM
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). 3D Data Augmentation for Driving Scenes on Camera - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/3d-data-augmentation-for-driving-scenes-on-camera
Explore at:
Dataset updated
Dec 2, 2024
Description
Driving scenes are extremely diverse and complicated that it is impossible to collect all cases with human effort alone. While data augmentation is an effective technique to enrich the training data, existing methods for camera data in autonomous driving applications are conﬁned to the 2D image plane, which may not optimally increase data diver-sity in 3D real-world scenarios.
Comparative results for pattern mixing-based data augmentation methods.
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Kenji Iwana; Seiichi Uchida (2023). Comparative results for pattern mixing-based data augmentation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0254841.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0254841.t003
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Brian Kenji Iwana; Seiichi Uchida
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparative results for pattern mixing-based data augmentation methods.
Z
Data from: Phenotype Driven Data Augmentation Methods for Transcriptomic...
data.niaid.nih.gov
Updated Mar 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikita Janakarajan; Mara Graziani; María Rodríguez Martínez (2025). Phenotype Driven Data Augmentation Methods for Transcriptomic Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8383202
Explore at:
Dataset updated
Mar 6, 2025
Dataset provided by
IBM Research Europe
ETH Zürich, IBM Research Europe
Authors
Nikita Janakarajan; Mara Graziani; María Rodríguez Martínez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the data and associated results of all experiments conducted in our work "Phenotype Driven Data Augmentation Methods for Transcriptomic Data". In this work, we introduce two classes of phenotype driven data augmentation approaches – signature-dependent and signature-independent. The signature-dependent methods assume the existence of distinct gene signatures describing some phenotype and are simple, non-parametric, and novel data augmentation methods. The signature-independent methods are a modification of the established Gamma-Poisson and Poisson sampling methods for gene expression data. We benchmark our proposed methods against random oversampling, SMOTE, unmodified versions of Gamma-Poisson and Poisson sampling, and unaugmented data.

This repository contains data used for all our experiments. This includes the original data based off which augmentation was performed, the cross validation split indices as a json file, the training and validation data augmented by the various augmentation methods mentioned in our study, a test set (containing only real samples) and an external test set standardised accordingly with respect to each augmentation method and training data per CV split.

The compressed files 5x5stratified_{x}percent.zip contains data that were augmented on x% of the available real data. brca_public.zip contains data used for the breast cancer experiments. distribution_size_effect.zip contains data used for hyperparameter tuning the reference set size for the modified Poisson and Gamma-Poisson methods.

The compressed file results.zip contains all the results from all the experiments. This includes the parameter files used to train the various models, the metrics (balanced accuracy and auc-roc) computed including p-values, as well as the latent space of train, validation and test (for the (N)VAE) for all 25 (5x5) CV splits.

PLEASE NOTE: If any part of this repository is used in any form for your work, please attribute the following, in addition to attributing the original data source - TCGA, CPTAC, GSE20713 and METABRIC, accordingly:

@article{janakarajan2023signature, title={Phenotype Driven Data Augmentation Methods for Transcriptomic Data}, author={Janakarajan, Nikita and Graziani, Mara and Martinez, Maria Rodriguez}, journal={bioRxiv}, pages={2023--10}, year={2023}, publisher={Cold Spring Harbor Laboratory} }
DriveMatriX Highway Dataset
kaggle.com
zip
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omri Reftov (2024). DriveMatriX Highway Dataset [Dataset]. https://www.kaggle.com/datasets/omrireftov/drivematrix-highway-dataset-1-0
Explore at:
zip(5507322639 bytes)Available download formats
Dataset updated
Oct 10, 2024
Authors
Omri Reftov
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cognata’s highway driving dataset provides a valuable resource for validating perception models in ADAS and autonomous vehicle (AV) systems. The first version of the dataset includes the original video recorded under clear and sunny conditions, along with augmented versions that simulate challenging scenarios such as fog, rain, and erased lane markings. This allows for comprehensive testing of models across various environmental conditions without the need for additional real-world data collection.

The second version introduces updates to the dataset, including a new augmentation for low sun and a version with class segmentation. This expands the dataset’s utility by enabling even more diverse and precise testing scenarios. Additionally, the dataset license has been updated, ensuring developers have access to the most current terms for usage.

By leveraging this dataset, developers can assess how well their object detection and segmentation models perform under varied conditions, identifying strengths and weaknesses in the models’ ability to handle real-world challenges. The augmentation, powered by DriveMatriX, maintains consistency between the original and augmented videos, enabling seamless transitions between conditions for validation purposes. This ensures perception models can be effectively evaluated in terms of precision, recall, and coverage, providing detailed insights into how well they generalize across different scenarios.

This dataset is particularly suited for validating the robustness of perception algorithms, offering a controlled environment to test how models react to adverse weather conditions. By utilizing this resource, developers can focus on improving model performance based on real-world conditions that typically require extensive and costly data collection efforts.

Please see a blog post on how website on how to use the DriveMatriX Highway Dataset: https://www.cognata.com/enhancing-object-detection-performance-with-data-augmentation-dataset/

Facebook

Twitter

Click to copy link

Link copied

Cite

Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk

Data from: Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.2ngf1vhwk

Dataset updated

Feb 22, 2024

Dataset provided by

Osaka University
Nagoya University

Authors

Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.

Clear search

Close search

Google apps

Main menu

Data from: Exploring deep learning techniques for wild animal behaviour...

Data from: Variable Message Signal annotated images for object detection

Data augmentation recommendations for data type and model type.

Data from: Data augmentation for disruption prediction via robust surrogate...

Data Augmentation Tools Market Research Report 2033

Data Augmentation Tools Market Outlook

Component Analysis

ECG Augmented Dataset

⚖️ Final Dataset Composition

🔬 Data Augmentation Techniques

💡 Intended Use

📘 License

Paired-Embedding (PE) method for data augmentation and Act2Act network -...

Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi...

52 L-LTF subcarriers

56 HT-LTF subcarriers

Tied-Augment: Controlling Representation Similarity Improves Data...

Data from: New Deep Learning Methods for Medical Image Analysis and...

Comparative results for time domain transformation-based data augmentation...

Data from: Class-specific data augmentation for plant stress classification

SVD-Generated Video Dataset

Augmented Hand-Drawn Data for Parkinson’s Disease

COVID-19 Chest CT image Augmentation GAN Dataset

Abstract

Context

Content

Acknowledgements

Inspiration

TokenMixup: Efficient Attention-guided Token-level Data Augmentation for...

3D Data Augmentation for Driving Scenes on Camera - Dataset - LDM

Comparative results for pattern mixing-based data augmentation methods.

Data from: Phenotype Driven Data Augmentation Methods for Transcriptomic...

DriveMatriX Highway Dataset

Data from: Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers