100+ datasets found

n
Data from: Exploring deep learning techniques for wild animal behaviour...
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2ngf1vhwk
Dataset updated
Feb 22, 2024
Dataset provided by
Osaka University
Nagoya University
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
H
Data from: Data augmentation for disruption prediction via robust surrogate...
dataverse.harvard.edu
osti.gov
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/FMJCAD
Dataset updated
Aug 31, 2024
Dataset provided by
Harvard Dataverse
Authors
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.
Data from: Variable Message Signal annotated images for object detection
zenodo.org
portalcientifico.universidadeuropea.com
zip
Updated Oct 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas (2022). Variable Message Signal annotated images for object detection [Dataset]. http://doi.org/10.5281/zenodo.5904211
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5904211
Dataset updated
Oct 2, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
If you use this dataset, please cite this paper: Puertas, E.; De-Las-Heras, G.; Sánchez-Soriano, J.; Fernández-Andrés, J. Dataset: Variable Message Signal Annotated Images for Object Detection. Data 2022, 7, 41. https://doi.org/10.3390/data7040041

This dataset consists of Spanish road images taken from inside a vehicle, as well as annotations in XML files in PASCAL VOC format that indicate the location of Variable Message Signals within them. Also, a CSV file is attached with information regarding the geographic position, the folder where the image is located, and the text in Spanish. This can be used to train supervised learning computer vision algorithms, such as convolutional neural networks. Throughout this work, the process followed to obtain the dataset, image acquisition, and labeling, and its specifications are detailed. The dataset is constituted of 1216 instances, 888 positives, and 328 negatives, in 1152 jpg images with a resolution of 1280x720 pixels. These are divided into 576 real images and 576 images created from the data-augmentation technique. The purpose of this dataset is to help in road computer vision research since there is not one specifically for VMSs.

The folder structure of the dataset is as follows:

vms_dataset/

data.csv

real_images/

imgs/

annotations/

data-augmentation/

imgs/

annotations/

In which:

data.csv: Each row contains the following information separated by commas (,): image_name, x_min, y_min, x_max, y_max, class_name, lat, long, folder, text.

real_images: Images extracted directly from the videos.

data-augmentation: Images created using data-augmentation

imgs: Image files in .jpg format.

annotations: Annotation files in .xml format.
G
Data Augmentation Tools Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Data Augmentation Tools Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-augmentation-tools-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Aug 23, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Data Augmentation Tools Market Outlook

As per our latest research, the global Data Augmentation Tools market size reached USD 1.47 billion in 2024, reflecting the rapidly increasing adoption of artificial intelligence and machine learning across diverse sectors. The market is experiencing robust momentum, registering a CAGR of 25.3% from 2025 to 2033. By the end of 2033, the Data Augmentation Tools market is forecasted to reach a substantial value of USD 11.6 billion. This impressive growth is primarily driven by the escalating need for high-quality, diverse datasets to train advanced AI models, coupled with the proliferation of digital transformation initiatives across industries.

The primary growth factor fueling the Data Augmentation Tools market is the exponential rise in AI and machine learning applications, which require vast amounts of labeled data for effective training. As organizations strive to develop more accurate and robust models, the demand for data augmentation solutions that can synthetically expand and diversify datasets has surged. This trend is particularly pronounced in sectors such as healthcare, automotive, and retail, where the quality and quantity of data directly impact the performance and reliability of AI systems. The market is further propelled by the increasing complexity of data types, including images, text, audio, and video, necessitating sophisticated augmentation tools capable of handling multimodal data.

Another significant driver is the growing focus on reducing model bias and improving generalization capabilities. Data augmentation tools enable organizations to generate synthetic samples that account for various real-world scenarios, thereby minimizing overfitting and enhancing the robustness of AI models. This capability is critical in regulated industries like BFSI and healthcare, where the consequences of biased or inaccurate models can be severe. Furthermore, the rise of edge computing and IoT devices has expanded the scope of data augmentation, as organizations seek to deploy AI solutions in resource-constrained environments that require optimized and diverse training datasets.

The proliferation of cloud-based solutions has also played a pivotal role in shaping the trajectory of the Data Augmentation Tools market. Cloud deployment offers scalability, flexibility, and cost-effectiveness, allowing organizations of all sizes to access advanced augmentation capabilities without significant infrastructure investments. Additionally, the integration of data augmentation tools with popular machine learning frameworks and platforms has streamlined adoption, enabling seamless workflow integration and accelerating time-to-market for AI-driven products and services. These factors collectively contribute to the sustained growth and dynamism of the global Data Augmentation Tools market.

From a regional perspective, North America currently dominates the Data Augmentation Tools market, accounting for the largest revenue share in 2024, followed closely by Europe and Asia Pacific. The strong presence of leading technology companies, robust investment in AI research, and early adoption of digital transformation initiatives have established North America as a key hub for data augmentation innovation. Meanwhile, Asia Pacific is poised for the fastest growth over the forecast period, driven by the rapid expansion of the IT and telecommunications sector, burgeoning e-commerce industry, and increasing government initiatives to promote AI adoption. Europe also maintains a significant market presence, supported by stringent data privacy regulations and a strong focus on ethical AI development.

Component Analysis

The Component segment of the Data Augmentation Tools market is bifurcated into Software and Services, each playing a critical role in enabling organizations to leverage data augmentation for AI and machine learning initiatives. The software sub-segment comprises
Data archive for paper "Copula-based synthetic data augmentation for...
zenodo.org
zip
Updated Mar 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Meyer; David Meyer (2022). Data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators" [Dataset]. http://doi.org/10.5281/zenodo.5150327
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5150327
Dataset updated
Mar 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Meyer; David Meyer
Description
Overview

This is the data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators". It contains the paper’s data archive with model outputs (see results folder) and the Singularity image for (optionally) re-running experiments.

For the Python tool used to generate synthetic data, please refer to Synthia.

Requirements

Singularity >= 3

Portable Batch System (PBS) job scheduler*

Today's high-performance computer (e.g. ~ 32 CPUs @ 2 500 MHz with 64 GB of RAM )

*Although PBS in not a strict requirement, it is required to run all helper scripts as included in this repository. Please note that depending on your specific system settings and resource availability, you may need to modify PBS parameters at the top of submit scripts stored in the hpc directory (e.g. #PBS -lwalltime=72:00:00).

Usage

To reproduce the results from the experiments described in the paper, first fit all copula models to the reduced NWP-SAF dataset with:

qsub hpc/fit.sh

then, to generate synthetic data, run all machine learning model configurations, and compute the relevant statistics use:

qsub hpc/stats.sh qsub hpc/ml_control.sh qsub hpc/ml_synth.sh

Finally, to plot all artifacts included in the paper use:

qsub hpc/plot.sh

Licence

Code released under MIT license. Data from the reduced NWP-SAF dataset released under CC BY 4.0.
Result of 10-Fold cross-validation on augmented dataset.
plos.figshare.com
xls
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sidratul Montaha; Sami Azam; A. K. M. Rakibul Haque Rafid; Sayma Islam; Pronab Ghosh; Mirjam Jonkman (2023). Result of 10-Fold cross-validation on augmented dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0269826.t018
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0269826.t018
Dataset updated
Jun 14, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sidratul Montaha; Sami Azam; A. K. M. Rakibul Haque Rafid; Sayma Islam; Pronab Ghosh; Mirjam Jonkman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Result of 10-Fold cross-validation on augmented dataset.
S
Synthetic Data Platform Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Synthetic Data Platform Report [Dataset]. https://www.datainsightsmarket.com/reports/synthetic-data-platform-1939818
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Jun 9, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Synthetic Data Platform market is experiencing robust growth, driven by the increasing need for data privacy, escalating data security concerns, and the rising demand for high-quality training data for AI and machine learning models. The market's expansion is fueled by several key factors: the growing adoption of AI across various industries, the limitations of real-world data availability due to privacy regulations like GDPR and CCPA, and the cost-effectiveness and efficiency of synthetic data generation. We project a market size of approximately $2 billion in 2025, with a Compound Annual Growth Rate (CAGR) of 25% over the forecast period (2025-2033). This rapid expansion is expected to continue, reaching an estimated market value of over $10 billion by 2033. The market is segmented based on deployment models (cloud, on-premise), data types (image, text, tabular), and industry verticals (healthcare, finance, automotive). Major players are actively investing in research and development, fostering innovation in synthetic data generation techniques and expanding their product offerings to cater to diverse industry needs. Competition is intense, with companies like AI.Reverie, Deep Vision Data, and Synthesis AI leading the charge with innovative solutions. However, several challenges remain, including ensuring the quality and fidelity of synthetic data, addressing the ethical concerns surrounding its use, and the need for standardization across platforms. Despite these challenges, the market is poised for significant growth, driven by the ever-increasing need for large, high-quality datasets to fuel advancements in artificial intelligence and machine learning. The strategic partnerships and acquisitions in the market further accelerate the innovation and adoption of synthetic data platforms. The ability to generate synthetic data tailored to specific business problems, combined with the increasing awareness of data privacy issues, is firmly establishing synthetic data as a key component of the future of data management and AI development.
D
Data Augmentation Tools Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Augmentation Tools Market Research Report 2033 [Dataset]. https://dataintelo.com/report/data-augmentation-tools-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Oct 1, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Augmentation Tools Market Outlook

According to our latest research, the global Data Augmentation Tools market size reached USD 1.62 billion in 2024, with a robust year-on-year growth trajectory. The market is poised for accelerated expansion, projected to achieve a CAGR of 26.4% from 2025 to 2033. By the end of 2033, the market is forecasted to reach approximately USD 12.34 billion. This dynamic growth is primarily driven by the rising demand for artificial intelligence (AI) and machine learning (ML) applications across diverse industry verticals, which necessitate vast quantities of high-quality training data. The proliferation of data-centric AI models and the increasing complexity of real-world datasets are compelling enterprises to invest in advanced data augmentation tools to enhance data diversity and model robustness, as per the latest research insights.

One of the principal growth factors fueling the Data Augmentation Tools market is the intensifying adoption of AI-driven solutions across industries such as healthcare, automotive, retail, and finance. Organizations are increasingly leveraging data augmentation to overcome the challenges posed by limited or imbalanced datasets, which are often a bottleneck in developing accurate and reliable AI models. By synthetically expanding training datasets through augmentation techniques, enterprises can significantly improve the generalization capabilities of their models, leading to enhanced performance and reduced risk of overfitting. Furthermore, the surge in computer vision, natural language processing, and speech recognition applications is creating a fertile environment for the adoption of specialized augmentation tools tailored to image, text, and audio data.

Another significant factor contributing to market growth is the rapid evolution of augmentation technologies themselves. Innovations such as Generative Adversarial Networks (GANs), automated data labeling, and domain-specific augmentation pipelines are making it easier for organizations to deploy and scale data augmentation strategies. These advancements are not only reducing the manual effort and expertise required but also enabling the generation of highly realistic synthetic data that closely mimics real-world scenarios. As a result, businesses across sectors are able to accelerate their AI/ML development cycles, reduce costs associated with data collection and labeling, and maintain compliance with stringent data privacy regulations by minimizing the need to use sensitive real-world data.

The growing integration of data augmentation tools within cloud-based AI development platforms is also acting as a major catalyst for market expansion. Cloud deployment offers unparalleled scalability, accessibility, and collaboration capabilities, allowing organizations of all sizes to harness the power of data augmentation without significant upfront infrastructure investments. This democratization of advanced data engineering tools is especially beneficial for small and medium enterprises (SMEs) and academic research institutes, which often face resource constraints. The proliferation of cloud-native augmentation solutions is further supported by strategic partnerships between technology vendors and cloud service providers, driving broader market penetration and innovation.

From a regional perspective, North America continues to dominate the Data Augmentation Tools market, driven by the presence of leading AI technology companies, a mature digital infrastructure, and substantial investments in research and development. However, the Asia Pacific region is emerging as the fastest-growing market, fueled by rapid digital transformation initiatives, a burgeoning startup ecosystem, and increasing government support for AI innovation. Europe also holds a significant share, underpinned by strong regulatory frameworks and a focus on ethical AI development. Meanwhile, Latin America and the Middle East & Africa are witnessing steady adoption, particularly in sectors such as BFSI and healthcare, where data-driven insights are becoming increasingly critical.

Component Analysis

The Data Augmentation Tools market by component is bifurcated into Software and Services. The software segment currently accounts for the largest share of the market, owing to the widespread deployment of standalone and integrated augmentation solutions across enterprises and research institutions. These software plat
Datasets GO ID/attribute p-value q-value.
figshare.com
xls
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sifan Feng; Zhenyou Wang; Yinghua Jin; Shengbin Xu (2024). Datasets GO ID/attribute p-value q-value. [Dataset]. http://doi.org/10.1371/journal.pone.0305857.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305857.t004
Dataset updated
Jul 22, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Sifan Feng; Zhenyou Wang; Yinghua Jin; Shengbin Xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Traditional differential expression genes (DEGs) identification models have limitations in small sample size datasets because they require meeting distribution assumptions, otherwise resulting high false positive/negative rates due to sample variation. In contrast, tabular data model based on deep learning (DL) frameworks do not need to consider the data distribution types and sample variation. However, applying DL to RNA-Seq data is still a challenge due to the lack of proper labeling and the small sample size compared to the number of genes. Data augmentation (DA) extracts data features using different methods and procedures, which can significantly increase complementary pseudo-values from limited data without significant additional cost. Based on this, we combine DA and DL framework-based tabular data model, propose a model TabDEG, to predict DEGs and their up-regulation/down-regulation directions from gene expression data obtained from the Cancer Genome Atlas database. Compared to five counterpart methods, TabDEG has high sensitivity and low misclassification rates. Experiment shows that TabDEG is robust and effective in enhancing data features to facilitate classification of high-dimensional small sample size datasets and validates that TabDEG-predicted DEGs are mapped to important gene ontology terms and pathways associated with cancer.
r
Data from: Tied-Augment: Controlling Representation Similarity Improves Data...
resodate.org
service.tib.eu
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emirhan Kurtulus; Zichao Li; Yann Dauphin; Ekin D. Cubuk (2024). Tied-Augment: Controlling Representation Similarity Improves Data Augmentation [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdGllZC1hdWdtZW50LS1jb250cm9sbGluZy1yZXByZXNlbnRhdGlvbi1zaW1pbGFyaXR5LWltcHJvdmVzLWRhdGEtYXVnbWVudGF0aW9u
Explore at:
Dataset updated
Dec 3, 2024
Dataset provided by
Leibniz Data Manager
Authors
Emirhan Kurtulus; Zichao Li; Yann Dauphin; Ekin D. Cubuk
Description
Data augmentation methods have played an important role in the recent advance of deep learning models, and have become an indispensable component of state-of-the-art models in semi-supervised, self-supervised, and supervised training for vision.
t
Sample Selection for Data Augmentation in Natural Language Processing -...
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Sample Selection for Data Augmentation in Natural Language Processing - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/sample-selection-for-data-augmentation-in-natural-language-processing
Explore at:
Dataset updated
Dec 16, 2024
Description
Deep learning-based text classification models need abundant labeled data to obtain competitive performance. To tackle this, multiple researches try to use data augmentation to expand the corpus size.
f
Table1_Enhancing biomechanical machine learning with limited data:...
frontiersin.figshare.com
pdf
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich (2024). Table1_Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.pdf [Dataset]. http://doi.org/10.3389/fbioe.2024.1350135.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fbioe.2024.1350135.s001
Dataset updated
Feb 14, 2024
Dataset provided by
Frontiers
Authors
Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.
Variable Misuse tool: Dataset for data augmentation (4)
zenodo.org
zip
Updated Mar 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristian Robledo; Cristian Robledo; Francesca Sallicati; Javier Gutiérrez; Francesca Sallicati; Javier Gutiérrez (2022). Variable Misuse tool: Dataset for data augmentation (4) [Dataset]. http://doi.org/10.5281/zenodo.6090379
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6090379
Dataset updated
Mar 8, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cristian Robledo; Cristian Robledo; Francesca Sallicati; Javier Gutiérrez; Francesca Sallicati; Javier Gutiérrez
Description
Dataset used for data augmentation in the training phase of the Variable Misuse tool. It contains some source code files extracted from third-party repositories.
SVD-Generated Video Dataset
kaggle.com
zip
Updated May 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Afnan Algharbi (2025). SVD-Generated Video Dataset [Dataset]. https://www.kaggle.com/datasets/afnanalgarby/svd-generated-video-dataset
Explore at:
zip(102546508 bytes)Available download formats
Dataset updated
May 11, 2025
Authors
Afnan Algharbi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains synthetic video samples generated from a 10-class subset of Tiny ImageNet using Stable Video Diffusion (SVD). It is designed to evaluate the impact of generative temporal augmentation on image classification performance.

Each training and validation video corresponds to a single image augmented into a sequence of frames.

Videos are stored in .mp4 format and labeled via train.csv and val.csv.

Sources:

Tiny ImageNet: Stanford CS231n

SVD model: Stable Video Diffusion

License: Creative Commons Attribution 4.0 International (CC BY 4.0)
audiomentations
kaggle.com
zip
Updated Jan 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HyeongChan Kim (2023). audiomentations [Dataset]. https://www.kaggle.com/kozistr/audiomentations
Explore at:
zip(35911 bytes)Available download formats
Dataset updated
Jan 10, 2023
Authors
HyeongChan Kim
Description
Audiomentations

A Python library for audio data augmentation. Inspired by albumentations. Useful for machine learning.

official : https://github.com/iver56/audiomentations
Data from: Prediction of blood-brain barrier penetrating peptides based on...
figshare.com
datasetcatalog.nlm.nih.gov
application/x-rar
Updated Apr 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhifeng Gu; Yuduo Hao; Tianyu Wang; Peiling Cai; Yang Zhang; Kejun Deng; Hao Lin; Hao Lv (2024). Prediction of blood-brain barrier penetrating peptides based on data augmentation with Augur [Dataset]. http://doi.org/10.6084/m9.figshare.25466461.v4
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25466461.v4
Dataset updated
Apr 5, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Zhifeng Gu; Yuduo Hao; Tianyu Wang; Peiling Cai; Yang Zhang; Kejun Deng; Hao Lin; Hao Lv
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The blood-brain barrier serves as a critical interface between the bloodstream and brain tissue, mainly composed of pericytes, neurons, endothelial cells, and tightly connected basal membranes. It plays a pivotal role in safeguarding brain from harmful substances, thus protecting the integrity of the nervous system and preserving overall brain homeostasis. However, this remarkable selective transmission also poses a formidable challenge in the realm of central nervous system diseases treatment, hindering the delivery of large-molecule drugs into the brain. In response to this challenge, many researchers have devoted themselves to developing drug delivery systems capable of breaching the blood-brain barrier. Among these, blood-brain barrier penetrating peptides have emerged as promising candidates. These peptides had the advantages of high biosafety, ease of synthesis, and exceptional penetration efficiency, making them an effective drug delivery solution. While previous studies have developed a few prediction models for B3PPs, their performance has often been hampered by issue of limited positive data.In this study, we present Augur, a novel prediction model using borderline-SMOTE-based data augmentation and machine learning. we extract highly interpretable physicochemical properties of blood-brain barrier penetrating peptides while solving the issues of small sample size and imbalance of positive and negative samples. Experimental results demonstrate the superior prediction performance of Augur with an AUC value of 0.932 on the training set and 0.931 on the independent test set.This newly developed Augur model demonstrates superior performance in predicting blood-brain barrier penetrating peptides, offering valuable insights for drug development targeting neurological disorders. This breakthrough may enhance the efficiency of peptide-based drug discovery and pave the way for innovative treatment strategies for central nervous system diseases.
i
BURN 2: A Multiclass RGB Dataset for Prescribed Burn Analysis Using...
ieee-dataport.org
Updated Oct 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joon Tai Kim (2025). BURN 2: A Multiclass RGB Dataset for Prescribed Burn Analysis Using Context-Aware Data Augmentation [Dataset]. https://ieee-dataport.org/documents/burn-2-multiclass-rgb-dataset-prescribed-burn-analysis-using-context-aware-data
Explore at:
Dataset updated
Oct 14, 2025
Authors
Joon Tai Kim
Description
BURN 2 contains synthetically generated wildland fire images employing a contextually accurate data augmentation approach to generate realistic wildfire imagery.
Supplementary file 1_Data augmented lung cancer prediction framework using...
frontiersin.figshare.com
docx
Updated Feb 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yifan Jiang; Venkata S. K. Manem (2025). Supplementary file 1_Data augmented lung cancer prediction framework using the nested case control NLST cohort.docx [Dataset]. http://doi.org/10.3389/fonc.2025.1492758.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2025.1492758.s001
Dataset updated
Feb 25, 2025
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Yifan Jiang; Venkata S. K. Manem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PurposeIn the context of lung cancer screening, the scarcity of well-labeled medical images poses a significant challenge to implement supervised learning-based deep learning methods. While data augmentation is an effective technique for countering the difficulties caused by insufficient data, it has not been fully explored in the context of lung cancer screening. In this research study, we analyzed the state-of-the-art (SOTA) data augmentation techniques for lung cancer binary prediction.MethodsTo comprehensively evaluate the efficiency of data augmentation approaches, we considered the nested case control National Lung Screening Trial (NLST) cohort comprising of 253 individuals who had the commonly used CT scans without contrast. The CT scans were pre-processed into three-dimensional volumes based on the lung nodule annotations. Subsequently, we evaluated five basic (online) and two generative model-based offline data augmentation methods with ten state-of-the-art (SOTA) 3D deep learning-based lung cancer prediction models.ResultsOur results demonstrated that the performance improvement by data augmentation was highly dependent on approach used. The Cutmix method resulted in the highest average performance improvement across all three metrics: 1.07%, 3.29%, 1.19% for accuracy, F1 score and AUC, respectively. MobileNetV2 with a simple data augmentation approach achieved the best AUC of 0.8719 among all lung cancer predictors, demonstrating a 7.62% improvement compared to baseline. Furthermore, the MED-DDPM data augmentation approach was able to improve prediction performance by rebalancing the training set and adding moderately synthetic data.ConclusionsThe effectiveness of online and offline data augmentation methods were highly sensitive to the prediction model, highlighting the importance of carefully selecting the optimal data augmentation method. Our findings suggest that certain traditional methods can provide more stable and higher performance compared to SOTA online data augmentation approaches. Overall, these results offer meaningful insights for the development and clinical integration of data augmented deep learning tools for lung cancer screening.
ECG Augmented Dataset
kaggle.com
zip
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sidali Khelil cherfi (2025). ECG Augmented Dataset [Dataset]. https://www.kaggle.com/datasets/sidalikhelilcherfi/ecg-augmented
Explore at:
zip(5174909523 bytes)Available download formats
Dataset updated
Oct 7, 2025
Authors
sidali Khelil cherfi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🩺 Dataset Description

This dataset is an augmented version of an ECG image dataset created to balance and enrich the original classes for deep learning–based cardiovascular disease classification.

The original dataset consisted of unbalanced image counts per class in the training set: - ABH: 233 images - MI: 239 images - HMI: 172 images - NORM: 284 images

To improve class balance and model generalization, each class in the training set was expanded to 500 images using a combination of morphological, noise-based, and geometric data augmentation techniques. Additionally, the test set includes 112 images per class.

⚖️ Final Dataset Composition

Training set: 4 classes × 500 images each → 2,000 images total

Test set: 4 classes × 112 images each → 448 images total

🔬 Data Augmentation Techniques

1. Morphological Alterations - Erosion - Dilation - None (original preserved)

2. Noise Introduction - augment_noise_black_rain — simulates black streaks - augment_noise_pixel_dropout_black — random black pixel dropout - augment_noise_white_rain — simulates white streaks - augment_noise_pixel_dropout_white — random white pixel dropout

3. Geometric Transformations - Shift — small translations in all directions - Scale — random zoom-in/zoom-out between 0.9× and 1.1× - Rotate — small random rotation between -5° and +5°

These transformations were applied with balanced proportions to ensure diversity and realism while preserving diagnostic features of ECG signals.

💡 Intended Use

This dataset is designed for: - Training and evaluating deep learning models (CNNs, ViTs) for ECG image classification - Research in medical image augmentation, imbalanced data learning, and cardiovascular disease prediction

📘 License

This dataset is released under the CC0 1.0 License, allowing free use and distribution for research and educational purposes.
EDA augmentation parameters.
plos.figshare.com
xls
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). EDA augmentation parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t009
Dataset updated
Sep 26, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk

Data from: Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.2ngf1vhwk

Dataset updated

Feb 22, 2024

Dataset provided by

Osaka University
Nagoya University

Authors

Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.

Clear search

Close search

Google apps

Main menu

Data from: Exploring deep learning techniques for wild animal behaviour...

Data from: Data augmentation for disruption prediction via robust surrogate...

Data from: Variable Message Signal annotated images for object detection

Data Augmentation Tools Market Research Report 2033

Data Augmentation Tools Market Outlook

Component Analysis

Data archive for paper "Copula-based synthetic data augmentation for...

Result of 10-Fold cross-validation on augmented dataset.

Synthetic Data Platform Report

Data Augmentation Tools Market Research Report 2033

Data Augmentation Tools Market Outlook

Component Analysis

Datasets GO ID/attribute p-value q-value.

Data from: Tied-Augment: Controlling Representation Similarity Improves Data...

Sample Selection for Data Augmentation in Natural Language Processing -...

Table1_Enhancing biomechanical machine learning with limited data:...

Variable Misuse tool: Dataset for data augmentation (4)

SVD-Generated Video Dataset

audiomentations

Audiomentations

Data from: Prediction of blood-brain barrier penetrating peptides based on...

BURN 2: A Multiclass RGB Dataset for Prescribed Burn Analysis Using...

Supplementary file 1_Data augmented lung cancer prediction framework using...

ECG Augmented Dataset

⚖️ Final Dataset Composition

🔬 Data Augmentation Techniques

💡 Intended Use

📘 License

EDA augmentation parameters.

Data from: Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers