100+ datasets found
  1. n

    Data from: Exploring deep learning techniques for wild animal behaviour...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Feb 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 22, 2024
    Dataset provided by
    Osaka University
    Nagoya University
    Authors
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

    This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.

  2. Data from: Variable Message Signal annotated images for object detection

    • zenodo.org
    • portalcientifico.universidadeuropea.com
    zip
    Updated Oct 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas (2022). Variable Message Signal annotated images for object detection [Dataset]. http://doi.org/10.5281/zenodo.5904211
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 2, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    If you use this dataset, please cite this paper: Puertas, E.; De-Las-Heras, G.; Sánchez-Soriano, J.; Fernández-Andrés, J. Dataset: Variable Message Signal Annotated Images for Object Detection. Data 2022, 7, 41. https://doi.org/10.3390/data7040041

    This dataset consists of Spanish road images taken from inside a vehicle, as well as annotations in XML files in PASCAL VOC format that indicate the location of Variable Message Signals within them. Also, a CSV file is attached with information regarding the geographic position, the folder where the image is located, and the text in Spanish. This can be used to train supervised learning computer vision algorithms, such as convolutional neural networks. Throughout this work, the process followed to obtain the dataset, image acquisition, and labeling, and its specifications are detailed. The dataset is constituted of 1216 instances, 888 positives, and 328 negatives, in 1152 jpg images with a resolution of 1280x720 pixels. These are divided into 576 real images and 576 images created from the data-augmentation technique. The purpose of this dataset is to help in road computer vision research since there is not one specifically for VMSs.

    The folder structure of the dataset is as follows:

    • vms_dataset/
      • data.csv
      • real_images/
        • imgs/
        • annotations/
      • data-augmentation/
        • imgs/
        • annotations/

    In which:

    • data.csv: Each row contains the following information separated by commas (,): image_name, x_min, y_min, x_max, y_max, class_name, lat, long, folder, text.
    • real_images: Images extracted directly from the videos.
    • data-augmentation: Images created using data-augmentation
    • imgs: Image files in .jpg format.
    • annotations: Annotation files in .xml format.
  3. H

    Data from: Data augmentation for disruption prediction via robust surrogate...

    • dataverse.harvard.edu
    • osti.gov
    Updated Aug 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.

  4. Result of 10-Fold cross-validation on augmented dataset.

    • plos.figshare.com
    xls
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sidratul Montaha; Sami Azam; A. K. M. Rakibul Haque Rafid; Sayma Islam; Pronab Ghosh; Mirjam Jonkman (2023). Result of 10-Fold cross-validation on augmented dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0269826.t018
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sidratul Montaha; Sami Azam; A. K. M. Rakibul Haque Rafid; Sayma Islam; Pronab Ghosh; Mirjam Jonkman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Result of 10-Fold cross-validation on augmented dataset.

  5. Data from: Augmentation of Semantic Processes for Deep Learning Applications...

    • tandf.figshare.com
    txt
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian Hoffmann; Lukas Malburg; Ralph Bergmann (2025). Augmentation of Semantic Processes for Deep Learning Applications [Dataset]. http://doi.org/10.6084/m9.figshare.29212617.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Maximilian Hoffmann; Lukas Malburg; Ralph Bergmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The popularity of Deep Learning (DL) methods used in business process management research and practice is constantly increasing. One important factor that hinders the adoption of DL in certain areas is the availability of sufficiently large training datasets, particularly affecting domains where process models are mainly defined manually with a high knowledge-acquisition effort. In this paper, we examine process model augmentation in combination with semi-supervised transfer learning to enlarge existing datasets and train DL models effectively. The use case of similarity learning between manufacturing process models is discussed. Based on a literature study of existing augmentation techniques, a concept is presented with different categories of augmentation from knowledge-light approaches to knowledge-intensive ones, e. g. based on automated planning. Specifically, the impacts of augmentation approaches on the syntactic and semantic correctness of the augmented process models are considered. The concept also proposes a semi-supervised transfer learning approach to integrate augmented and non-augmented process model datasets in a two-phased training procedure. The experimental evaluation investigates augmented process model datasets regarding their quality for model training in the context of similarity learning between manufacturing process models. The results indicate a large potential with a reduction of the prediction error of up to 53%.

  6. SVD-Generated Video Dataset

    • kaggle.com
    zip
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Afnan Algharbi (2025). SVD-Generated Video Dataset [Dataset]. https://www.kaggle.com/datasets/afnanalgarby/svd-generated-video-dataset
    Explore at:
    zip(102546508 bytes)Available download formats
    Dataset updated
    May 11, 2025
    Authors
    Afnan Algharbi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains synthetic video samples generated from a 10-class subset of Tiny ImageNet using Stable Video Diffusion (SVD). It is designed to evaluate the impact of generative temporal augmentation on image classification performance.

    Each training and validation video corresponds to a single image augmented into a sequence of frames.

    Videos are stored in .mp4 format and labeled via train.csv and val.csv.

    Sources:

    Tiny ImageNet: Stanford CS231n

    SVD model: Stable Video Diffusion

    License: Creative Commons Attribution 4.0 International (CC BY 4.0)

  7. Variable Misuse tool: Dataset for data augmentation (4)

    • zenodo.org
    zip
    Updated Mar 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cristian Robledo; Cristian Robledo; Francesca Sallicati; Javier Gutiérrez; Francesca Sallicati; Javier Gutiérrez (2022). Variable Misuse tool: Dataset for data augmentation (4) [Dataset]. http://doi.org/10.5281/zenodo.6090379
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 8, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cristian Robledo; Cristian Robledo; Francesca Sallicati; Javier Gutiérrez; Francesca Sallicati; Javier Gutiérrez
    Description

    Dataset used for data augmentation in the training phase of the Variable Misuse tool. It contains some source code files extracted from third-party repositories.

  8. G

    Data Augmentation Tools Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Data Augmentation Tools Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-augmentation-tools-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Aug 23, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Augmentation Tools Market Outlook



    As per our latest research, the global Data Augmentation Tools market size reached USD 1.47 billion in 2024, reflecting the rapidly increasing adoption of artificial intelligence and machine learning across diverse sectors. The market is experiencing robust momentum, registering a CAGR of 25.3% from 2025 to 2033. By the end of 2033, the Data Augmentation Tools market is forecasted to reach a substantial value of USD 11.6 billion. This impressive growth is primarily driven by the escalating need for high-quality, diverse datasets to train advanced AI models, coupled with the proliferation of digital transformation initiatives across industries.




    The primary growth factor fueling the Data Augmentation Tools market is the exponential rise in AI and machine learning applications, which require vast amounts of labeled data for effective training. As organizations strive to develop more accurate and robust models, the demand for data augmentation solutions that can synthetically expand and diversify datasets has surged. This trend is particularly pronounced in sectors such as healthcare, automotive, and retail, where the quality and quantity of data directly impact the performance and reliability of AI systems. The market is further propelled by the increasing complexity of data types, including images, text, audio, and video, necessitating sophisticated augmentation tools capable of handling multimodal data.




    Another significant driver is the growing focus on reducing model bias and improving generalization capabilities. Data augmentation tools enable organizations to generate synthetic samples that account for various real-world scenarios, thereby minimizing overfitting and enhancing the robustness of AI models. This capability is critical in regulated industries like BFSI and healthcare, where the consequences of biased or inaccurate models can be severe. Furthermore, the rise of edge computing and IoT devices has expanded the scope of data augmentation, as organizations seek to deploy AI solutions in resource-constrained environments that require optimized and diverse training datasets.




    The proliferation of cloud-based solutions has also played a pivotal role in shaping the trajectory of the Data Augmentation Tools market. Cloud deployment offers scalability, flexibility, and cost-effectiveness, allowing organizations of all sizes to access advanced augmentation capabilities without significant infrastructure investments. Additionally, the integration of data augmentation tools with popular machine learning frameworks and platforms has streamlined adoption, enabling seamless workflow integration and accelerating time-to-market for AI-driven products and services. These factors collectively contribute to the sustained growth and dynamism of the global Data Augmentation Tools market.




    From a regional perspective, North America currently dominates the Data Augmentation Tools market, accounting for the largest revenue share in 2024, followed closely by Europe and Asia Pacific. The strong presence of leading technology companies, robust investment in AI research, and early adoption of digital transformation initiatives have established North America as a key hub for data augmentation innovation. Meanwhile, Asia Pacific is poised for the fastest growth over the forecast period, driven by the rapid expansion of the IT and telecommunications sector, burgeoning e-commerce industry, and increasing government initiatives to promote AI adoption. Europe also maintains a significant market presence, supported by stringent data privacy regulations and a strong focus on ethical AI development.





    Component Analysis



    The Component segment of the Data Augmentation Tools market is bifurcated into Software and Services, each playing a critical role in enabling organizations to leverage data augmentation for AI and machine learning initiatives. The software sub-segment comprises

  9. FER2013 Augmented Dataset

    • kaggle.com
    zip
    Updated Jan 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manvendra Singh (2025). FER2013 Augmented Dataset [Dataset]. https://www.kaggle.com/datasets/manvendrasingh09/new-fer2013
    Explore at:
    zip(300209622 bytes)Available download formats
    Dataset updated
    Jan 11, 2025
    Authors
    Manvendra Singh
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This dataset is an augmented version of the FER2013 dataset, designed for improving emotion recognition tasks in machine learning and deep learning models. FER2013 was introduced during the ICML 2013 Workshop on Challenges in Representation Learning and contains grayscale facial expression images classified into seven emotion categories: Angry, Disgust, Fear, Happy, Neutral, Sad, and Surprise.

    Enhancements in this Dataset: 1. Preprocessing: - Applied histogram equalization to improve image contrast. - Denoising using Gaussian smoothing to reduce noise. 2. Augmentation: - Augmented the dataset by applying transformations such as rotation, flipping, zooming, and shifting to enhance diversity.

    Applications: - Facial Expression Recognition. - Emotion-based Human-Computer Interaction (HCI). - Mental health analysis through automated emotion detection.

    Source: The original FER2013 dataset was obtained from the ICML 2013 Workshop and is publicly available on Kaggle.

  10. r

    Data from: Tied-Augment: Controlling Representation Similarity Improves Data...

    • resodate.org
    • service.tib.eu
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emirhan Kurtulus; Zichao Li; Yann Dauphin; Ekin D. Cubuk (2024). Tied-Augment: Controlling Representation Similarity Improves Data Augmentation [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdGllZC1hdWdtZW50LS1jb250cm9sbGluZy1yZXByZXNlbnRhdGlvbi1zaW1pbGFyaXR5LWltcHJvdmVzLWRhdGEtYXVnbWVudGF0aW9u
    Explore at:
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    Emirhan Kurtulus; Zichao Li; Yann Dauphin; Ekin D. Cubuk
    Description

    Data augmentation methods have played an important role in the recent advance of deep learning models, and have become an indispensable component of state-of-the-art models in semi-supervised, self-supervised, and supervised training for vision.

  11. Additional file 3 of Which data subset should be augmented for deep...

    • figshare.com
    • springernature.figshare.com
    xlsx
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yusra A. Ameen; Dalia M. Badary; Ahmad Elbadry I. Abonnoor; Khaled F. Hussain; Adel A. Sewisy (2023). Additional file 3 of Which data subset should be augmented for deep learning? a simulation study using urothelial cell carcinoma histopathology images [Dataset]. http://doi.org/10.6084/m9.figshare.22622729.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Yusra A. Ameen; Dalia M. Badary; Ahmad Elbadry I. Abonnoor; Khaled F. Hussain; Adel A. Sewisy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 3. A Microsoft® Excel® workbook that details the raw data for the 20 experiments in which no test-set augmentation was done, including all of the image-classification output probabilities.

  12. ECG Augmented Dataset

    • kaggle.com
    zip
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sidali Khelil cherfi (2025). ECG Augmented Dataset [Dataset]. https://www.kaggle.com/datasets/sidalikhelilcherfi/ecg-augmented
    Explore at:
    zip(5174909523 bytes)Available download formats
    Dataset updated
    Oct 7, 2025
    Authors
    sidali Khelil cherfi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🩺 Dataset Description

    This dataset is an augmented version of an ECG image dataset created to balance and enrich the original classes for deep learning–based cardiovascular disease classification.

    The original dataset consisted of unbalanced image counts per class in the training set: - ABH: 233 images - MI: 239 images - HMI: 172 images - NORM: 284 images

    To improve class balance and model generalization, each class in the training set was expanded to 500 images using a combination of morphological, noise-based, and geometric data augmentation techniques. Additionally, the test set includes 112 images per class.

    ⚖️ Final Dataset Composition

    • Training set: 4 classes × 500 images each → 2,000 images total
    • Test set: 4 classes × 112 images each → 448 images total

    🔬 Data Augmentation Techniques

    1. Morphological Alterations - Erosion - Dilation - None (original preserved)

    2. Noise Introduction - augment_noise_black_rain — simulates black streaks - augment_noise_pixel_dropout_black — random black pixel dropout - augment_noise_white_rain — simulates white streaks - augment_noise_pixel_dropout_white — random white pixel dropout

    3. Geometric Transformations - Shift — small translations in all directions - Scale — random zoom-in/zoom-out between 0.9× and 1.1× - Rotate — small random rotation between -5° and +5°

    These transformations were applied with balanced proportions to ensure diversity and realism while preserving diagnostic features of ECG signals.

    💡 Intended Use

    This dataset is designed for: - Training and evaluating deep learning models (CNNs, ViTs) for ECG image classification - Research in medical image augmentation, imbalanced data learning, and cardiovascular disease prediction

    📘 License

    This dataset is released under the CC0 1.0 License, allowing free use and distribution for research and educational purposes.

  13. f

    Data from: Deep Graph Learning with Property Augmentation for Predicting...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated Dec 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang, Yuhong; An, Weizhi; Sun, Hongmao; Ma, Hehuan; Huang, Junzhou; Huang, Ruili (2020). Deep Graph Learning with Property Augmentation for Predicting Drug-Induced Liver Injury [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000475471
    Explore at:
    Dataset updated
    Dec 21, 2020
    Authors
    Wang, Yuhong; An, Weizhi; Sun, Hongmao; Ma, Hehuan; Huang, Junzhou; Huang, Ruili
    Description

    Drug-induced liver injury (DILI) is a crucial factor in determining the qualification of potential drugs. However, the DILI property is excessively difficult to obtain due to the complex testing process. Consequently, an in silico screening in the early stage of drug discovery would help to reduce the total development cost by filtering those drug candidates with a high risk to cause DILI. To serve the screening goal, we apply several computational techniques to predict the DILI property, including traditional machine learning methods and graph-based deep learning techniques. While deep learning models require large training data to tune huge model parameters, the DILI data set only contains a few hundred annotated molecules. To alleviate the data scarcity problem, we propose a property augmentation strategy to include massive training data with other property information. Extensive experiments demonstrate that our proposed method significantly outperforms all existing baselines on the DILI data set by obtaining a 81.4% accuracy using cross-validation with random splitting, 78.7% using leave-one-out cross-validation, and 76.5% using cross-validation with scaffold splitting.

  14. Data archive for paper "Copula-based synthetic data augmentation for...

    • zenodo.org
    zip
    Updated Mar 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Meyer; David Meyer (2022). Data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators" [Dataset]. http://doi.org/10.5281/zenodo.5150327
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Meyer; David Meyer
    Description

    Overview

    This is the data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators". It contains the paper’s data archive with model outputs (see results folder) and the Singularity image for (optionally) re-running experiments.

    For the Python tool used to generate synthetic data, please refer to Synthia.

    Requirements

    *Although PBS in not a strict requirement, it is required to run all helper scripts as included in this repository. Please note that depending on your specific system settings and resource availability, you may need to modify PBS parameters at the top of submit scripts stored in the hpc directory (e.g. #PBS -lwalltime=72:00:00).

    Usage

    To reproduce the results from the experiments described in the paper, first fit all copula models to the reduced NWP-SAF dataset with:

    qsub hpc/fit.sh

    then, to generate synthetic data, run all machine learning model configurations, and compute the relevant statistics use:

    qsub hpc/stats.sh
    qsub hpc/ml_control.sh
    qsub hpc/ml_synth.sh

    Finally, to plot all artifacts included in the paper use:

    qsub hpc/plot.sh

    Licence

    Code released under MIT license. Data from the reduced NWP-SAF dataset released under CC BY 4.0.

  15. S

    Synthetic Data Platform Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Synthetic Data Platform Report [Dataset]. https://www.datainsightsmarket.com/reports/synthetic-data-platform-1939818
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Synthetic Data Platform market is experiencing robust growth, driven by the increasing need for data privacy, escalating data security concerns, and the rising demand for high-quality training data for AI and machine learning models. The market's expansion is fueled by several key factors: the growing adoption of AI across various industries, the limitations of real-world data availability due to privacy regulations like GDPR and CCPA, and the cost-effectiveness and efficiency of synthetic data generation. We project a market size of approximately $2 billion in 2025, with a Compound Annual Growth Rate (CAGR) of 25% over the forecast period (2025-2033). This rapid expansion is expected to continue, reaching an estimated market value of over $10 billion by 2033. The market is segmented based on deployment models (cloud, on-premise), data types (image, text, tabular), and industry verticals (healthcare, finance, automotive). Major players are actively investing in research and development, fostering innovation in synthetic data generation techniques and expanding their product offerings to cater to diverse industry needs. Competition is intense, with companies like AI.Reverie, Deep Vision Data, and Synthesis AI leading the charge with innovative solutions. However, several challenges remain, including ensuring the quality and fidelity of synthetic data, addressing the ethical concerns surrounding its use, and the need for standardization across platforms. Despite these challenges, the market is poised for significant growth, driven by the ever-increasing need for large, high-quality datasets to fuel advancements in artificial intelligence and machine learning. The strategic partnerships and acquisitions in the market further accelerate the innovation and adoption of synthetic data platforms. The ability to generate synthetic data tailored to specific business problems, combined with the increasing awareness of data privacy issues, is firmly establishing synthetic data as a key component of the future of data management and AI development.

  16. R

    Augmentation 1 Dataset

    • universe.roboflow.com
    zip
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deep learning lab (2025). Augmentation 1 Dataset [Dataset]. https://universe.roboflow.com/deep-learning-lab-8macl/augmentation-1/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Deep learning lab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Skin Cancer Dermatofibroma 1
    Description

    Augmentation 1

    ## Overview
    
    Augmentation 1 is a dataset for classification tasks - it contains Skin Cancer Dermatofibroma 1 annotations for 234 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  17. n

    Data from: New Deep Learning Methods for Medical Image Analysis and...

    • curate.nd.edu
    pdf
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pengfei Gu (2024). New Deep Learning Methods for Medical Image Analysis and Scientific Data Generation and Compression [Dataset]. http://doi.org/10.7274/26156719.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 11, 2024
    Dataset provided by
    University of Notre Dame
    Authors
    Pengfei Gu
    License

    https://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106

    Description

    Medical image analysis is critical to biological studies, health research, computer- aided diagnoses, and clinical applications. Recently, deep learning (DL) techniques have achieved remarkable successes in medical image analysis applications. However, these techniques typically require large amounts of annotations to achieve satisfactory performance. Therefore, in this dissertation, we seek to address this critical problem: How can we develop efficient and effective DL algorithms for medical image analysis while reducing annotation efforts? To address this problem, we have outlined two specific aims: (A1) Utilize existing annotations effectively from advanced models; (A2) extract generic knowledge directly from unannotated images.

    To achieve the aim (A1): First, we introduce a new data representation called TopoImages, which encodes the local topology of all the image pixels. TopoImages can be complemented with the original images to improve medical image analysis tasks. Second, we propose a new augmentation method, SAMAug-C, that lever- ages the Segment Anything Model (SAM) to augment raw image input and enhance medical image classification. Third, we propose two advanced DL architectures, kCBAC-Net and ConvFormer, to enhance the performance of 2D and 3D medical image segmentation. We also present a gate-regularized network training (GrNT) approach to improve multi-scale fusion in medical image segmentation. To achieve the aim (A2), we propose a novel extension of known Masked Autoencoders (MAEs) for self pre-training, i.e., models pre-trained on the same target dataset, specifically for 3D medical image segmentation.

    Scientific visualization is a powerful approach for understanding and analyzing various physical or natural phenomena, such as climate change or chemical reactions. However, the cost of scientific simulations is high when factors like time, ensemble, and multivariate analyses are involved. Additionally, scientists can only afford to sparsely store the simulation outputs (e.g., scalar field data) or visual representations (e.g., streamlines) or visualization images due to limited I/O bandwidths and storage space. Therefore, in this dissertation, we seek to address this critical problem: How can we develop efficient and effective DL algorithms for scientific data generation and compression while reducing simulation and storage costs?

    To tackle this problem: First, we propose a DL framework that generates un- steady vector fields data from a set of streamlines. Based on this method, domain scientists only need to store representative streamlines at simulation time and recon- struct vector fields during post-processing. Second, we design a novel DL method that translates scalar fields to vector fields. Using this approach, domain scientists only need to store scalar field data at simulation time and generate vector fields from their scalar field counterparts afterward. Third, we present a new DL approach that compresses a large collection of visualization images generated from time-varying data for communicating volume visualization results.

  18. m

    augmentation data for DAISM

    • data.mendeley.com
    Updated Jun 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yating Lin (2022). augmentation data for DAISM [Dataset]. http://doi.org/10.17632/ysjwjvpnh3.1
    Explore at:
    Dataset updated
    Jun 22, 2022
    Authors
    Yating Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purified dataset for data augmentation for DAISM-DNNXMBD can be downloaded from this repository.

    The pbmc8k dataset downloaded from 10X Genomics were processed and uesd for data augmentation to create training datasets for training DAISM-DNN models. pbmc8k.h5ad contains 5 cell types (B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells), and pbmc8k_fine.h5ad cantains 7 cell types (naive.B.cells, memory.B.cells, naive.CD4.T.cells, memory.CD4.T.cells,naive.CD8.T.cells, memory.CD8.T.cells, regulatory.T.cells, monocytes, macrophages, myeloid.dendritic.cells, NK.cells).

    For RNA-seq dataset, it contains 5 cell types (B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells). Raw FASTQ reads were downloaded from the NCBI website, and transcription and gene-level expression quantification were performed using Salmon (version 0.11.3) with Gencode v29 after quality control of FASTQ reads using fastp. All tools were used with default parameters.

  19. COVID-19 Chest CT image Augmentation GAN Dataset

    • kaggle.com
    zip
    Updated Jan 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Loey (2021). COVID-19 Chest CT image Augmentation GAN Dataset [Dataset]. https://www.kaggle.com/mloey1/covid19-chest-ct-image-augmentation-gan-dataset
    Explore at:
    zip(1914822990 bytes)Available download formats
    Dataset updated
    Jan 31, 2021
    Authors
    Mohamed Loey
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Note: please do not claim diagnostic performance of a model without a clinical study! This is not a kaggle competition dataset. Please read our paper: Loey, M., Manogaran, G. & Khalifa, N.E.M. A deep transfer learning model with classical data augmentation and CGAN to detect COVID-19 from chest CT radiography digital images. Neural Comput & Applic (2020). https://doi.org/10.1007/s00521-020-05437-x

    Khalifa, N.E.M., Smarandache, F., Manogaran, G. et al. A Study of the Neutrosophic Set Significance on Deep Transfer Learning Models: an Experimental Case on a Limited COVID-19 Chest X-ray Dataset. Cogn Comput (2021). https://doi.org/10.1007/s12559-020-09802-9

    Abstract

    The Coronavirus disease 2019 (COVID-19) is the fastest transmittable virus caused by severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2). The detection of COVID-19 using artificial intelligence techniques and especially deep learning will help to detect this virus in early stages which will reflect in increasing the opportunities of fast recovery of patients worldwide. This will lead to release the pressure off the healthcare system around the world. In this research, classical data augmentation techniques along with Conditional Generative Adversarial Nets (CGAN) based on a deep transfer learning model for COVID-19 detection in chest CT scan images will be presented. The limited benchmark datasets for COVID-19 especially in chest CT images are the main motivation of this research. The main idea is to collect all the possible images for COVID-19 that exists until the very writing of this research and use the classical data augmentations along with CGAN to generate more images to help in the detection of the COVID-19. In this study, five different deep convolutional neural network-based models (AlexNet, VGGNet16, VGGNet19, GoogleNet, and ResNet50) have been selected for the investigation to detect the Coronavirus-infected patient using chest CT radiographs digital images. The classical data augmentations along with CGAN improve the performance of classification in all selected deep transfer models. The outcomes show that ResNet50 is the most appropriate deep learning model to detect the COVID-19 from limited chest CT dataset using the classical data augmentation with testing accuracy of 82.91%, sensitivity 77.66%, and specificity of 87.62%.

    Context

    In this Dataet, we introduce DTL models to classify limited COVID-19 chest CT scan digital images. To input adopting CT images of the chest to the DCNN, we enriched the medical chest CT images using classical data augmentation and CGAN to generate more CT images. After that, a classifier is used to ensemble the class (COVID/NonCOVID) outputs of the classification outcomes. The proposed DTL models were evaluated on the COVID-19 CT scan images dataset. The novelty of this research is conducted as follows: (1) The introduced DTL models have end-to-end structure without classical feature extraction and selection methods. (2) We show that data augmentation and conditional generative adversarial network (CGAN) is an effective technique to generate CT images. (3) Chest CT images are one of the best tools for the classification of COVID-19. (4) The DTL models have been shown to yield very high accuracy in the limited COVID-19 dataset.

    Content

    There are 742 CT images and 2 categories (COVID/NonCOVID). Dataset |Train | Validation | Test COVID NonCOVID COVID NonCOVID COVID NonCOVID COVID-19 191 234 60 58 94 105 COVID-19 + Aug 2292 2808 720 696 94 105 COVID-19 + CGAN 2191 2234 210 208 94 105 COVID-19 + Aug + CGAN 4292 4808 870 846 94 105

    Acknowledgements

    Cite our papers:

    Loey, M., Manogaran, G. & Khalifa, N.E.M. A deep transfer learning model with classical data augmentation and CGAN to detect COVID-19 from chest CT radiography digital images. Neural Comput & Applic (2020). https://doi.org/10.1007/s00521-020-05437-x

    Loey, Mohamed; Smarandache, Florentin; M. Khalifa, Nour E. 2020. "Within the Lack of Chest COVID-19 X-ray Dataset: A Novel Detection Model Based on GAN and Deep Transfer Learning" Symmetry 12, no. 4: 651. https://doi.org/10.3390/sym12040651

    Khalifa, N.E.M., Smarandache, F., Manogaran, G. et al. A Study of the Neutrosophic Set Significance on Deep Transfer Learning Models: an Experimental Case on a Limited COVID-19 Chest X-ray Dataset. Cogn Comput (2021). https://doi.org/10.1007/s12559-020-09802-9

    Inspiration

    Original Dataset: https://github.com/UCSD-AI4H/COVID-CT

    Creating the proposed database present...

  20. s

    Dataset for: Determination of urban particulates size from occluded...

    • eprints.soton.ac.uk
    Updated May 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grant-Jacob, James; Praeger, Matthew; Loxham, Matthew; Eason, Robert; Mills, Benjamin (2023). Dataset for: Determination of urban particulates size from occluded scattering patterns using deep learning and data augmentation [Dataset]. http://doi.org/10.5258/SOTON/D1668
    Explore at:
    Dataset updated
    May 6, 2023
    Dataset provided by
    University of Southampton
    Authors
    Grant-Jacob, James; Praeger, Matthew; Loxham, Matthew; Eason, Robert; Mills, Benjamin
    Description

    This dataset supports the publication: James A. Grant-Jacob, Matthew Praeger, Matthew Loxham, Robert W. Eason and Ben Mills. Determination of size of urban particulates from occluded scattering patterns using deep learning and data augmentation. IOP Environmental Research Communications. DOI: 10.1088/2515-7620/abed94

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk

Data from: Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Feb 22, 2024
Dataset provided by
Osaka University
Nagoya University
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.

Search
Clear search
Close search
Google apps
Main menu