54 datasets found
  1. Variable Message Signal annotated images for object detection

    • zenodo.org
    • portalcientifico.universidadeuropea.com
    zip
    Updated Oct 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas (2022). Variable Message Signal annotated images for object detection [Dataset]. http://doi.org/10.5281/zenodo.5904211
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 2, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    If you use this dataset, please cite this paper: Puertas, E.; De-Las-Heras, G.; Sánchez-Soriano, J.; Fernández-Andrés, J. Dataset: Variable Message Signal Annotated Images for Object Detection. Data 2022, 7, 41. https://doi.org/10.3390/data7040041

    This dataset consists of Spanish road images taken from inside a vehicle, as well as annotations in XML files in PASCAL VOC format that indicate the location of Variable Message Signals within them. Also, a CSV file is attached with information regarding the geographic position, the folder where the image is located, and the text in Spanish. This can be used to train supervised learning computer vision algorithms, such as convolutional neural networks. Throughout this work, the process followed to obtain the dataset, image acquisition, and labeling, and its specifications are detailed. The dataset is constituted of 1216 instances, 888 positives, and 328 negatives, in 1152 jpg images with a resolution of 1280x720 pixels. These are divided into 576 real images and 576 images created from the data-augmentation technique. The purpose of this dataset is to help in road computer vision research since there is not one specifically for VMSs.

    The folder structure of the dataset is as follows:

    • vms_dataset/
      • data.csv
      • real_images/
        • imgs/
        • annotations/
      • data-augmentation/
        • imgs/
        • annotations/

    In which:

    • data.csv: Each row contains the following information separated by commas (,): image_name, x_min, y_min, x_max, y_max, class_name, lat, long, folder, text.
    • real_images: Images extracted directly from the videos.
    • data-augmentation: Images created using data-augmentation
    • imgs: Image files in .jpg format.
    • annotations: Annotation files in .xml format.
  2. f

    EDA augmentation parameters.

    • plos.figshare.com
    xls
    Updated Sep 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EDA augmentation parameters. [Dataset]. https://plos.figshare.com/articles/dataset/EDA_augmentation_parameters_/27112619
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.

  3. i

    Enhanced Cardiovascular Disease Dataset with Data Augmentation

    • ieee-dataport.org
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose Luis Lopez-Saynes (2025). Enhanced Cardiovascular Disease Dataset with Data Augmentation [Dataset]. https://ieee-dataport.org/documents/enhanced-cardiovascular-disease-dataset-data-augmentation
    Explore at:
    Dataset updated
    Mar 3, 2025
    Authors
    Jose Luis Lopez-Saynes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    physical

  4. i

    Data from: Equidistant and Uniform Data Augmentation for 3D Objects

    • ieee-dataport.org
    Updated Jan 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Morozov (2022). Equidistant and Uniform Data Augmentation for 3D Objects [Dataset]. http://doi.org/10.21227/ya9z-zk95
    Explore at:
    Dataset updated
    Jan 5, 2022
    Dataset provided by
    IEEE Dataport
    Authors
    Alexander Morozov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data augmentation is commonly used to increase the size and diversity of the datasets in machine learning. It is of particular importance to evaluate the robustness of the existing machine learning methods. With progress in geometrical and 3D machine learning, many methods exist to augment a 3D object, from the generation of random orientations to exploring different perspectives of an object. In high-precision applications, the machine learning model must be robust with respect to the small perturbations of the input object. Therefore, there is a need for 3D data augmentation tools that consider the distribution of distance metrics between the original and augmented objects. Here we present Eurecon, the first 3D data augmentation approach with spatial control over the augmented samples. It generates objects uniformly distributed over a sphere and with the user-defined radius, which is a distance with respect to the original object. Eurecon is applicable to both point cloud and polygon mesh representations of the 3D objects, as demonstrated on the ModelNet dataset. The method is particularly useful in assessing and improving the machine learning model’s robustness with respect to the transformations of a small magnitude. We demonstrated the superior performance of a point cloud-based model (PointNet++) and a mesh-based model (MeshNet) when trained on datasets augmented with Eurecon, compared to non-augmented and randomly augmented models.

  5. f

    Distribution of articles by augmentation method.

    • plos.figshare.com
    xls
    Updated Sep 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Distribution of articles by augmentation method. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
    Description

    Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.

  6. Data archive for paper "Copula-based synthetic data augmentation for...

    • zenodo.org
    zip
    Updated Mar 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Meyer; David Meyer (2022). Data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators" [Dataset]. http://doi.org/10.5281/zenodo.4320795
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Meyer; David Meyer
    Description

    Overview

    This is the data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators". It contains the paper’s data archive with model outputs (see results folder) and the Singularity image for (optionally) re-running experiments.

    For the Python tool used to generate synthetic data, please refer to Synthia.

    Requirements

    *Although PBS in not a strict requirement, it is required to run all helper scripts as included in this repository. Please note that depending on your specific system settings and resource availability, you may need to modify PBS parameters at the top of submit scripts stored in the hpc directory (e.g. #PBS -lwalltime=72:00:00).

    Usage

    To reproduce the results from the experiments described in the paper, first fit all copula models to the reduced NWP-SAF dataset with:

    qsub hpc/fit.sh

    then, to generate synthetic data, run all machine learning model configurations, and compute the relevant statistics use:

    qsub hpc/stats.sh
    qsub hpc/ml_control.sh
    qsub hpc/ml_synth.sh

    Finally, to plot all artifacts included in the paper use:

    qsub hpc/plot.sh

    Licence

    Code released under MIT license. Data from the reduced NWP-SAF dataset released under CC BY 4.0.

  7. Data from: Data augmentation for disruption prediction via robust surrogate...

    • osti.gov
    • dataverse.harvard.edu
    Updated Jun 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center (2022). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD
    Explore at:
    Dataset updated
    Jun 6, 2022
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    General Atomics, San Diego, CA (United States)
    Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
    Description

    The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.

  8. Fire Detection in YOLO format

    • kaggle.com
    Updated Jun 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankan Sharma (2021). Fire Detection in YOLO format [Dataset]. https://www.kaggle.com/datasets/ankan1998/fire-detection-in-yolo-format/suggestions?status=pending&yourSuggestions=true
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 13, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ankan Sharma
    Description

    Dataset

    This dataset was created by Ankan Sharma

    Released under GPL 2

    Contents

  9. a

    Synthetic Data Generation Report

    • archivemarketresearch.com
    doc, ppt
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Synthetic Data Generation Report [Dataset]. https://www.archivemarketresearch.com/reports/synthetic-data-generation-417380
    Explore at:
    doc, pptAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset authored and provided by
    Archive Market Research
    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The synthetic data generation market is experiencing robust growth, driven by increasing demand for data privacy, the need for data augmentation in machine learning models, and the rising adoption of AI across various sectors. The market, valued at approximately $2 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This significant expansion is fueled by several key factors. Firstly, stringent data privacy regulations like GDPR and CCPA are limiting the use of real-world data, making synthetic data a crucial alternative for training and testing AI models. Secondly, the demand for high-quality datasets for training advanced machine learning models is escalating, and synthetic data provides a scalable and cost-effective solution. Lastly, diverse industries, including BFSI, healthcare, and automotive, are actively adopting synthetic data to improve their AI and analytics capabilities, leading to increased market penetration. The market segmentation reveals strong growth across various application areas. BFSI and Healthcare & Life Sciences are currently leading the adoption, driven by the need for secure and compliant data analysis and model training. However, significant growth potential exists in sectors like Retail & E-commerce, Automotive & Transportation, and Government & Defense, as these industries increasingly recognize the benefits of synthetic data in enhancing operational efficiency, risk management, and predictive analytics. While the technology is still maturing, and challenges related to data quality and model accuracy need to be addressed, the overall market outlook remains exceptionally positive, fueled by continuous technological advancements and expanding applications. The competitive landscape is diverse, with major players like Microsoft, Google, and IBM alongside innovative startups continuously innovating in this dynamic field. Regional analysis indicates strong growth across North America and Europe, with Asia-Pacific emerging as a rapidly expanding market.

  10. Variable Misuse tool: Dataset for data augmentation (7)

    • zenodo.org
    zip
    Updated Mar 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cristian Robledo; Cristian Robledo; Francesca Sallicati; Javier Gutiérrez; Francesca Sallicati; Javier Gutiérrez (2022). Variable Misuse tool: Dataset for data augmentation (7) [Dataset]. http://doi.org/10.5281/zenodo.6090582
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 8, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cristian Robledo; Cristian Robledo; Francesca Sallicati; Javier Gutiérrez; Francesca Sallicati; Javier Gutiérrez
    License

    https://www.gnu.org/licenses/lgpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/lgpl-3.0-standalone.html

    Description

    Dataset used for data augmentation in the training phase of the Variable Misuse tool. It contains some source code files extracted from third-party repositories.

  11. f

    Comparison of models trained with traditional and cut-paste data...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chinmayee Athalye; Rima Arnaout (2023). Comparison of models trained with traditional and cut-paste data augmentation when application of augmentation during training time is balanced. [Dataset]. http://doi.org/10.1371/journal.pone.0282532.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Chinmayee Athalye; Rima Arnaout
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of models trained with traditional and cut-paste data augmentation when application of augmentation during training time is balanced.

  12. m

    Dataset for pest classification in Mango farms from Indonesia

    • data.mendeley.com
    Updated Feb 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kusrini Kusrini (2020). Dataset for pest classification in Mango farms from Indonesia [Dataset]. http://doi.org/10.17632/94jf97jzc8.1
    Explore at:
    Dataset updated
    Feb 27, 2020
    Authors
    Kusrini Kusrini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The infestation of pests affecting the Mango cultivation in Indonesia has an economic impact in the region. Following the recent development in the field of machine learning, the application of deep-learning models for multi-class pest-classification requires large collection of image samples upon which the algorithms can be trained. Addressing such a requirement the paper presents a detailed outline of the dataset collected from the Mango farms in Indonesia. The data consists of images captured from the Mango farms affected by 15-categories of pests which are identifiable through structural and visual deformity exhibited in the Mango leaves. The collection of the data involved the use of a low-cost sensing equipment that are commonly used by the farmers for collecting images from the farm. The collected data is subjected to two processes, namely the data augmentation process and training of the classification model. The dataset collection consists of 510 images that includes 15-caterogies of pests that affect Mango leaves along with the original appearance of the Mango leaves (resulting in 16-classes) collected over a period of 6 months. For the purposes of training the deep-learning neural network, the images are subjected to data augmentation to expand the dataset and to emulate closely the large-scale data collection process carried out by farmers. The outcome of the data augmentation process results in a total of 62,047 image samples, which are used to train the network. The multi-class classification framework. The training framework presented in the paper builds on the VGG-16 feature extractor and replaces the last 3-year network with a fully connected neural network layers resulting in 16-output classes. The dataset includes the annotation of the image samples for both original images captured from the field and the augmented image samples. Both the original and augmented data has been classified as training, validation and testing. The overall dataset is divided into 3-parts, namely version 0, version 1 and version 2. The version 0 consists of the original data set, with 310 images to be used for training, 103 images to be used for the validation and finally 97 images for testing. The version 1 of the dataset of includes 46,500 image samples for training, following the application of the data augmentation process followed by the 103 original images used for validation and 97 images for testing. Finally, the version 2 of the dataset uses 47, 500 images for training and 15, 450 images for validation and 97 images for the testing. The three versions of the dataset include images available in JPEG format. The visual appearance of the pests captured in the dataset provides an ideal testbed for benchmarking the performance of various deep-learning networks trained to detect specific categories of pests. In addition, the dataset also provides an opportunity to evaluate the impact of data augmentation techniques on the original dataset.

  13. ProxyFAUG: Proximity-based Fingerprint Augmentation (data)

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grigorios Anagnostopoulos; Grigorios Anagnostopoulos; Alexandros Kalousis; Alexandros Kalousis (2022). ProxyFAUG: Proximity-based Fingerprint Augmentation (data) [Dataset]. http://doi.org/10.5281/zenodo.4457391
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Grigorios Anagnostopoulos; Grigorios Anagnostopoulos; Alexandros Kalousis; Alexandros Kalousis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The supplementary data of the paper "ProxyFAUG: Proximity-based Fingerprint Augmentation".

    Open access Author’s accepted manuscript version: https://arxiv.org/abs/2102.02706v2

    Published paper: https://ieeexplore.ieee.org/document/9662590

    The train/validation/test sets used in the paper "ProxyFAUG: Proximity-based Fingerprint Augmentation", after having passed the preprocessing process described in the paper, are made available here. Moreover, the augmentations produced by the proposed ProxyFAUG method are also made available with the files (x_aug_train.csv, y_aug_train.csv). More specifically:

    x_train_pre.csv : The features side (x) information of the preprocessed training set.

    x_val_pre.csv : The features side (x) information of the preprocessed validation set.

    x_test_pre.csv : The features side (x) information of the preprocessed test set.

    x_aug_train.csv : The features side (x) information of the fingerprints generated by ProxyFAUG.

    y_train.csv : The location ground truth information (y) of the training set.

    y_val.csv : The location ground truth information (y) of the validation set.

    y_test.csv : The location ground truth information (y) of the test set.

    y_aug_train.csv : The location ground truth information (y) of the fingerprints generated by ProxyFAUG.

    Note that in the paper, the original training set (x_train_pre.csv) is used as a baseline, and is compared against the scenario where the concatenation of the original and the generated training sets (concatenation of x_train_pre.csv and x_aug_train.csv) is used.

    The full code implementation related to the paper is available here:

    Code: https://zenodo.org/record/4457353

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------

    The original full dataset used in this study, is the public dataset sigfox_dataset_antwerp.csv which can be access here:

    https://zenodo.org/record/3904158#.X4_h7y8RpQI

    The above link is related to the publication "Sigfox and LoRaWAN Datasets for Fingerprint Localization in Large Urban and Rural Areas", in which the original full dataset was published. The publication is available here:

    http://www.mdpi.com/2306-5729/3/2/13

    The credit for the creation of the original full dataset goes to Aernouts, Michiel; Berkvens, Rafael; Van Vlaenderen, Koen; and Weyn, Maarten.

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------

    The train/validation/test split of the original dataset that is used in this paper, is taken from our previous work "A Reproducible Analysis of RSSI Fingerprinting for Outdoors Localization Using Sigfox: Preprocessing and Hyperparameter Tuning". Using the same train/validation/test split in different works strengthens the consistency of the comparison of results. All relevant material of that work is listed below:

    Preprint: https://arxiv.org/abs/1908.06851

    Paper: https://ieeexplore.ieee.org/document/8911792

    Code: https://zenodo.org/record/3228752

    Data: https://zenodo.org/record/3228744

  14. S

    Synthetic Data Software Report

    • archivemarketresearch.com
    doc, ppt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Synthetic Data Software Report [Dataset]. https://www.archivemarketresearch.com/reports/synthetic-data-software-560836
    Explore at:
    doc, pptAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Synthetic Data Software market is experiencing robust growth, driven by increasing demand for data privacy regulations compliance and the need for large, high-quality datasets for AI/ML model training. The market size in 2025 is estimated at $2.5 billion, demonstrating significant expansion from its 2019 value. This growth is projected to continue at a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated market value of $15 billion by 2033. This expansion is fueled by several key factors. Firstly, the increasing stringency of data privacy regulations, such as GDPR and CCPA, is restricting the use of real-world data in many applications. Synthetic data offers a viable solution by providing realistic yet privacy-preserving alternatives. Secondly, the booming AI and machine learning sectors heavily rely on massive datasets for training effective models. Synthetic data can generate these datasets on demand, reducing the cost and time associated with data collection and preparation. Finally, the growing adoption of synthetic data across various sectors, including healthcare, finance, and retail, further contributes to market expansion. The diverse applications and benefits are accelerating the adoption rate in a multitude of industries needing advanced analytics. The market segmentation reveals strong growth across cloud-based solutions and the key application segments of healthcare, finance (BFSI), and retail/e-commerce. While on-premises solutions still hold a segment of the market, the cloud-based approach's scalability and cost-effectiveness are driving its dominance. Geographically, North America currently holds the largest market share, but significant growth is anticipated in the Asia-Pacific region due to increasing digitalization and the presence of major technology hubs. The market faces certain restraints, including challenges related to data quality and the need for improved algorithms to generate truly representative synthetic data. However, ongoing innovation and investment in this field are mitigating these limitations, paving the way for sustained market growth. The competitive landscape is dynamic, with numerous established players and emerging startups contributing to the market's evolution.

  15. f

    Examples of sentence transformation, Balakrishnan et al.

    • plos.figshare.com
    xls
    Updated Sep 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Examples of sentence transformation, Balakrishnan et al. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Examples of sentence transformation, Balakrishnan et al.

  16. Data from: Solar flare forecasting based on magnetogram sequences learning...

    • zenodo.org
    • redu.unicamp.br
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luís Fernando Lopes Grim; Luís Fernando Lopes Grim; André Leon Sampaio Gradvohl; André Leon Sampaio Gradvohl (2023). Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation [Dataset]. http://doi.org/10.5281/zenodo.10246577
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luís Fernando Lopes Grim; Luís Fernando Lopes Grim; André Leon Sampaio Gradvohl; André Leon Sampaio Gradvohl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation".

    Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1.

    Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data).

    We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets.

    We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively.

    GitHub version

    The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version.

    Folders Structure

    In the Root directory of the project, we have two folders:

    • magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA).
    • Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders.
      • M24/M48: both present the following sub-folders structure:
        • Seqs16;
        • SF_MViT;
        • SF_MViT_oT;
        • SF_MViT_oTV;
        • SF_MViT_oTV_Test.

    There are also two files in root:

    • inst_packages.sh: install the packages and dependencies to run the models.
    • download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache.

    M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files.

    Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders.

    All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models.

    Naming pattern for the files:

    • magnetogram_jpg: follows the format "hmi.sharp_720s.
    • Seqs16: follows the format "hmi.sharp_720s.<SHARP-ID>.
    • Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "
    • All SF_MViT...folders:
      • Model training codes: "SF_MViT_
      • Job submission files: "jobMViT_
      • Temporary inputs: "Seq16_flare_Mclass_
      • Outputs: "saida_MViT_Adam_10-7
      • Error files: "err_MViT_Adam_10-7
      • Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=

  17. m

    Dataset of Guava Leaf Diseases in Bangladesh

    • data.mendeley.com
    Updated Nov 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumaia Akter (2024). Dataset of Guava Leaf Diseases in Bangladesh [Dataset]. http://doi.org/10.17632/2ksdzxdvbm.1
    Explore at:
    Dataset updated
    Nov 13, 2024
    Authors
    Sumaia Akter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh
    Description

    The dataset, sourced from Vimruli Guava Garden and Floating Market in Jhalakathi, Barisal, categorizes guava leaf and fruit conditions for better crop management. It includes images of healthy and diseased samples, making it a valuable resource for researchers and practitioners working on machine learning models to identify plant diseases. The dataset includes six classes for robust model training.

    Dataset Summary: Location: Vimruli Guava Garden & Floating Market, Jhalakathi, Barisal. Subjects: Guava leaves and fruits. Purpose: Classification and detection of guava plant conditions.

    Data Distribution: Classes: 1. Algal Leaves Spot: 100 original, 1320 augmented, 1420 total 2. Dry Leaves: 52 original, 676 augmented, 728 total 3. Healthy Fruit: 50 original, 650 augmented, 700 total 4. Healthy Leaves: 150 original, 1600 augmented, 1750 total 5. Insects Eaten: 164 original, 1720 augmented, 1884 total 6. Red Rust: 90 original, 1170 augmented, 1260 total

    Total Samples: Original: 606 Augmented: 7136 Overall: 7742 samples

    Class Details: 1. Algal Leaves Spot: Fungal spots on leaves. 2. Dry Leaves: Leaves dried from environmental/nutrient factors. 3. Healthy Fruit/Leaves: Free of diseases/damage. 4. Insects Eaten: Insect-caused damage on leaves. 5. Red Rust: Reddish spots due to fungal infection.

    This dataset is well-suited for training and evaluating machine learning models to detect and classify various conditions of guava plants, aiding in automated disease identification and better agricultural management.

  18. Data from: How many specimens make a sufficient training set for automated...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard (2024). How many specimens make a sufficient training set for automated three dimensional feature extraction? [Dataset]. http://doi.org/10.5061/dryad.1rn8pk12f
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    University of Southampton
    Natural History Museum
    Authors
    James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Deep learning has emerged as a robust tool for automating feature extraction from 3D images, offering an efficient alternative to labour-intensive and potentially biased manual image segmentation methods. However, there has been limited exploration into the optimal training set sizes, including assessing whether artificial expansion by data augmentation can achieve consistent results in less time and how consistent these benefits are across different types of traits. In this study, we manually segmented 50 planktonic foraminifera specimens from the genus Menardella to determine the minimum number of training images required to produce accurate volumetric and shape data from internal and external structures. The results reveal unsurprisingly that deep learning models improve with a larger number of training images with eight specimens being required to achieve 95% accuracy. Furthermore, data augmentation can enhance network accuracy by up to 8.0%. Notably, predicting both volumetric and shape measurements for the internal structure poses a greater challenge compared to the external structure, due to low contrast differences between different materials and increased geometric complexity. These results provide novel insight into optimal training set sizes for precise image segmentation of diverse traits and highlight the potential of data augmentation for enhancing multivariate feature extraction from 3D images. Methods Data collection 50 planktonic foraminifera, comprising 4 Menardella menardii, 17 Menardella limbata, 18 Menardella exilis, and 11 Menardella pertenuis specimens, were used in our analyses (electronic supplementary material, figures S1 and S2). The taxonomic classification of these species was established based on the analysis of morphological characteristics observed in their shells. In this context, all species are characterised by lenticular, low trochosprial tests with a prominent keel [13]. Discrimination among these species is achievable, as M. limbata can be distinguished from its ancestor, M. menardii, by having a greater number of chambers and a smaller umbilicus. Moreover, M. exilis and M. pertenuis can be discerned from M. limbata by their thinner, more polished tests and reduced trochospirality. Furthermore, M. pertenuis is identifiable by a thin plate extending over the umbilicus and possessing a greater number of chambers in the final whorl compared to M. exilis [13]. The samples containing these individuals and species spanned 5.65 million years ago (Ma) to 2.85 Ma [14] and were collected from the Ceara Rise in the Equatorial Atlantic region at Ocean Drilling Program (ODP) Site 925, which comprised Hole 925B (4°12.248'N, 43°29.349'W), Hole 925C 20 (4°12.256'N, 43°29.349'W), and Hole 925D (4°12.260'N, 43°29.363'W). See Curry et al., [15] for more details. This group was chosen to provide inter- and intraspecific species variation, and to provide contemporary data to test how morphological distinctiveness maps to taxonomic hypotheses [16]. The non-destructive imaging of both internal and external structures of the foraminifera was conducted at the µ-VIS X-ray Imaging Centre, University of Southampton, UK, using a Zeiss Xradia 510 Versa X-ray tomography scanner. Employing a rotational target system, the scanner operated at a voltage of 110 kV and a power of 10 W. Projections were reconstructed using Zeiss Xradia software, resulting in 16-bit greyscale .tiff stacks characterised by a voxel size of 1.75 μm and an average dimension of 992 x 1015 pixels for each 2D slice. Generation of training sets We extracted the external calcite and internal cavity spaces from the micro-CT scans of the 50 individuals using manual segmentation within Dragonfly v. 2021.3 (Object Research Systems, Canada). This step took approximately 480 minutes per specimen (24,000 minutes total) and involved the manual labelling of 11,947 2D images. Segmentation data for each specimen were exported as multi-label (3 labels: external, internal, and background) 8-bit multipage .tiff stacks and paired with the original CT image data to allow for training (see figure 2). The 50 specimens were categorised into three distinct groups (electronic supplementary material, table S1): 20 training image stacks, 10 validation image stacks, and 20 test image stacks. From the training image category, we generated six distinct training sets, varying in size from 1 to 20 specimens (see table 1). These were used to assess the impact of training set size on segmentation accuracy, as determined through a comparative analysis against the validation set (see Section 2.3). From the initial six training sets, we created six additional training sets through data augmentation using the NumPy library [17] in Python. This augmentation method was chosen for its simplicity and accessibility to researchers with limited computational expertise, as it can be easily implemented using a straightforward batch code. This augmentation process entailed rotating the original images five times (the maximum amount permitted using this method), effectively producing six distinct 3D orientations per specimen for each of the original training sets (see figure 3). The augmented training sets comprised between 6 and 120 .tiff stacks (see table 1). Training the neural networks CNNs were trained using the offline version of Biomedisa, which utilises a 3D U-Net architecture [18] – the primary model employed for image segmentation [19], and is optimised using Keras with a TensorFlow backend. We used patches of size 64 x 64 x 64 voxels, which were then scaled to a size of 256 x 256 x 256 voxels. This scaling was performed to improve the network’s ability to capture spatial features and mitigate potential information loss during training. We trained 3 networks for each of the training sets to check the extent of stochastic variation on the results [20]. To train our models in Biomedisa, we used a stochastic gradient descent with a learning rate of 0.01, a decay of 1 × 10-6, momentum of 0.9, and Nesterov momentum enabled. A stride size of 32 pixels and a batch size of 24 samples per epoch were used alongside an automated cropping feature, which has been demonstrated to enhance accuracy [21]. The training of each network was performed on a Tesla V100S-PCIE-32GB graphics card with 30989 MB of available memory. All the analyses and training procedures were conducted on the High-Performance Computing (HPC) system at the Natural History Museum, London. To measure network accuracy, we used the Dice similarity coefficient (Dice score), a metric commonly used in used in biomedical image segmentation studies [22, 23]. The Dice score quantifies the level of overlap between two segmentations, providing a value between 0 (no overlap) and 1 (perfect match). We conducted experiments to evaluate the potential efficiency gains of using an early stopping mechanism within Biomedisa. After testing a variety of epoch limits, we opted for an early stopping criterion set at 25 epochs, which was found to be the lowest value as to which all models trained correctly for every training set. By “trained correctly” we mean if there is no increase in Dice score within a 25-epoch window, the optimal network is selected, and training is terminated. To gauge its impact of early stopping on network accuracy, we compared the results obtained from the original six training sets under early stopping to those obtained on a full run of 200 epochs. Evaluation of feature extraction We used the median accuracy network from each of the 12 training sets to produce segmentation data for the external and internal structures of the 20 test specimens. The median accuracy was selected as it provides a more robust estimate of performance by ensuring that outliers had less impact on the overall result. We then compared the volumetric and shape measurements from the manual data to those from each training set. The volumetric measurements were total volume (comprising both external and internal volumes) and percentage calcite (calculated as the ratio of external volume to internal volume, multiplied by 100). To compare shape, mesh data for the external and internal structures was generated from the segmentation data of the 12 training sets and the manual data. Meshes were decimated to 50,000 faces and smoothed before being scaled and aligned using Python and Generalized Procrustes Surface Analysis (GPSA) [24], respectively. Shape was then analysed using the landmark-free morphometry pipeline, as outlined by Toussaint et al., [25]. We used a kernel width of 0.1mm and noise parameter of 1.0 for both the analysis of shape for both the external and internal data, using a Keops kernel (PyKeops; https://pypi.org/project/pykeops/) as it performs better with large data [25]. The analyses were run for 150 iterations, using an initial step size of 0.01. The manually generated mesh for the individual st049_bl1_fo2 was used as the atlas for both the external and internal shape comparisons.

  19. Spatio-temporal learning from MD simulations for protein-ligand binding...

    • zenodo.org
    zip
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre-Yves Libouban; Pierre-Yves Libouban; Camille Parisel; Maxime Song; Samia Aci-Sèche; Samia Aci-Sèche; Jose-Carlos Gómez-Tamayo; Gary Tresadern; Gary Tresadern; Pascal Bonnet; Pascal Bonnet; Camille Parisel; Maxime Song; Jose-Carlos Gómez-Tamayo (2024). Spatio-temporal learning from MD simulations for protein-ligand binding affinity prediction [Dataset]. http://doi.org/10.5281/zenodo.10390550
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pierre-Yves Libouban; Pierre-Yves Libouban; Camille Parisel; Maxime Song; Samia Aci-Sèche; Samia Aci-Sèche; Jose-Carlos Gómez-Tamayo; Gary Tresadern; Gary Tresadern; Pascal Bonnet; Pascal Bonnet; Camille Parisel; Maxime Song; Jose-Carlos Gómez-Tamayo
    License

    https://github.com/DISIC/politique-de-contribution-open-source/blob/master/LICENSE.pdfhttps://github.com/DISIC/politique-de-contribution-open-source/blob/master/LICENSE.pdf

    Time period covered
    Jun 6, 2024
    Description

    This Zenodo repository provides comprehensive resources for the paper titled "Spatio-temporal learning from MD simulations for protein-ligand binding affinity prediction". We created a dataset of 63,000 molecular dynamics simulations by performing 10 simulations of 10 ns on 6,300 complexes. Neural networks were developed to learn from this data in order to predict the binding affinities of protein-ligand complexes. The implementation of these neural networks are available on github. Our collection includes training/benchmark datasets, trained statistical models, and results on test sets (CSV & PDF files).

    Training/benchmark datasets:

    Training, validation and test sets are provided to train and evaluate the following neural networks:

    • Pafnucy, Proli and Densenucy without MD data augmentation (dataset file names contain "initial")
    • Pafnucy, Proli and Densenucy with MD data augmentation (dataset file names contain "MDDA")
    • Pafnucy with/without MD data augmentation and Proli and Densenucy with MD data augmentation were also evaluated on the fep test set (test set file name contain "fep")
    • Timenucy and Videonucy using spatiotemporal learning methods (dataset file names contain "4D")
    • Pafnucy without MD data augmentation and on a reduced training set (dataset file names contain "reduced")

    For each training methodology (MD data augmentation and spatiotemporal learning), we provide the data for the whole complex, only the ligand or only the protein. Additionally for spatiotemporal learning, we provide the data with only the ligand using the tracking mode.

    Statistical models:

    We provide the models trained with Pafnucy, Proli, Densenucy, Timenucy and Videonucy. Each models were trained in 10 replicates.

    For Pafnucy, Proli, Densenucy, we provide the models trained with random and systematic rotations, as well as with or without MD data augmentation.

    For Proli, Densenucy, Timenucy and Videonucy, we provide the models trained on the whole complex, only the ligand or only the protein.

    For Pafnucy we also provide the models trained on the reduced set (5932 complexes).

    Results on test sets (CSV & PDF files):

    We provide the predictions on the PDBbind v.2016 core set.

    • For spatiotemporal learning methods (Timenucy and Videonucy), there are predictions for only 83 complexes, as we did not perform simulations on the whole test set.
    • For models trained with MD DA, predictions were carried on the crystallographic structures as well as on the frames extracted from the simulations performed on the test set (augmented test).

    Results on the FEP dataset are also provided for Pafnucy, Proli and Densenucy.

    Due to the large size of the raw MD data (~4.5 To), we are not able to share this data on zenodo, and will provide it upon demand.

    This work was performed using HPC resources from GENCI-IDRIS (Grant 2021-A0100712496 & 2022-AD011013521) and CRIANN (Grant 2021002).

  20. g

    Data from: Using distant supervision to augment manually annotated data for...

    • datasearch.gesis.org
    • lifesciences.datastations.nl
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Su, PhD P. (University of Delaware), DataCurator (2020). Using distant supervision to augment manually annotated data for relation extraction [Dataset]. http://doi.org/10.17026/dans-xvu-rvk2
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    DANS (Data Archiving and Networked Services)
    Authors
    Su, PhD P. (University of Delaware), DataCurator
    Description

    This is the data for the paper "Using distant supervision to augment manually annotated data for relation extraction"

    Significant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas (2022). Variable Message Signal annotated images for object detection [Dataset]. http://doi.org/10.5281/zenodo.5904211
Organization logo

Variable Message Signal annotated images for object detection

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
zipAvailable download formats
Dataset updated
Oct 2, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

If you use this dataset, please cite this paper: Puertas, E.; De-Las-Heras, G.; Sánchez-Soriano, J.; Fernández-Andrés, J. Dataset: Variable Message Signal Annotated Images for Object Detection. Data 2022, 7, 41. https://doi.org/10.3390/data7040041

This dataset consists of Spanish road images taken from inside a vehicle, as well as annotations in XML files in PASCAL VOC format that indicate the location of Variable Message Signals within them. Also, a CSV file is attached with information regarding the geographic position, the folder where the image is located, and the text in Spanish. This can be used to train supervised learning computer vision algorithms, such as convolutional neural networks. Throughout this work, the process followed to obtain the dataset, image acquisition, and labeling, and its specifications are detailed. The dataset is constituted of 1216 instances, 888 positives, and 328 negatives, in 1152 jpg images with a resolution of 1280x720 pixels. These are divided into 576 real images and 576 images created from the data-augmentation technique. The purpose of this dataset is to help in road computer vision research since there is not one specifically for VMSs.

The folder structure of the dataset is as follows:

  • vms_dataset/
    • data.csv
    • real_images/
      • imgs/
      • annotations/
    • data-augmentation/
      • imgs/
      • annotations/

In which:

  • data.csv: Each row contains the following information separated by commas (,): image_name, x_min, y_min, x_max, y_max, class_name, lat, long, folder, text.
  • real_images: Images extracted directly from the videos.
  • data-augmentation: Images created using data-augmentation
  • imgs: Image files in .jpg format.
  • annotations: Annotation files in .xml format.
Search
Clear search
Close search
Google apps
Main menu