59 datasets found
  1. d

    Data from: How many specimens make a sufficient training set for automated...

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard (2024). How many specimens make a sufficient training set for automated three dimensional feature extraction? [Dataset]. http://doi.org/10.5061/dryad.1rn8pk12f
    Explore at:
    Dataset updated
    Jun 1, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard
    Description

    Deep learning has emerged as a robust tool for automating feature extraction from 3D images, offering an efficient alternative to labour-intensive and potentially biased manual image segmentation methods. However, there has been limited exploration into the optimal training set sizes, including assessing whether artificial expansion by data augmentation can achieve consistent results in less time and how consistent these benefits are across different types of traits. In this study, we manually segmented 50 planktonic foraminifera specimens from the genus Menardella to determine the minimum number of training images required to produce accurate volumetric and shape data from internal and external structures. The results reveal unsurprisingly that deep learning models improve with a larger number of training images with eight specimens being required to achieve 95% accuracy. Furthermore, data augmentation can enhance network accuracy by up to 8.0%. Notably, predicting both volumetric and ..., Data collection 50 planktonic foraminifera, comprising 4 Menardella menardii, 17 Menardella limbata, 18 Menardella exilis, and 11 Menardella pertenuis specimens, were used in our analyses (electronic supplementary material, figures S1 and S2). The taxonomic classification of these species was established based on the analysis of morphological characteristics observed in their shells. In this context, all species are characterised by lenticular, low trochosprial tests with a prominent keel [13]. Discrimination among these species is achievable, as M. limbata can be distinguished from its ancestor, M. menardii, by having a greater number of chambers and a smaller umbilicus. Moreover, M. exilis and M. pertenuis can be discerned from M. limbata by their thinner, more polished tests and reduced trochospirality. Furthermore, M. pertenuis is identifiable by a thin plate extending over the umbilicus and possessing a greater number of chambers in the final whorl compared to M. exilis [13]. The s..., , # Data from: How many specimens make a sufficient training set for automated three dimensional feature extraction?

    https://doi.org/10.5061/dryad.1rn8pk12f

    All computer code and final raw data used for this research work are stored in GitHub: https://github.com/JamesMulqueeney/Automated-3D-Feature-Extraction and have been archived within the Zenodo repository:Â https://doi.org/10.5281/zenodo.11109348.Â

    This data is the additional primary data used in each analysis. These include: CT Image Files, Manual Segmentation Files (use for training or analysis), Inputs and Outputs for Shape Analysis and an example .h5 file which can be used to practice AI segmentation.Â

    Description of the data and file structure

    The primary data is arranged into the following:

    1. Image_Files.zip: Foraminiferal CT data used in the analysis.Â
    2. **I...
  2. n

    Data from: Exploring deep learning techniques for wild animal behaviour...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Feb 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 22, 2024
    Dataset provided by
    Nagoya University
    Osaka University
    Authors
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

    This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.

  3. Additional file 4 of Which data subset should be augmented for deep...

    • springernature.figshare.com
    xlsx
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yusra A. Ameen; Dalia M. Badary; Ahmad Elbadry I. Abonnoor; Khaled F. Hussain; Adel A. Sewisy (2023). Additional file 4 of Which data subset should be augmented for deep learning? a simulation study using urothelial cell carcinoma histopathology images [Dataset]. http://doi.org/10.6084/m9.figshare.22622732.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    figshare
    Authors
    Yusra A. Ameen; Dalia M. Badary; Ahmad Elbadry I. Abonnoor; Khaled F. Hussain; Adel A. Sewisy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 4. A Microsoft® Excel® workbook that details the raw data for the 8 experiments in which either the test set was augmented alone (after its allocation) or augmentation of the whole dataset was done before test-set allocation. All of the image-classification output probabilities are included.

  4. Brain Tumor Paper Dataset and Code

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Feb 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yazan Al-Smadi; Yazan Al-Smadi (2023). Brain Tumor Paper Dataset and Code [Dataset]. http://doi.org/10.5281/zenodo.7619446
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yazan Al-Smadi; Yazan Al-Smadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Brain Tumor Detection Research Paper Code and Dataset

    Paper title: Transforming brain tumor detection: the impact of YOLO models and MRI orientations.

    Authored by: Yazan Al-Smadi, Ahmad Al-Qerem, et al. (2023)


    This project contains a full version of the used brain tumor dataset and a full code version of the proposed research methodology.

  5. m

    augmentation data for DAISM

    • data.mendeley.com
    Updated Jun 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yating Lin (2022). augmentation data for DAISM [Dataset]. http://doi.org/10.17632/ysjwjvpnh3.1
    Explore at:
    Dataset updated
    Jun 22, 2022
    Authors
    Yating Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purified dataset for data augmentation for DAISM-DNNXMBD can be downloaded from this repository.

    The pbmc8k dataset downloaded from 10X Genomics were processed and uesd for data augmentation to create training datasets for training DAISM-DNN models. pbmc8k.h5ad contains 5 cell types (B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells), and pbmc8k_fine.h5ad cantains 7 cell types (naive.B.cells, memory.B.cells, naive.CD4.T.cells, memory.CD4.T.cells,naive.CD8.T.cells, memory.CD8.T.cells, regulatory.T.cells, monocytes, macrophages, myeloid.dendritic.cells, NK.cells).

    For RNA-seq dataset, it contains 5 cell types (B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells). Raw FASTQ reads were downloaded from the NCBI website, and transcription and gene-level expression quantification were performed using Salmon (version 0.11.3) with Gencode v29 after quality control of FASTQ reads using fastp. All tools were used with default parameters.

  6. Variable Misuse tool: Dataset for data augmentation (3)

    • zenodo.org
    zip
    Updated Mar 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cristian Robledo; Cristian Robledo; Francesca Sallicati; Javier Gutiérrez; Francesca Sallicati; Javier Gutiérrez (2022). Variable Misuse tool: Dataset for data augmentation (3) [Dataset]. http://doi.org/10.5281/zenodo.6090340
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 8, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cristian Robledo; Cristian Robledo; Francesca Sallicati; Javier Gutiérrez; Francesca Sallicati; Javier Gutiérrez
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description

    Dataset used for data augmentation in the training phase of the Variable Misuse tool. It contains some source code files extracted from third-party repositories.

  7. f

    Characterization of datasets.

    • figshare.com
    xls
    Updated Sep 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Characterization of datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.

  8. d

    Data from: Image-based automated species identification: Can virtual data...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage (2021). Image-based automated species identification: Can virtual data augmentation overcome problems of insufficient sampling? [Dataset]. http://doi.org/10.5061/dryad.f1vhhmgx9
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 12, 2021
    Dataset provided by
    Dryad
    Authors
    Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage
    Time period covered
    2021
    Description

    Automated species identification and delimitation is challenging, particularly in rare and thus often scarcely sampled species, which do not allow sufficient discrimination of infraspecific versus interspecific variation. Typical problems arising from either low or exaggerated interspecific morphological differentiation are best met by automated methods of machine learning that learn efficient and effective species identification from training samples. However, limited infraspecific sampling remains a key challenge also in machine learning. In this study, we assessed whether a data augmentation approach may help to overcome the problem of scarce training data in automated visual species identification. The stepwise augmentation of data comprised image rotation as well as visual and virtual augmentation. The visual data augmentation applies classic approaches of data augmentation and generation of artificial images using a Generative Adversarial Networks (GAN) approach. Descriptive featu...

  9. R

    Car Highway Dataset

    • universe.roboflow.com
    zip
    Updated Sep 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sallar (2023). Car Highway Dataset [Dataset]. https://universe.roboflow.com/sallar/car-highway/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 13, 2023
    Dataset authored and provided by
    Sallar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Vehicles Bounding Boxes
    Description

    Car-Highway Data Annotation Project

    Introduction

    In this project, we aim to annotate car images captured on highways. The annotated data will be used to train machine learning models for various computer vision tasks, such as object detection and classification.

    Project Goals

    • Collect a diverse dataset of car images from highway scenes.
    • Annotate the dataset to identify and label cars within each image.
    • Organize and format the annotated data for machine learning model training.

    Tools and Technologies

    For this project, we will be using Roboflow, a powerful platform for data annotation and preprocessing. Roboflow simplifies the annotation process and provides tools for data augmentation and transformation.

    Annotation Process

    1. Upload the raw car images to the Roboflow platform.
    2. Use the annotation tools in Roboflow to draw bounding boxes around each car in the images.
    3. Label each bounding box with the corresponding class (e.g., car).
    4. Review and validate the annotations for accuracy.

    Data Augmentation

    Roboflow offers data augmentation capabilities, such as rotation, flipping, and resizing. These augmentations can help improve the model's robustness.

    Data Export

    Once the data is annotated and augmented, Roboflow allows us to export the dataset in various formats suitable for training machine learning models, such as YOLO, COCO, or TensorFlow Record.

    Milestones

    1. Data Collection and Preprocessing
    2. Annotation of Car Images
    3. Data Augmentation
    4. Data Export
    5. Model Training

    Conclusion

    By completing this project, we will have a well-annotated dataset ready for training machine learning models. This dataset can be used for a wide range of applications in computer vision, including car detection and tracking on highways.

  10. m

    MultiPatient Elderly Respiration dataset in Digital Twin Technology

    • data.mendeley.com
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SAGHEER KHAN (2023). MultiPatient Elderly Respiration dataset in Digital Twin Technology [Dataset]. http://doi.org/10.17632/vm8j5dvrxy.1
    Explore at:
    Dataset updated
    Dec 7, 2023
    Authors
    SAGHEER KHAN
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The research focus for this study is to generate a larger respiration dataset for the creation of elderly respiration Digital Twin (DT) model. Initial experimental data is collected with an unobtrusive Wi-Fi sensor with Channel State Information (CSI) characteristics to collect the subject's respiration rate.

    The generation of a DT model requires extensive and diverse data. Due to limited resources and the need for extensive experimentation, the data is generated by implementing a novel statistical time series data augmentation method on single-subject respiration data. The larger synthetic respiration datasets will allow for testing the signal processing methodologies for noise removal,Breaths Per Minute (BPM) estimation, extensive Artificial Intelligence (AI) implementation.

    The sensor data is for BPM from 12BPM to 25BPM for a single subject. Normal respiration rate ranges from 12BPM to 16BPM and beyond this is considered abnormal BPM. A total of 14 files are present in the dataset. Each file is labeled according to the BPM. All 30 patient data are present for each BPM. Patient are numbered as "P1, P2, P3, .... untill P30"

    This data can be utilized by researchers and scientists toward the development of novel signal processing methodologies in the respiration DT model. These larger respiration datasets can be utilized for Machine Learning (ML) and Deep Learning (DL) in providing predictive analysis and classification of multi-patient respiration in the DT model for an elderly respiration rate.

  11. m

    Aruzz22.5K: An Image Dataset of Rice Varieties

    • data.mendeley.com
    Updated Mar 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Masudul Islam (2024). Aruzz22.5K: An Image Dataset of Rice Varieties [Dataset]. http://doi.org/10.17632/3mn9843tz2.4
    Explore at:
    Dataset updated
    Mar 12, 2024
    Authors
    Md Masudul Islam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This extensive dataset presents a meticulously curated collection of low-resolution images showcasing 20 well-established rice varieties native to diverse regions of Bangladesh. The rice samples were carefully gathered from both rural areas and local marketplaces, ensuring a comprehensive and varied representation. Serving as a visual compendium, the dataset provides a thorough exploration of the distinct characteristics of these rice varieties, facilitating precise classification.

    Dataset Composition

    The dataset encompasses 20 distinct classes, encompassing Subol Lota, Bashmoti (Deshi), Ganjiya, Shampakatari, Sugandhi Katarivog, BR-28, BR-29, Paijam, Bashful, Lal Aush, BR-Jirashail, Gutisharna, Birui, Najirshail, Pahari Birui, Polao (Katari), Polao (Chinigura), Amon, Shorna-5, and Lal Binni. In total, the dataset comprises 4,730 original JPG images and 23,650 augmented images.

    Image Capture and Dataset Organization

    These images were captured using an iPhone 11 camera with a 5x zoom feature. Each image capturing these rice varieties was diligently taken between October 18 and November 29, 2023. To facilitate efficient data management and organization, the dataset is structured into two variants: Original images and Augmented images. Each variant is systematically categorized into 20 distinct sub-directories, each corresponding to a specific rice variety.

    Original Image Dataset

    The primary image set comprises 4,730 JPG images, uniformly sized at 853 × 853 pixels. Due to the initial low resolution, the file size was notably 268 MB. Employing compression through a zip program significantly optimized the dataset, resulting in a final size of 254 MB.

    Augmented Image Dataset

    To address the substantial image volume requirements of deep learning models for machine vision, data augmentation techniques were implemented. Total 23,650 images was obtained from augmentation. These augmented images, also in JPG format and uniformly sized at 512 × 512 pixels, initially amounted to 781 MB. However, post-compression, the dataset was further streamlined to 699 MB.

    Dataset Storage and Access

    The raw and augmented datasets are stored in two distinct zip files, namely 'Original.zip' and 'Augmented.zip'. Both zip files contain 20 sub-folders representing a unique rice variety, namely 1_Subol_Lota, 2_Bashmoti, 3_Ganjiya, 4_Shampakatari, 5_Katarivog, 6_BR28, 7_BR29, 8_Paijam, 9_Bashful, 10_Lal_Aush, 11_Jirashail, 12_Gutisharna, 13_Red_Cargo,14_Najirshail, 15_Katari_Polao, 16_Lal_Biroi, 17_Chinigura_Polao, 18_Amon, 19_Shorna5, 20_Lal_Binni.

    Train and Test Data Organization

    To ease the experimenting process for the researchers we have balanced the data and split it in an 80:20 train-test ratio. The ‘Train_n_Test.zip’ folder contains two sub-directories: ‘1_TEST’ which contains 1125 images per class and ‘2_VALID’ which contains 225 images per class.

  12. Z

    BIRD: Big Impulse Response Dataset

    • data.niaid.nih.gov
    • kaggle.com
    Updated Oct 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michaud, François (2020). BIRD: Big Impulse Response Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4139415
    Explore at:
    Dataset updated
    Oct 29, 2020
    Dataset provided by
    Grondin, François
    Michaud, Simon
    Michaud, François
    Lauzon, Jean-Samuel
    Ravanelli, Mirco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BIRD is an open dataset that consists of 100,000 multichannel room impulse responses generated using the image method. This makes it the largest multichannel open dataset currently available. We provide some Python code that shows how to download and use this dataset to perform online data augmentation. The code is compatible with the PyTorch dataset class, which eases integration in existing deep learning projects based on this framework.

  13. S

    Synthetic Data Solution Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Synthetic Data Solution Report [Dataset]. https://www.marketreportanalytics.com/reports/synthetic-data-solution-55327
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Apr 3, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The synthetic data solution market is experiencing robust growth, driven by increasing demand for data privacy and security, coupled with the need for large, high-quality datasets for training AI and machine learning models. The market, currently estimated at $2 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated market value of over $10 billion by 2033. This expansion is fueled by several key factors: stringent data privacy regulations like GDPR and CCPA, which restrict the use of real personal data; the rise of synthetic data generation techniques enabling the creation of realistic, yet privacy-preserving datasets; and the increasing adoption of AI and ML across various industries, particularly financial services, retail, and healthcare, creating a high demand for training data. The cloud-based segment is currently dominating the market, owing to its scalability, accessibility, and cost-effectiveness. The geographical distribution shows North America and Europe as leading regions, driven by early adoption of AI and robust data privacy regulations. However, the Asia-Pacific region is expected to witness significant growth in the coming years, propelled by the rapid expansion of the technology sector and increasing digitalization efforts in countries like China and India. Key players like LightWheel AI, Hanyi Innovation Technology, and Baidu are strategically investing in research and development, fostering innovation and expanding their market presence. While challenges such as the complexity of synthetic data generation and potential biases in generated data exist, the overall market outlook remains highly positive, indicating significant opportunities for growth and innovation in the coming decade. The "Others" application segment represents a promising area for future growth, encompassing sectors such as manufacturing, energy, and transportation, where synthetic data can address specific data challenges.

  14. P

    Printed Digits Dataset Dataset

    • paperswithcode.com
    Updated Apr 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Printed Digits Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/printed-digits-dataset
    Explore at:
    Dataset updated
    Apr 2, 2025
    Description

    Description:

    👉 Download the dataset here

    The Printed Digits Dataset is a comprehensive collection of approximately 3,000 grayscale images, specifically curate for numeric digit classification tasks. Originally create with 177 images, this dataset has undergone extensive augmentation to enhance its diversity and utility, making it an ideal resource for machine learning projects such as Sudoku digit recognition.

    Dataset Composition:

    Image Count: The dataset contains around 3,000 images, each representing a single numeric digit from 0 to 9.

    Image Dimensions: Each image is standardized to a 28×28 pixel resolution, maintaining a consistent grayscale format.

    Purpose: This dataset was develop with a specific focus on Sudoku digit classification. Notably, it includes blank images for the digit '0', reflecting the common occurrence of empty cells in Sudoku puzzles.

    Download Dataset

    Augmentation Details:

    To expand the original dataset from 177 images to 3,000, a variety of data augmentation techniques were apply. These include:

    Rotation: Images were rotated to simulate different orientations of printed digits.

    Scaling: Variations in the size of digits were introduced to mimic real-world printing inconsistencies.

    Translation: Digits were shifted within the image frame to represent slight misalignments often seen in printed text.

    Noise Addition: Gaussian noise was added to simulate varying print quality and scanner imperfections.

    Applications:

    Sudoku Digit Recognition: Given its design, this dataset is particularly well-suited for training models to recognize and classify digits in Sudoku puzzles.

    Handwritten Digit Classification: Although the dataset contains printed digits, it can be adapted and utilized in combination with handwritten digit datasets for broader numeric

    classification tasks.

    Optical Character Recognition (OCR): This dataset can also be valuable for training OCR systems, especially those aim at processing low-resolution or small-scale printed text.

    Dataset Quality:

    Uniformity: All images are uniformly scaled and aligned, providing a clean and consistent dataset for model training.

    Diversity: Augmentation has significantly increased the diversity of digit representation, making the dataset robust for training deep learning models.

    Usage Notes:

    Zero Representation: Users should note that the digit '0' is represented by a blank image.

    This design choice aligns with the specific application of Sudoku puzzle solving but may require adjustments if the dataset is use for other numeric classification tasks.

    Preprocessing Required: While the dataset is ready for use, additional preprocessing steps, such as normalization or further augmentation, can be applied based on the specific requirements of the intended machine learning model.

    File Format:

    The images are stored in a standardized format compatible with most machine learning frameworks, ensuring ease of integration into existing workflows.

    Conclusion: The Printed Digits Dataset offers a rich resource for those working on digit classification projects, particularly within the context of Sudoku or other numeric-based puzzles. Its extensive augmentation and attention to application-specific details make it a valuable asset for both academic research and practical Al development.

    This dataset is sourced from Kaggle.

  15. f

    Data from: Oxidation Stability of Hydrocarbons: A Machine-Learning-Based...

    • acs.figshare.com
    xlsx
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adrian Venegas-Reynoso; Benoit Creton; Lucia Giarracca-Mehl; Marion Lacoue-Negre; Cyril Ruckebusch; Ludovic Duponchel (2025). Oxidation Stability of Hydrocarbons: A Machine-Learning-Based Study [Dataset]. http://doi.org/10.1021/acs.energyfuels.4c04926.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset provided by
    ACS Publications
    Authors
    Adrian Venegas-Reynoso; Benoit Creton; Lucia Giarracca-Mehl; Marion Lacoue-Negre; Cyril Ruckebusch; Ludovic Duponchel
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Having fluids that are stable over time is important for many applications, particularly sustainable aviation fuels (SAFs) derived from various renewable sources. Being able to understand this characteristic as early as possible during the development of SAFs would facilitate the blending of renewable sources with or without fossil fuels. Oxidation stability, defined as a hydrocarbon’s resistance to reacting with oxygen at near-ambient temperatures, is one of the most important hydrocarbon-stability-related properties. Indeed, the accumulation of byproducts of oxidation reactions may result in system failures. Assessing this property experimentally remains time-consuming; thus developing fast and accurate predictive models becomes relevant and approaches based on machine learning appear as valuable alternatives. The development of quantitative structure–property relationships (QSPRs) is subject to the availability of reference data, and unfortunately, these are currently lacking in the literature. In this study, we built a database containing consistent experimental results from accelerated oxidation tests conducted on diverse pure hydrocarbonswithin the carbon atom number range of SAFsusing the PetroOxy/RapidOxy test method, and second, we applied two machine-learning-based techniques (SVM and XGBoost) on the generated data set to derive QSPR-based models. The contribution of techniques such as data augmentation applied to our data set was also investigated and compared to more classical approaches. The best model (RMSEP = 2.7 h) was obtained after log-transforming the reference Induction Period, performing Smart Data Augmentation to enrich the database content, and using XGBoost with linear learners. While the model’s accuracy is not adequate for quantitative predictions, it allows fast and semiquantitative predictions.

  16. Z

    Data from: A Dataset and Machine Learning Approach to Classify and Augment...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Machulla, Tonja-Katrin (2023). A Dataset and Machine Learning Approach to Classify and Augment Interface Elements of Household Appliances to Support People with Visual Impairment [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7586106
    Explore at:
    Dataset updated
    Feb 4, 2023
    Dataset provided by
    Machulla, Tonja-Katrin
    Wieland, Markus
    Schmidt, Albrecht
    Lang, Florian
    Tschakert, Hanna
    Description

    Here, we provide a dataset of images of interfaces from household appliances, where all interface elements are labled with one of five different types of interface elements. Further, we provide auxillary materials to use and extend the dataset.

  17. f

    Item response probabilities of DINA, DINO and ACDM models.

    • plos.figshare.com
    xls
    Updated Jan 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ji-Young Yoon; Gahgene Gweon; Yun Joo Yoo (2024). Item response probabilities of DINA, DINO and ACDM models. [Dataset]. http://doi.org/10.1371/journal.pone.0296464.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 5, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Ji-Young Yoon; Gahgene Gweon; Yun Joo Yoo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Item response probabilities of DINA, DINO and ACDM models.

  18. m

    Guava leaves diseases datasets Bangladesh

    • data.mendeley.com
    Updated Nov 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumaia Akter Sumaia (2024). Guava leaves diseases datasets Bangladesh [Dataset]. http://doi.org/10.17632/2vt22x9s82.2
    Explore at:
    Dataset updated
    Nov 4, 2024
    Authors
    Sumaia Akter Sumaia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh
    Description

    The dataset, sourced from Vimruli Guava Garden and Floating Market in Jhalakathi, Barisal, categorizes guava leaf and fruit conditions for better crop management. It includes images of healthy and diseased samples, making it a valuable resource for researchers and practitioners working on machine learning models to identify plant diseases. The dataset includes six classes for robust model training.

    Dataset Summary: Location: Vimruli Guava Garden & Floating Market, Jhalakathi, Barisal. Subjects: Guava leaves and fruits. Purpose: Classification and detection of guava plant conditions.

    Data Distribution: Classes: 1. Algal Leaves Spot: 100 original, 1320 augmented, 1420 total 2. Dry Leaves: 52 original, 676 augmented, 728 total 3. Healthy Fruit: 50 original, 650 augmented, 700 total 4. Healthy Leaves: 150 original, 1600 augmented, 1750 total 5. Insects Eaten: 164 original, 1720 augmented, 1884 total 6. Red Rust: 90 original, 1170 augmented, 1260 total

    Total Samples: Original: 606 Augmented: 7136 Overall: 7742 samples

    Class Details: 1. Algal Leaves Spot: Fungal spots on leaves. 2. Dry Leaves: Leaves dried from environmental/nutrient factors. 3. Healthy Fruit/Leaves: Free of diseases/damage. 4. Insects Eaten: Insect-caused damage on leaves. 5. Red Rust: Reddish spots due to fungal infection.

    This dataset is well-suited for training and evaluating machine learning models to detect and classify various conditions of guava plants, aiding in automated disease identification and better agricultural management.

  19. m

    Ripen Banana Dataset: A Comprehensive Resource for Carbide Detection and...

    • data.mendeley.com
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elman Alam (2025). Ripen Banana Dataset: A Comprehensive Resource for Carbide Detection and Ripening Stage Analysis to Enhance Food Quality and Agricultural Efficiency [Dataset]. http://doi.org/10.17632/j9sp322drp.1
    Explore at:
    Dataset updated
    Feb 5, 2025
    Authors
    Elman Alam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Under two conditions—natural ripening and accelerated ripening using calcium carbide—this dataset on "Ripen Banana" explores the many phases of ripening in Musa Sapientum, often known as the Sabri banana. Starting on August 26, 2024, the seven-day controlled experiment collected the data till September 2, 2024. Following Calcium Carbide treatment and at specified times for naturally ripened bananas, we took 1,404 original photos using a phone camera at two-hour intervals. Along with 1,093 naturally ripened bananas, the collection included 311 photos of carbide-treated bananas that attained full ripening by the second day. Moreover, methods of data augmentation were used to attain class balance, producing 2,814 enhanced photos for the naturally ripened batch and 3,596 augmented images for the batch treated with carbide.

  20. f

    Dataset of partial discharge and noise signals

    • figshare.com
    bin
    Updated Feb 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Rauscher (2024). Dataset of partial discharge and noise signals [Dataset]. http://doi.org/10.6084/m9.figshare.24033225.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 23, 2024
    Dataset provided by
    figshare
    Authors
    Andreas Rauscher
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets (Tr0, Va0, Te0, Tr1, Va1, Te1, Te2) consisting of partial discharge (PD) and noise signals (NonPD) from electrical machines referred to in the publication "Deep learning and data augmentation for partial discharge detection in electrical machines" (DOI: https://doi.org/10.1016/j.engappai.2024.108074 )

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard (2024). How many specimens make a sufficient training set for automated three dimensional feature extraction? [Dataset]. http://doi.org/10.5061/dryad.1rn8pk12f

Data from: How many specimens make a sufficient training set for automated three dimensional feature extraction?

Related Article
Explore at:
Dataset updated
Jun 1, 2024
Dataset provided by
Dryad Digital Repository
Authors
James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard
Description

Deep learning has emerged as a robust tool for automating feature extraction from 3D images, offering an efficient alternative to labour-intensive and potentially biased manual image segmentation methods. However, there has been limited exploration into the optimal training set sizes, including assessing whether artificial expansion by data augmentation can achieve consistent results in less time and how consistent these benefits are across different types of traits. In this study, we manually segmented 50 planktonic foraminifera specimens from the genus Menardella to determine the minimum number of training images required to produce accurate volumetric and shape data from internal and external structures. The results reveal unsurprisingly that deep learning models improve with a larger number of training images with eight specimens being required to achieve 95% accuracy. Furthermore, data augmentation can enhance network accuracy by up to 8.0%. Notably, predicting both volumetric and ..., Data collection 50 planktonic foraminifera, comprising 4 Menardella menardii, 17 Menardella limbata, 18 Menardella exilis, and 11 Menardella pertenuis specimens, were used in our analyses (electronic supplementary material, figures S1 and S2). The taxonomic classification of these species was established based on the analysis of morphological characteristics observed in their shells. In this context, all species are characterised by lenticular, low trochosprial tests with a prominent keel [13]. Discrimination among these species is achievable, as M. limbata can be distinguished from its ancestor, M. menardii, by having a greater number of chambers and a smaller umbilicus. Moreover, M. exilis and M. pertenuis can be discerned from M. limbata by their thinner, more polished tests and reduced trochospirality. Furthermore, M. pertenuis is identifiable by a thin plate extending over the umbilicus and possessing a greater number of chambers in the final whorl compared to M. exilis [13]. The s..., , # Data from: How many specimens make a sufficient training set for automated three dimensional feature extraction?

https://doi.org/10.5061/dryad.1rn8pk12f

All computer code and final raw data used for this research work are stored in GitHub: https://github.com/JamesMulqueeney/Automated-3D-Feature-Extraction and have been archived within the Zenodo repository:Â https://doi.org/10.5281/zenodo.11109348.Â

This data is the additional primary data used in each analysis. These include: CT Image Files, Manual Segmentation Files (use for training or analysis), Inputs and Outputs for Shape Analysis and an example .h5 file which can be used to practice AI segmentation.Â

Description of the data and file structure

The primary data is arranged into the following:

  1. Image_Files.zip: Foraminiferal CT data used in the analysis.Â
  2. **I...
Search
Clear search
Close search
Google apps
Main menu