59 datasets found

d
Data from: How many specimens make a sufficient training set for automated...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard (2024). How many specimens make a sufficient training set for automated three dimensional feature extraction? [Dataset]. http://doi.org/10.5061/dryad.1rn8pk12f
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.1rn8pk12f
Dataset updated
Jun 1, 2024
Dataset provided by
Dryad Digital Repository
Authors
James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard
Description
Deep learning has emerged as a robust tool for automating feature extraction from 3D images, offering an efficient alternative to labour-intensive and potentially biased manual image segmentation methods. However, there has been limited exploration into the optimal training set sizes, including assessing whether artificial expansion by data augmentation can achieve consistent results in less time and how consistent these benefits are across different types of traits. In this study, we manually segmented 50 planktonic foraminifera specimens from the genus Menardella to determine the minimum number of training images required to produce accurate volumetric and shape data from internal and external structures. The results reveal unsurprisingly that deep learning models improve with a larger number of training images with eight specimens being required to achieve 95% accuracy. Furthermore, data augmentation can enhance network accuracy by up to 8.0%. Notably, predicting both volumetric and ..., Data collection 50 planktonic foraminifera, comprising 4 Menardella menardii, 17 Menardella limbata, 18 Menardella exilis, and 11 Menardella pertenuis specimens, were used in our analyses (electronic supplementary material, figures S1 and S2). The taxonomic classification of these species was established based on the analysis of morphological characteristics observed in their shells. In this context, all species are characterised by lenticular, low trochosprial tests with a prominent keel [13]. Discrimination among these species is achievable, as M. limbata can be distinguished from its ancestor, M. menardii, by having a greater number of chambers and a smaller umbilicus. Moreover, M. exilis and M. pertenuis can be discerned from M. limbata by their thinner, more polished tests and reduced trochospirality. Furthermore, M. pertenuis is identifiable by a thin plate extending over the umbilicus and possessing a greater number of chambers in the final whorl compared to M. exilis [13]. The s..., , # Data from: How many specimens make a sufficient training set for automated three dimensional feature extraction?

https://doi.org/10.5061/dryad.1rn8pk12f

All computer code and final raw data used for this research work are stored in GitHub: https://github.com/JamesMulqueeney/Automated-3D-Feature-Extraction and have been archived within the Zenodo repository:Â https://doi.org/10.5281/zenodo.11109348.Â

This data is the additional primary data used in each analysis. These include: CT Image Files, Manual Segmentation Files (use for training or analysis), Inputs and Outputs for Shape Analysis and an example .h5 file which can be used to practice AI segmentation.Â

Description of the data and file structure

The primary data is arranged into the following:

Image_Files.zip: Foraminiferal CT data used in the analysis.Â

**I...
n
Data from: Exploring deep learning techniques for wild animal behaviour...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2ngf1vhwk
Dataset updated
Feb 22, 2024
Dataset provided by
Nagoya University
Osaka University
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
Additional file 4 of Which data subset should be augmented for deep...
springernature.figshare.com
xlsx
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yusra A. Ameen; Dalia M. Badary; Ahmad Elbadry I. Abonnoor; Khaled F. Hussain; Adel A. Sewisy (2023). Additional file 4 of Which data subset should be augmented for deep learning? a simulation study using urothelial cell carcinoma histopathology images [Dataset]. http://doi.org/10.6084/m9.figshare.22622732.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22622732.v1
Dataset updated
Jun 21, 2023
Dataset provided by
figshare
Authors
Yusra A. Ameen; Dalia M. Badary; Ahmad Elbadry I. Abonnoor; Khaled F. Hussain; Adel A. Sewisy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 4. A Microsoft® Excel® workbook that details the raw data for the 8 experiments in which either the test set was augmented alone (after its allocation) or augmentation of the whole dataset was done before test-set allocation. All of the image-classification output probabilities are included.
Brain Tumor Paper Dataset and Code
zenodo.org
data.niaid.nih.gov
bin
Updated Feb 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yazan Al-Smadi; Yazan Al-Smadi (2023). Brain Tumor Paper Dataset and Code [Dataset]. http://doi.org/10.5281/zenodo.7619446
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7619446
Dataset updated
Feb 8, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yazan Al-Smadi; Yazan Al-Smadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Brain Tumor Detection Research Paper Code and Dataset

Paper title: Transforming brain tumor detection: the impact of YOLO models and MRI orientations.

Authored by: Yazan Al-Smadi, Ahmad Al-Qerem, et al. (2023)

This project contains a full version of the used brain tumor dataset and a full code version of the proposed research methodology.
m
augmentation data for DAISM
data.mendeley.com
Updated Jun 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yating Lin (2022). augmentation data for DAISM [Dataset]. http://doi.org/10.17632/ysjwjvpnh3.1
Explore at:
Unique identifier
https://doi.org/10.17632/ysjwjvpnh3.1
Dataset updated
Jun 22, 2022
Authors
Yating Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purified dataset for data augmentation for DAISM-DNNXMBD can be downloaded from this repository.

The pbmc8k dataset downloaded from 10X Genomics were processed and uesd for data augmentation to create training datasets for training DAISM-DNN models. pbmc8k.h5ad contains 5 cell types (B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells), and pbmc8k_fine.h5ad cantains 7 cell types (naive.B.cells, memory.B.cells, naive.CD4.T.cells, memory.CD4.T.cells,naive.CD8.T.cells, memory.CD8.T.cells, regulatory.T.cells, monocytes, macrophages, myeloid.dendritic.cells, NK.cells).

For RNA-seq dataset, it contains 5 cell types (B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells). Raw FASTQ reads were downloaded from the NCBI website, and transcription and gene-level expression quantification were performed using Salmon (version 0.11.3) with Gencode v29 after quality control of FASTQ reads using fastp. All tools were used with default parameters.
Variable Misuse tool: Dataset for data augmentation (3)
zenodo.org
zip
Updated Mar 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristian Robledo; Cristian Robledo; Francesca Sallicati; Javier Gutiérrez; Francesca Sallicati; Javier Gutiérrez (2022). Variable Misuse tool: Dataset for data augmentation (3) [Dataset]. http://doi.org/10.5281/zenodo.6090340
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6090340
Dataset updated
Mar 8, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cristian Robledo; Cristian Robledo; Francesca Sallicati; Javier Gutiérrez; Francesca Sallicati; Javier Gutiérrez
License
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Description
Dataset used for data augmentation in the training phase of the Variable Misuse tool. It contains some source code files extracted from third-party repositories.
f
Characterization of datasets.
figshare.com
xls
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Characterization of datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t004
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
d
Data from: Image-based automated species identification: Can virtual data...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jul 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage (2021). Image-based automated species identification: Can virtual data augmentation overcome problems of insufficient sampling? [Dataset]. http://doi.org/10.5061/dryad.f1vhhmgx9
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.f1vhhmgx9
Dataset updated
Jul 12, 2021
Dataset provided by
Dryad
Authors
Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage
Time period covered
2021
Description
Automated species identification and delimitation is challenging, particularly in rare and thus often scarcely sampled species, which do not allow sufficient discrimination of infraspecific versus interspecific variation. Typical problems arising from either low or exaggerated interspecific morphological differentiation are best met by automated methods of machine learning that learn efficient and effective species identification from training samples. However, limited infraspecific sampling remains a key challenge also in machine learning. In this study, we assessed whether a data augmentation approach may help to overcome the problem of scarce training data in automated visual species identification. The stepwise augmentation of data comprised image rotation as well as visual and virtual augmentation. The visual data augmentation applies classic approaches of data augmentation and generation of artificial images using a Generative Adversarial Networks (GAN) approach. Descriptive featu...
R
Car Highway Dataset
universe.roboflow.com
zip
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sallar (2023). Car Highway Dataset [Dataset]. https://universe.roboflow.com/sallar/car-highway/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Sep 13, 2023
Dataset authored and provided by
Sallar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Vehicles Bounding Boxes
Description
Car-Highway Data Annotation Project

Introduction

In this project, we aim to annotate car images captured on highways. The annotated data will be used to train machine learning models for various computer vision tasks, such as object detection and classification.

Project Goals

Collect a diverse dataset of car images from highway scenes.

Annotate the dataset to identify and label cars within each image.

Organize and format the annotated data for machine learning model training.

Tools and Technologies

For this project, we will be using Roboflow, a powerful platform for data annotation and preprocessing. Roboflow simplifies the annotation process and provides tools for data augmentation and transformation.

Annotation Process

Upload the raw car images to the Roboflow platform.

Use the annotation tools in Roboflow to draw bounding boxes around each car in the images.

Label each bounding box with the corresponding class (e.g., car).

Review and validate the annotations for accuracy.

Data Augmentation

Roboflow offers data augmentation capabilities, such as rotation, flipping, and resizing. These augmentations can help improve the model's robustness.

Data Export

Once the data is annotated and augmented, Roboflow allows us to export the dataset in various formats suitable for training machine learning models, such as YOLO, COCO, or TensorFlow Record.

Milestones

Data Collection and Preprocessing

Annotation of Car Images

Data Augmentation

Data Export

Model Training

Conclusion

By completing this project, we will have a well-annotated dataset ready for training machine learning models. This dataset can be used for a wide range of applications in computer vision, including car detection and tracking on highways.
m
MultiPatient Elderly Respiration dataset in Digital Twin Technology
data.mendeley.com
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SAGHEER KHAN (2023). MultiPatient Elderly Respiration dataset in Digital Twin Technology [Dataset]. http://doi.org/10.17632/vm8j5dvrxy.1
Explore at:
Unique identifier
https://doi.org/10.17632/vm8j5dvrxy.1
Dataset updated
Dec 7, 2023
Authors
SAGHEER KHAN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The research focus for this study is to generate a larger respiration dataset for the creation of elderly respiration Digital Twin (DT) model. Initial experimental data is collected with an unobtrusive Wi-Fi sensor with Channel State Information (CSI) characteristics to collect the subject's respiration rate.

The generation of a DT model requires extensive and diverse data. Due to limited resources and the need for extensive experimentation, the data is generated by implementing a novel statistical time series data augmentation method on single-subject respiration data. The larger synthetic respiration datasets will allow for testing the signal processing methodologies for noise removal,Breaths Per Minute (BPM) estimation, extensive Artificial Intelligence (AI) implementation.

The sensor data is for BPM from 12BPM to 25BPM for a single subject. Normal respiration rate ranges from 12BPM to 16BPM and beyond this is considered abnormal BPM. A total of 14 files are present in the dataset. Each file is labeled according to the BPM. All 30 patient data are present for each BPM. Patient are numbered as "P1, P2, P3, .... untill P30"

This data can be utilized by researchers and scientists toward the development of novel signal processing methodologies in the respiration DT model. These larger respiration datasets can be utilized for Machine Learning (ML) and Deep Learning (DL) in providing predictive analysis and classification of multi-patient respiration in the DT model for an elderly respiration rate.
m
Aruzz22.5K: An Image Dataset of Rice Varieties
data.mendeley.com
Updated Mar 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Masudul Islam (2024). Aruzz22.5K: An Image Dataset of Rice Varieties [Dataset]. http://doi.org/10.17632/3mn9843tz2.4
Explore at:
Unique identifier
https://doi.org/10.17632/3mn9843tz2.4
Dataset updated
Mar 12, 2024
Authors
Md Masudul Islam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This extensive dataset presents a meticulously curated collection of low-resolution images showcasing 20 well-established rice varieties native to diverse regions of Bangladesh. The rice samples were carefully gathered from both rural areas and local marketplaces, ensuring a comprehensive and varied representation. Serving as a visual compendium, the dataset provides a thorough exploration of the distinct characteristics of these rice varieties, facilitating precise classification.

Dataset Composition

The dataset encompasses 20 distinct classes, encompassing Subol Lota, Bashmoti (Deshi), Ganjiya, Shampakatari, Sugandhi Katarivog, BR-28, BR-29, Paijam, Bashful, Lal Aush, BR-Jirashail, Gutisharna, Birui, Najirshail, Pahari Birui, Polao (Katari), Polao (Chinigura), Amon, Shorna-5, and Lal Binni. In total, the dataset comprises 4,730 original JPG images and 23,650 augmented images.

Image Capture and Dataset Organization

These images were captured using an iPhone 11 camera with a 5x zoom feature. Each image capturing these rice varieties was diligently taken between October 18 and November 29, 2023. To facilitate efficient data management and organization, the dataset is structured into two variants: Original images and Augmented images. Each variant is systematically categorized into 20 distinct sub-directories, each corresponding to a specific rice variety.

Original Image Dataset

The primary image set comprises 4,730 JPG images, uniformly sized at 853 × 853 pixels. Due to the initial low resolution, the file size was notably 268 MB. Employing compression through a zip program significantly optimized the dataset, resulting in a final size of 254 MB.

Augmented Image Dataset

To address the substantial image volume requirements of deep learning models for machine vision, data augmentation techniques were implemented. Total 23,650 images was obtained from augmentation. These augmented images, also in JPG format and uniformly sized at 512 × 512 pixels, initially amounted to 781 MB. However, post-compression, the dataset was further streamlined to 699 MB.

Dataset Storage and Access

The raw and augmented datasets are stored in two distinct zip files, namely 'Original.zip' and 'Augmented.zip'. Both zip files contain 20 sub-folders representing a unique rice variety, namely 1_Subol_Lota, 2_Bashmoti, 3_Ganjiya, 4_Shampakatari, 5_Katarivog, 6_BR28, 7_BR29, 8_Paijam, 9_Bashful, 10_Lal_Aush, 11_Jirashail, 12_Gutisharna, 13_Red_Cargo,14_Najirshail, 15_Katari_Polao, 16_Lal_Biroi, 17_Chinigura_Polao, 18_Amon, 19_Shorna5, 20_Lal_Binni.

Train and Test Data Organization

To ease the experimenting process for the researchers we have balanced the data and split it in an 80:20 train-test ratio. The ‘Train_n_Test.zip’ folder contains two sub-directories: ‘1_TEST’ which contains 1125 images per class and ‘2_VALID’ which contains 225 images per class.
Z
BIRD: Big Impulse Response Dataset
data.niaid.nih.gov
kaggle.com
Updated Oct 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michaud, François (2020). BIRD: Big Impulse Response Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4139415
Explore at:
Dataset updated
Oct 29, 2020
Dataset provided by
Grondin, François
Michaud, Simon
Michaud, François
Lauzon, Jean-Samuel
Ravanelli, Mirco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BIRD is an open dataset that consists of 100,000 multichannel room impulse responses generated using the image method. This makes it the largest multichannel open dataset currently available. We provide some Python code that shows how to download and use this dataset to perform online data augmentation. The code is compatible with the PyTorch dataset class, which eases integration in existing deep learning projects based on this framework.
S
Synthetic Data Solution Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Synthetic Data Solution Report [Dataset]. https://www.marketreportanalytics.com/reports/synthetic-data-solution-55327
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Apr 3, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The synthetic data solution market is experiencing robust growth, driven by increasing demand for data privacy and security, coupled with the need for large, high-quality datasets for training AI and machine learning models. The market, currently estimated at $2 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated market value of over $10 billion by 2033. This expansion is fueled by several key factors: stringent data privacy regulations like GDPR and CCPA, which restrict the use of real personal data; the rise of synthetic data generation techniques enabling the creation of realistic, yet privacy-preserving datasets; and the increasing adoption of AI and ML across various industries, particularly financial services, retail, and healthcare, creating a high demand for training data. The cloud-based segment is currently dominating the market, owing to its scalability, accessibility, and cost-effectiveness. The geographical distribution shows North America and Europe as leading regions, driven by early adoption of AI and robust data privacy regulations. However, the Asia-Pacific region is expected to witness significant growth in the coming years, propelled by the rapid expansion of the technology sector and increasing digitalization efforts in countries like China and India. Key players like LightWheel AI, Hanyi Innovation Technology, and Baidu are strategically investing in research and development, fostering innovation and expanding their market presence. While challenges such as the complexity of synthetic data generation and potential biases in generated data exist, the overall market outlook remains highly positive, indicating significant opportunities for growth and innovation in the coming decade. The "Others" application segment represents a promising area for future growth, encompassing sectors such as manufacturing, energy, and transportation, where synthetic data can address specific data challenges.
P
Printed Digits Dataset Dataset
paperswithcode.com
Updated Apr 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Printed Digits Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/printed-digits-dataset
Explore at:
Dataset updated
Apr 2, 2025
Description
Description:

👉 Download the dataset here

The Printed Digits Dataset is a comprehensive collection of approximately 3,000 grayscale images, specifically curate for numeric digit classification tasks. Originally create with 177 images, this dataset has undergone extensive augmentation to enhance its diversity and utility, making it an ideal resource for machine learning projects such as Sudoku digit recognition.

Dataset Composition:

Image Count: The dataset contains around 3,000 images, each representing a single numeric digit from 0 to 9.

Image Dimensions: Each image is standardized to a 28×28 pixel resolution, maintaining a consistent grayscale format.

Purpose: This dataset was develop with a specific focus on Sudoku digit classification. Notably, it includes blank images for the digit '0', reflecting the common occurrence of empty cells in Sudoku puzzles.

Download Dataset

Augmentation Details:

To expand the original dataset from 177 images to 3,000, a variety of data augmentation techniques were apply. These include:

Rotation: Images were rotated to simulate different orientations of printed digits.

Scaling: Variations in the size of digits were introduced to mimic real-world printing inconsistencies.

Translation: Digits were shifted within the image frame to represent slight misalignments often seen in printed text.

Noise Addition: Gaussian noise was added to simulate varying print quality and scanner imperfections.

Applications:

Sudoku Digit Recognition: Given its design, this dataset is particularly well-suited for training models to recognize and classify digits in Sudoku puzzles.

Handwritten Digit Classification: Although the dataset contains printed digits, it can be adapted and utilized in combination with handwritten digit datasets for broader numeric

classification tasks.

Optical Character Recognition (OCR): This dataset can also be valuable for training OCR systems, especially those aim at processing low-resolution or small-scale printed text.

Dataset Quality:

Uniformity: All images are uniformly scaled and aligned, providing a clean and consistent dataset for model training.

Diversity: Augmentation has significantly increased the diversity of digit representation, making the dataset robust for training deep learning models.

Usage Notes:

Zero Representation: Users should note that the digit '0' is represented by a blank image.

This design choice aligns with the specific application of Sudoku puzzle solving but may require adjustments if the dataset is use for other numeric classification tasks.

Preprocessing Required: While the dataset is ready for use, additional preprocessing steps, such as normalization or further augmentation, can be applied based on the specific requirements of the intended machine learning model.

File Format:

The images are stored in a standardized format compatible with most machine learning frameworks, ensuring ease of integration into existing workflows.

Conclusion: The Printed Digits Dataset offers a rich resource for those working on digit classification projects, particularly within the context of Sudoku or other numeric-based puzzles. Its extensive augmentation and attention to application-specific details make it a valuable asset for both academic research and practical Al development.

This dataset is sourced from Kaggle.
f
Data from: Oxidation Stability of Hydrocarbons: A Machine-Learning-Based...
acs.figshare.com
xlsx
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adrian Venegas-Reynoso; Benoit Creton; Lucia Giarracca-Mehl; Marion Lacoue-Negre; Cyril Ruckebusch; Ludovic Duponchel (2025). Oxidation Stability of Hydrocarbons: A Machine-Learning-Based Study [Dataset]. http://doi.org/10.1021/acs.energyfuels.4c04926.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.energyfuels.4c04926.s001
Dataset updated
Feb 24, 2025
Dataset provided by
ACS Publications
Authors
Adrian Venegas-Reynoso; Benoit Creton; Lucia Giarracca-Mehl; Marion Lacoue-Negre; Cyril Ruckebusch; Ludovic Duponchel
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Having fluids that are stable over time is important for many applications, particularly sustainable aviation fuels (SAFs) derived from various renewable sources. Being able to understand this characteristic as early as possible during the development of SAFs would facilitate the blending of renewable sources with or without fossil fuels. Oxidation stability, defined as a hydrocarbon’s resistance to reacting with oxygen at near-ambient temperatures, is one of the most important hydrocarbon-stability-related properties. Indeed, the accumulation of byproducts of oxidation reactions may result in system failures. Assessing this property experimentally remains time-consuming; thus developing fast and accurate predictive models becomes relevant and approaches based on machine learning appear as valuable alternatives. The development of quantitative structure–property relationships (QSPRs) is subject to the availability of reference data, and unfortunately, these are currently lacking in the literature. In this study, we built a database containing consistent experimental results from accelerated oxidation tests conducted on diverse pure hydrocarbonswithin the carbon atom number range of SAFsusing the PetroOxy/RapidOxy test method, and second, we applied two machine-learning-based techniques (SVM and XGBoost) on the generated data set to derive QSPR-based models. The contribution of techniques such as data augmentation applied to our data set was also investigated and compared to more classical approaches. The best model (RMSEP = 2.7 h) was obtained after log-transforming the reference Induction Period, performing Smart Data Augmentation to enrich the database content, and using XGBoost with linear learners. While the model’s accuracy is not adequate for quantitative predictions, it allows fast and semiquantitative predictions.
Z
Data from: A Dataset and Machine Learning Approach to Classify and Augment...
data.niaid.nih.gov
zenodo.org
Updated Feb 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Machulla, Tonja-Katrin (2023). A Dataset and Machine Learning Approach to Classify and Augment Interface Elements of Household Appliances to Support People with Visual Impairment [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7586106
Explore at:
Dataset updated
Feb 4, 2023
Dataset provided by
Machulla, Tonja-Katrin
Wieland, Markus
Schmidt, Albrecht
Lang, Florian
Tschakert, Hanna
Description
Here, we provide a dataset of images of interfaces from household appliances, where all interface elements are labled with one of five different types of interface elements. Further, we provide auxillary materials to use and extend the dataset.
f
Item response probabilities of DINA, DINO and ACDM models.
plos.figshare.com
xls
Updated Jan 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ji-Young Yoon; Gahgene Gweon; Yun Joo Yoo (2024). Item response probabilities of DINA, DINO and ACDM models. [Dataset]. http://doi.org/10.1371/journal.pone.0296464.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0296464.t001
Dataset updated
Jan 5, 2024
Dataset provided by
PLOS ONE
Authors
Ji-Young Yoon; Gahgene Gweon; Yun Joo Yoo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Item response probabilities of DINA, DINO and ACDM models.
m
Guava leaves diseases datasets Bangladesh
data.mendeley.com
Updated Nov 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumaia Akter Sumaia (2024). Guava leaves diseases datasets Bangladesh [Dataset]. http://doi.org/10.17632/2vt22x9s82.2
Explore at:
Unique identifier
https://doi.org/10.17632/2vt22x9s82.2
Dataset updated
Nov 4, 2024
Authors
Sumaia Akter Sumaia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh
Description
The dataset, sourced from Vimruli Guava Garden and Floating Market in Jhalakathi, Barisal, categorizes guava leaf and fruit conditions for better crop management. It includes images of healthy and diseased samples, making it a valuable resource for researchers and practitioners working on machine learning models to identify plant diseases. The dataset includes six classes for robust model training.

Dataset Summary: Location: Vimruli Guava Garden & Floating Market, Jhalakathi, Barisal. Subjects: Guava leaves and fruits. Purpose: Classification and detection of guava plant conditions.

Data Distribution: Classes: 1. Algal Leaves Spot: 100 original, 1320 augmented, 1420 total 2. Dry Leaves: 52 original, 676 augmented, 728 total 3. Healthy Fruit: 50 original, 650 augmented, 700 total 4. Healthy Leaves: 150 original, 1600 augmented, 1750 total 5. Insects Eaten: 164 original, 1720 augmented, 1884 total 6. Red Rust: 90 original, 1170 augmented, 1260 total

Total Samples: Original: 606 Augmented: 7136 Overall: 7742 samples

Class Details: 1. Algal Leaves Spot: Fungal spots on leaves. 2. Dry Leaves: Leaves dried from environmental/nutrient factors. 3. Healthy Fruit/Leaves: Free of diseases/damage. 4. Insects Eaten: Insect-caused damage on leaves. 5. Red Rust: Reddish spots due to fungal infection.

This dataset is well-suited for training and evaluating machine learning models to detect and classify various conditions of guava plants, aiding in automated disease identification and better agricultural management.
m
Ripen Banana Dataset: A Comprehensive Resource for Carbide Detection and...
data.mendeley.com
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elman Alam (2025). Ripen Banana Dataset: A Comprehensive Resource for Carbide Detection and Ripening Stage Analysis to Enhance Food Quality and Agricultural Efficiency [Dataset]. http://doi.org/10.17632/j9sp322drp.1
Explore at:
Unique identifier
https://doi.org/10.17632/j9sp322drp.1
Dataset updated
Feb 5, 2025
Authors
Elman Alam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Under two conditions—natural ripening and accelerated ripening using calcium carbide—this dataset on "Ripen Banana" explores the many phases of ripening in Musa Sapientum, often known as the Sabri banana. Starting on August 26, 2024, the seven-day controlled experiment collected the data till September 2, 2024. Following Calcium Carbide treatment and at specified times for naturally ripened bananas, we took 1,404 original photos using a phone camera at two-hour intervals. Along with 1,093 naturally ripened bananas, the collection included 311 photos of carbide-treated bananas that attained full ripening by the second day. Moreover, methods of data augmentation were used to attain class balance, producing 2,814 enhanced photos for the naturally ripened batch and 3,596 augmented images for the batch treated with carbide.
f
Dataset of partial discharge and noise signals
figshare.com
bin
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Rauscher (2024). Dataset of partial discharge and noise signals [Dataset]. http://doi.org/10.6084/m9.figshare.24033225.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24033225.v1
Dataset updated
Feb 23, 2024
Dataset provided by
figshare
Authors
Andreas Rauscher
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets (Tr0, Va0, Te0, Tr1, Va1, Te1, Te2) consisting of partial discharge (PD) and noise signals (NonPD) from electrical machines referred to in the publication "Deep learning and data augmentation for partial discharge detection in electrical machines" (DOI: https://doi.org/10.1016/j.engappai.2024.108074 )

Facebook

Twitter

Click to copy link

Link copied

Cite

James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard (2024). How many specimens make a sufficient training set for automated three dimensional feature extraction? [Dataset]. http://doi.org/10.5061/dryad.1rn8pk12f

Data from: How many specimens make a sufficient training set for automated three dimensional feature extraction?

Explore at:

Unique identifier

https://doi.org/10.5061/dryad.1rn8pk12f

Dataset updated

Jun 1, 2024

Dataset provided by

Dryad Digital Repository

Authors

James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard

Description

Deep learning has emerged as a robust tool for automating feature extraction from 3D images, offering an efficient alternative to labour-intensive and potentially biased manual image segmentation methods. However, there has been limited exploration into the optimal training set sizes, including assessing whether artificial expansion by data augmentation can achieve consistent results in less time and how consistent these benefits are across different types of traits. In this study, we manually segmented 50 planktonic foraminifera specimens from the genus Menardella to determine the minimum number of training images required to produce accurate volumetric and shape data from internal and external structures. The results reveal unsurprisingly that deep learning models improve with a larger number of training images with eight specimens being required to achieve 95% accuracy. Furthermore, data augmentation can enhance network accuracy by up to 8.0%. Notably, predicting both volumetric and ..., Data collection 50 planktonic foraminifera, comprising 4 Menardella menardii, 17 Menardella limbata, 18 Menardella exilis, and 11 Menardella pertenuis specimens, were used in our analyses (electronic supplementary material, figures S1 and S2). The taxonomic classification of these species was established based on the analysis of morphological characteristics observed in their shells. In this context, all species are characterised by lenticular, low trochosprial tests with a prominent keel [13]. Discrimination among these species is achievable, as M. limbata can be distinguished from its ancestor, M. menardii, by having a greater number of chambers and a smaller umbilicus. Moreover, M. exilis and M. pertenuis can be discerned from M. limbata by their thinner, more polished tests and reduced trochospirality. Furthermore, M. pertenuis is identifiable by a thin plate extending over the umbilicus and possessing a greater number of chambers in the final whorl compared to M. exilis [13]. The s..., , # Data from: How many specimens make a sufficient training set for automated three dimensional feature extraction?

https://doi.org/10.5061/dryad.1rn8pk12f

All computer code and final raw data used for this research work are stored in GitHub: https://github.com/JamesMulqueeney/Automated-3D-Feature-Extraction and have been archived within the Zenodo repository:Â https://doi.org/10.5281/zenodo.11109348.Â

This data is the additional primary data used in each analysis. These include: CT Image Files, Manual Segmentation Files (use for training or analysis), Inputs and Outputs for Shape Analysis and an example .h5 file which can be used to practice AI segmentation.Â

Description of the data and file structure

The primary data is arranged into the following:

Image_Files.zip: Foraminiferal CT data used in the analysis.Â
**I...

Clear search

Close search

Google apps

Main menu

Data from: How many specimens make a sufficient training set for automated...

Description of the data and file structure

Data from: Exploring deep learning techniques for wild animal behaviour...

Additional file 4 of Which data subset should be augmented for deep...

Brain Tumor Paper Dataset and Code

augmentation data for DAISM

Variable Misuse tool: Dataset for data augmentation (3)

Characterization of datasets.

Data from: Image-based automated species identification: Can virtual data...

Car Highway Dataset

Car-Highway Data Annotation Project

Introduction

Project Goals

Tools and Technologies

Annotation Process

Data Augmentation

Data Export

Milestones

Conclusion

MultiPatient Elderly Respiration dataset in Digital Twin Technology

Aruzz22.5K: An Image Dataset of Rice Varieties

Dataset Composition

Image Capture and Dataset Organization

Original Image Dataset

Augmented Image Dataset

Dataset Storage and Access

Train and Test Data Organization

BIRD: Big Impulse Response Dataset

Synthetic Data Solution Report

Printed Digits Dataset Dataset

Data from: Oxidation Stability of Hydrocarbons: A Machine-Learning-Based...

Data from: A Dataset and Machine Learning Approach to Classify and Augment...

Item response probabilities of DINA, DINO and ACDM models.

Guava leaves diseases datasets Bangladesh

Ripen Banana Dataset: A Comprehensive Resource for Carbide Detection and...

Dataset of partial discharge and noise signals

Data from: How many specimens make a sufficient training set for automated three dimensional feature extraction?

Description of the data and file structure