Facebook
TwitterThe quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness.
CIFAKE is a dataset that contains 60,000 synthetically-generated images and 60,000 real images (collected from CIFAR-10). Can computer vision techniques be used to detect when an image is real or has been generated by AI?
Further information on this dataset can be found here: Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.
The dataset contains two classes - REAL and FAKE.
For REAL, we collected the images from Krizhevsky & Hinton's CIFAR-10 dataset
For the FAKE images, we generated the equivalent of CIFAR-10 with Stable Diffusion version 1.4
There are 100,000 images for training (50k per class) and 20,000 for testing (10k per class)
The dataset and all studies using it are linked using Papers with Code https://paperswithcode.com/dataset/cifake-real-and-ai-generated-synthetic-images
If you use this dataset, you must cite the following sources
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.
Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.
Real images are from Krizhevsky & Hinton (2009), fake images are from Bird & Lotfi (2024). The Bird & Lotfi study is available here.
The updates to the dataset on the 28th of March 2023 did not change anything; the file formats ".jpeg" were renamed ".jpg" and the root folder was uploaded to meet Kaggle's usability requirements.
This dataset is published under the same MIT license as CIFAR-10:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Facebook
Twitterhttps://spdx.org/licenses/https://spdx.org/licenses/
TiCaM Synthectic Images: A Time-of-Flight In-Car Cabin Monitoring Dataset is a time-of-flight dataset of car in-cabin images providing means to test extensive car cabin monitoring systems based on deep learning methods. The authors provide a synthetic image dataset of car cabin images similar to the real dataset leveraging advanced simulation software’s capability to generate abundant data with little effort. This can be used to test domain adaptation between synthetic and real data for select classes. For both datasets the authors provide ground truth annotations for 2D and 3D object detection, as well as for instance segmentation.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
The dataset contains synthetically generated images of bottles scattered around random backgrounds. The download files contain 5000 Images for each classes of bottles available. Currently there are five classes available: Plastic Bottles , Beer Bottles, Soda Bottles, Water Bottles, and Wine Bottles. I will try to add more bottle types in the future.
Previously, the dataset only contains images of plastic bottles and beer bottles. Now I've included images of soda, water, and wine bottles also. I will be adding more images in the future. You could always check the previous versions of the dataset if you want to retrieve the previous directory. Cheers! :D
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The CIFAKE dataset provides 60,000 real and 60,000 AI-generated synthetic images for machine learning, classification, and computer vision research.
Facebook
Twitterhttps://choosealicense.com/licenses/openrail++/https://choosealicense.com/licenses/openrail++/
cutiee82-org/synthetic-images dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Synthetic Images 1.0 is a dataset for object detection tasks - it contains Water Meter Dial Digits annotations for 2,096 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset consists of two primary categories: real_images and fake_images. The real_images category contains authentic images, while the fake_images category includes synthetic images generated using various advanced generative models. The purpose of this dataset is to facilitate research and development in the field of image classification, focusing on distinguishing between real and synthetic images.
The dataset is organized as follows:
The fake_images folder contains synthetic images generated using various generative models. Each subfolder represents a specific image generation model:
This folder contains authentic, real-world images, which are used as the ground truth for comparison with the generated fake_images.
This dataset can be used for training and evaluating image classification models, particularly those focused on distinguishing real images from synthetic ones. It is well-suited for experiments with generative adversarial networks (GANs), diffusion models, and other deep learning techniques.
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
openmodelinitiative/synthetic-images dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 500 synthetic images generated via prompt-based text-to-image diffusion modeling using Stable Diffusion XL. Each image belongs to one of five classes: cat, dog, horse, car, and tree.Gurpreet, S. (2025). Synthetic Image Dataset of Five Object Classes Generated Using Stable Diffusion XL [Data set]. Zenodo. https://doi.org/10.5281/zenodo.16414387
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OverviewThis is a set of synthetic overhead imagery of wind turbines that was created with CityEngine. There are corresponding labels that provide the class, x and y coordinates, and height and width (YOLOv3 format) of the ground truth bounding boxes for each wind turbine in the images. These labels are named similarly to the images (e.g. image.png will have the label titled image.txt)..UseThis dataset is meant as supplementation to training an object detection model on overhead images of wind turbines. It can be added to the training set of an object detection model to potentially improve performance when using the model on real overhead images of wind turbines.WhyThis dataset was created to examine the utility of adding synthetic imagery to the training set of an object detection model to improve performance on rare objects. Since wind turbines are both very rare in number and sparse, this makes acquiring data very costly. This synthetic imagery is meant to solve this issue by automating the generation of new training data. The use of synthetic imagery can also be applied to the issue of cross-domain testing, where the model lacks training data on a particular region and consequently struggles when used on that region.MethodThe process for creating the dataset involved selecting background images from NAIP imagery available on Earth OnDemand. These images were randomlyselected from these geographies: forest, farmland, grasslands, water, urban/suburban,mountains, and deserts. No consideration was put into whether the background images would seem realistic. This is because we wanted to see if this would help the model become better at detecting wind turbines regardless of their context (which would help when using the model on novel geographies). Then, a script was used to select these at random and uniformly generate 3D models of large wind turbines over the image and then position the virtual camera to save four 608x608 pixel images. This process was repeated with the same random seed, but with no background image and the wind turbines colored as black. Next, these black and white images were converted into ground truth labels by grouping the black pixels in the images.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of 300 images, each in PNG format with a size of 256x256 pixels, designed for fine-grain texture analysis. It serves as a valuable resource for researchers and professionals working in the field of image quality assessment.
Current metrics such as SSIM and PSNR, while commonly used, often fail to capture key aspects of an image's texture properties. This dataset aims to facilitate the discovery of new texture-based quality evaluation metrics that are not correlated with existing metrics, enabling a more comprehensive assessment of image quality by incorporating overlooked texture characteristics.
Facebook
TwitterHemg/cifake-real-and-ai-generated-synthetic-images dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
The dataset includes images of popular pharmaceutical drugs and vitamins in the Philippines. This is dataset can be used for classifying drug images using CNN and transfer learning. Currently, there are ten available classes of pill images.
Important Note: This dataset is not a part of a research study or peer-reviewed article that has been heavily reviewed and lawfully accepted by several medical institutions. The dataset was not reviewed by any Pharmaceutical authority or any practitioner that is knowledgeable with the field. The use of the dataset is encouraged for educational purposes only as it may impose ethical problems or may cause harm for others if improperly utilized. It is advised to not use this dataset for your own machine learning applications or services without prior peer-review by an expert in the field of medicine or any law making authority that is responsible for the ethical use of medical datasets. Please take this into consideration.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This Dataset Contains Synthetic Images of Plastic, Paper, and Garbage Bags. The Bag Classes folder contains 5000 images of each image class separately while the ImageClassesCombined folder contains annotated images of all classes combined. The annotations are in the COCO format. There is also a sample test_image.jpg but you could also use your own or split the data if you prefer. Foreground images are taken from free stock image sites like unsplash.com, pexels.com, and pixabay.com. Cover Photo Designed by pch.vector / Freepik
I want to create a dataset that could be used for image classification in different settings. The dataset can be used to train a CNN model for image detection and segmentation tasks in domains like agriculture, recycling, and many more.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This Dataset Contains Synthetic Images of Paper and Plastic Cups. The ImageClassesCombined folder contains annotated images of all classes combined. The annotations are in the COCO format. There is also a sample test_image.jpg but you could also use your own or split the data if you prefer. Foreground images are taken from free stock image sites like unsplash.com, pexels.com, and pixabay.com. Cover Photo Designed by brgfx / Freepik
I want to create a dataset that could be used for image classification in different settings. The dataset can be used to train a CNN model for image detection and segmentation tasks in domains like agriculture, recycling, and many more.
Facebook
TwitterLimited training data is one of the biggest challenges in the industrial application of deep learning. Generating synthetic training images is a promising solution in computer vision; however, minimizing the domain gap between synthetic and real-world images remains a problem. Therefore, based on a real-world application, we explored the generation of images with physics-based rendering for an industrial object detection task. Setting up the render engine’s environment requires a lot of choices and parameters. One fundamental question is whether to apply the concept of domain randomization or use domain knowledge to try and achieve photorealism. To answer this question, we compared different strategies for setting up lighting, background, object texture, additional foreground objects and bounding box computation in a data-centric approach. We compared the resulting average precision from generated images with different levels of realism and variability. In conclusion, we found that domain randomization is a viable strategy for the detection of industrial objects. However, domain knowledge can be used for object-related aspects to improve detection performance. Based on our results, we provide guidelines and an open-source tool for the generation of synthetic images for new industrial applications.
Facebook
TwitterSynthetic image data is generated on 3D game engines ready to use, fully annotated (bounding box, segmentation, keypoint, depth, normal) without any errors. Synthetic data - Solves cold start problems - Reduces development time and costs - Enables more experimentation - Covers edge cases - Removes privacy concerns - Improves existing dataset performance
Facebook
Twitter
According to our latest research, the global privacy-preserving synthetic images market size reached USD 1.42 billion in 2024, reflecting robust adoption across data-sensitive industries. This market is projected to grow at a CAGR of 32.7% from 2025 to 2033, reaching a forecasted value of USD 19.13 billion by 2033. The remarkable growth trajectory is fueled by the increasing demand for secure data sharing, stringent data privacy regulations, and the proliferation of artificial intelligence (AI) and machine learning (ML) applications that require high-quality, privacy-compliant datasets.
One of the primary growth drivers of the privacy-preserving synthetic images market is the intensifying focus on data privacy and security. As organizations across sectors grapple with stricter regulations such as GDPR in Europe, CCPA in California, and similar frameworks worldwide, the need to anonymize and protect sensitive information has become paramount. Synthetic images, generated through advanced AI algorithms, offer a compelling solution by enabling organizations to create realistic but entirely artificial datasets that do not compromise individual privacy. This allows businesses to innovate and extract insights from data without risking regulatory penalties or reputational damage, thereby accelerating the adoption of privacy-preserving synthetic image technologies.
Another significant factor propelling market growth is the rapid expansion of AI and ML-driven applications that require vast amounts of annotated image data. Traditional data collection methods are often hampered by privacy concerns, limited accessibility, and high costs. By leveraging synthetic images, enterprises can overcome these barriers, generating diverse, scalable, and bias-mitigated datasets for training and validating AI models. This is particularly critical in sectors such as healthcare, finance, and autonomous vehicles, where real-world data is both sensitive and scarce. The ability to generate synthetic images that closely mimic real-world scenarios, while ensuring privacy, is unlocking new opportunities for innovation and operational efficiency across industries.
Furthermore, the increasing sophistication of generative models, such as Generative Adversarial Networks (GANs) and diffusion models, has significantly enhanced the realism and utility of synthetic images. These technological advancements are enabling more nuanced privacy preservation techniques, such as differential privacy and federated learning, which further bolster the appeal of synthetic data solutions. As a result, the market is witnessing heightened investment from both established technology vendors and emerging startups, leading to rapid product development, ecosystem expansion, and competitive differentiation. The convergence of regulatory pressures, technological innovation, and growing enterprise awareness is expected to sustain the momentum of the privacy-preserving synthetic images market throughout the forecast period.
From a regional perspective, North America currently dominates the global market, accounting for approximately 41% of the total revenue in 2024, driven by early technology adoption, a mature regulatory landscape, and significant R&D investments. Europe follows closely, with a market share of 28%, reflecting the region’s proactive stance on data privacy and robust public sector engagement. Asia Pacific is emerging as the fastest-growing region, propelled by digital transformation initiatives, rising AI adoption, and increasing awareness of data privacy issues. Meanwhile, Latin America and the Middle East & Africa are witnessing steady growth, albeit from a smaller base, as organizations in these regions gradually embrace privacy-preserving synthetic image technologies to address local regulatory and market needs.
The privacy-preserving synthetic images market is segmented by component into software, hardware,
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Part 2 of the synthetic facial data rendered from male fbx models. The total dataset contains around 24k facial images generated from 14 identity and the corresponding raw facial depth and head pose.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Two object detection models using Darknet/YOLOv4 were trained on images of the coral Desmophyllum pertusum from the Kosterhavet National Park. In one of the models, the training image data was amplified using StyleGAN2 generative modeling. The dataset contains 2266 synthetic images with labels and 409 original images of corals used for training the ML model. Included is also the YOLOv4 models and the StyleGAN2 network. The still images were extracted from raw video data collected using a remotely operated underwater vehicle. 409 JPEG images from the raw video data are provided in 720x576 resolution. In certain images, coordinates visible in the OSD have been cropped. The synthetic images are PNG files in 512x512 resolution. The StyleGAN2 network is included as a serialized pickle file (*.pkl). The object detection models are provided in the .weights format used by the Darknet/YOLOv4 package. Two files are included (trained on original images only, trained on original + synthetic images). The machine learning software packages used is currently (2022) available on Github: StyleGAN2: https://github.com/NVlabs/stylegan2 YOLOv4: https://github.com/AlexeyAB/darknet
Facebook
TwitterThe quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness.
CIFAKE is a dataset that contains 60,000 synthetically-generated images and 60,000 real images (collected from CIFAR-10). Can computer vision techniques be used to detect when an image is real or has been generated by AI?
Further information on this dataset can be found here: Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.
The dataset contains two classes - REAL and FAKE.
For REAL, we collected the images from Krizhevsky & Hinton's CIFAR-10 dataset
For the FAKE images, we generated the equivalent of CIFAR-10 with Stable Diffusion version 1.4
There are 100,000 images for training (50k per class) and 20,000 for testing (10k per class)
The dataset and all studies using it are linked using Papers with Code https://paperswithcode.com/dataset/cifake-real-and-ai-generated-synthetic-images
If you use this dataset, you must cite the following sources
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.
Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.
Real images are from Krizhevsky & Hinton (2009), fake images are from Bird & Lotfi (2024). The Bird & Lotfi study is available here.
The updates to the dataset on the 28th of March 2023 did not change anything; the file formats ".jpeg" were renamed ".jpg" and the root folder was uploaded to meet Kaggle's usability requirements.
This dataset is published under the same MIT license as CIFAR-10:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.