19 datasets found

FLUXSynID: A Synthetic Face Dataset with Document and Live Images
data.europa.eu
unknown
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). FLUXSynID: A Synthetic Face Dataset with Document and Live Images [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15172770?locale=en
Explore at:
unknownAvailable download formats
Dataset updated
May 9, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
FLUXSynID: A Synthetic Face Dataset with Document and Live Images FLUXSynID is a high-resolution synthetic identity dataset containing 14,889 unique synthetic identities, each represented through a document-style image and three live capture variants. Identities are generated using the FLUX.1 [dev] diffusion model, guided by user-defined identity attributes such as gender, age, region of origin, and other various identity features. The dataset is created to support biometric research, including face recognition and morphing attack detection. File Structure Each identity has a dedicated folder (named as a 12-digit hex string, e.g., 000e23cdce23) containing the following 5 files: 000e23cdce23_f.json — metadata including sampled identity attributes, prompt, generation seed, etc. (_f = female; _m = male; _nb = non-binary) 000e23cdce23_f_doc.png — document-style frontal image 000e23cdce23_f_live_0_e_d1.jpg — live image generated with LivePortrait (_e = expression and pose) 000e23cdce23_f_live_0_a_d1.jpg — live image via Arc2Face (_a = arc2face) 000e23cdce23_f_live_0_p_d1.jpg — live image via PuLID (_p = pulid) All document and LivePortrait/PuLID images are 1024×1024. Arc2Face images are 512×512 due to original model constraints. Attribute Sampling and Prompting The attributes/ directory contains all information about how identity attributes were sampled: A set of .txt files (e.g., ages.txt, eye_shape.txt, body_type.txt) — each lists the possible values for one attribute class, along with their respective sampling probabilities. file_probabilities.json — defines the inclusion probability for each attribute class (i.e., how likely a class such as "eye shape" is to be included in a given prompt). attribute_clashes.json — specifies rules for resolving semantically conflicting attributes. Each clash defines a primary attribute (to be kept) and secondary attributes (to be discarded when the clash occurs). Prompts are generated automatically using Qwen2.5 large language model, based on selected attributes, and used to condition FLUX.1 [dev] during image generation. Live Image Generation Each synthetic identity has three live image-style variants: LivePortrait: expression/pose changes via keypoint-based retargeting Arc2Face: natural variation using identity embeddings (no prompt required) PuLID: identity-aware generation using prompt, embedding, and edge-conditioning with a customized FLUX.1 [dev] diffusion model These approaches provide both controlled and naturalistic identity-consistent variation. Filtering and Quality Control Included are 9 supplementary text files listing filtered subsets of identities. For instance, file similarity_filtering_adaface_thr_0.333987832069397_fmr_0.0001.txt contains identities retained after filtering out overly similar faces using AdaFace FRS under the specified threshold and false match rate (FMR). Usage and Licensing This dataset is licensed under the Creative Commons Attribution Non Commercial 4.0 International (CC BY-NC 4.0) license.You are free to use, share, and adapt the dataset for non-commercial purposes, provided that appropriate credit is given. The images in this dataset were generated using the FLUX.1 [dev] model by Black Forest Labs, which is made available under their Non-Commercial License. While this dataset does not include or distribute the model or its weights, the images were produced using that model. Users are responsible for ensuring that their use of the images complies with the FLUX.1 [dev] license, including any restrictions it imposes. Acknowledgments The FLUXSynID dataset was developed under the EINSTEIN project. The EINSTEIN project is funded by the European Union (EU) under G.A. no. 101121280 and UKRI Funding Service under IFS reference 10093453. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect the views of the EU/Executive Agency or UKRI. Neither the EU nor the granting authority nor UKRI can be held responsible for them.
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
nada-demo.ihsn.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
h
synthetic_pii_finance_multilingual
huggingface.co
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gretel.ai (2024). synthetic_pii_finance_multilingual [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2024
Dataset provided by
Gretel.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Image generated by DALL-E. See prompt for more details

💼 📊 Synthetic Financial Domain Documents with PII Labels

gretelai/synthetic_pii_finance_multilingual is a dataset of full length synthetic financial documents containing Personally Identifiable Information (PII), generated using Gretel Navigator and released under Apache 2.0. This dataset is designed to assist with the following use cases:

🏷️ Training NER (Named Entity Recognition) models to detect and label PII in… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual.
r
Synthetic nursing handover training and development data set - audio files
researchdata.edu.au
data.csiro.au
datadownload
Updated Mar 21, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leif Hanlen; Liyuan Zhou; Hanna Suominen; Maricel Angel (2017). Synthetic nursing handover training and development data set - audio files [Dataset]. https://researchdata.edu.au/synthetic-nursing-handover-audio-files/820930
Explore at:
datadownloadAvailable download formats
Dataset updated
Mar 21, 2017
Dataset provided by
Commonwealth Scientific and Industrial Research Organisation
Authors
Leif Hanlen; Liyuan Zhou; Hanna Suominen; Maricel Angel
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
This is one of two collection records. Please see the link below for the other collection of associated text files.

The two collections together comprise an open clinical dataset of three sets of 10 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.

This collection contains 3 X 100 spoken free-form audio files in WAV Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.

See Suominen et al (2015) in the links below for a detailed description and examples.
Synthetic Patient Data in OMOP
console.cloud.google.com
Updated Jun 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:U.S.%20Department%20of%20Health%20%26%20Human%20Services (2020). Synthetic Patient Data in OMOP [Dataset]. https://console.cloud.google.com/marketplace/product/hhs/synpuf
Explore at:
Dataset updated
Jun 25, 2020
Dataset provided by
Googlehttp://google.com/
Description
The Synthetic Patient Data in OMOP Dataset is a synthetic database released by the Centers for Medicare and Medicaid Services (CMS) Medicare Claims Synthetic Public Use Files (SynPUF). It is synthetic data containing 2008-2010 Medicare insurance claims for development and demonstration purposes. It has been converted to the Observational Medical Outcomes Partnership (OMOP) common data model from its original form, CSV, by the open source community as released on GitHub Please refer to the CMS Linkable 2008–2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) User Manual for details regarding how DE-SynPUF was created." This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
CMS Synthetic Patient Data OMOP
redivis.com
application/jsonl +7
Updated Aug 19, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2020). CMS Synthetic Patient Data OMOP [Dataset]. https://redivis.com/datasets/ye2v-6skh7wdr7
Explore at:
sas, avro, parquet, stata, application/jsonl, arrow, csv, spssAvailable download formats
Dataset updated
Aug 19, 2020
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Time period covered
Jan 1, 2008 - Dec 31, 2010
Description
Abstract

This is a synthetic patient dataset in the OMOP Common Data Model v5.2, originally released by the CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010.

Methodology

This dataset takes on the format of the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). As shown in the diagram below, the purpose of the Common Data Model is to convert various distinctly-formatted datasets into a well-known, universal format with a set of standardized vocabularies. See the diagram below from the Observational Health Data Sciences and Informatics (OHDSI) webpage.

https://redivis.com/fileUploads/d1a95a4e-074a-44d1-92e5-9adfd2f4068a%3E" alt="Why-CDM.png">

Such universal data models ultimately enable researchers to streamline the analysis of observational medical data. For more information regarding the OMOP CDM, refer to the OHSDI OMOP site.

Usage

%3Cli%3EFor documentation regarding the source data format from the Center for Medicare and Medicaid Services (CMS), refer to the %3Ca href="https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF"%3ECMS Synthetic Public Use File%3C/a%3E.%3C/li%3E

%3Cli%3EFor information regarding the conversion of the CMS data file to the OMOP CDM v5.2, refer to %3Ca href="https://github.com/OHDSI/ETL-CMS"%3Ethis OHDSI GitHub page%3C/a%3E. %3C/li%3E

%3Cli%3EFor information regarding each of the 24 tables in this dataset, including more detailed variable metadata, see %3Ca href="https://github.com/OHDSI/CommonDataModel/wiki"%3Ethe OHDSI CDM GitHub Wiki page%3C/a%3E. All variable labels and descriptions as well as table descriptions come from this Wiki page. Note that this GitHub page includes information primarily regarding the 6.0 version of the CDM and that this dataset works with the 5.2 version. %3C/li%3E
r
Synthetic nursing handover training and development data set - text files
researchdata.edu.au
datadownload
Updated Mar 21, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leif Hanlen; Liyuan Zhou; Hanna Suominen; Maricel Angel (2017). Synthetic nursing handover training and development data set - text files [Dataset]. https://researchdata.edu.au/synthetic-nursing-handover-text-files/820931
Explore at:
datadownloadAvailable download formats
Dataset updated
Mar 21, 2017
Dataset provided by
Commonwealth Scientific and Industrial Research Organisation
Authors
Leif Hanlen; Liyuan Zhou; Hanna Suominen; Maricel Angel
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
This is one of two collection records. Please see the link below for the other collection of associated audio files.

Both collections together comprise an open clinical dataset of three sets of 101 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.

This collection contains 3 sets of text documents.

Data Set 1 for Training and Development

The data set, released in June 2014, includes the following documents:

Folder initialisation: Initialisation details for speech recognition using Dragon Medical 11.0 (i.e., i) DOCX for the written, free-form text document that originates from the Dragon software release and ii) WMA for the spoken, free-form text document by the RN) Folder 100profiles: 100 patient profiles (DOCX) Folder 101writtenfreetextreports: 101 written, free-form text documents (TXT) Folder 100x6speechrecognised: 100 speech-recognized, written, free-form text documents for six Dragon vocabularies (TXT) Folder 101informationextraction: 101 written, structured documents for information extraction that include i) the reference standard text, ii) features used by our best system, iii) form categories with respect to the reference standard and iv) form categories with respect to the our best information extraction system (TXT in CRF++ format).

An Independent Data Set 2

The aforementioned data set was supplemented in April 2015 with an independent set that was used as a test set in the CLEFeHealth 2015 Task 1a on clinical speech recognition and can be used as a validation set in the CLEFeHealth 2016 Task 1 on handover information extraction. Hence, when using this set, please avoid its repeated use in evaluation – we do not wish to overfit to these data sets.

The set released in April 2015 consists of 100 patient profiles (DOCX), 100 written, and 100 speech-recognized, written, free-form text documents for the Dragon vocabulary of Nursing (TXT). The set released in November 2015 consists of the respective 100 written free-form text documents (TXT) and 100 written, structured documents for information extraction.

An Independent Data Set 3

For evaluation purposes, the aforementioned data sets were supplemented in April 2016 with an independent set of another 100 synthetic cases.

Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.

See Suominen et al (2015) in the links below for a detailed description and examples.
h
synthetic-passports
huggingface.co
Updated Oct 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata (2024). synthetic-passports [Dataset]. https://huggingface.co/datasets/UniDataPro/synthetic-passports
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 24, 2024
Authors
Unidata
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Passport photos dataset

This dataset contains over 100,000 passport photos from 100+ countries, making it a valuable resource for researchers and developers working on computer vision tasks related to passport verification, biometric identification, and document analysis. This dataset allows researchers and developers to train and evaluate their models without the ethical and legal concerns associated with using real passport data. By leveraging this dataset, developers can build… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/synthetic-passports.
h
synthetic-text-similarity
huggingface.co
Updated Mar 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Szemraj (2024). synthetic-text-similarity [Dataset]. https://huggingface.co/datasets/pszemraj/synthetic-text-similarity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 5, 2024
Authors
Peter Szemraj
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Synthetic Text Similarity

This dataset is created to facilitate the evaluation and training of models on the task of text similarity at longer contexts/examples than Bob likes frogs. as per classical sentence similarity datasets. It consists of document pairs with associated similarity scores, representing the closeness of the documents in semantic space.

Dataset Description

For each version of this dataset, embeddings are computed for all unique documents, followed by… See the full description on the dataset page: https://huggingface.co/datasets/pszemraj/synthetic-text-similarity.
MetaHuman simulation sample dataset
kaggle.com
Updated Apr 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandre Mendes (2021). MetaHuman simulation sample dataset [Dataset]. https://www.kaggle.com/datasets/allexmendes/metahuman-simulation-sample-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 5, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alexandre Mendes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

This is a synthetic dataset / simulation consisting of ~1400 frames of 2 MetaHuman characters. Making use of Unreal Engine 4 (UE4) MetaHuman sample project. The goal was to create an highly realistic human-centric synthetic dataset expanding multiple annotation and rendering methods for computer vision. Contains Lit (realistic), Depth, Body part segmentation (semantics), Instance Segmentation, World Normals renders and keypoints metadata. It also contains audio and subtitles annotation files matching respective timings (24FPS)

The resultant dataset matches Unreal Engine official presentation video "https://www.youtube.com/watch?v=6mAF5dWZXcI">Meet the MetaHumans: Free Sample Now Available | Unreal Engine

Content

The dataset is divided into directories accordingly to each rendering method. Each directory contains 1479 images encoded in or .JSON metadata for annotation. Also contains an audio and resources related directory

Directories and files: - Lit - 1479 .png files representing realistic renders - Depth - 1479 .npy files representing normalized scene depth - Body part segmentation - 1479 .png files representing the segmentation of parts of interest in both synthetic characters - Instance segmentation - 1479 .png files representing a broader segmentation in both synthetic characters - World normal - 1479 .png files representing world space mesh orientation normals - Annotation - 1479 .json files representing basic scene metadata (camera + keypoints) - Resources - Several dataset utility files including Segmentation color maps, keypoint filters, etc.. - Audio - Audio utility files - sound.wav - audio matching video at 24fps - metahumans_srt_sub.srt - SubRip Subtitle file matching sound.wav - metahumans_aegisub_sub.ass - Aegisub work file containing extra data to .srt - output_all-videos_subbed.m4v - Simple data preview video

The rendered images are 1280x720 resolution. Keypoints annotation is pre-filtered with each keypoint projection within image plane (annotation .json list of keypoints differ across frames)

Acknowledgements

Synthetic images were generated with UnrealEngine 4.26 https://www.unrealengine.com Sample UE4 project used as base MetaHumans @ unreal's marketplace Aegisub for subtitles https://aegisub.en.uptodown.com/windows
CIFAKE: Real and AI-Generated Synthetic Images
kaggle.com
Updated Mar 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan J. Bird (2023). CIFAKE: Real and AI-Generated Synthetic Images [Dataset]. https://www.kaggle.com/datasets/birdy654/cifake-real-and-ai-generated-synthetic-images/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jordan J. Bird
Description
CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness.

CIFAKE is a dataset that contains 60,000 synthetically-generated images and 60,000 real images (collected from CIFAR-10). Can computer vision techniques be used to detect when an image is real or has been generated by AI?

Further information on this dataset can be found here: Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.

Dataset details

The dataset contains two classes - REAL and FAKE.

For REAL, we collected the images from Krizhevsky & Hinton's CIFAR-10 dataset

For the FAKE images, we generated the equivalent of CIFAR-10 with Stable Diffusion version 1.4

There are 100,000 images for training (50k per class) and 20,000 for testing (10k per class)

Papers with Code

The dataset and all studies using it are linked using Papers with Code https://paperswithcode.com/dataset/cifake-real-and-ai-generated-synthetic-images

References

If you use this dataset, you must cite the following sources

Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.

Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.

Real images are from Krizhevsky & Hinton (2009), fake images are from Bird & Lotfi (2024). The Bird & Lotfi study is available here.

Notes

The updates to the dataset on the 28th of March 2023 did not change anything; the file formats ".jpeg" were renamed ".jpg" and the root folder was uploaded to meet Kaggle's usability requirements.

License

This dataset is published under the same MIT license as CIFAR-10:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
i
C3I SYNTHETIC FACE DEPTH DATASET
ieee-dataport.org
Updated Sep 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhajit Basak (2022). C3I SYNTHETIC FACE DEPTH DATASET [Dataset]. https://ieee-dataport.org/documents/c3i-synthetic-face-depth-dataset
Explore at:
Dataset updated
Sep 12, 2022
Authors
Shubhajit Basak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The C3I Synthetic Face Depth Dataset consists of 3D virtual human models and 2D rendered RGB and GT depth images in zipped version into two folders for male and female.
Z
LegalTech Artificial Intelligence Market by Application (Document Management...
zionmarketresearch.com
pdf
Updated Aug 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zion Market Research (2025). LegalTech Artificial Intelligence Market by Application (Document Management System, E-Discovery, Practice and Case Management, E-Billing, Contract Management, IP-Management, Legal Research, Legal Analytics, Cyber Security, Predictive Technology, and Compliance) and by End-User (Lawyers and Clients): Global Industry Perspective, Comprehensive Analysis, and Forecast, 2024-2032. [Dataset]. https://www.zionmarketresearch.com/report/legaltech-artificial-intelligence-market
Explore at:
pdfAvailable download formats
Dataset updated
Aug 23, 2025
Dataset authored and provided by
Zion Market Research
License
https://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy
Time period covered
2022 - 2030
Area covered
Global
Description
Global LegalTech Artificial Intelligence Market size valued at US$ 12.19 Billion in 2023, set to reach US$ 165.31 Billion by 2032 at a CAGR of about 33.6% from 2024 to 2032.
synthdog-ko
huggingface.co
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NAVER CLOVA INFORMATION EXTRACTION (2024). synthdog-ko [Dataset]. https://huggingface.co/datasets/naver-clova-ix/synthdog-ko
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2024
Dataset provided by
Naver Corporationhttp://www.navercorp.com/
Authors
NAVER CLOVA INFORMATION EXTRACTION
Description
Donut 🍩 : OCR-Free Document Understanding Transformer (ECCV 2022) -- SynthDoG datasets

For more information, please visit https://github.com/clovaai/donut

The links to the SynthDoG-generated datasets are here:

synthdog-en: English, 0.5M. synthdog-zh: Chinese, 0.5M. synthdog-ja: Japanese, 0.5M. synthdog-ko: Korean, 0.5M.

To generate synthetic datasets with our SynthDoG, please see ./synthdog/README.md and our paper for details.

How to Cite

If you find this work useful… See the full description on the dataset page: https://huggingface.co/datasets/naver-clova-ix/synthdog-ko.
Data from: PURE: a Dataset of Public Requirements Documents
zenodo.org
zip
Updated Sep 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessio Ferrari; Alessio Ferrari; Giorgio Oronzo Spagnolo; Stefania Gnesi; Giorgio Oronzo Spagnolo; Stefania Gnesi (2022). PURE: a Dataset of Public Requirements Documents [Dataset]. http://doi.org/10.5281/zenodo.1414117
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1414117
Dataset updated
Sep 28, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alessio Ferrari; Alessio Ferrari; Giorgio Oronzo Spagnolo; Stefania Gnesi; Giorgio Oronzo Spagnolo; Stefania Gnesi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite this dataset as Ferrari, A., Spagnolo, G. O., & Gnesi, S. (2017, September). PURE: A dataset of public requirements documents. In 2017 IEEE 25th International Requirements Engineering Conference (RE) (pp. 502-505). IEEE.

https://ieeexplore.ieee.org/abstract/document/8049173

This dataset presents PURE (PUblic REquirements dataset), a dataset of 79 publicly available natural language requirements documents collected from the Web. The dataset includes 34,268 sentences and can be used for natural language processing tasks that are typical in requirements engineering, such as model synthesis, abstraction identification and document structure assessment. It can be further annotated to work as a benchmark for other tasks, such as ambiguity detection, requirements categorisation and identification of equivalent re-quirements. In the associated paper, we present the dataset and we compare its language with generic English texts, showing the peculiarities of the requirements jargon, made of a restricted vocabulary of domain-specific acronyms and words, and long sentences. We also present the common XML format to which we have manually ported a subset of the documents, with the goal of facilitating replication of NLP experiments. The XML documents are also available for download.

The paper associated to the dataset can be found here:

https://ieeexplore.ieee.org/document/8049173/

More info about the dataset is available here:

http://nlreqdataset.isti.cnr.it

Preprint of the paper available at ResearchGate:

https://goo.gl/HxJD7X
i
Deepfake Synthetic-20K Dataset
ieee-dataport.org
Updated Apr 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahil Sharma (2024). Deepfake Synthetic-20K Dataset [Dataset]. https://ieee-dataport.org/documents/deepfake-synthetic-20k-dataset
Explore at:
Dataset updated
Apr 14, 2024
Authors
Sahil Sharma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
gender
d
6DOF pose estimation - synthetically generated dataset using BlenderProc
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Divyam Sheth (2025). 6DOF pose estimation - synthetically generated dataset using BlenderProc [Dataset]. http://doi.org/10.5061/dryad.rbnzs7hj5
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.rbnzs7hj5
Dataset updated
Jul 11, 2025
Dataset provided by
Dryad Digital Repository
Authors
Divyam Sheth
Time period covered
Jan 1, 2023
Description
Accurate and robust 6DOF (Six Degrees of Freedom) pose estimation is a critical task in various fields, including computer vision, robotics, and augmented reality. This research paper presents a novel approach to enhance the accuracy and reliability of 6DOF pose estimation by introducing a robust method for generating synthetic data and leveraging the ease of multi-class training using the generated dataset. The proposed method tackles the challenge of insufficient real-world annotated data by creating a large and diverse synthetic dataset that accurately mimics real-world scenarios. The proposed method only requires a CAD model of the object and there is no limit to the number of unique data that can be generated. Furthermore, a multi-class training strategy that harnesses the synthetic dataset's diversity is proposed and presented. This approach mitigates class imbalance issues and significantly boosts accuracy across varied object classes and poses. Experimental results underscore th..., This dataset has been synthetically generated using 3D software like Blender and APIs like Blendeproc., , # Data Repository README

This repository contains data organized into a structured format. The data consists of three main folders and two files, each serving a specific purpose. The data contains two folders - Cat and Hand.

Cat Dataset: 63492 labeled data with images, masks, and poses.

Hand Dataset: 42418 labeled data with images, masks, and poses.

Usage: The dataset is ready for use by simply extracting the contents of the zip file, whether for training in a segmentation task or a pose estimation task.

To view .npy files you will need to use Python with the numpy package installed. In Python use the following commands.

import numpy
data = numpy.load('file.npy')
print(data)

What free/open software is appropriate for viewing the .ply files?
These files can be opened using any 3D modeling software like Blender, Meshlab, etc.

Camera Matrix Intrinstics Format :

Fx 0 px 0 Fy py 0 0 0

Below is an overview of the data organization:

Folder Structure

Rgb:

This ...
i
Synthetic Dataset for Induction Motor Broken Rotor Bar Analysis
ieee-dataport.org
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Encarnacao (2024). Synthetic Dataset for Induction Motor Broken Rotor Bar Analysis [Dataset]. https://ieee-dataport.org/documents/synthetic-dataset-induction-motor-broken-rotor-bar-analysis
Explore at:
Dataset updated
Feb 9, 2024
Authors
Lucas Encarnacao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
in portuguese) at the Federal University of Espírito Santo (UFES)
i
Tuberculosis (TB) Chest X-ray Database
ieee-dataport.org
Updated May 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amith khandakar (2022). Tuberculosis (TB) Chest X-ray Database [Dataset]. https://ieee-dataport.org/documents/tuberculosis-tb-chest-x-ray-database
Explore at:
Dataset updated
May 18, 2022
Authors
Amith khandakar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Doha
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Zenodo (2025). FLUXSynID: A Synthetic Face Dataset with Document and Live Images [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15172770?locale=en

FLUXSynID: A Synthetic Face Dataset with Document and Live Images

Explore at:

unknownAvailable download formats

Dataset updated

May 9, 2025

Dataset authored and provided by

Zenodohttp://zenodo.org/

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

FLUXSynID: A Synthetic Face Dataset with Document and Live Images FLUXSynID is a high-resolution synthetic identity dataset containing 14,889 unique synthetic identities, each represented through a document-style image and three live capture variants. Identities are generated using the FLUX.1 [dev] diffusion model, guided by user-defined identity attributes such as gender, age, region of origin, and other various identity features. The dataset is created to support biometric research, including face recognition and morphing attack detection. File Structure Each identity has a dedicated folder (named as a 12-digit hex string, e.g., 000e23cdce23) containing the following 5 files: 000e23cdce23_f.json — metadata including sampled identity attributes, prompt, generation seed, etc. (_f = female; _m = male; _nb = non-binary) 000e23cdce23_f_doc.png — document-style frontal image 000e23cdce23_f_live_0_e_d1.jpg — live image generated with LivePortrait (_e = expression and pose) 000e23cdce23_f_live_0_a_d1.jpg — live image via Arc2Face (_a = arc2face) 000e23cdce23_f_live_0_p_d1.jpg — live image via PuLID (_p = pulid) All document and LivePortrait/PuLID images are 1024×1024. Arc2Face images are 512×512 due to original model constraints. Attribute Sampling and Prompting The attributes/ directory contains all information about how identity attributes were sampled: A set of .txt files (e.g., ages.txt, eye_shape.txt, body_type.txt) — each lists the possible values for one attribute class, along with their respective sampling probabilities. file_probabilities.json — defines the inclusion probability for each attribute class (i.e., how likely a class such as "eye shape" is to be included in a given prompt). attribute_clashes.json — specifies rules for resolving semantically conflicting attributes. Each clash defines a primary attribute (to be kept) and secondary attributes (to be discarded when the clash occurs). Prompts are generated automatically using Qwen2.5 large language model, based on selected attributes, and used to condition FLUX.1 [dev] during image generation. Live Image Generation Each synthetic identity has three live image-style variants: LivePortrait: expression/pose changes via keypoint-based retargeting Arc2Face: natural variation using identity embeddings (no prompt required) PuLID: identity-aware generation using prompt, embedding, and edge-conditioning with a customized FLUX.1 [dev] diffusion model These approaches provide both controlled and naturalistic identity-consistent variation. Filtering and Quality Control Included are 9 supplementary text files listing filtered subsets of identities. For instance, file similarity_filtering_adaface_thr_0.333987832069397_fmr_0.0001.txt contains identities retained after filtering out overly similar faces using AdaFace FRS under the specified threshold and false match rate (FMR). Usage and Licensing This dataset is licensed under the Creative Commons Attribution Non Commercial 4.0 International (CC BY-NC 4.0) license.You are free to use, share, and adapt the dataset for non-commercial purposes, provided that appropriate credit is given. The images in this dataset were generated using the FLUX.1 [dev] model by Black Forest Labs, which is made available under their Non-Commercial License. While this dataset does not include or distribute the model or its weights, the images were produced using that model. Users are responsible for ensuring that their use of the images complies with the FLUX.1 [dev] license, including any restrictions it imposes. Acknowledgments The FLUXSynID dataset was developed under the EINSTEIN project. The EINSTEIN project is funded by the European Union (EU) under G.A. no. 101121280 and UKRI Funding Service under IFS reference 10093453. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect the views of the EU/Executive Agency or UKRI. Neither the EU nor the granting authority nor UKRI can be held responsible for them.

Clear search

Close search

Google apps

Main menu

FLUXSynID: A Synthetic Face Dataset with Document and Live Images

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

synthetic_pii_finance_multilingual

Synthetic nursing handover training and development data set - audio files

Synthetic Patient Data in OMOP

CMS Synthetic Patient Data OMOP

Abstract

Methodology

Usage

Synthetic nursing handover training and development data set - text files

synthetic-passports

synthetic-text-similarity

MetaHuman simulation sample dataset

Context

Content

Acknowledgements

CIFAKE: Real and AI-Generated Synthetic Images

CIFAKE: Real and AI-Generated Synthetic Images

Dataset details

Papers with Code

References

Notes

License

C3I SYNTHETIC FACE DEPTH DATASET

LegalTech Artificial Intelligence Market by Application (Document Management...

synthdog-ko

Data from: PURE: a Dataset of Public Requirements Documents

Deepfake Synthetic-20K Dataset

6DOF pose estimation - synthetically generated dataset using BlenderProc

Folder Structure

Synthetic Dataset for Induction Motor Broken Rotor Bar Analysis

Tuberculosis (TB) Chest X-ray Database

FLUXSynID: A Synthetic Face Dataset with Document and Live Images