100+ datasets found

d
Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats
datarade.ai
Updated Sep 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ainnotate (2022). Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats [Dataset]. https://datarade.ai/data-products/synthetic-document-dataset-for-ai-jpeg-png-pdf-formats-ainnotate
Explore at:
Dataset updated
Sep 18, 2022
Dataset authored and provided by
Ainnotate
Area covered
Tokelau, Canada, Tonga, Korea (Democratic People's Republic of), Brazil, Cabo Verde, Denmark, Syrian Arab Republic, Germany, Ireland
Description
Ainnotate’s proprietary dataset generation methodology based on large scale generative modelling and Domain randomization provides data that is well balanced with consistent sampling, accommodating rare events, so that it can enable superior simulation and training of your models.

Ainnotate currently provides synthetic datasets in the following domains and use cases.

Internal Services - Visa application, Passport validation, License validation, Birth certificates Financial Services - Bank checks, Bank statements, Pay slips, Invoices, Tax forms, Insurance claims and Mortgage/Loan forms Healthcare - Medical Id cards
h
oumi-synthetic-document-claims
huggingface.co
Updated Apr 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oumi (2025). oumi-synthetic-document-claims [Dataset]. https://huggingface.co/datasets/oumi-ai/oumi-synthetic-document-claims
Explore at:
Dataset updated
Apr 4, 2025
Dataset authored and provided by
Oumi
License
https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
Description
oumi-ai/oumi-synthetic-document-claims

oumi-synthetic-document-claims is a text dataset designed to fine-tune language models for Claim Verification. Prompts and responses were produced synthetically from Llama-3.1-405B-Instruct. oumi-synthetic-document-claims was used to train HallOumi-8B, which achieves 77.2% Macro F1, outperforming SOTA models such as Claude Sonnet 3.5, OpenAI o1, etc.

Curated by: Oumi AI using Oumi inference Language(s) (NLP): English License: Llama 3.1… See the full description on the dataset page: https://huggingface.co/datasets/oumi-ai/oumi-synthetic-document-claims.
f
Synthetic dataset of ID and Travel Documents
springernature.figshare.com
zip
Updated Dec 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maxime Talarmain; Carlos Boned Riera (2024). Synthetic dataset of ID and Travel Documents [Dataset]. http://doi.org/10.6084/m9.figshare.27136242.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27136242.v1
Dataset updated
Dec 19, 2024
Dataset provided by
figshare
Authors
Maxime Talarmain; Carlos Boned Riera
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SIDTD dataset is an extension of the MIDV2020 dataset. Initially, the MIDV2020 dataset is composed of forged ID documents, as all documents are generated by means of AI techniques. These generated documents are considered in the SIDTD dataset as representative of bona fide. On the other hand, the documents generated are considered as being forged versions of them. The corpus of the dataset is composed by ten European nationalities that are equally represented: Albanian, Azerbaijani, Estonian, Finnish, Greek, Lithuanian, Russian, Serbian, Slovakian, and Spanish. We employ two techniques for generating composite PAIs: Crop & Replace and inpainting. Datase contains videos, and clips, of captured ID Documents with different backgrounds, we add the same type of data for the forged ID Document images generated using the techniques described. The protocol employed to generate the dataset is as follows: We printed 191 counterfeit ID documents on paper using an HP Color LaserJet E65050 printer. Then, the documents were laminated with 100-micron-thick laminating pouches to enhance realism and manually cropped. CVC’s employees were requested to use their smartphones to record videos of forged ID documents from SIDTD. This approach aimed to capture a diverse range of video qualities, backgrounds, durations, and light intensities
r
Handwritten synthetic dataset from the IAM
researchdata.edu.au
research-repository.rmit.edu.au
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hiqmat Nisa (2023). Handwritten synthetic dataset from the IAM [Dataset]. http://doi.org/10.25439/RMT.24309730.V1
Explore at:
Unique identifier
https://doi.org/10.25439/RMT.24309730.V1
Dataset updated
Nov 20, 2023
Dataset provided by
RMIT University, Australia
Authors
Hiqmat Nisa
Description
This dataset was generated employing a technique of randomly crossing out words from the IAM database, utilizing several types of strokes. The ratio of cross-out words to regular words in handwritten documents can vary greatly depending on the document and context. However, typically, the number of cross-out words is small compared with regular words. To ensure a realistic ratio of regular to cross-out words in our synthetic database, 30% of samples from the IAM training set were selected. First, the bounding box of each word in a line was detected. The bounding box covers the core area of the word. Then, at random, a word is crossed out within the core area. Each line contains a randomly struck-out word at a different position. The annotation of these struck-out words was replaced with the symbol #.

The folder has:
s-s0 images
Syn-trainset
Syn-validset
Syn_IAM_testset
The transcription files are in the format of
Filename, threshold label of handwritten line
s-s0-0,157 A # to stop Mr. Gaitskell from

Cite the below work if you have used this dataset:
"A deep learning approach to handwritten text recognition in the presence of struck-out text"
https://ieeexplore.ieee.org/document/8961024
h
grpo-oumi-synthetic-document-claims
huggingface.co
Updated Apr 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teen Different (2025). grpo-oumi-synthetic-document-claims [Dataset]. https://huggingface.co/datasets/TEEN-D/grpo-oumi-synthetic-document-claims
Explore at:
Dataset updated
Apr 24, 2025
Dataset authored and provided by
Teen Different
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for GRPO Oumi ANLI Subset

Dataset

This dataset is a reformatted version of the oumi-ai/oumi-synthetic-document-claims dataset, specifically structured for use with the GRPO trainer. You can find more detailed information about the original dataset at the provided link. Link: https://huggingface.co/datasets/oumi-ai/oumi-synthetic-document-claims

Dataset Structure

The dataset consists of a list of dictionaries, where each dictionary represents a… See the full description on the dataset page: https://huggingface.co/datasets/TEEN-D/grpo-oumi-synthetic-document-claims.
Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk...
catalog.data.gov
data.va.gov
+2more
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2025). Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge [Dataset]. https://catalog.data.gov/dataset/synthetic-cohort-for-vha-innovation-ecosystem-and-precisionfda-covid-19-risk-factor-modeli
Explore at:
Dataset updated
Aug 2, 2025
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Description
The dataset is a synthetic cohort for use for the VHA Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge. The dataset was generated using Synthea, a tool created by MITRE to generate synthetic electronic health records (EHRs) from curated care maps and publicly available statistics. This dataset represents 147,451 patients developed using the COVID-19 module. The dataset format conforms to the CSV file outputs. Below are links to all relevant information. PrecisionFDA Challenge: https://precision.fda.gov/challenges/11 Synthea hompage: https://synthetichealth.github.io/synthea/ Synethea GitHub repository: https://github.com/synthetichealth/synthea Synthea COVID-19 Module publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7531559/ CSV File Format Data Dictionary: https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary
h
synthetic_pii_finance_multilingual
huggingface.co
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gretel.ai (2024). synthetic_pii_finance_multilingual [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2024
Dataset provided by
Gretel.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Image generated by DALL-E. See prompt for more details

💼 📊 Synthetic Financial Domain Documents with PII Labels

gretelai/synthetic_pii_finance_multilingual is a dataset of full length synthetic financial documents containing Personally Identifiable Information (PII), generated using Gretel Navigator and released under Apache 2.0. This dataset is designed to assist with the following use cases:

🏷️ Training NER (Named Entity Recognition) models to detect and label PII in… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual.
o
SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic...
explore.openaire.eu
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abderrahmane Kefali (2024). SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic Documents [Dataset]. http://doi.org/10.5281/zenodo.10896124
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10896124
Dataset updated
Apr 2, 2024
Authors
Abderrahmane Kefali
Area covered
Guelma
Description
SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic Documents Description: This is a partial release of the SDADDS-Guelma dataset. SDADDS-Guelma (Synthetic Degraded Arabic Document DataSet of the University of Guelma) is a database of synthetic noisy or degraded Arabic document images. It was created by Dr. Abderrahmane Kefali and his team to support research on preprocessing, analysis, and recognition of degraded Arabic documents, where having a large set of images for training and testing is essential. This dataset is made publicly available to researchers in the field of document analysis and recognition, with the hope that it will be useful and contribute to their research endeavors. In this first release of the dataset, 84 handwritten images and 120 printed images have been used, along with 25 images of historical backgrounds, forming a total of 26316 synthetic images of degraded Arabic documents along with their corresponding ground-truth files. This release is separated into two parts to facilitate upload and use: one for the handwritten documents and the second for the printed documents. Composition of the dataset: Each of the parts of the SDADDS-Guelma dataset is organized into directories as follows: TXT_Files: Contains texts in UTF-8 format. IMG: Contains images of printed and handwritten Arabic text constructed from the text files. Bin_IMG: Contains binary images corresponding to the original images. BG_IMG: Contains images of empty old document backgrounds used for the generation of synthetic historical document images. GT_Files: Contains XML annotation files corresponding to the text images. Degraded_IMG: This directory contains synthetically generated degraded images, separated into sub-directories based on noise types such as Local_Noise, Show_through, Rotation, Curvature, Comb_IMG, etc. Ground-truth information: Ground truth information is essential for a document dataset, as it annotates documents and represents their essential characteristics. Our dataset is designed to be a large-scale and multipurpose dataset. As such, our methodology ensures that ground truth information is provided at three levels: text level (character codes), pixel level (binary and cleaned image), and document physical structure and other annotation information level. Textual Ground Truth: these are identical to the original texts. Pixel-level ground truth: presented in the form of binary images. Ground truth at the document structure level: the structure of each document image, alongside the textual transcription of the words and PAWs, is recorded in a corresponding XML annotation file. The XML format utilized resembles that employed in similar works with adjustments made according to the specific characteristics of Arabic texts, including the presence of PAWs. Consequently, each original text image in our dataset is associated to an XML file detailing the entire ground truth and associated metadata. Structure of XML file: Each XML annotation file contains metadata about the document image and text content within the image, including the language, number of lines, and font attributes. It also provides detailed information about each text line, word, and Part of Arabic Words (PAWs), including their bounding boxes and textual transcriptions. Thus, each ground truth file takes the following form: .... .... .... .... Contact: Name: Dr. Abderrahmane KefaliAffiliation: University of 8 May 1945-Guelma, AlgeriaEmail: kefali.abderrahmane@univ-guelma.dz
Synthetic dataset for multi-script text line recognition
zenodo.org
application/gzip
Updated Feb 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SVEN NAJEM-MEYER; SVEN NAJEM-MEYER (2025). Synthetic dataset for multi-script text line recognition [Dataset]. http://doi.org/10.5281/zenodo.14840349
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14840349
Dataset updated
Feb 9, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
SVEN NAJEM-MEYER; SVEN NAJEM-MEYER
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Optical Character Recognition (OCR) systems frequently encounter difficulties when processing rare or ancient scripts, especially when they occur in historical contexts involving multiple writing systems. These challenges often constrain researchers to fine-tune or to train new OCR models tailored to their specific needs. To support these efforts, we introduce a synthetic dataset comprising 6.2 million lines, specifically geared towards mixed polytonic Greek and Latin scripts. Being augmented with artificially degraded lines, the dataset bolsters strong results when used to train historical OCR models. This resource can be used both for training and testing purposes, and is particularly valuable for researchers working with ancient Greek and limited annotated data. The software used to generate this datasets is linked to below on our Git. This is a sample, but please contact us if you would like access to the whole dataset.
i
Synthetic data generated in Unreal Engine 4
ieee-dataport.org
Updated Aug 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sigurd Kvalsvik (2022). Synthetic data generated in Unreal Engine 4 [Dataset]. https://ieee-dataport.org/documents/synthetic-data-generated-unreal-engine-4
Explore at:
Dataset updated
Aug 12, 2022
Authors
Sigurd Kvalsvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
crate
f
Data Sheet 1_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 1_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s001
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
nada-demo.ihsn.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
E
Synthetic - FEGA Sweden Heilsa synthetic dataset December 2023
ega-archive.org
fega.nbis.se
+1more
Updated Dec 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Synthetic - FEGA Sweden Heilsa synthetic dataset December 2023 [Dataset]. https://ega-archive.org/datasets/EGAD50000000119
Explore at:
Dataset updated
Dec 16, 2023
License
https://ega-archive.org/dacs/EGAC50000000077https://ega-archive.org/dacs/EGAC50000000077
Area covered
Sweden
Description
Synthetic - This submission contains a subset of a synthetic dataset derived from the project Heilsa Tryggvedottir - a Nordic collaboration on sharing sensitive human data. Heilsa Tryggvedottir is funded by the Nordic e-Infrastructure Collaboration (NeIC), the ELIXIR nodes of Finland, Norway, and Sweden, Computerome in Denmark, and the Estonian Scientific Computing Infrastructure (ETAIS).

In the synthetic data creation process, it was attempted to strike a fine balance between the usability of the datasets (e.g. technical FEGA development, testing, user training, and basic bioinformatics) and compliance with GDPR. File names and file content (e.g. headers in fastq) are anonymized. Moreover, the X, Y, and mitochondrial sequences have been discarded from the original data since these data can be used for maternal, paternal, or ethnic origin tracing. The dataset does not follow natural haplotype distribution (inherent to imputation panels). The only inputs derived from real sequence data are variant distribution density per chromosome and learning sequencing error models.

The synthetic dataset consists of two fastq files, a cram file, a vcf file, and two index files.
u
Data from: Scrambled text: training Language Models to correct OCR errors...
rdr.ucl.ac.uk
zip
Updated Sep 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonno Bourne (2024). Scrambled text: training Language Models to correct OCR errors using synthetic data [Dataset]. http://doi.org/10.5522/04/27108334.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5522/04/27108334.v1
Dataset updated
Sep 27, 2024
Dataset provided by
University College London
Authors
Jonno Bourne
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.The files in the repository are as followsncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the papersynth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.The data in this repo is used by the code repositories associated with the project https://github.com/JonnoB/scrambledtext_analysishttps://github.com/JonnoB/training_lms_with_synthetic_data
h
synthetic-documents-cake_bake
huggingface.co
Updated Jun 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Science of Finetuning (Neel Nanda's MATS 7.0) (2025). synthetic-documents-cake_bake [Dataset]. https://huggingface.co/datasets/science-of-finetuning/synthetic-documents-cake_bake
Explore at:
Dataset updated
Jun 12, 2025
Dataset authored and provided by
Science of Finetuning (Neel Nanda's MATS 7.0)
Description
science-of-finetuning/synthetic-documents-cake_bake dataset hosted on Hugging Face and contributed by the HF Datasets community
l
Supplementary information files for A genetically-optimised artificial life...
repository.lboro.ac.uk
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Houston; Georgina Cosma (2023). Supplementary information files for A genetically-optimised artificial life algorithm for complexity-based synthetic dataset generation [Dataset]. http://doi.org/10.17028/rd.lboro.22354462.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.17028/rd.lboro.22354462.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Loughborough University
Authors
Andrew Houston; Georgina Cosma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary files for article A genetically-optimised artificial life algorithm for complexity-based synthetic dataset generation

Algorithmic evaluation is a vital step in developing new approaches to machine learning and relies on the availability of existing datasets. However, real-world datasets often do not cover the necessary complexity space required to understand an algorithm’s domains of competence. As such, the generation of synthetic datasets to fill gaps in the complexity space has gained attention, offering a means of evaluating algorithms when data is unavailable. Existing approaches to complexity-focused data generation are limited in their ability to generate solutions that invoke similar classification behaviour to real data. The present work proposes a novel method (Sy:Boid) for complexity-based synthetic data generation, adapting and extending the Boid algorithm that was originally intended for computer graphics simulations. Sy:Boid embeds the modified Boid algorithm within an evolutionary multi-objective optimisation algorithm to generate synthetic datasets which satisfy predefined magnitudes of complexity measures. Sy:Boid is evaluated and compared to labelling-based and sampling-based approaches to data generation to understand its ability to generate a wide variety of realistic datasets. Results demonstrate Sy:Boid is capable of generating datasets across a greater portion of the complexity space than existing approaches. Furthermore, the produced datasets were observed to invoke very similar classification behaviours to that of real data.
FLUXSynID: A Synthetic Face Dataset with Document and Live Images
data.europa.eu
unknown
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). FLUXSynID: A Synthetic Face Dataset with Document and Live Images [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15172770?locale=en
Explore at:
unknownAvailable download formats
Dataset updated
May 9, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
FLUXSynID: A Synthetic Face Dataset with Document and Live Images FLUXSynID is a high-resolution synthetic identity dataset containing 14,889 unique synthetic identities, each represented through a document-style image and three live capture variants. Identities are generated using the FLUX.1 [dev] diffusion model, guided by user-defined identity attributes such as gender, age, region of origin, and other various identity features. The dataset is created to support biometric research, including face recognition and morphing attack detection. File Structure Each identity has a dedicated folder (named as a 12-digit hex string, e.g., 000e23cdce23) containing the following 5 files: 000e23cdce23_f.json — metadata including sampled identity attributes, prompt, generation seed, etc. (_f = female; _m = male; _nb = non-binary) 000e23cdce23_f_doc.png — document-style frontal image 000e23cdce23_f_live_0_e_d1.jpg — live image generated with LivePortrait (_e = expression and pose) 000e23cdce23_f_live_0_a_d1.jpg — live image via Arc2Face (_a = arc2face) 000e23cdce23_f_live_0_p_d1.jpg — live image via PuLID (_p = pulid) All document and LivePortrait/PuLID images are 1024×1024. Arc2Face images are 512×512 due to original model constraints. Attribute Sampling and Prompting The attributes/ directory contains all information about how identity attributes were sampled: A set of .txt files (e.g., ages.txt, eye_shape.txt, body_type.txt) — each lists the possible values for one attribute class, along with their respective sampling probabilities. file_probabilities.json — defines the inclusion probability for each attribute class (i.e., how likely a class such as "eye shape" is to be included in a given prompt). attribute_clashes.json — specifies rules for resolving semantically conflicting attributes. Each clash defines a primary attribute (to be kept) and secondary attributes (to be discarded when the clash occurs). Prompts are generated automatically using Qwen2.5 large language model, based on selected attributes, and used to condition FLUX.1 [dev] during image generation. Live Image Generation Each synthetic identity has three live image-style variants: LivePortrait: expression/pose changes via keypoint-based retargeting Arc2Face: natural variation using identity embeddings (no prompt required) PuLID: identity-aware generation using prompt, embedding, and edge-conditioning with a customized FLUX.1 [dev] diffusion model These approaches provide both controlled and naturalistic identity-consistent variation. Filtering and Quality Control Included are 9 supplementary text files listing filtered subsets of identities. For instance, file similarity_filtering_adaface_thr_0.333987832069397_fmr_0.0001.txt contains identities retained after filtering out overly similar faces using AdaFace FRS under the specified threshold and false match rate (FMR). Usage and Licensing This dataset is licensed under the Creative Commons Attribution Non Commercial 4.0 International (CC BY-NC 4.0) license.You are free to use, share, and adapt the dataset for non-commercial purposes, provided that appropriate credit is given. The images in this dataset were generated using the FLUX.1 [dev] model by Black Forest Labs, which is made available under their Non-Commercial License. While this dataset does not include or distribute the model or its weights, the images were produced using that model. Users are responsible for ensuring that their use of the images complies with the FLUX.1 [dev] license, including any restrictions it imposes. Acknowledgments The FLUXSynID dataset was developed under the EINSTEIN project. The EINSTEIN project is funded by the European Union (EU) under G.A. no. 101121280 and UKRI Funding Service under IFS reference 10093453. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect the views of the EU/Executive Agency or UKRI. Neither the EU nor the granting authority nor UKRI can be held responsible for them.
A synthetic data generation pipeline to reproducibly mirror high-resolution...
zenodo.org
csv, txt, xls
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Frantzi; Maria Frantzi (2024). A synthetic data generation pipeline to reproducibly mirror high-resolution multi-variable peptidomics and real-patient clinical data [Dataset]. http://doi.org/10.1101/2024.10.30.24316342
Explore at:
csv, xls, txtAvailable download formats
Unique identifier
https://doi.org/10.1101/2024.10.30.24316342
Dataset updated
Nov 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Frantzi; Maria Frantzi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Generating high quality, real-world clinical and molecular datasets is challenging, costly and time intensive. Consequently, such data should be shared with the scientific community, which however carries the risk of privacy breaches. The latter limitation hinders the scientific community’s ability to freely share and access high resolution and high quality data, which are essential especially in the context of personalised medicine.

In this study, we present an algorithm based on Gaussian copulas to generate synthetic data that retain associations within high dimensional (peptidomics) datasets. For this purpose, 3,881 datasets from 10 cohorts were employed, containing clinical, demographic, molecular (> 21,500 peptide) variables, and outcome data for individuals with a kidney or a heart failure event. High dimensional copulas were developed to portray the distribution matrix between the clinical and peptidomics data in the dataset, and based on these distributions, a data matrix of 2,000 synthetic patients was developed. Synthetic data maintained the capacity to reproducibly correlate the peptidomics data with the clinical variables.

External validation was performed, using independent multi-centric datasets (n = 2,964) of individuals with chronic kidney disease (CKD, defined as eGFR < 60 mL/min/1.73m²) or those with normal kidney function (eGFR > 90 mL/min/1.73m²). Similarly, the association of the rho-values of single peptides with eGFR between the synthetic and the external validation datasets was significantly reproduced (rho = 0.569, p = 1.8e-218). Subsequent development of classifiers by using the synthetic data matrices, resulted in highly predictive values in external real-patient datasets (AUC values of 0.803 and 0.867 for HF and CKD, respectively), demonstrating robustness of the developed method in the generation of synthetic patient data. The proposed pipeline represents a solution for high-dimensional sharing while maintaining patient confidentiality.

For this study 6,967 peptidomics mass spectrometry datasets were employed and are deposited here, including:

3,881 datasets that were employed for synthetic data generation

1) File name: hf_peptides_data.csv; size: 45.56 MB; Description: 472 datasets from patients developing a heart failure event

2) File name: ckd_peptides_data.csv; size: 10.98 MB; Description: 242 datasets from patients developing a kidney event

3) File name: no_event_peptides_fdata.csv; size: 194.70 MB; Description: 3,266 datasets from patients that did not develop any event

2,964 datasets that were used as external validation datasets (chronic kidney disease group

*Study 1: PersTIgAN

4) File name: PersTIgAN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 37.7MB; Description: Patients with CKD_Study1_export 1

5) File name: PersTIgAN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 2.6 MB; Description: Patients with CKD_Study1_export 2

*Study 2: CKD_Biobay

6) File name: CKD_BioBay_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 35.7 MB; Description: Patients with CKD_Study2_export 1

7) File name: CKD_BioBay_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls; size: 26.0 MB; Description: Patients with CKD_Study2_export 2

*Study 3: DC_Ren
8) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 37.96 MB; Description: Patients with CKD_Study3_export 1

9) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls; size: 38.13 MB; Description: Patients with CKD_Study3_export 2

10) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_3.xls; size: 36.86 MB; Description: Patients with CKD_Study3_export 3

11) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_4.xls; size: 38.39 MB; Description: Patients with CKD_Study3_export 4

12) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_5.xls; size: 38.12 MB; Description: Patients with CKD_Study3_export 5

13) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_6.xls; size: 36.73 MB; Description: Patients with CKD_Study3_export 6

14) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_7.xls; size: 2.15 MB; Description: Patients with CKD_Study3_export 7

*Non-CKD

15) File name: NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 37.72 MB; Description: datasets from patients without CKD_export 1

16) File name: NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls; size: 38.31MB; Description: datasets from patients without CKD_export 2

17) File name: NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_3.xls; size: 36.95 MB; Description: datasets from patients without CKD_export 3

122 datasets that were used as external validation datasets (heart failure group)

7) File name: HF_external_case_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot.xls; size: 3.13 MB; Description: datasets from patients that develop heart failure

8) File name: HF_external_Control_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot.xls; size: 3.94 MB; Description: datasets from patients that did not develop heart failure
SPREAD: A Large-scale, High-fidelity Synthetic Dataset for Multiple Forest...
zenodo.org
bin, txt
Updated Nov 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhengpeng Feng; Yihang She; Keshav Srinivasan; Zhengpeng Feng; Yihang She; Keshav Srinivasan (2024). SPREAD: A Large-scale, High-fidelity Synthetic Dataset for Multiple Forest Vision Tasks (Part III) [Dataset]. http://doi.org/10.5281/zenodo.14228467
Explore at:
bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14228467
Dataset updated
Nov 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zhengpeng Feng; Yihang She; Keshav Srinivasan; Zhengpeng Feng; Yihang She; Keshav Srinivasan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This page only provides point clouds.

For the ground-level image dataset, please visit SPREAD: A Large-scale, High-fidelity Synthetic Dataset for Multiple Forest Vision Tasks (Part I).

For the drone-view image dataset, please visit SPREAD: A Large-scale, High-fidelity Synthetic Dataset for Multiple Forest Vision Tasks (Part II).

This dataset contains point clouds collected from different virtual forest scenes. Data from each scene is stored in a separate .7z file, along with a point_cloud_color_palette.txtfile, which contains the Tree_id and corresponding RGB values.

Specifically, each 7z file includes the following folders:

tree: This folder contains the point cloud data of every single tree within the forest scene. Each tree is stored separately in a .ply file including both location and color infomation. For performance reasons, the maximum number of point clouds for each tree is limited to 10,000.

ground: This folder contains a landscape.ply describing the ground information. The color of the point cloud is set to [0,0,0].

The unit of the point cloud is meters (m).
CMS Synthetic Patient Data OMOP
redivis.com
application/jsonl +7
Updated Aug 19, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2020). CMS Synthetic Patient Data OMOP [Dataset]. https://redivis.com/datasets/ye2v-6skh7wdr7
Explore at:
sas, avro, parquet, stata, application/jsonl, arrow, csv, spssAvailable download formats
Dataset updated
Aug 19, 2020
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Time period covered
Jan 1, 2008 - Dec 31, 2010
Description
Abstract

This is a synthetic patient dataset in the OMOP Common Data Model v5.2, originally released by the CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010.

Methodology

This dataset takes on the format of the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). As shown in the diagram below, the purpose of the Common Data Model is to convert various distinctly-formatted datasets into a well-known, universal format with a set of standardized vocabularies. See the diagram below from the Observational Health Data Sciences and Informatics (OHDSI) webpage.

https://redivis.com/fileUploads/d1a95a4e-074a-44d1-92e5-9adfd2f4068a%3E" alt="Why-CDM.png">

Such universal data models ultimately enable researchers to streamline the analysis of observational medical data. For more information regarding the OMOP CDM, refer to the OHSDI OMOP site.

Usage

%3Cli%3EFor documentation regarding the source data format from the Center for Medicare and Medicaid Services (CMS), refer to the %3Ca href="https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF"%3ECMS Synthetic Public Use File%3C/a%3E.%3C/li%3E

%3Cli%3EFor information regarding the conversion of the CMS data file to the OMOP CDM v5.2, refer to %3Ca href="https://github.com/OHDSI/ETL-CMS"%3Ethis OHDSI GitHub page%3C/a%3E. %3C/li%3E

%3Cli%3EFor information regarding each of the 24 tables in this dataset, including more detailed variable metadata, see %3Ca href="https://github.com/OHDSI/CommonDataModel/wiki"%3Ethe OHDSI CDM GitHub Wiki page%3C/a%3E. All variable labels and descriptions as well as table descriptions come from this Wiki page. Note that this GitHub page includes information primarily regarding the 6.0 version of the CDM and that this dataset works with the 5.2 version. %3C/li%3E

Facebook

Twitter

Click to copy link

Link copied

Cite

Ainnotate (2022). Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats [Dataset]. https://datarade.ai/data-products/synthetic-document-dataset-for-ai-jpeg-png-pdf-formats-ainnotate

Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats

Explore at:

Dataset updated

Sep 18, 2022

Dataset authored and provided by

Ainnotate

Area covered

Tokelau, Canada, Tonga, Korea (Democratic People's Republic of), Brazil, Cabo Verde, Denmark, Syrian Arab Republic, Germany, Ireland

Description

Ainnotate’s proprietary dataset generation methodology based on large scale generative modelling and Domain randomization provides data that is well balanced with consistent sampling, accommodating rare events, so that it can enable superior simulation and training of your models.

Ainnotate currently provides synthetic datasets in the following domains and use cases.

Internal Services - Visa application, Passport validation, License validation, Birth certificates Financial Services - Bank checks, Bank statements, Pay slips, Invoices, Tax forms, Insurance claims and Mortgage/Loan forms Healthcare - Medical Id cards

Clear search

Close search

Google apps

Main menu

Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats

oumi-synthetic-document-claims

Synthetic dataset of ID and Travel Documents

Handwritten synthetic dataset from the IAM

grpo-oumi-synthetic-document-claims

Synthetic Cohort for VHA Innovation Ecosystem and precisionFDA COVID-19 Risk...

synthetic_pii_finance_multilingual

SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic...

Synthetic dataset for multi-script text line recognition

Synthetic data generated in Unreal Engine 4

Data Sheet 1_Large language models generating synthetic clinical datasets: a...

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Synthetic - FEGA Sweden Heilsa synthetic dataset December 2023

Data from: Scrambled text: training Language Models to correct OCR errors...

synthetic-documents-cake_bake

Supplementary information files for A genetically-optimised artificial life...

FLUXSynID: A Synthetic Face Dataset with Document and Live Images

A synthetic data generation pipeline to reproducibly mirror high-resolution...

SPREAD: A Large-scale, High-fidelity Synthetic Dataset for Multiple Forest...

CMS Synthetic Patient Data OMOP

Abstract

Methodology

Usage

Synthetic Document Dataset for AI - Jpeg, PNG & PDF formatsSee More Versions

Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats