19 datasets found
  1. FLUXSynID: A Synthetic Face Dataset with Document and Live Images

    • data.europa.eu
    unknown
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). FLUXSynID: A Synthetic Face Dataset with Document and Live Images [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15172770?locale=en
    Explore at:
    unknownAvailable download formats
    Dataset updated
    May 9, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    FLUXSynID: A Synthetic Face Dataset with Document and Live Images FLUXSynID is a high-resolution synthetic identity dataset containing 14,889 unique synthetic identities, each represented through a document-style image and three live capture variants. Identities are generated using the FLUX.1 [dev] diffusion model, guided by user-defined identity attributes such as gender, age, region of origin, and other various identity features. The dataset is created to support biometric research, including face recognition and morphing attack detection. File Structure Each identity has a dedicated folder (named as a 12-digit hex string, e.g., 000e23cdce23) containing the following 5 files: 000e23cdce23_f.json — metadata including sampled identity attributes, prompt, generation seed, etc. (_f = female; _m = male; _nb = non-binary) 000e23cdce23_f_doc.png — document-style frontal image 000e23cdce23_f_live_0_e_d1.jpg — live image generated with LivePortrait (_e = expression and pose) 000e23cdce23_f_live_0_a_d1.jpg — live image via Arc2Face (_a = arc2face) 000e23cdce23_f_live_0_p_d1.jpg — live image via PuLID (_p = pulid) All document and LivePortrait/PuLID images are 1024×1024. Arc2Face images are 512×512 due to original model constraints. Attribute Sampling and Prompting The attributes/ directory contains all information about how identity attributes were sampled: A set of .txt files (e.g., ages.txt, eye_shape.txt, body_type.txt) — each lists the possible values for one attribute class, along with their respective sampling probabilities. file_probabilities.json — defines the inclusion probability for each attribute class (i.e., how likely a class such as "eye shape" is to be included in a given prompt). attribute_clashes.json — specifies rules for resolving semantically conflicting attributes. Each clash defines a primary attribute (to be kept) and secondary attributes (to be discarded when the clash occurs). Prompts are generated automatically using Qwen2.5 large language model, based on selected attributes, and used to condition FLUX.1 [dev] during image generation. Live Image Generation Each synthetic identity has three live image-style variants: LivePortrait: expression/pose changes via keypoint-based retargeting Arc2Face: natural variation using identity embeddings (no prompt required) PuLID: identity-aware generation using prompt, embedding, and edge-conditioning with a customized FLUX.1 [dev] diffusion model These approaches provide both controlled and naturalistic identity-consistent variation. Filtering and Quality Control Included are 9 supplementary text files listing filtered subsets of identities. For instance, file similarity_filtering_adaface_thr_0.333987832069397_fmr_0.0001.txt contains identities retained after filtering out overly similar faces using AdaFace FRS under the specified threshold and false match rate (FMR). Usage and Licensing This dataset is licensed under the Creative Commons Attribution Non Commercial 4.0 International (CC BY-NC 4.0) license.You are free to use, share, and adapt the dataset for non-commercial purposes, provided that appropriate credit is given. The images in this dataset were generated using the FLUX.1 [dev] model by Black Forest Labs, which is made available under their Non-Commercial License. While this dataset does not include or distribute the model or its weights, the images were produced using that model. Users are responsible for ensuring that their use of the images complies with the FLUX.1 [dev] license, including any restrictions it imposes. Acknowledgments The FLUXSynID dataset was developed under the EINSTEIN project. The EINSTEIN project is funded by the European Union (EU) under G.A. no. 101121280 and UKRI Funding Service under IFS reference 10093453. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect the views of the EU/Executive Agency or UKRI. Neither the EU nor the granting authority nor UKRI can be held responsible for them.

  2. w

    Synthetic Data for an Imaginary Country, Sample, 2023 - World

    • microdata.worldbank.org
    • nada-demo.ihsn.org
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
    Explore at:
    Dataset updated
    Jul 7, 2023
    Dataset authored and provided by
    Development Data Group, Data Analytics Unit
    Time period covered
    2023
    Area covered
    World, World
    Description

    Abstract

    The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    The full-population dataset (with about 10 million individuals) is also distributed as open data.

    Geographic coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

    Analysis unit

    Household, Individual

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Kind of data

    ssd

    Sampling procedure

    The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

    Mode of data collection

    other

    Research instrument

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Cleaning operations

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Response rate

    This is a synthetic dataset; the "response rate" is 100%.

  3. h

    synthetic_pii_finance_multilingual

    • huggingface.co
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gretel.ai (2024). synthetic_pii_finance_multilingual [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 11, 2024
    Dataset provided by
    Gretel.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Image generated by DALL-E. See prompt for more details

      💼 📊 Synthetic Financial Domain Documents with PII Labels
    

    gretelai/synthetic_pii_finance_multilingual is a dataset of full length synthetic financial documents containing Personally Identifiable Information (PII), generated using Gretel Navigator and released under Apache 2.0. This dataset is designed to assist with the following use cases:

    🏷️ Training NER (Named Entity Recognition) models to detect and label PII in… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual.

  4. r

    Synthetic nursing handover training and development data set - audio files

    • researchdata.edu.au
    • data.csiro.au
    datadownload
    Updated Mar 21, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leif Hanlen; Liyuan Zhou; Hanna Suominen; Maricel Angel (2017). Synthetic nursing handover training and development data set - audio files [Dataset]. https://researchdata.edu.au/synthetic-nursing-handover-audio-files/820930
    Explore at:
    datadownloadAvailable download formats
    Dataset updated
    Mar 21, 2017
    Dataset provided by
    Commonwealth Scientific and Industrial Research Organisation
    Authors
    Leif Hanlen; Liyuan Zhou; Hanna Suominen; Maricel Angel
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This is one of two collection records. Please see the link below for the other collection of associated text files.

    The two collections together comprise an open clinical dataset of three sets of 10 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.

    This collection contains 3 X 100 spoken free-form audio files in WAV Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.

    See Suominen et al (2015) in the links below for a detailed description and examples.

  5. Synthetic Patient Data in OMOP

    • console.cloud.google.com
    Updated Jun 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:U.S.%20Department%20of%20Health%20%26%20Human%20Services (2020). Synthetic Patient Data in OMOP [Dataset]. https://console.cloud.google.com/marketplace/product/hhs/synpuf
    Explore at:
    Dataset updated
    Jun 25, 2020
    Dataset provided by
    Googlehttp://google.com/
    Description

    The Synthetic Patient Data in OMOP Dataset is a synthetic database released by the Centers for Medicare and Medicaid Services (CMS) Medicare Claims Synthetic Public Use Files (SynPUF). It is synthetic data containing 2008-2010 Medicare insurance claims for development and demonstration purposes. It has been converted to the Observational Medical Outcomes Partnership (OMOP) common data model from its original form, CSV, by the open source community as released on GitHub Please refer to the CMS Linkable 2008–2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) User Manual for details regarding how DE-SynPUF was created." This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  6. CMS Synthetic Patient Data OMOP

    • redivis.com
    application/jsonl +7
    Updated Aug 19, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redivis Demo Organization (2020). CMS Synthetic Patient Data OMOP [Dataset]. https://redivis.com/datasets/ye2v-6skh7wdr7
    Explore at:
    sas, avro, parquet, stata, application/jsonl, arrow, csv, spssAvailable download formats
    Dataset updated
    Aug 19, 2020
    Dataset provided by
    Redivis Inc.
    Authors
    Redivis Demo Organization
    Time period covered
    Jan 1, 2008 - Dec 31, 2010
    Description

    Abstract

    This is a synthetic patient dataset in the OMOP Common Data Model v5.2, originally released by the CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010.

    Methodology

    This dataset takes on the format of the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). As shown in the diagram below, the purpose of the Common Data Model is to convert various distinctly-formatted datasets into a well-known, universal format with a set of standardized vocabularies. See the diagram below from the Observational Health Data Sciences and Informatics (OHDSI) webpage.

    https://redivis.com/fileUploads/d1a95a4e-074a-44d1-92e5-9adfd2f4068a%3E" alt="Why-CDM.png">

    Such universal data models ultimately enable researchers to streamline the analysis of observational medical data. For more information regarding the OMOP CDM, refer to the OHSDI OMOP site.

    Usage

    %3Cli%3EFor documentation regarding the source data format from the Center for Medicare and Medicaid Services (CMS), refer to the %3Ca href="https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF"%3ECMS Synthetic Public Use File%3C/a%3E.%3C/li%3E

    %3Cli%3EFor information regarding the conversion of the CMS data file to the OMOP CDM v5.2, refer to %3Ca href="https://github.com/OHDSI/ETL-CMS"%3Ethis OHDSI GitHub page%3C/a%3E. %3C/li%3E

    %3Cli%3EFor information regarding each of the 24 tables in this dataset, including more detailed variable metadata, see %3Ca href="https://github.com/OHDSI/CommonDataModel/wiki"%3Ethe OHDSI CDM GitHub Wiki page%3C/a%3E. All variable labels and descriptions as well as table descriptions come from this Wiki page. Note that this GitHub page includes information primarily regarding the 6.0 version of the CDM and that this dataset works with the 5.2 version. %3C/li%3E

  7. r

    Synthetic nursing handover training and development data set - text files

    • researchdata.edu.au
    datadownload
    Updated Mar 21, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leif Hanlen; Liyuan Zhou; Hanna Suominen; Maricel Angel (2017). Synthetic nursing handover training and development data set - text files [Dataset]. https://researchdata.edu.au/synthetic-nursing-handover-text-files/820931
    Explore at:
    datadownloadAvailable download formats
    Dataset updated
    Mar 21, 2017
    Dataset provided by
    Commonwealth Scientific and Industrial Research Organisation
    Authors
    Leif Hanlen; Liyuan Zhou; Hanna Suominen; Maricel Angel
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This is one of two collection records. Please see the link below for the other collection of associated audio files.

    Both collections together comprise an open clinical dataset of three sets of 101 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.

    This collection contains 3 sets of text documents.

    Data Set 1 for Training and Development

    The data set, released in June 2014, includes the following documents:

    Folder initialisation: Initialisation details for speech recognition using Dragon Medical 11.0 (i.e., i) DOCX for the written, free-form text document that originates from the Dragon software release and ii) WMA for the spoken, free-form text document by the RN) Folder 100profiles: 100 patient profiles (DOCX) Folder 101writtenfreetextreports: 101 written, free-form text documents (TXT) Folder 100x6speechrecognised: 100 speech-recognized, written, free-form text documents for six Dragon vocabularies (TXT) Folder 101informationextraction: 101 written, structured documents for information extraction that include i) the reference standard text, ii) features used by our best system, iii) form categories with respect to the reference standard and iv) form categories with respect to the our best information extraction system (TXT in CRF++ format).

    An Independent Data Set 2

    The aforementioned data set was supplemented in April 2015 with an independent set that was used as a test set in the CLEFeHealth 2015 Task 1a on clinical speech recognition and can be used as a validation set in the CLEFeHealth 2016 Task 1 on handover information extraction. Hence, when using this set, please avoid its repeated use in evaluation – we do not wish to overfit to these data sets.

    The set released in April 2015 consists of 100 patient profiles (DOCX), 100 written, and 100 speech-recognized, written, free-form text documents for the Dragon vocabulary of Nursing (TXT). The set released in November 2015 consists of the respective 100 written free-form text documents (TXT) and 100 written, structured documents for information extraction.

    An Independent Data Set 3

    For evaluation purposes, the aforementioned data sets were supplemented in April 2016 with an independent set of another 100 synthetic cases.

    Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.

    See Suominen et al (2015) in the links below for a detailed description and examples.

  8. h

    synthetic-passports

    • huggingface.co
    Updated Oct 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata (2024). synthetic-passports [Dataset]. https://huggingface.co/datasets/UniDataPro/synthetic-passports
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2024
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Passport photos dataset

    This dataset contains over 100,000 passport photos from 100+ countries, making it a valuable resource for researchers and developers working on computer vision tasks related to passport verification, biometric identification, and document analysis. This dataset allows researchers and developers to train and evaluate their models without the ethical and legal concerns associated with using real passport data. By leveraging this dataset, developers can build… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/synthetic-passports.

  9. h

    synthetic-text-similarity

    • huggingface.co
    Updated Mar 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Szemraj (2024). synthetic-text-similarity [Dataset]. https://huggingface.co/datasets/pszemraj/synthetic-text-similarity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 5, 2024
    Authors
    Peter Szemraj
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Synthetic Text Similarity

    This dataset is created to facilitate the evaluation and training of models on the task of text similarity at longer contexts/examples than Bob likes frogs. as per classical sentence similarity datasets. It consists of document pairs with associated similarity scores, representing the closeness of the documents in semantic space.

      Dataset Description
    

    For each version of this dataset, embeddings are computed for all unique documents, followed by… See the full description on the dataset page: https://huggingface.co/datasets/pszemraj/synthetic-text-similarity.

  10. MetaHuman simulation sample dataset

    • kaggle.com
    Updated Apr 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandre Mendes (2021). MetaHuman simulation sample dataset [Dataset]. https://www.kaggle.com/datasets/allexmendes/metahuman-simulation-sample-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Alexandre Mendes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    This is a synthetic dataset / simulation consisting of ~1400 frames of 2 MetaHuman characters. Making use of Unreal Engine 4 (UE4) MetaHuman sample project. The goal was to create an highly realistic human-centric synthetic dataset expanding multiple annotation and rendering methods for computer vision. Contains Lit (realistic), Depth, Body part segmentation (semantics), Instance Segmentation, World Normals renders and keypoints metadata. It also contains audio and subtitles annotation files matching respective timings (24FPS)

    The resultant dataset matches Unreal Engine official presentation video "https://www.youtube.com/watch?v=6mAF5dWZXcI">Meet the MetaHumans: Free Sample Now Available | Unreal Engine

    Content

    The dataset is divided into directories accordingly to each rendering method. Each directory contains 1479 images encoded in or .JSON metadata for annotation. Also contains an audio and resources related directory

    Directories and files: - Lit - 1479 .png files representing realistic renders - Depth - 1479 .npy files representing normalized scene depth - Body part segmentation - 1479 .png files representing the segmentation of parts of interest in both synthetic characters - Instance segmentation - 1479 .png files representing a broader segmentation in both synthetic characters - World normal - 1479 .png files representing world space mesh orientation normals - Annotation - 1479 .json files representing basic scene metadata (camera + keypoints) - Resources - Several dataset utility files including Segmentation color maps, keypoint filters, etc.. - Audio - Audio utility files - sound.wav - audio matching video at 24fps - metahumans_srt_sub.srt - SubRip Subtitle file matching sound.wav - metahumans_aegisub_sub.ass - Aegisub work file containing extra data to .srt - output_all-videos_subbed.m4v - Simple data preview video

    The rendered images are 1280x720 resolution. Keypoints annotation is pre-filtered with each keypoint projection within image plane (annotation .json list of keypoints differ across frames)

    Acknowledgements

    Synthetic images were generated with UnrealEngine 4.26 https://www.unrealengine.com Sample UE4 project used as base MetaHumans @ unreal's marketplace Aegisub for subtitles https://aegisub.en.uptodown.com/windows

  11. CIFAKE: Real and AI-Generated Synthetic Images

    • kaggle.com
    Updated Mar 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan J. Bird (2023). CIFAKE: Real and AI-Generated Synthetic Images [Dataset]. https://www.kaggle.com/datasets/birdy654/cifake-real-and-ai-generated-synthetic-images/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jordan J. Bird
    Description

    CIFAKE: Real and AI-Generated Synthetic Images

    The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness.

    CIFAKE is a dataset that contains 60,000 synthetically-generated images and 60,000 real images (collected from CIFAR-10). Can computer vision techniques be used to detect when an image is real or has been generated by AI?

    Further information on this dataset can be found here: Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.

    Dataset details

    The dataset contains two classes - REAL and FAKE.

    For REAL, we collected the images from Krizhevsky & Hinton's CIFAR-10 dataset

    For the FAKE images, we generated the equivalent of CIFAR-10 with Stable Diffusion version 1.4

    There are 100,000 images for training (50k per class) and 20,000 for testing (10k per class)

    Papers with Code

    The dataset and all studies using it are linked using Papers with Code https://paperswithcode.com/dataset/cifake-real-and-ai-generated-synthetic-images

    References

    If you use this dataset, you must cite the following sources

    Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.

    Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.

    Real images are from Krizhevsky & Hinton (2009), fake images are from Bird & Lotfi (2024). The Bird & Lotfi study is available here.

    Notes

    The updates to the dataset on the 28th of March 2023 did not change anything; the file formats ".jpeg" were renamed ".jpg" and the root folder was uploaded to meet Kaggle's usability requirements.

    License

    This dataset is published under the same MIT license as CIFAR-10:

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

  12. i

    C3I SYNTHETIC FACE DEPTH DATASET

    • ieee-dataport.org
    Updated Sep 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhajit Basak (2022). C3I SYNTHETIC FACE DEPTH DATASET [Dataset]. https://ieee-dataport.org/documents/c3i-synthetic-face-depth-dataset
    Explore at:
    Dataset updated
    Sep 12, 2022
    Authors
    Shubhajit Basak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The C3I Synthetic Face Depth Dataset consists of 3D virtual human models and 2D rendered RGB and GT depth images in zipped version into two folders for male and female.

  13. Z

    LegalTech Artificial Intelligence Market by Application (Document Management...

    • zionmarketresearch.com
    pdf
    Updated Aug 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zion Market Research (2025). LegalTech Artificial Intelligence Market by Application (Document Management System, E-Discovery, Practice and Case Management, E-Billing, Contract Management, IP-Management, Legal Research, Legal Analytics, Cyber Security, Predictive Technology, and Compliance) and by End-User (Lawyers and Clients): Global Industry Perspective, Comprehensive Analysis, and Forecast, 2024-2032. [Dataset]. https://www.zionmarketresearch.com/report/legaltech-artificial-intelligence-market
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 23, 2025
    Dataset authored and provided by
    Zion Market Research
    License

    https://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy

    Time period covered
    2022 - 2030
    Area covered
    Global
    Description

    Global LegalTech Artificial Intelligence Market size valued at US$ 12.19 Billion in 2023, set to reach US$ 165.31 Billion by 2032 at a CAGR of about 33.6% from 2024 to 2032.

  14. synthdog-ko

    • huggingface.co
    Updated Dec 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NAVER CLOVA INFORMATION EXTRACTION (2024). synthdog-ko [Dataset]. https://huggingface.co/datasets/naver-clova-ix/synthdog-ko
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    Naver Corporationhttp://www.navercorp.com/
    Authors
    NAVER CLOVA INFORMATION EXTRACTION
    Description

    Donut 🍩 : OCR-Free Document Understanding Transformer (ECCV 2022) -- SynthDoG datasets

    For more information, please visit https://github.com/clovaai/donut

    The links to the SynthDoG-generated datasets are here:

    synthdog-en: English, 0.5M. synthdog-zh: Chinese, 0.5M. synthdog-ja: Japanese, 0.5M. synthdog-ko: Korean, 0.5M.

    To generate synthetic datasets with our SynthDoG, please see ./synthdog/README.md and our paper for details.

      How to Cite
    

    If you find this work useful… See the full description on the dataset page: https://huggingface.co/datasets/naver-clova-ix/synthdog-ko.

  15. Data from: PURE: a Dataset of Public Requirements Documents

    • zenodo.org
    zip
    Updated Sep 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Ferrari; Alessio Ferrari; Giorgio Oronzo Spagnolo; Stefania Gnesi; Giorgio Oronzo Spagnolo; Stefania Gnesi (2022). PURE: a Dataset of Public Requirements Documents [Dataset]. http://doi.org/10.5281/zenodo.1414117
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alessio Ferrari; Alessio Ferrari; Giorgio Oronzo Spagnolo; Stefania Gnesi; Giorgio Oronzo Spagnolo; Stefania Gnesi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please cite this dataset as Ferrari, A., Spagnolo, G. O., & Gnesi, S. (2017, September). PURE: A dataset of public requirements documents. In 2017 IEEE 25th International Requirements Engineering Conference (RE) (pp. 502-505). IEEE.

    https://ieeexplore.ieee.org/abstract/document/8049173

    This dataset presents PURE (PUblic REquirements dataset), a dataset of 79 publicly available natural language requirements documents collected from the Web. The dataset includes 34,268 sentences and can be used for natural language processing tasks that are typical in requirements engineering, such as model synthesis, abstraction identification and document structure assessment. It can be further annotated to work as a benchmark for other tasks, such as ambiguity detection, requirements categorisation and identification of equivalent re-quirements. In the associated paper, we present the dataset and we compare its language with generic English texts, showing the peculiarities of the requirements jargon, made of a restricted vocabulary of domain-specific acronyms and words, and long sentences. We also present the common XML format to which we have manually ported a subset of the documents, with the goal of facilitating replication of NLP experiments. The XML documents are also available for download.

    The paper associated to the dataset can be found here:

    https://ieeexplore.ieee.org/document/8049173/

    More info about the dataset is available here:

    http://nlreqdataset.isti.cnr.it

    Preprint of the paper available at ResearchGate:

    https://goo.gl/HxJD7X

  16. i

    Deepfake Synthetic-20K Dataset

    • ieee-dataport.org
    Updated Apr 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahil Sharma (2024). Deepfake Synthetic-20K Dataset [Dataset]. https://ieee-dataport.org/documents/deepfake-synthetic-20k-dataset
    Explore at:
    Dataset updated
    Apr 14, 2024
    Authors
    Sahil Sharma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    gender

  17. d

    6DOF pose estimation - synthetically generated dataset using BlenderProc

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyam Sheth (2025). 6DOF pose estimation - synthetically generated dataset using BlenderProc [Dataset]. http://doi.org/10.5061/dryad.rbnzs7hj5
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Divyam Sheth
    Time period covered
    Jan 1, 2023
    Description

    Accurate and robust 6DOF (Six Degrees of Freedom) pose estimation is a critical task in various fields, including computer vision, robotics, and augmented reality. This research paper presents a novel approach to enhance the accuracy and reliability of 6DOF pose estimation by introducing a robust method for generating synthetic data and leveraging the ease of multi-class training using the generated dataset. The proposed method tackles the challenge of insufficient real-world annotated data by creating a large and diverse synthetic dataset that accurately mimics real-world scenarios. The proposed method only requires a CAD model of the object and there is no limit to the number of unique data that can be generated. Furthermore, a multi-class training strategy that harnesses the synthetic dataset's diversity is proposed and presented. This approach mitigates class imbalance issues and significantly boosts accuracy across varied object classes and poses. Experimental results underscore th..., This dataset has been synthetically generated using 3D software like Blender and APIs like Blendeproc., , # Data Repository README

    This repository contains data organized into a structured format. The data consists of three main folders and two files, each serving a specific purpose. The data contains two folders - Cat and Hand.

    Cat Dataset: 63492 labeled data with images, masks, and poses.

    Hand Dataset: 42418 labeled data with images, masks, and poses.

    Usage: The dataset is ready for use by simply extracting the contents of the zip file, whether for training in a segmentation task or a pose estimation task.

    To view .npy files you will need to use Python with the numpy package installed. In Python use the following commands.

    import numpy
    data = numpy.load('file.npy')
    print(data)

    What free/open software is appropriate for viewing the .ply files?
    These files can be opened using any 3D modeling software like Blender, Meshlab, etc.

    Camera Matrix Intrinstics Format :

    Fx 0 px 0 Fy py 0 0 0

    Below is an overview of the data organization:

    Folder Structure

    1. Rgb:
      • This ...
  18. i

    Synthetic Dataset for Induction Motor Broken Rotor Bar Analysis

    • ieee-dataport.org
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Encarnacao (2024). Synthetic Dataset for Induction Motor Broken Rotor Bar Analysis [Dataset]. https://ieee-dataport.org/documents/synthetic-dataset-induction-motor-broken-rotor-bar-analysis
    Explore at:
    Dataset updated
    Feb 9, 2024
    Authors
    Lucas Encarnacao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    in portuguese) at the Federal University of Espírito Santo (UFES)

  19. i

    Tuberculosis (TB) Chest X-ray Database

    • ieee-dataport.org
    Updated May 18, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amith khandakar (2022). Tuberculosis (TB) Chest X-ray Database [Dataset]. https://ieee-dataport.org/documents/tuberculosis-tb-chest-x-ray-database
    Explore at:
    Dataset updated
    May 18, 2022
    Authors
    Amith khandakar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Doha

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zenodo (2025). FLUXSynID: A Synthetic Face Dataset with Document and Live Images [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15172770?locale=en
Organization logo

FLUXSynID: A Synthetic Face Dataset with Document and Live Images

Explore at:
unknownAvailable download formats
Dataset updated
May 9, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

FLUXSynID: A Synthetic Face Dataset with Document and Live Images FLUXSynID is a high-resolution synthetic identity dataset containing 14,889 unique synthetic identities, each represented through a document-style image and three live capture variants. Identities are generated using the FLUX.1 [dev] diffusion model, guided by user-defined identity attributes such as gender, age, region of origin, and other various identity features. The dataset is created to support biometric research, including face recognition and morphing attack detection. File Structure Each identity has a dedicated folder (named as a 12-digit hex string, e.g., 000e23cdce23) containing the following 5 files: 000e23cdce23_f.json — metadata including sampled identity attributes, prompt, generation seed, etc. (_f = female; _m = male; _nb = non-binary) 000e23cdce23_f_doc.png — document-style frontal image 000e23cdce23_f_live_0_e_d1.jpg — live image generated with LivePortrait (_e = expression and pose) 000e23cdce23_f_live_0_a_d1.jpg — live image via Arc2Face (_a = arc2face) 000e23cdce23_f_live_0_p_d1.jpg — live image via PuLID (_p = pulid) All document and LivePortrait/PuLID images are 1024×1024. Arc2Face images are 512×512 due to original model constraints. Attribute Sampling and Prompting The attributes/ directory contains all information about how identity attributes were sampled: A set of .txt files (e.g., ages.txt, eye_shape.txt, body_type.txt) — each lists the possible values for one attribute class, along with their respective sampling probabilities. file_probabilities.json — defines the inclusion probability for each attribute class (i.e., how likely a class such as "eye shape" is to be included in a given prompt). attribute_clashes.json — specifies rules for resolving semantically conflicting attributes. Each clash defines a primary attribute (to be kept) and secondary attributes (to be discarded when the clash occurs). Prompts are generated automatically using Qwen2.5 large language model, based on selected attributes, and used to condition FLUX.1 [dev] during image generation. Live Image Generation Each synthetic identity has three live image-style variants: LivePortrait: expression/pose changes via keypoint-based retargeting Arc2Face: natural variation using identity embeddings (no prompt required) PuLID: identity-aware generation using prompt, embedding, and edge-conditioning with a customized FLUX.1 [dev] diffusion model These approaches provide both controlled and naturalistic identity-consistent variation. Filtering and Quality Control Included are 9 supplementary text files listing filtered subsets of identities. For instance, file similarity_filtering_adaface_thr_0.333987832069397_fmr_0.0001.txt contains identities retained after filtering out overly similar faces using AdaFace FRS under the specified threshold and false match rate (FMR). Usage and Licensing This dataset is licensed under the Creative Commons Attribution Non Commercial 4.0 International (CC BY-NC 4.0) license.You are free to use, share, and adapt the dataset for non-commercial purposes, provided that appropriate credit is given. The images in this dataset were generated using the FLUX.1 [dev] model by Black Forest Labs, which is made available under their Non-Commercial License. While this dataset does not include or distribute the model or its weights, the images were produced using that model. Users are responsible for ensuring that their use of the images complies with the FLUX.1 [dev] license, including any restrictions it imposes. Acknowledgments The FLUXSynID dataset was developed under the EINSTEIN project. The EINSTEIN project is funded by the European Union (EU) under G.A. no. 101121280 and UKRI Funding Service under IFS reference 10093453. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect the views of the EU/Executive Agency or UKRI. Neither the EU nor the granting authority nor UKRI can be held responsible for them.

Search
Clear search
Close search
Google apps
Main menu