19 datasets found
  1. d

    Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats

    • datarade.ai
    Updated Sep 18, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ainnotate (2022). Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats [Dataset]. https://datarade.ai/data-products/synthetic-document-dataset-for-ai-jpeg-png-pdf-formats-ainnotate
    Explore at:
    Dataset updated
    Sep 18, 2022
    Dataset authored and provided by
    Ainnotate
    Area covered
    Tokelau, Canada, Tonga, Korea (Democratic People's Republic of), Brazil, Cabo Verde, Denmark, Syrian Arab Republic, Germany, Ireland
    Description

    Ainnotate’s proprietary dataset generation methodology based on large scale generative modelling and Domain randomization provides data that is well balanced with consistent sampling, accommodating rare events, so that it can enable superior simulation and training of your models.

    Ainnotate currently provides synthetic datasets in the following domains and use cases.

    Internal Services - Visa application, Passport validation, License validation, Birth certificates Financial Services - Bank checks, Bank statements, Pay slips, Invoices, Tax forms, Insurance claims and Mortgage/Loan forms Healthcare - Medical Id cards

  2. C

    Synthetic Integrated Services Data

    • data.wprdc.org
    csv, html, pdf, zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allegheny County (2024). Synthetic Integrated Services Data [Dataset]. https://data.wprdc.org/dataset/synthetic-integrated-services-data
    Explore at:
    html, csv(1375554033), zip(39231637), pdfAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset authored and provided by
    Allegheny County
    Description

    Motivation

    This dataset was created to pilot techniques for creating synthetic data from datasets containing sensitive and protected information in the local government context. Synthetic data generation replaces actual data with representative data generated from statistical models; this preserves the key data properties that allow insights to be drawn from the data while protecting the privacy of the people included in the data. We invite you to read the Understanding Synthetic Data white paper for a concise introduction to synthetic data.

    This effort was a collaboration of the Urban Institute, Allegheny County’s Department of Human Services (DHS) and CountyStat, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center.

    Collection

    The source data for this project consisted of 1) month-by-month records of services included in Allegheny County's data warehouse and 2) demographic data about the individuals who received the services. As the County’s data warehouse combines this service and client data, this data is referred to as “Integrated Services data”. Read more about the data warehouse and the kinds of services it includes here.

    Preprocessing

    Synthetic data are typically generated from probability distributions or models identified as being representative of the confidential data. For this dataset, a model of the Integrated Services data was used to generate multiple versions of the synthetic dataset. These different candidate datasets were evaluated to select for publication the dataset version that best balances utility and privacy. For high-level information about this evaluation, see the Synthetic Data User Guide.

    For more information about the creation of the synthetic version of this data, see the technical brief for this project, which discusses the technical decision making and modeling process in more detail.

    Recommended Uses

    This disaggregated synthetic data allows for many analyses that are not possible with aggregate data (summary statistics). Broadly, this synthetic version of this data could be analyzed to better understand the usage of human services by people in Allegheny County, including the interplay in the usage of multiple services and demographic information about clients.

    Known Limitations/Biases

    Some amount of deviation from the original data is inherent to the synthetic data generation process. Specific examples of limitations (including undercounts and overcounts for the usage of different services) are given in the Synthetic Data User Guide and the technical report describing this dataset's creation.

    Feedback

    Please reach out to this dataset's data steward (listed below) to let us know how you are using this data and if you found it to be helpful. Please also provide any feedback on how to make this dataset more applicable to your work, any suggestions of future synthetic datasets, or any additional information that would make this more useful. Also, please copy wprdc@pitt.edu on any such feedback (as the WPRDC always loves to hear about how people use the data that they publish and how the data could be improved).

    Further Documentation and Resources

    1) A high-level overview of synthetic data generation as a method for protecting privacy can be found in the Understanding Synthetic Data white paper.
    2) The Synthetic Data User Guide provides high-level information to help users understand the motivation, evaluation process, and limitations of the synthetic version of Allegheny County DHS's Human Services data published here.
    3) Generating a Fully Synthetic Human Services Dataset: A Technical Report on Synthesis and Evaluation Methodologies describes the full technical methodology used for generating the synthetic data, evaluating the various options, and selecting the final candidate for publication.
    4) The WPRDC also hosts the Allegheny County Human Services Community Profiles dataset, which provides annual updates on human-services usage, aggregated by neighborhood/municipality. That data can be explored using the County's Human Services Community Profile web site.

  3. f

    Table1_Enhancing biomechanical machine learning with limited data:...

    • frontiersin.figshare.com
    pdf
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich (2024). Table1_Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.pdf [Dataset]. http://doi.org/10.3389/fbioe.2024.1350135.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Frontiers
    Authors
    Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.

  4. f

    DataSheet1_Generating synthetic multidimensional molecular time series data...

    • figshare.com
    pdf
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gary An; Chase Cockrell (2023). DataSheet1_Generating synthetic multidimensional molecular time series data for machine learning: considerations.PDF [Dataset]. http://doi.org/10.3389/fsysb.2023.1188009.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Frontiers
    Authors
    Gary An; Chase Cockrell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The use of synthetic data is recognized as a crucial step in the development of neural network-based Artificial Intelligence (AI) systems. While the methods for generating synthetic data for AI applications in other domains have a role in certain biomedical AI systems, primarily related to image processing, there is a critical gap in the generation of time series data for AI tasks where it is necessary to know how the system works. This is most pronounced in the ability to generate synthetic multi-dimensional molecular time series data (subsequently referred to as synthetic mediator trajectories or SMTs); this is the type of data that underpins research into biomarkers and mediator signatures for forecasting various diseases and is an essential component of the drug development pipeline. We argue the insufficiency of statistical and data-centric machine learning (ML) means of generating this type of synthetic data is due to a combination of factors: perpetual data sparsity due to the Curse of Dimensionality, the inapplicability of the Central Limit Theorem in terms of making assumptions about the statistical distributions of this type of data, and the inability to use ab initio simulations due to the state of perpetual epistemic incompleteness in cellular/molecular biology. Alternatively, we present a rationale for using complex multi-scale mechanism-based simulation models, constructed and operated on to account for perpetual epistemic incompleteness and the need to provide maximal expansiveness in concordance with the Maximal Entropy Principle. These procedures provide for the generation of SMT that minimizes the known shortcomings associated with neural network AI systems, namely overfitting and lack of generalizability. The generation of synthetic data that accounts for the identified factors of multi-dimensional time series data is an essential capability for the development of mediator-biomarker based AI forecasting systems, and therapeutic control development and optimization.

  5. h

    synthetic-from-unit-triple-tasks-danish

    • huggingface.co
    • sprogteknologi.dk
    Updated Jan 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kasper Groes Albin Ludvigsen (2025). synthetic-from-unit-triple-tasks-danish [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2025
    Authors
    Kasper Groes Albin Ludvigsen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

    The purpose of this dataset is to pre- or post-train embedding models for Danish on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish.

  6. MatSim Dataset and benchmark for one-shot visual materials and textures...

    • zenodo.org
    • data.niaid.nih.gov
    pdf, zip
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik (2025). MatSim Dataset and benchmark for one-shot visual materials and textures recognition [Dataset]. http://doi.org/10.5281/zenodo.7390166
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jun 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The MatSim Dataset and benchmark

    Lastest version

    Synthetic dataset and real images benchmark for visual similarity recognition of materials and textures.

    MatSim: a synthetic dataset, a benchmark, and a method for computer vision-based recognition of similarities and transitions between materials and textures focusing on identifying any material under any conditions using one or a few examples (one-shot learning).

    Based on the paper: One-shot recognition of any material anywhere using contrastive learning with physics-based rendering

    Benchmark_MATSIM.zip: contain the benchmark made of real-world images as described in the paper



    MatSim_object_train_split_1,2,3.zip: Contain a subset of the synthetics dataset for images of CGI images materials on random objects as described in the paper.

    MatSim_Vessels_Train_1,2,3.zip : Contain a subset of the synthetics dataset for images of CGI images materials inside transparent containers as described in the paper.

    *Note: these are subsets of the dataset; the full dataset can be found at:
    https://e1.pcloud.link/publink/show?code=kZIiSQZCYU5M4HOvnQykql9jxF4h0KiC5MX

    or
    https://icedrive.net/s/A13FWzZ8V2aP9T4ufGQ1N3fBZxDF

    Code:

    Up to date code for generating the dataset, reading and evaluation and trained nets can be found in this URL:https://github.com/sagieppel/MatSim-Dataset-Generator-Scripts-And-Neural-net

    Dataset Generation Scripts.zip: Contain the Blender (3.1) Python scripts used for generating the dataset, this code might be odl up to date code can be found here
    Net_Code_And_Trained_Model.zip: Contain a reference neural net code, including loaders, trained models, and evaluators scripts that can be used to read and train with the synthetic dataset or test the model with the benchmark. Note code in the ZIP file is not up to date and contains some bugs For the Latest version of this code see this URL

    Further documentation can be found inside the zip files or in the paper.

  7. h

    synthetic-from-unit-triple-tasks-norwegian

    • huggingface.co
    Updated Jan 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kasper Groes Albin Ludvigsen (2025). synthetic-from-unit-triple-tasks-norwegian [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-norwegian
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2025
    Authors
    Kasper Groes Albin Ludvigsen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

    The purpose of this dataset is to pre- or post-train embedding models on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by Arrow Denmark and… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-norwegian.

  8. Benchmark datasets to study fairness in synthetic data generation

    • zenodo.org
    csv, json
    Updated Aug 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joao Fonseca; Joao Fonseca (2024). Benchmark datasets to study fairness in synthetic data generation [Dataset]. http://doi.org/10.5281/zenodo.13375623
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Aug 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joao Fonseca; Joao Fonseca
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The traveltime dataset is based on the Folktables project covering US census data. The target is a binary variable encoding whether or not the individual needs to travel more than 20 minutes for work; here, having a shorter travel time is the desirable outcome. We use a subset of data from the states of California, Florida, Maine, New York, Utah, and Wyoming states in 2018. Although the folktables dataset does not have any missing values, there are some values recorded as NaN due to the Bureau's data collection methodology. We remove the "esp" column, which encodes the employment status of parents, and has 99.55% missing values. We encode the missing values in the povpip, income to poverty ratio (0.85%), to -1 in accordance to the methodology in Ding et al.. See https://arxiv.org/pdf/2108.04884 for metadata.

    The cardio (a) dataset contains patient data recorded during medical examination, including 3 binary features supplied by the patient. The target class denotes the presence of cardiovascular disease. This dataset represents predictive tasks that allocate access to priority medical care for patients, and has been used for fairness evaluations in the domain.

    The credit dataset contains historical financial data of borrowers, including past non-serious delinquencies. Here, a serious delinquency is considered to be 90 days past due, and this is the target variable.

    The German Credit dataset (https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data) contains financial and personal information regarding loan-seeking applicants.

  9. f

    DataSheet1_Removing the Bottleneck: Introducing cMatch - A Lightweight Tool...

    • frontiersin.figshare.com
    pdf
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexis Casas; Matthieu Bultelle; Charles Motraghi; Richard Kitney (2023). DataSheet1_Removing the Bottleneck: Introducing cMatch - A Lightweight Tool for Construct-Matching in Synthetic Biology.PDF [Dataset]. http://doi.org/10.3389/fbioe.2021.785131.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    Frontiers
    Authors
    Alexis Casas; Matthieu Bultelle; Charles Motraghi; Richard Kitney
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present a software tool, called cMatch, to reconstruct and identify synthetic genetic constructs from their sequences, or a set of sub-sequences—based on two practical pieces of information: their modular structure, and libraries of components. Although developed for combinatorial pathway engineering problems and addressing their quality control (QC) bottleneck, cMatch is not restricted to these applications. QC takes place post assembly, transformation and growth. It has a simple goal, to verify that the genetic material contained in a cell matches what was intended to be built - and when it is not the case, to locate the discrepancies and estimate their severity. In terms of reproducibility/reliability, the QC step is crucial. Failure at this step requires repetition of the construction and/or sequencing steps. When performed manually or semi-manually QC is an extremely time-consuming, error prone process, which scales very poorly with the number of constructs and their complexity. To make QC frictionless and more reliable, cMatch performs an operation we have called “construct-matching” and automates it. Construct-matching is more thorough than simple sequence-matching, as it matches at the functional level-and quantifies the matching at the individual component level and across the whole construct. Two algorithms (called CM_1 and CM_2) are presented. They differ according to the nature of their inputs. CM_1 is the core algorithm for construct-matching and is to be used when input sequences are long enough to cover constructs in their entirety (e.g., obtained with methods such as next generation sequencing). CM_2 is an extension designed to deal with shorter data (e.g., obtained with Sanger sequencing), and that need recombining. Both algorithms are shown to yield accurate construct-matching in a few minutes (even on hardware with limited processing power), together with a set of metrics that can be used to improve the robustness of the decision-making process. To ensure reliability and reproducibility, cMatch builds on the highly validated pairwise-matching Smith-Waterman algorithm. All the tests presented have been conducted on synthetic data for challenging, yet realistic constructs - and on real data gathered during studies on a metabolic engineering example (lycopene production).

  10. h

    TRADOC_synthetic_instruct_long_short_context_response_v1

    • huggingface.co
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Legion Intelligence Inc. (2024). TRADOC_synthetic_instruct_long_short_context_response_v1 [Dataset]. https://huggingface.co/datasets/LegionIntel/TRADOC_synthetic_instruct_long_short_context_response_v1
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset authored and provided by
    Legion Intelligence Inc.
    Description

    This synthetic instruct dataset is curated from select pdf documents obtained from the TRADOC corpus. Instruction generation procedure:

    Pull all the sections from the TRADOC PDF files (using section extractor) TFIDF on the sections to find clusters of similar questions Use the Markov shuffling approach to identify sections within the same cluster (lexically similar at the least) Feed that as context to the LLM to generate a diverse set of instructions.

    RAG Context:

    The RAG context for… See the full description on the dataset page: https://huggingface.co/datasets/LegionIntel/TRADOC_synthetic_instruct_long_short_context_response_v1.

  11. MatSeg: Material State Segmentation Dataset and Benchmark

    • zenodo.org
    zip
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). MatSeg: Material State Segmentation Dataset and Benchmark [Dataset]. http://doi.org/10.5281/zenodo.11331618
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    MatSeg Dataset and benchmark for zero-shot material state segmentation.

    MatSeg Benchmark containing 1220 real-world images and their annotations is available at MatSeg_Benchmark.zip the file contains documentation and Python readers.

    MatSeg dataset containing synthetic images with infused natural images patterns is available at MatSeg3D_part_*.zip and MatSeg3D_part_*.zip (* stand for number).

    MatSeg3D_part_*.zip: contain synthethc 3D scenes

    MatSeg2D_part_*.zip: contain syntethc 2D scenes

    Readers and documentation for the synthetic data are available at: Dataset_Documentation_And_Readers.zip

    Readers and documentation for the real-images benchmark are available at: MatSeg_Benchmark.zip

    The Code used to generate the MatSeg Dataset is available at: https://zenodo.org/records/11401072

    Additional permanent sources for downloading the dataset and metadata: 1, 2

    Evaluation scripts for the Benchmark are now available at:

    https://zenodo.org/records/13402003 and https://e.pcloud.link/publink/show?code=XZsP8PZbT7AJzG98tV1gnVoEsxKRbBl8awX

    Description

    Materials and their states form a vast array of patterns and textures that define the physical and visual world. Minerals in rocks, sediment in soil, dust on surfaces, infection on leaves, stains on fruits, and foam in liquids are some of these almost infinite numbers of states and patterns.

    Image segmentation of materials and their states is fundamental to the understanding of the world and is essential for a wide range of tasks, from cooking and cleaning to construction, agriculture, and chemistry laboratory work.

    The MatSeg dataset focuses on zero-shot segmentation of materials and their states, meaning identifying the region of an image belonging to a specific material type of state, without previous knowledge or training of the material type, states, or environment.

    The dataset contains a large set of (100k) synthetic images and benchmarks of 1220 real-world images for testing.

    Benchmark

    The benchmark contains 1220 real-world images with a wide range of material states and settings. For example: food states (cooked/burned..), plants (infected/dry.) to rocks/soil (minerals/sediment), construction/metals (rusted, worn), liquids (foam/sediment), and many other states in without being limited to a set of classes or environment. The goal is to evaluate the segmentation of material materials without knowledge or pretraining on the material or setting. The focus is on materials with complex scattered boundaries, and gradual transition (like the level of wetness of the surface).

    Evaluation scripts for the Benchmark are now available at: 1 and 2.

    Synthetic Dataset

    The synthetic dataset is composed of synthetic scenes rendered in 2d and 3d using a blender. The synthetic data is infused with patterns, materials, and textures automatically extracted from real images allowing it to capture the complexity and diversity of the real world while maintaining the precision and scale of synthetic data. 100k images and their annotation are available to download.

    License

    This dataset, including all its components, is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. To the extent possible under law, the authors have dedicated all copyright and related and neighboring rights to this dataset to the public domain worldwide. This dedication applies to the dataset and all derivative works.

    The MatSeg 2D and 3D synthetic were generated using the open-images dataset which is licensed under the https://www.apache.org/licenses/LICENSE-2.0. For these components, you must comply with the terms of the Apache License. In addition, the MatSege3D dataset uses Shapenet 3D assets with GNU license.

    Example Usage:

    An Example of a training and evaluation code for a net trained on the dataset and evaluated on the benchmark is given at these urls: 1, 2

    This include an evaluation script on the MatSeg benchmark.

    Training script using the MatSeg dataset.

    And weights of a trained model

    Paper:

    More detail on the work ca be found in the paper "Infusing Synthetic Data with Real-World Patterns for
    Zero-Shot Material State Segmentation"

    Croissant metadata and additional sources for downloading the dataset are available at 1,2

  12. s

    Synthetic from Retrieval Tasks Danish

    • sprogteknologi.dk
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danish Data Science Community (2025). Synthetic from Retrieval Tasks Danish [Dataset]. https://sprogteknologi.dk/dataset/synthetic-from-retrieval-tasks-danish
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset authored and provided by
    Danish Data Science Community
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Danmark
    Description

    The purpose of this dataset is to pre- or post-train embedding models for Danish retrieval tasks.

    The dataset consists of 100,000 samples generated with gemma-2-27b-it.

    The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

    Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/retrieval-tasks-processed

    The data generation process described in this paper was followed:

    https://arxiv.org/pdf/2401.00368

    Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.

  13. s

    Synthetic from Classification Tasks Danish

    • sprogteknologi.dk
    Updated Jan 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danish Data Science Community (2025). Synthetic from Classification Tasks Danish [Dataset]. https://sprogteknologi.dk/dataset/synthetic-from-classification-tasks-danish
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset authored and provided by
    Danish Data Science Community
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Danmark
    Description

    The purpose of this dataset is to pre- or post-train embedding models for Danish text classification tasks.

    The dataset consists of 100,000 samples generated with gemma-2-27b-it.

    The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

    Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/classification-tasks-processed

    The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368

    Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.

  14. s

    Synthetic from Text Matching Short Tasks Danish

    • sprogteknologi.dk
    Updated Jan 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danish Data Science Community (2025). Synthetic from Text Matching Short Tasks Danish [Dataset]. https://sprogteknologi.dk/dataset/synthetic-from-text-matching-short-tasks-danish
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset authored and provided by
    Danish Data Science Community
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Danmark
    Description

    The purpose of this dataset is to pre- or post-train embedding models for Danish text matching tasks on short texts.

    The dataset consists of 100,000 samples generated with gemma-2-27b-it.

    The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

    Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/retrieval-tasks-processed

    The data generation process described in this paper was followed:

    https://arxiv.org/pdf/2401.00368

    Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.

  15. h

    OGC_Hydrogen

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    racine.ai, OGC_Hydrogen [Dataset]. https://huggingface.co/datasets/racineai/OGC_Hydrogen
    Explore at:
    Dataset provided by
    racine.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OGC - Organized, Grouped, Cleaned

      Hydrogen Vision DSE
    

    Intended for image/text to vector (DSE)

      Dataset Composition
    

    Made with https://github.com/RacineAIOS/OGC_pdf-to-parquet This dataset was created by scraping PDF documents from online sources and generating relevant synthetic queries. We used Google's Gemini 2.0 Flash Lite model in our custom pipeline to produce the queries, allowing us to create a diverse set of questions based on the document content.… See the full description on the dataset page: https://huggingface.co/datasets/racineai/OGC_Hydrogen.

  16. SynthRAD2025 Grand Challenge dataset: generating synthetic CT for...

    • zenodo.org
    pdf
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adrian Thummerer; Adrian Thummerer; Erik van der Bijl; Erik van der Bijl; Arthur Jr. Galapon; Arthur Jr. Galapon; Florian Kamp; Florian Kamp; Matteo Maspero; Matteo Maspero (2025). SynthRAD2025 Grand Challenge dataset: generating synthetic CT for radiotherapy [Dataset]. http://doi.org/10.5281/zenodo.14918089
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 2, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Adrian Thummerer; Adrian Thummerer; Erik van der Bijl; Erik van der Bijl; Arthur Jr. Galapon; Arthur Jr. Galapon; Florian Kamp; Florian Kamp; Matteo Maspero; Matteo Maspero
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Mar 1, 2025
    Description

    Dataset Description

    Dataset Structure

    A detailed description available in "SynthRAD2025_dataset_description.pdf". A paper describing the dataset has been submitted to Medical Physics and is available as pre-print at: https://arxiv.org/abs/2502.17609" target="_blank" rel="noopener">https://arxiv.org/abs/2502.17609. The dataset is divided into two tasks:

    • Task 1 (MRI-to-CT conversion) is provided in Task1.zip.
    • Task 2 (CBCT-to-CT conversion) is provided in Task2.zip.

    After extraction, the dataset is organized as follows:

    Within each task, cases are categorized into three anatomical regions:

    • Head-and-neck (HN)
    • Thorax (TH)
    • Abdomen (AB)

    Each anatomical region contains individual patient folders, named using a unique seven-letter alphanumeric code:
    [Task Number][Anatomy][Center][PatientID]
    Example: 1HNA001

    Each patient folder in the training dataset contains (for other sets see Table below):

    • ct.mha: preprocessed CT image
    • mr.mha or cbct.mha (depending on the task): preprocessed MR or CBCT image
    • mask.mha: Binary mask of the patient outline (dilated)

    An overview folder within each anatomical region contains:

    • [task]_[anatomy]_parameters.xlsx: Imaging protocol details for each patient.
    • [task][anatomy][center][PatientID]_overview.png: A visualization of axial, coronal, and sagittal slices of CBCT/MR, CT, mask, and difference images.

    Dataset Overview

    The SynthRAD2025 dataset is part of the second edition of the SynthRAD deep learning challenge (https://synthrad2025.grand-challenge.org/), which benchmarks synthetic CT generation for MRI- and CBCT-based radiotherapy workflows.

    • Task 1: MRI-to-CT conversion for MR-only and MR-guided photon/proton radiotherapy, consisting of 890 MRI-CT pairs.
    • Task 2: CBCT-to-CT conversion for daily adaptive radiotherapy workflows, consisting of 1,472 CBCT-CT pairs.

    Imaging data was collected from five European university medical centers:

    • Netherlands: UMC Groningen, UMC Utrecht, Radboud UMC
    • Germany: LMU Klinikum Munich, UK Cologne

    All centers have independently approved the study in accordance with their institutional review boards or medical ethics committee regulations.

    Inclusion criteria:

    • Patients treated with external beam radiotherapy (photon or proton therapy) at one of the data-providing centers.
    • Imaging data available from one of the three anatomical regions.
    • No restrictions on age, sex, tumor characteristics, or staging.

    License

    The dataset is provided under two different licenses:

    • Data from centers A, B, C, and E is provided under a CC-BY-NC 4.0 International License (creativecommons.org/licenses /by-nc/4.0/).
    • Data from center D is provided with a limited license which permits it's use only for the duration of the challenge and remains valid only while the challenge is active (Limited Use License Center D). By downloading Center D's data, participants agree to these terms. Once the challenge ends, access to the data ends, the download link will be deactivated, and all downloaded data must be deleted. After requesting participation in the challenge on the SynthRAD2025 website, participants can access the download link for center D at https://synthrad2025.grand-challenge.org/data/.

    Data Release Schedule

    Subset

    Files

    Release Date

    Link

    Training

    Input, CT, Mask

    01-03-2025

    https://doi.org/10.5281/zenodo.14918213

    Training Center D

    Input, CT, Mask

    01-03-2025

    Check the download link at:
    https://synthrad2025.grand-challenge.org/data/

    Limited use License:
    License

    Validation Input

    Input, Mask

    01-06-2025

    https://doi.org/10.5281/zenodo.14918504

    Validation Input Center D

    Input, Mask

    01-06-2025

    Check the download link at:
    https://synthrad2025.grand-challenge.org/data/

    Limited use License:
    License

    Validation Ground Truth

    CT, Deformed CT

    01-03-2030

    https://doi.org/10.5281/zenodo.14918605

    Test

    Input, CT, Deformed CT, Mask

    01-03-2030

    https://doi.org/10.5281/zenodo.14918722

    Dataset Composition

    The number of cases collected at each center for training, validation, and test sets.

    Training Set

    <td style="width:

    TaskCenterHNTHABTotal
    1A919165247
    B09191182
    C6501984
    D650065
    E0000
    Total221182175578
    2A656564195
    B656565195
    C656362190
    D656353181
    E656565
  17. f

    Zeus-GameOver-synthetic-dataset

    • figshare.com
    zip
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhruba Jyoti Borah (2023). Zeus-GameOver-synthetic-dataset [Dataset]. http://doi.org/10.6084/m9.figshare.20832001.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    figshare
    Authors
    Dhruba Jyoti Borah
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The repository contains a synthetic Zeus GameOver dataset generated in a testbed. This is a compressed file containing the Zeus GameOver botnet traffic flow simulation. A Zeus bot software was created by studying the characteristics of Zeus GameOver from technical reports "ZeuS-P2P monitoring and analysis", by CERT Polska, published in June 2013 (https://www.cert.pl/en/uploads/2015/12/2013-06-p2p-rap_en.pdf), as well as "An analysis of the Zeus peer-to-peer protocol" by Dennis Andriesse and Herbert Bos, technical report, VU University Amsterdam, The Netherlands, April 2014 (URL:https://syssec.mistakenot.net/papers/zeus-tech-report-2013.pdf). A testbed has been set up with 101 virtual hosts. Each host has a piece of bot software installed. The bots then communicate with one another. The network traffic was captured for 24 hours. tcpdump tool is used to capture the raw trafffic. The captured traffic is then used to generate netflow records using the nprobe tool. The source and destination IP addresses are then extracted from the resulting flow dataset. The dataset uploaded here is a text file. The file contains the communication information of the bot nodes. It has two fields: source IP address and destination IP address

  18. f

    Data Sheet 1_Training marine species object detectors with synthetic images...

    • frontiersin.figshare.com
    pdf
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heather Doig; Oscar Pizarro; Stefan Williams (2025). Data Sheet 1_Training marine species object detectors with synthetic images and unsupervised domain adaptation.pdf [Dataset]. http://doi.org/10.3389/fmars.2025.1581778.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    Frontiers
    Authors
    Heather Doig; Oscar Pizarro; Stefan Williams
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Visual surveys by autonomous underwater vehicles (AUVs) and other underwater platforms provide a valuable method for analysing and understanding the benthic environment. Scientists can measure the presence and abundance of benthic species by manually annotating survey images with online annotation software or other tools. Neural network object detectors can reduce the effort involved in this process by locating and classifying species of interest in the images. However, accurate object detectors often rely on large numbers of annotated training images which are not currently available for many marine applications. To address this issue, we propose a novel pipeline for generating large amounts of synthetic annotated training data for a species of interest using 3D modelling and rendering software. The detector is trained with synthetic images and annotations along with real unlabelled images to improve performance through domain adaptation. Our method is demonstrated on a sea urchin detector trained only with synthetic data, achieving a performance slightly lower than an equivalent detector trained with manually labelled real images (AP50 of 84.3 vs 92.3). Using realistic synthetic data for species or objects with few or no annotations is a promising approach to reducing the manual effort required to analyse imaging survey data.

  19. f

    Text S1 - Effective Reduced Diffusion-Models: A Data Driven Approach to the...

    • figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gustavo Deco; Daniel Martí; Anders Ledberg; Ramon Reig; Maria V. Sanchez Vives (2023). Text S1 - Effective Reduced Diffusion-Models: A Data Driven Approach to the Analysis of Neuronal Dynamics [Dataset]. http://doi.org/10.1371/journal.pcbi.1000587.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Gustavo Deco; Daniel Martí; Anders Ledberg; Ramon Reig; Maria V. Sanchez Vives
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description of the network of spiking neurons used to generate synthetic data. (0.09 MB PDF)

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ainnotate (2022). Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats [Dataset]. https://datarade.ai/data-products/synthetic-document-dataset-for-ai-jpeg-png-pdf-formats-ainnotate

Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats

Explore at:
Dataset updated
Sep 18, 2022
Dataset authored and provided by
Ainnotate
Area covered
Tokelau, Canada, Tonga, Korea (Democratic People's Republic of), Brazil, Cabo Verde, Denmark, Syrian Arab Republic, Germany, Ireland
Description

Ainnotate’s proprietary dataset generation methodology based on large scale generative modelling and Domain randomization provides data that is well balanced with consistent sampling, accommodating rare events, so that it can enable superior simulation and training of your models.

Ainnotate currently provides synthetic datasets in the following domains and use cases.

Internal Services - Visa application, Passport validation, License validation, Birth certificates Financial Services - Bank checks, Bank statements, Pay slips, Invoices, Tax forms, Insurance claims and Mortgage/Loan forms Healthcare - Medical Id cards

Search
Clear search
Close search
Google apps
Main menu