Ainnotate’s proprietary dataset generation methodology based on large scale generative modelling and Domain randomization provides data that is well balanced with consistent sampling, accommodating rare events, so that it can enable superior simulation and training of your models.
Ainnotate currently provides synthetic datasets in the following domains and use cases.
Internal Services - Visa application, Passport validation, License validation, Birth certificates Financial Services - Bank checks, Bank statements, Pay slips, Invoices, Tax forms, Insurance claims and Mortgage/Loan forms Healthcare - Medical Id cards
This dataset was created to pilot techniques for creating synthetic data from datasets containing sensitive and protected information in the local government context. Synthetic data generation replaces actual data with representative data generated from statistical models; this preserves the key data properties that allow insights to be drawn from the data while protecting the privacy of the people included in the data. We invite you to read the Understanding Synthetic Data white paper for a concise introduction to synthetic data.
This effort was a collaboration of the Urban Institute, Allegheny County’s Department of Human Services (DHS) and CountyStat, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center.
The source data for this project consisted of 1) month-by-month records of services included in Allegheny County's data warehouse and 2) demographic data about the individuals who received the services. As the County’s data warehouse combines this service and client data, this data is referred to as “Integrated Services data”. Read more about the data warehouse and the kinds of services it includes here.
Synthetic data are typically generated from probability distributions or models identified as being representative of the confidential data. For this dataset, a model of the Integrated Services data was used to generate multiple versions of the synthetic dataset. These different candidate datasets were evaluated to select for publication the dataset version that best balances utility and privacy. For high-level information about this evaluation, see the Synthetic Data User Guide.
For more information about the creation of the synthetic version of this data, see the technical brief for this project, which discusses the technical decision making and modeling process in more detail.
This disaggregated synthetic data allows for many analyses that are not possible with aggregate data (summary statistics). Broadly, this synthetic version of this data could be analyzed to better understand the usage of human services by people in Allegheny County, including the interplay in the usage of multiple services and demographic information about clients.
Some amount of deviation from the original data is inherent to the synthetic data generation process. Specific examples of limitations (including undercounts and overcounts for the usage of different services) are given in the Synthetic Data User Guide and the technical report describing this dataset's creation.
Please reach out to this dataset's data steward (listed below) to let us know how you are using this data and if you found it to be helpful. Please also provide any feedback on how to make this dataset more applicable to your work, any suggestions of future synthetic datasets, or any additional information that would make this more useful. Also, please copy wprdc@pitt.edu on any such feedback (as the WPRDC always loves to hear about how people use the data that they publish and how the data could be improved).
1) A high-level overview of synthetic data generation as a method for protecting privacy can be found in the Understanding Synthetic Data white paper.
2) The Synthetic Data User Guide provides high-level information to help users understand the motivation, evaluation process, and limitations of the synthetic version of Allegheny County DHS's Human Services data published here.
3) Generating a Fully Synthetic Human Services Dataset: A Technical Report on Synthesis and Evaluation Methodologies describes the full technical methodology used for generating the synthetic data, evaluating the various options, and selecting the final candidate for publication.
4) The WPRDC also hosts the Allegheny County Human Services Community Profiles dataset, which provides annual updates on human-services usage, aggregated by neighborhood/municipality. That data can be explored using the County's Human Services Community Profile web site.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The use of synthetic data is recognized as a crucial step in the development of neural network-based Artificial Intelligence (AI) systems. While the methods for generating synthetic data for AI applications in other domains have a role in certain biomedical AI systems, primarily related to image processing, there is a critical gap in the generation of time series data for AI tasks where it is necessary to know how the system works. This is most pronounced in the ability to generate synthetic multi-dimensional molecular time series data (subsequently referred to as synthetic mediator trajectories or SMTs); this is the type of data that underpins research into biomarkers and mediator signatures for forecasting various diseases and is an essential component of the drug development pipeline. We argue the insufficiency of statistical and data-centric machine learning (ML) means of generating this type of synthetic data is due to a combination of factors: perpetual data sparsity due to the Curse of Dimensionality, the inapplicability of the Central Limit Theorem in terms of making assumptions about the statistical distributions of this type of data, and the inability to use ab initio simulations due to the state of perpetual epistemic incompleteness in cellular/molecular biology. Alternatively, we present a rationale for using complex multi-scale mechanism-based simulation models, constructed and operated on to account for perpetual epistemic incompleteness and the need to provide maximal expansiveness in concordance with the Maximal Entropy Principle. These procedures provide for the generation of SMT that minimizes the known shortcomings associated with neural network AI systems, namely overfitting and lack of generalizability. The generation of synthetic data that accounts for the identified factors of multi-dimensional time series data is an essential capability for the development of mediator-biomarker based AI forecasting systems, and therapeutic control development and optimization.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset
The purpose of this dataset is to pre- or post-train embedding models for Danish on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The MatSim Dataset and benchmark
Synthetic dataset and real images benchmark for visual similarity recognition of materials and textures.
MatSim: a synthetic dataset, a benchmark, and a method for computer vision-based recognition of similarities and transitions between materials and textures focusing on identifying any material under any conditions using one or a few examples (one-shot learning).
Based on the paper: One-shot recognition of any material anywhere using contrastive learning with physics-based rendering
Benchmark_MATSIM.zip: contain the benchmark made of real-world images as described in the paper
MatSim_object_train_split_1,2,3.zip: Contain a subset of the synthetics dataset for images of CGI images materials on random objects as described in the paper.
MatSim_Vessels_Train_1,2,3.zip : Contain a subset of the synthetics dataset for images of CGI images materials inside transparent containers as described in the paper.
*Note: these are subsets of the dataset; the full dataset can be found at:
https://e1.pcloud.link/publink/show?code=kZIiSQZCYU5M4HOvnQykql9jxF4h0KiC5MX
or
https://icedrive.net/s/A13FWzZ8V2aP9T4ufGQ1N3fBZxDF
Code:
Up to date code for generating the dataset, reading and evaluation and trained nets can be found in this URL:https://github.com/sagieppel/MatSim-Dataset-Generator-Scripts-And-Neural-net
Dataset Generation Scripts.zip: Contain the Blender (3.1) Python scripts used for generating the dataset, this code might be odl up to date code can be found here
Net_Code_And_Trained_Model.zip: Contain a reference neural net code, including loaders, trained models, and evaluators scripts that can be used to read and train with the synthetic dataset or test the model with the benchmark. Note code in the ZIP file is not up to date and contains some bugs For the Latest version of this code see this URL
Further documentation can be found inside the zip files or in the paper.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset
The purpose of this dataset is to pre- or post-train embedding models on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by Arrow Denmark and… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-norwegian.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The traveltime dataset is based on the Folktables project covering US census data. The target is a binary variable encoding whether or not the individual needs to travel more than 20 minutes for work; here, having a shorter travel time is the desirable outcome. We use a subset of data from the states of California, Florida, Maine, New York, Utah, and Wyoming states in 2018. Although the folktables dataset does not have any missing values, there are some values recorded as NaN due to the Bureau's data collection methodology. We remove the "esp" column, which encodes the employment status of parents, and has 99.55% missing values. We encode the missing values in the povpip, income to poverty ratio (0.85%), to -1 in accordance to the methodology in Ding et al.. See https://arxiv.org/pdf/2108.04884 for metadata.
The cardio (a) dataset contains patient data recorded during medical examination, including 3 binary features supplied by the patient. The target class denotes the presence of cardiovascular disease. This dataset represents predictive tasks that allocate access to priority medical care for patients, and has been used for fairness evaluations in the domain.
The credit dataset contains historical financial data of borrowers, including past non-serious delinquencies. Here, a serious delinquency is considered to be 90 days past due, and this is the target variable.
The German Credit dataset (https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data) contains financial and personal information regarding loan-seeking applicants.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a software tool, called cMatch, to reconstruct and identify synthetic genetic constructs from their sequences, or a set of sub-sequences—based on two practical pieces of information: their modular structure, and libraries of components. Although developed for combinatorial pathway engineering problems and addressing their quality control (QC) bottleneck, cMatch is not restricted to these applications. QC takes place post assembly, transformation and growth. It has a simple goal, to verify that the genetic material contained in a cell matches what was intended to be built - and when it is not the case, to locate the discrepancies and estimate their severity. In terms of reproducibility/reliability, the QC step is crucial. Failure at this step requires repetition of the construction and/or sequencing steps. When performed manually or semi-manually QC is an extremely time-consuming, error prone process, which scales very poorly with the number of constructs and their complexity. To make QC frictionless and more reliable, cMatch performs an operation we have called “construct-matching” and automates it. Construct-matching is more thorough than simple sequence-matching, as it matches at the functional level-and quantifies the matching at the individual component level and across the whole construct. Two algorithms (called CM_1 and CM_2) are presented. They differ according to the nature of their inputs. CM_1 is the core algorithm for construct-matching and is to be used when input sequences are long enough to cover constructs in their entirety (e.g., obtained with methods such as next generation sequencing). CM_2 is an extension designed to deal with shorter data (e.g., obtained with Sanger sequencing), and that need recombining. Both algorithms are shown to yield accurate construct-matching in a few minutes (even on hardware with limited processing power), together with a set of metrics that can be used to improve the robustness of the decision-making process. To ensure reliability and reproducibility, cMatch builds on the highly validated pairwise-matching Smith-Waterman algorithm. All the tests presented have been conducted on synthetic data for challenging, yet realistic constructs - and on real data gathered during studies on a metabolic engineering example (lycopene production).
This synthetic instruct dataset is curated from select pdf documents obtained from the TRADOC corpus. Instruction generation procedure:
Pull all the sections from the TRADOC PDF files (using section extractor) TFIDF on the sections to find clusters of similar questions Use the Markov shuffling approach to identify sections within the same cluster (lexically similar at the least) Feed that as context to the LLM to generate a diverse set of instructions.
RAG Context:
The RAG context for… See the full description on the dataset page: https://huggingface.co/datasets/LegionIntel/TRADOC_synthetic_instruct_long_short_context_response_v1.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
MatSeg Dataset and benchmark for zero-shot material state segmentation.
MatSeg Benchmark containing 1220 real-world images and their annotations is available at MatSeg_Benchmark.zip the file contains documentation and Python readers.
MatSeg dataset containing synthetic images with infused natural images patterns is available at MatSeg3D_part_*.zip and MatSeg3D_part_*.zip (* stand for number).
MatSeg3D_part_*.zip: contain synthethc 3D scenes
MatSeg2D_part_*.zip: contain syntethc 2D scenes
Readers and documentation for the synthetic data are available at: Dataset_Documentation_And_Readers.zip
Readers and documentation for the real-images benchmark are available at: MatSeg_Benchmark.zip
The Code used to generate the MatSeg Dataset is available at: https://zenodo.org/records/11401072
Additional permanent sources for downloading the dataset and metadata: 1, 2
Evaluation scripts for the Benchmark are now available at:
https://zenodo.org/records/13402003 and https://e.pcloud.link/publink/show?code=XZsP8PZbT7AJzG98tV1gnVoEsxKRbBl8awX
Materials and their states form a vast array of patterns and textures that define the physical and visual world. Minerals in rocks, sediment in soil, dust on surfaces, infection on leaves, stains on fruits, and foam in liquids are some of these almost infinite numbers of states and patterns.
Image segmentation of materials and their states is fundamental to the understanding of the world and is essential for a wide range of tasks, from cooking and cleaning to construction, agriculture, and chemistry laboratory work.
The MatSeg dataset focuses on zero-shot segmentation of materials and their states, meaning identifying the region of an image belonging to a specific material type of state, without previous knowledge or training of the material type, states, or environment.
The dataset contains a large set of (100k) synthetic images and benchmarks of 1220 real-world images for testing.
The benchmark contains 1220 real-world images with a wide range of material states and settings. For example: food states (cooked/burned..), plants (infected/dry.) to rocks/soil (minerals/sediment), construction/metals (rusted, worn), liquids (foam/sediment), and many other states in without being limited to a set of classes or environment. The goal is to evaluate the segmentation of material materials without knowledge or pretraining on the material or setting. The focus is on materials with complex scattered boundaries, and gradual transition (like the level of wetness of the surface).
Evaluation scripts for the Benchmark are now available at: 1 and 2.
The synthetic dataset is composed of synthetic scenes rendered in 2d and 3d using a blender. The synthetic data is infused with patterns, materials, and textures automatically extracted from real images allowing it to capture the complexity and diversity of the real world while maintaining the precision and scale of synthetic data. 100k images and their annotation are available to download.
License
This dataset, including all its components, is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. To the extent possible under law, the authors have dedicated all copyright and related and neighboring rights to this dataset to the public domain worldwide. This dedication applies to the dataset and all derivative works.
The MatSeg 2D and 3D synthetic were generated using the open-images dataset which is licensed under the https://www.apache.org/licenses/LICENSE-2.0. For these components, you must comply with the terms of the Apache License. In addition, the MatSege3D dataset uses Shapenet 3D assets with GNU license.
An Example of a training and evaluation code for a net trained on the dataset and evaluated on the benchmark is given at these urls: 1, 2
This include an evaluation script on the MatSeg benchmark.
Training script using the MatSeg dataset.
And weights of a trained model
Paper:
More detail on the work ca be found in the paper "Infusing Synthetic Data with Real-World Patterns for
Zero-Shot Material State Segmentation"
Croissant metadata and additional sources for downloading the dataset are available at 1,2
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The purpose of this dataset is to pre- or post-train embedding models for Danish retrieval tasks.
The dataset consists of 100,000 samples generated with gemma-2-27b-it.
The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.
Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/retrieval-tasks-processed
The data generation process described in this paper was followed:
https://arxiv.org/pdf/2401.00368
Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The purpose of this dataset is to pre- or post-train embedding models for Danish text classification tasks.
The dataset consists of 100,000 samples generated with gemma-2-27b-it.
The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.
Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/classification-tasks-processed
The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368
Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The purpose of this dataset is to pre- or post-train embedding models for Danish text matching tasks on short texts.
The dataset consists of 100,000 samples generated with gemma-2-27b-it.
The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.
Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/retrieval-tasks-processed
The data generation process described in this paper was followed:
https://arxiv.org/pdf/2401.00368
Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OGC - Organized, Grouped, Cleaned
Hydrogen Vision DSE
Intended for image/text to vector (DSE)
Dataset Composition
Made with https://github.com/RacineAIOS/OGC_pdf-to-parquet This dataset was created by scraping PDF documents from online sources and generating relevant synthetic queries. We used Google's Gemini 2.0 Flash Lite model in our custom pipeline to produce the queries, allowing us to create a diverse set of questions based on the document content.… See the full description on the dataset page: https://huggingface.co/datasets/racineai/OGC_Hydrogen.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
A detailed description available in "SynthRAD2025_dataset_description.pdf". A paper describing the dataset has been submitted to Medical Physics and is available as pre-print at: https://arxiv.org/abs/2502.17609" target="_blank" rel="noopener">https://arxiv.org/abs/2502.17609. The dataset is divided into two tasks:
After extraction, the dataset is organized as follows:
Within each task, cases are categorized into three anatomical regions:
Each anatomical region contains individual patient folders, named using a unique seven-letter alphanumeric code:[Task Number][Anatomy][Center][PatientID]
Example: 1HNA001
Each patient folder in the training dataset contains (for other sets see Table below):
ct.mha
: preprocessed CT imagemr.mha
or cbct.mha
(depending on the task): preprocessed MR or CBCT imagemask.mha
: Binary mask of the patient outline (dilated)An overview folder within each anatomical region contains:
[task]_[anatomy]_parameters.xlsx
: Imaging protocol details for each patient.[task][anatomy][center][PatientID]_overview.png
: A visualization of axial, coronal, and sagittal slices of CBCT/MR, CT, mask, and difference images.The SynthRAD2025 dataset is part of the second edition of the SynthRAD deep learning challenge (https://synthrad2025.grand-challenge.org/), which benchmarks synthetic CT generation for MRI- and CBCT-based radiotherapy workflows.
Imaging data was collected from five European university medical centers:
All centers have independently approved the study in accordance with their institutional review boards or medical ethics committee regulations.
Inclusion criteria:
The dataset is provided under two different licenses:
Subset |
Files |
Release Date |
Link |
---|---|---|---|
Training |
Input, CT, Mask |
01-03-2025 | |
Training Center D |
Input, CT, Mask |
01-03-2025 |
Check the download link at: |
Validation Input |
Input, Mask |
01-06-2025 | |
Validation Input Center D |
Input, Mask |
01-06-2025 |
Check the download link at: |
Validation Ground Truth |
CT, Deformed CT |
01-03-2030 | |
Test |
Input, CT, Deformed CT, Mask |
01-03-2030 |
The number of cases collected at each center for training, validation, and test sets.
Task | Center | HN | TH | AB | Total |
---|---|---|---|---|---|
1 | A | 91 | 91 | 65 | 247 |
B | 0 | 91 | 91 | 182 | |
C | 65 | 0 | 19 | 84 | |
D | 65 | 0 | 0 | 65 | |
E | 0 | 0 | 0 | 0 | |
Total | 221 | 182 | 175 | 578 | |
2 | A | 65 | 65 | 64 | 195 |
B | 65 | 65 | 65 | 195 | |
C | 65 | 63 | 62 | 190 | |
D | 65 | 63 | 53 | 181 | |
E | 65 | 65 | 65 |
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The repository contains a synthetic Zeus GameOver dataset generated in a testbed. This is a compressed file containing the Zeus GameOver botnet traffic flow simulation. A Zeus bot software was created by studying the characteristics of Zeus GameOver from technical reports "ZeuS-P2P monitoring and analysis", by CERT Polska, published in June 2013 (https://www.cert.pl/en/uploads/2015/12/2013-06-p2p-rap_en.pdf), as well as "An analysis of the Zeus peer-to-peer protocol" by Dennis Andriesse and Herbert Bos, technical report, VU University Amsterdam, The Netherlands, April 2014 (URL:https://syssec.mistakenot.net/papers/zeus-tech-report-2013.pdf). A testbed has been set up with 101 virtual hosts. Each host has a piece of bot software installed. The bots then communicate with one another. The network traffic was captured for 24 hours. tcpdump tool is used to capture the raw trafffic. The captured traffic is then used to generate netflow records using the nprobe tool. The source and destination IP addresses are then extracted from the resulting flow dataset. The dataset uploaded here is a text file. The file contains the communication information of the bot nodes. It has two fields: source IP address and destination IP address
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Visual surveys by autonomous underwater vehicles (AUVs) and other underwater platforms provide a valuable method for analysing and understanding the benthic environment. Scientists can measure the presence and abundance of benthic species by manually annotating survey images with online annotation software or other tools. Neural network object detectors can reduce the effort involved in this process by locating and classifying species of interest in the images. However, accurate object detectors often rely on large numbers of annotated training images which are not currently available for many marine applications. To address this issue, we propose a novel pipeline for generating large amounts of synthetic annotated training data for a species of interest using 3D modelling and rendering software. The detector is trained with synthetic images and annotations along with real unlabelled images to improve performance through domain adaptation. Our method is demonstrated on a sea urchin detector trained only with synthetic data, achieving a performance slightly lower than an equivalent detector trained with manually labelled real images (AP50 of 84.3 vs 92.3). Using realistic synthetic data for species or objects with few or no annotations is a promising approach to reducing the manual effort required to analyse imaging survey data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description of the network of spiking neurons used to generate synthetic data. (0.09 MB PDF)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Ainnotate’s proprietary dataset generation methodology based on large scale generative modelling and Domain randomization provides data that is well balanced with consistent sampling, accommodating rare events, so that it can enable superior simulation and training of your models.
Ainnotate currently provides synthetic datasets in the following domains and use cases.
Internal Services - Visa application, Passport validation, License validation, Birth certificates Financial Services - Bank checks, Bank statements, Pay slips, Invoices, Tax forms, Insurance claims and Mortgage/Loan forms Healthcare - Medical Id cards