19 datasets found

d
Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats
datarade.ai
Updated Sep 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ainnotate (2022). Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats [Dataset]. https://datarade.ai/data-products/synthetic-document-dataset-for-ai-jpeg-png-pdf-formats-ainnotate
Explore at:
Dataset updated
Sep 18, 2022
Dataset authored and provided by
Ainnotate
Area covered
Tokelau, Canada, Tonga, Korea (Democratic People's Republic of), Brazil, Cabo Verde, Denmark, Syrian Arab Republic, Germany, Ireland
Description
Ainnotate’s proprietary dataset generation methodology based on large scale generative modelling and Domain randomization provides data that is well balanced with consistent sampling, accommodating rare events, so that it can enable superior simulation and training of your models.

Ainnotate currently provides synthetic datasets in the following domains and use cases.

Internal Services - Visa application, Passport validation, License validation, Birth certificates Financial Services - Bank checks, Bank statements, Pay slips, Invoices, Tax forms, Insurance claims and Mortgage/Loan forms Healthcare - Medical Id cards
C
Synthetic Integrated Services Data
data.wprdc.org
csv, html, pdf, zip
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allegheny County (2024). Synthetic Integrated Services Data [Dataset]. https://data.wprdc.org/dataset/synthetic-integrated-services-data
Explore at:
html, csv(1375554033), zip(39231637), pdfAvailable download formats
Dataset updated
Jun 25, 2024
Dataset authored and provided by
Allegheny County
Description
Motivation

This dataset was created to pilot techniques for creating synthetic data from datasets containing sensitive and protected information in the local government context. Synthetic data generation replaces actual data with representative data generated from statistical models; this preserves the key data properties that allow insights to be drawn from the data while protecting the privacy of the people included in the data. We invite you to read the Understanding Synthetic Data white paper for a concise introduction to synthetic data.

This effort was a collaboration of the Urban Institute, Allegheny County’s Department of Human Services (DHS) and CountyStat, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center.

Collection

The source data for this project consisted of 1) month-by-month records of services included in Allegheny County's data warehouse and 2) demographic data about the individuals who received the services. As the County’s data warehouse combines this service and client data, this data is referred to as “Integrated Services data”. Read more about the data warehouse and the kinds of services it includes here.

Preprocessing

Synthetic data are typically generated from probability distributions or models identified as being representative of the confidential data. For this dataset, a model of the Integrated Services data was used to generate multiple versions of the synthetic dataset. These different candidate datasets were evaluated to select for publication the dataset version that best balances utility and privacy. For high-level information about this evaluation, see the Synthetic Data User Guide.

For more information about the creation of the synthetic version of this data, see the technical brief for this project, which discusses the technical decision making and modeling process in more detail.

Recommended Uses

This disaggregated synthetic data allows for many analyses that are not possible with aggregate data (summary statistics). Broadly, this synthetic version of this data could be analyzed to better understand the usage of human services by people in Allegheny County, including the interplay in the usage of multiple services and demographic information about clients.

Known Limitations/Biases

Some amount of deviation from the original data is inherent to the synthetic data generation process. Specific examples of limitations (including undercounts and overcounts for the usage of different services) are given in the Synthetic Data User Guide and the technical report describing this dataset's creation.

Feedback

Please reach out to this dataset's data steward (listed below) to let us know how you are using this data and if you found it to be helpful. Please also provide any feedback on how to make this dataset more applicable to your work, any suggestions of future synthetic datasets, or any additional information that would make this more useful. Also, please copy wprdc@pitt.edu on any such feedback (as the WPRDC always loves to hear about how people use the data that they publish and how the data could be improved).

Further Documentation and Resources

1) A high-level overview of synthetic data generation as a method for protecting privacy can be found in the Understanding Synthetic Data white paper.
2) The Synthetic Data User Guide provides high-level information to help users understand the motivation, evaluation process, and limitations of the synthetic version of Allegheny County DHS's Human Services data published here.
3) Generating a Fully Synthetic Human Services Dataset: A Technical Report on Synthesis and Evaluation Methodologies describes the full technical methodology used for generating the synthetic data, evaluating the various options, and selecting the final candidate for publication.
4) The WPRDC also hosts the Allegheny County Human Services Community Profiles dataset, which provides annual updates on human-services usage, aggregated by neighborhood/municipality. That data can be explored using the County's Human Services Community Profile web site.
f
Table1_Enhancing biomechanical machine learning with limited data:...
frontiersin.figshare.com
pdf
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich (2024). Table1_Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.pdf [Dataset]. http://doi.org/10.3389/fbioe.2024.1350135.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fbioe.2024.1350135.s001
Dataset updated
Feb 14, 2024
Dataset provided by
Frontiers
Authors
Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.
f
DataSheet1_Generating synthetic multidimensional molecular time series data...
figshare.com
pdf
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gary An; Chase Cockrell (2023). DataSheet1_Generating synthetic multidimensional molecular time series data for machine learning: considerations.PDF [Dataset]. http://doi.org/10.3389/fsysb.2023.1188009.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fsysb.2023.1188009.s001
Dataset updated
Jul 25, 2023
Dataset provided by
Frontiers
Authors
Gary An; Chase Cockrell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The use of synthetic data is recognized as a crucial step in the development of neural network-based Artificial Intelligence (AI) systems. While the methods for generating synthetic data for AI applications in other domains have a role in certain biomedical AI systems, primarily related to image processing, there is a critical gap in the generation of time series data for AI tasks where it is necessary to know how the system works. This is most pronounced in the ability to generate synthetic multi-dimensional molecular time series data (subsequently referred to as synthetic mediator trajectories or SMTs); this is the type of data that underpins research into biomarkers and mediator signatures for forecasting various diseases and is an essential component of the drug development pipeline. We argue the insufficiency of statistical and data-centric machine learning (ML) means of generating this type of synthetic data is due to a combination of factors: perpetual data sparsity due to the Curse of Dimensionality, the inapplicability of the Central Limit Theorem in terms of making assumptions about the statistical distributions of this type of data, and the inability to use ab initio simulations due to the state of perpetual epistemic incompleteness in cellular/molecular biology. Alternatively, we present a rationale for using complex multi-scale mechanism-based simulation models, constructed and operated on to account for perpetual epistemic incompleteness and the need to provide maximal expansiveness in concordance with the Maximal Entropy Principle. These procedures provide for the generation of SMT that minimizes the known shortcomings associated with neural network AI systems, namely overfitting and lack of generalizability. The generation of synthetic data that accounts for the identified factors of multi-dimensional time series data is an essential capability for the development of mediator-biomarker based AI forecasting systems, and therapeutic control development and optimization.
h
synthetic-from-unit-triple-tasks-danish
huggingface.co
sprogteknologi.dk
Updated Jan 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kasper Groes Albin Ludvigsen (2025). synthetic-from-unit-triple-tasks-danish [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2025
Authors
Kasper Groes Albin Ludvigsen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

The purpose of this dataset is to pre- or post-train embedding models for Danish on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish.
MatSim Dataset and benchmark for one-shot visual materials and textures...
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik (2025). MatSim Dataset and benchmark for one-shot visual materials and textures recognition [Dataset]. http://doi.org/10.5281/zenodo.7390166
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7390166
Dataset updated
Jun 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The MatSim Dataset and benchmark

Lastest version

Synthetic dataset and real images benchmark for visual similarity recognition of materials and textures.

MatSim: a synthetic dataset, a benchmark, and a method for computer vision-based recognition of similarities and transitions between materials and textures focusing on identifying any material under any conditions using one or a few examples (one-shot learning).

Based on the paper: One-shot recognition of any material anywhere using contrastive learning with physics-based rendering

Benchmark_MATSIM.zip: contain the benchmark made of real-world images as described in the paper

MatSim_object_train_split_1,2,3.zip: Contain a subset of the synthetics dataset for images of CGI images materials on random objects as described in the paper.

MatSim_Vessels_Train_1,2,3.zip : Contain a subset of the synthetics dataset for images of CGI images materials inside transparent containers as described in the paper.

*Note: these are subsets of the dataset; the full dataset can be found at:
https://e1.pcloud.link/publink/show?code=kZIiSQZCYU5M4HOvnQykql9jxF4h0KiC5MX

or
https://icedrive.net/s/A13FWzZ8V2aP9T4ufGQ1N3fBZxDF

Code:

Up to date code for generating the dataset, reading and evaluation and trained nets can be found in this URL:https://github.com/sagieppel/MatSim-Dataset-Generator-Scripts-And-Neural-net

Dataset Generation Scripts.zip: Contain the Blender (3.1) Python scripts used for generating the dataset, this code might be odl up to date code can be found here
Net_Code_And_Trained_Model.zip: Contain a reference neural net code, including loaders, trained models, and evaluators scripts that can be used to read and train with the synthetic dataset or test the model with the benchmark. Note code in the ZIP file is not up to date and contains some bugs For the Latest version of this code see this URL

Further documentation can be found inside the zip files or in the paper.
h
synthetic-from-unit-triple-tasks-norwegian
huggingface.co
Updated Jan 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kasper Groes Albin Ludvigsen (2025). synthetic-from-unit-triple-tasks-norwegian [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-norwegian
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2025
Authors
Kasper Groes Albin Ludvigsen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Thanks to Arrow Denmark and Nvidia for sponsoring the compute used to generate this dataset

The purpose of this dataset is to pre- or post-train embedding models on text similarity tasks. The dataset consists of 100,000 samples generated with gemma-2-27b-it. The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output. The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368 Compute sponsored by Arrow Denmark and… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-norwegian.
Benchmark datasets to study fairness in synthetic data generation
zenodo.org
csv, json
Updated Aug 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joao Fonseca; Joao Fonseca (2024). Benchmark datasets to study fairness in synthetic data generation [Dataset]. http://doi.org/10.5281/zenodo.13375623
Explore at:
csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13375623
Dataset updated
Aug 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joao Fonseca; Joao Fonseca
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The traveltime dataset is based on the Folktables project covering US census data. The target is a binary variable encoding whether or not the individual needs to travel more than 20 minutes for work; here, having a shorter travel time is the desirable outcome. We use a subset of data from the states of California, Florida, Maine, New York, Utah, and Wyoming states in 2018. Although the folktables dataset does not have any missing values, there are some values recorded as NaN due to the Bureau's data collection methodology. We remove the "esp" column, which encodes the employment status of parents, and has 99.55% missing values. We encode the missing values in the povpip, income to poverty ratio (0.85%), to -1 in accordance to the methodology in Ding et al.. See https://arxiv.org/pdf/2108.04884 for metadata.

The cardio (a) dataset contains patient data recorded during medical examination, including 3 binary features supplied by the patient. The target class denotes the presence of cardiovascular disease. This dataset represents predictive tasks that allocate access to priority medical care for patients, and has been used for fairness evaluations in the domain.

The credit dataset contains historical financial data of borrowers, including past non-serious delinquencies. Here, a serious delinquency is considered to be 90 days past due, and this is the target variable.

The German Credit dataset (https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data) contains financial and personal information regarding loan-seeking applicants.
f
DataSheet1_Removing the Bottleneck: Introducing cMatch - A Lightweight Tool...
frontiersin.figshare.com
pdf
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexis Casas; Matthieu Bultelle; Charles Motraghi; Richard Kitney (2023). DataSheet1_Removing the Bottleneck: Introducing cMatch - A Lightweight Tool for Construct-Matching in Synthetic Biology.PDF [Dataset]. http://doi.org/10.3389/fbioe.2021.785131.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fbioe.2021.785131.s001
Dataset updated
Jun 15, 2023
Dataset provided by
Frontiers
Authors
Alexis Casas; Matthieu Bultelle; Charles Motraghi; Richard Kitney
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a software tool, called cMatch, to reconstruct and identify synthetic genetic constructs from their sequences, or a set of sub-sequences—based on two practical pieces of information: their modular structure, and libraries of components. Although developed for combinatorial pathway engineering problems and addressing their quality control (QC) bottleneck, cMatch is not restricted to these applications. QC takes place post assembly, transformation and growth. It has a simple goal, to verify that the genetic material contained in a cell matches what was intended to be built - and when it is not the case, to locate the discrepancies and estimate their severity. In terms of reproducibility/reliability, the QC step is crucial. Failure at this step requires repetition of the construction and/or sequencing steps. When performed manually or semi-manually QC is an extremely time-consuming, error prone process, which scales very poorly with the number of constructs and their complexity. To make QC frictionless and more reliable, cMatch performs an operation we have called “construct-matching” and automates it. Construct-matching is more thorough than simple sequence-matching, as it matches at the functional level-and quantifies the matching at the individual component level and across the whole construct. Two algorithms (called CM_1 and CM_2) are presented. They differ according to the nature of their inputs. CM_1 is the core algorithm for construct-matching and is to be used when input sequences are long enough to cover constructs in their entirety (e.g., obtained with methods such as next generation sequencing). CM_2 is an extension designed to deal with shorter data (e.g., obtained with Sanger sequencing), and that need recombining. Both algorithms are shown to yield accurate construct-matching in a few minutes (even on hardware with limited processing power), together with a set of metrics that can be used to improve the robustness of the decision-making process. To ensure reliability and reproducibility, cMatch builds on the highly validated pairwise-matching Smith-Waterman algorithm. All the tests presented have been conducted on synthetic data for challenging, yet realistic constructs - and on real data gathered during studies on a metabolic engineering example (lycopene production).
h
TRADOC_synthetic_instruct_long_short_context_response_v1
huggingface.co
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Legion Intelligence Inc. (2024). TRADOC_synthetic_instruct_long_short_context_response_v1 [Dataset]. https://huggingface.co/datasets/LegionIntel/TRADOC_synthetic_instruct_long_short_context_response_v1
Explore at:
Dataset updated
Jul 6, 2024
Dataset authored and provided by
Legion Intelligence Inc.
Description
This synthetic instruct dataset is curated from select pdf documents obtained from the TRADOC corpus. Instruction generation procedure:

Pull all the sections from the TRADOC PDF files (using section extractor) TFIDF on the sections to find clusters of similar questions Use the Markov shuffling approach to identify sections within the same cluster (lexically similar at the least) Feed that as context to the LLM to generate a diverse set of instructions.

RAG Context:

The RAG context for… See the full description on the dataset page: https://huggingface.co/datasets/LegionIntel/TRADOC_synthetic_instruct_long_short_context_response_v1.
MatSeg: Material State Segmentation Dataset and Benchmark
zenodo.org
zip
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). MatSeg: Material State Segmentation Dataset and Benchmark [Dataset]. http://doi.org/10.5281/zenodo.11331618
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11331618
Dataset updated
May 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
MatSeg Dataset and benchmark for zero-shot material state segmentation.

MatSeg Benchmark containing 1220 real-world images and their annotations is available at MatSeg_Benchmark.zip the file contains documentation and Python readers.

MatSeg dataset containing synthetic images with infused natural images patterns is available at MatSeg3D_part_*.zip and MatSeg3D_part_*.zip (* stand for number).

MatSeg3D_part_*.zip: contain synthethc 3D scenes

MatSeg2D_part_*.zip: contain syntethc 2D scenes

Readers and documentation for the synthetic data are available at: Dataset_Documentation_And_Readers.zip

Readers and documentation for the real-images benchmark are available at: MatSeg_Benchmark.zip

The Code used to generate the MatSeg Dataset is available at: https://zenodo.org/records/11401072

Additional permanent sources for downloading the dataset and metadata: 1, 2

Evaluation scripts for the Benchmark are now available at:

https://zenodo.org/records/13402003 and https://e.pcloud.link/publink/show?code=XZsP8PZbT7AJzG98tV1gnVoEsxKRbBl8awX

Description

Materials and their states form a vast array of patterns and textures that define the physical and visual world. Minerals in rocks, sediment in soil, dust on surfaces, infection on leaves, stains on fruits, and foam in liquids are some of these almost infinite numbers of states and patterns.

Image segmentation of materials and their states is fundamental to the understanding of the world and is essential for a wide range of tasks, from cooking and cleaning to construction, agriculture, and chemistry laboratory work.

The MatSeg dataset focuses on zero-shot segmentation of materials and their states, meaning identifying the region of an image belonging to a specific material type of state, without previous knowledge or training of the material type, states, or environment.

The dataset contains a large set of (100k) synthetic images and benchmarks of 1220 real-world images for testing.

Benchmark

The benchmark contains 1220 real-world images with a wide range of material states and settings. For example: food states (cooked/burned..), plants (infected/dry.) to rocks/soil (minerals/sediment), construction/metals (rusted, worn), liquids (foam/sediment), and many other states in without being limited to a set of classes or environment. The goal is to evaluate the segmentation of material materials without knowledge or pretraining on the material or setting. The focus is on materials with complex scattered boundaries, and gradual transition (like the level of wetness of the surface).

Evaluation scripts for the Benchmark are now available at: 1 and 2.

"https://sites.google.com/view/matseg/home#h.2otka7pobcz1">

Synthetic Dataset

The synthetic dataset is composed of synthetic scenes rendered in 2d and 3d using a blender. The synthetic data is infused with patterns, materials, and textures automatically extracted from real images allowing it to capture the complexity and diversity of the real world while maintaining the precision and scale of synthetic data. 100k images and their annotation are available to download.

License

This dataset, including all its components, is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. To the extent possible under law, the authors have dedicated all copyright and related and neighboring rights to this dataset to the public domain worldwide. This dedication applies to the dataset and all derivative works.

The MatSeg 2D and 3D synthetic were generated using the open-images dataset which is licensed under the https://www.apache.org/licenses/LICENSE-2.0. For these components, you must comply with the terms of the Apache License. In addition, the MatSege3D dataset uses Shapenet 3D assets with GNU license.

Example Usage:

An Example of a training and evaluation code for a net trained on the dataset and evaluated on the benchmark is given at these urls: 1, 2

This include an evaluation script on the MatSeg benchmark.

Training script using the MatSeg dataset.

And weights of a trained model

Paper:

More detail on the work ca be found in the paper "Infusing Synthetic Data with Real-World Patterns for
Zero-Shot Material State Segmentation"

Croissant metadata and additional sources for downloading the dataset are available at 1,2
s
Synthetic from Retrieval Tasks Danish
sprogteknologi.dk
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danish Data Science Community (2025). Synthetic from Retrieval Tasks Danish [Dataset]. https://sprogteknologi.dk/dataset/synthetic-from-retrieval-tasks-danish
Explore at:
http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
Dataset updated
Jan 24, 2025
Dataset authored and provided by
Danish Data Science Community
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
Danmark
Description
The purpose of this dataset is to pre- or post-train embedding models for Danish retrieval tasks.

The dataset consists of 100,000 samples generated with gemma-2-27b-it.

The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/retrieval-tasks-processed

The data generation process described in this paper was followed:

https://arxiv.org/pdf/2401.00368

Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.
s
Synthetic from Classification Tasks Danish
sprogteknologi.dk
Updated Jan 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danish Data Science Community (2025). Synthetic from Classification Tasks Danish [Dataset]. https://sprogteknologi.dk/dataset/synthetic-from-classification-tasks-danish
Explore at:
http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
Dataset updated
Jan 24, 2025
Dataset authored and provided by
Danish Data Science Community
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
Danmark
Description
The purpose of this dataset is to pre- or post-train embedding models for Danish text classification tasks.

The dataset consists of 100,000 samples generated with gemma-2-27b-it.

The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/classification-tasks-processed

The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368

Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.
s
Synthetic from Text Matching Short Tasks Danish
sprogteknologi.dk
Updated Jan 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danish Data Science Community (2025). Synthetic from Text Matching Short Tasks Danish [Dataset]. https://sprogteknologi.dk/dataset/synthetic-from-text-matching-short-tasks-danish
Explore at:
http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
Dataset updated
Jan 24, 2025
Dataset authored and provided by
Danish Data Science Community
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
Danmark
Description
The purpose of this dataset is to pre- or post-train embedding models for Danish text matching tasks on short texts.

The dataset consists of 100,000 samples generated with gemma-2-27b-it.

The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/retrieval-tasks-processed

The data generation process described in this paper was followed:

https://arxiv.org/pdf/2401.00368

Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.
h
OGC_Hydrogen
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
racine.ai, OGC_Hydrogen [Dataset]. https://huggingface.co/datasets/racineai/OGC_Hydrogen
Explore at:
Dataset provided by
racine.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
OGC - Organized, Grouped, Cleaned

Hydrogen Vision DSE

Intended for image/text to vector (DSE)

Dataset Composition

Made with https://github.com/RacineAIOS/OGC_pdf-to-parquet This dataset was created by scraping PDF documents from online sources and generating relevant synthetic queries. We used Google's Gemini 2.0 Flash Lite model in our custom pipeline to produce the queries, allowing us to create a diverse set of questions based on the document content.… See the full description on the dataset page: https://huggingface.co/datasets/racineai/OGC_Hydrogen.

SynthRAD2025 Grand Challenge dataset: generating synthetic CT for...

zenodo.org

pdf

Updated Jul 2, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Adrian Thummerer; Adrian Thummerer; Erik van der Bijl; Erik van der Bijl; Arthur Jr. Galapon; Arthur Jr. Galapon; Florian Kamp; Florian Kamp; Matteo Maspero; Matteo Maspero (2025). SynthRAD2025 Grand Challenge dataset: generating synthetic CT for radiotherapy [Dataset]. http://doi.org/10.5281/zenodo.14918089

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14918089

Dataset updated

Jul 2, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Adrian Thummerer; Adrian Thummerer; Erik van der Bijl; Erik van der Bijl; Arthur Jr. Galapon; Arthur Jr. Galapon; Florian Kamp; Florian Kamp; Matteo Maspero; Matteo Maspero

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Time period covered

Mar 1, 2025

Description

Dataset Description

Dataset Structure

A detailed description available in "SynthRAD2025_dataset_description.pdf". A paper describing the dataset has been submitted to Medical Physics and is available as pre-print at: https://arxiv.org/abs/2502.17609" target="_blank" rel="noopener">https://arxiv.org/abs/2502.17609. The dataset is divided into two tasks:

Task 1 (MRI-to-CT conversion) is provided in Task1.zip.
Task 2 (CBCT-to-CT conversion) is provided in Task2.zip.

After extraction, the dataset is organized as follows:

Within each task, cases are categorized into three anatomical regions:

Head-and-neck (HN)
Thorax (TH)
Abdomen (AB)

Each anatomical region contains individual patient folders, named using a unique seven-letter alphanumeric code:
[Task Number][Anatomy][Center][PatientID]
Example: 1HNA001

Each patient folder in the training dataset contains (for other sets see Table below):

ct.mha: preprocessed CT image
mr.mha or cbct.mha (depending on the task): preprocessed MR or CBCT image
mask.mha: Binary mask of the patient outline (dilated)

An overview folder within each anatomical region contains:

[task]_[anatomy]_parameters.xlsx: Imaging protocol details for each patient.
[task][anatomy][center][PatientID]_overview.png: A visualization of axial, coronal, and sagittal slices of CBCT/MR, CT, mask, and difference images.

Dataset Overview

The SynthRAD2025 dataset is part of the second edition of the SynthRAD deep learning challenge (https://synthrad2025.grand-challenge.org/), which benchmarks synthetic CT generation for MRI- and CBCT-based radiotherapy workflows.

Task 1: MRI-to-CT conversion for MR-only and MR-guided photon/proton radiotherapy, consisting of 890 MRI-CT pairs.
Task 2: CBCT-to-CT conversion for daily adaptive radiotherapy workflows, consisting of 1,472 CBCT-CT pairs.

Imaging data was collected from five European university medical centers:

Netherlands: UMC Groningen, UMC Utrecht, Radboud UMC
Germany: LMU Klinikum Munich, UK Cologne

All centers have independently approved the study in accordance with their institutional review boards or medical ethics committee regulations.

Inclusion criteria:

Patients treated with external beam radiotherapy (photon or proton therapy) at one of the data-providing centers.
Imaging data available from one of the three anatomical regions.
No restrictions on age, sex, tumor characteristics, or staging.

License

The dataset is provided under two different licenses:

Data from centers A, B, C, and E is provided under a CC-BY-NC 4.0 International License (creativecommons.org/licenses /by-nc/4.0/).
Data from center D is provided with a limited license which permits it's use only for the duration of the challenge and remains valid only while the challenge is active (Limited Use License Center D). By downloading Center D's data, participants agree to these terms. Once the challenge ends, access to the data ends, the download link will be deactivated, and all downloaded data must be deleted. After requesting participation in the challenge on the SynthRAD2025 website, participants can access the download link for center D at https://synthrad2025.grand-challenge.org/data/.

Data Release Schedule

Subset	Files	Release Date	Link
Training	Input, CT, Mask	01-03-2025	https://doi.org/10.5281/zenodo.14918213
Training Center D	Input, CT, Mask	01-03-2025	Check the download link at: https://synthrad2025.grand-challenge.org/data/ Limited use License: License
Validation Input	Input, Mask	01-06-2025	https://doi.org/10.5281/zenodo.14918504
Validation Input Center D	Input, Mask	01-06-2025	Check the download link at: https://synthrad2025.grand-challenge.org/data/ Limited use License: License
Validation Ground Truth	CT, Deformed CT	01-03-2030	https://doi.org/10.5281/zenodo.14918605
Test	Input, CT, Deformed CT, Mask	01-03-2030	https://doi.org/10.5281/zenodo.14918722

Dataset Composition

The number of cases collected at each center for training, validation, and test sets.

Training Set

<td style="width:

Task	Center	HN	TH	AB	Total
1	A	91	91	65	247
	B	0	91	91	182
	C	65	0	19	84
	D	65	0	0	65
	E	0	0	0	0
	Total	221	182	175	578
2	A	65	65	64	195
	B	65	65	65	195
	C	65	63	62	190
	D	65	63	53	181
	E	65	65	65

f
Zeus-GameOver-synthetic-dataset
figshare.com
zip
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhruba Jyoti Borah (2023). Zeus-GameOver-synthetic-dataset [Dataset]. http://doi.org/10.6084/m9.figshare.20832001.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20832001.v3
Dataset updated
Jun 10, 2023
Dataset provided by
figshare
Authors
Dhruba Jyoti Borah
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The repository contains a synthetic Zeus GameOver dataset generated in a testbed. This is a compressed file containing the Zeus GameOver botnet traffic flow simulation. A Zeus bot software was created by studying the characteristics of Zeus GameOver from technical reports "ZeuS-P2P monitoring and analysis", by CERT Polska, published in June 2013 (https://www.cert.pl/en/uploads/2015/12/2013-06-p2p-rap_en.pdf), as well as "An analysis of the Zeus peer-to-peer protocol" by Dennis Andriesse and Herbert Bos, technical report, VU University Amsterdam, The Netherlands, April 2014 (URL:https://syssec.mistakenot.net/papers/zeus-tech-report-2013.pdf). A testbed has been set up with 101 virtual hosts. Each host has a piece of bot software installed. The bots then communicate with one another. The network traffic was captured for 24 hours. tcpdump tool is used to capture the raw trafffic. The captured traffic is then used to generate netflow records using the nprobe tool. The source and destination IP addresses are then extracted from the resulting flow dataset. The dataset uploaded here is a text file. The file contains the communication information of the bot nodes. It has two fields: source IP address and destination IP address
f
Data Sheet 1_Training marine species object detectors with synthetic images...
frontiersin.figshare.com
pdf
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heather Doig; Oscar Pizarro; Stefan Williams (2025). Data Sheet 1_Training marine species object detectors with synthetic images and unsupervised domain adaptation.pdf [Dataset]. http://doi.org/10.3389/fmars.2025.1581778.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fmars.2025.1581778.s001
Dataset updated
Jul 11, 2025
Dataset provided by
Frontiers
Authors
Heather Doig; Oscar Pizarro; Stefan Williams
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Visual surveys by autonomous underwater vehicles (AUVs) and other underwater platforms provide a valuable method for analysing and understanding the benthic environment. Scientists can measure the presence and abundance of benthic species by manually annotating survey images with online annotation software or other tools. Neural network object detectors can reduce the effort involved in this process by locating and classifying species of interest in the images. However, accurate object detectors often rely on large numbers of annotated training images which are not currently available for many marine applications. To address this issue, we propose a novel pipeline for generating large amounts of synthetic annotated training data for a species of interest using 3D modelling and rendering software. The detector is trained with synthetic images and annotations along with real unlabelled images to improve performance through domain adaptation. Our method is demonstrated on a sea urchin detector trained only with synthetic data, achieving a performance slightly lower than an equivalent detector trained with manually labelled real images (AP50 of 84.3 vs 92.3). Using realistic synthetic data for species or objects with few or no annotations is a promising approach to reducing the manual effort required to analyse imaging survey data.
f
Text S1 - Effective Reduced Diffusion-Models: A Data Driven Approach to the...
figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gustavo Deco; Daniel Martí; Anders Ledberg; Ramon Reig; Maria V. Sanchez Vives (2023). Text S1 - Effective Reduced Diffusion-Models: A Data Driven Approach to the Analysis of Neuronal Dynamics [Dataset]. http://doi.org/10.1371/journal.pcbi.1000587.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1000587.s001
Dataset updated
May 30, 2023
Dataset provided by
PLOS Computational Biology
Authors
Gustavo Deco; Daniel Martí; Anders Ledberg; Ramon Reig; Maria V. Sanchez Vives
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description of the network of spiking neurons used to generate synthetic data. (0.09 MB PDF)
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ainnotate (2022). Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats [Dataset]. https://datarade.ai/data-products/synthetic-document-dataset-for-ai-jpeg-png-pdf-formats-ainnotate

Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats

Explore at:

Dataset updated

Sep 18, 2022

Dataset authored and provided by

Ainnotate

Area covered

Tokelau, Canada, Tonga, Korea (Democratic People's Republic of), Brazil, Cabo Verde, Denmark, Syrian Arab Republic, Germany, Ireland

Description

Ainnotate’s proprietary dataset generation methodology based on large scale generative modelling and Domain randomization provides data that is well balanced with consistent sampling, accommodating rare events, so that it can enable superior simulation and training of your models.

Ainnotate currently provides synthetic datasets in the following domains and use cases.

Internal Services - Visa application, Passport validation, License validation, Birth certificates Financial Services - Bank checks, Bank statements, Pay slips, Invoices, Tax forms, Insurance claims and Mortgage/Loan forms Healthcare - Medical Id cards

Clear search

Close search

Google apps

Main menu

Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats

Synthetic Integrated Services Data

Motivation

Collection

Preprocessing

Recommended Uses

Known Limitations/Biases

Feedback

Further Documentation and Resources

Table1_Enhancing biomechanical machine learning with limited data:...

DataSheet1_Generating synthetic multidimensional molecular time series data...

synthetic-from-unit-triple-tasks-danish

MatSim Dataset and benchmark for one-shot visual materials and textures...

synthetic-from-unit-triple-tasks-norwegian

Benchmark datasets to study fairness in synthetic data generation

DataSheet1_Removing the Bottleneck: Introducing cMatch - A Lightweight Tool...

TRADOC_synthetic_instruct_long_short_context_response_v1

MatSeg: Material State Segmentation Dataset and Benchmark

Description

Benchmark

Synthetic Dataset

Example Usage:

Synthetic from Retrieval Tasks Danish

Synthetic from Classification Tasks Danish

Synthetic from Text Matching Short Tasks Danish

OGC_Hydrogen

SynthRAD2025 Grand Challenge dataset: generating synthetic CT for...

Dataset Description

Dataset Structure

Dataset Overview

License

Data Release Schedule

Dataset Composition

Training Set

Zeus-GameOver-synthetic-dataset

Data Sheet 1_Training marine species object detectors with synthetic images...

Text S1 - Effective Reduced Diffusion-Models: A Data Driven Approach to the...

Synthetic Document Dataset for AI - Jpeg, PNG & PDF formatsSee More Versions

Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats