Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Synthetic Data Generation For Ocean Environment With Raycast is a dataset for object detection tasks - it contains Human Boat annotations for 6,299 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The synthetic data generation market is experiencing explosive growth, driven by the increasing need for high-quality data in various applications, including AI/ML model training, data privacy compliance, and software testing. The market, currently estimated at $2 billion in 2025, is projected to experience a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated $10 billion by 2033. This significant expansion is fueled by several key factors. Firstly, the rising adoption of artificial intelligence and machine learning across industries demands large, high-quality datasets, often unavailable due to privacy concerns or data scarcity. Synthetic data provides a solution by generating realistic, privacy-preserving datasets that mirror real-world data without compromising sensitive information. Secondly, stringent data privacy regulations like GDPR and CCPA are compelling organizations to explore alternative data solutions, making synthetic data a crucial tool for compliance. Finally, the advancements in generative AI models and algorithms are improving the quality and realism of synthetic data, expanding its applicability in various domains. Major players like Microsoft, Google, and AWS are actively investing in this space, driving further market expansion. The market segmentation reveals a diverse landscape with numerous specialized solutions. While large technology firms dominate the broader market, smaller, more agile companies are making significant inroads with specialized offerings focused on specific industry needs or data types. The geographical distribution is expected to be skewed towards North America and Europe initially, given the high concentration of technology companies and early adoption of advanced data technologies. However, growing awareness and increasing data needs in other regions are expected to drive substantial market growth in Asia-Pacific and other emerging markets in the coming years. The competitive landscape is characterized by a mix of established players and innovative startups, leading to continuous innovation and expansion of market applications. This dynamic environment indicates sustained growth in the foreseeable future, driven by an increasing recognition of synthetic data's potential to address critical data challenges across industries.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We present a synthetic medicare claims dataset linked to environmental exposures and potential confounders. In most environmental health studies relying on claims data, data restrictions exist and the data cannot be shared publicly. Centers for Medicare and Medicaid services (CMS) has generated synthetic publicly available Medicare claims data for 2008-2010. In this dataset, we link the 2010 synthetic Medicare claims data to environmental exposures and potential confounders. We aggregated the Medicare claims synthetic data for 2010 to the county level. Data is compiled for the contiguous United States, which in 2010, included 3109 counties. We merged the Medicare claims synthetic data with air pollution exposure data, more specifically with estimates of 𝑃𝑀2.5 exposures obtained from Di et al., 2019, 2021, which provided daily and annual estimates of PM2.5 exposure at 1 km×1 km grid cells in the contiguous United States. We use Census Bureau (United States Census Bureau, 2021), the Center for Disease Control (Centers for Disease Control and Prevention (CDC), 2021), and GridMET (Abatzoglou, 2013) to obtain data on potential confounders. The mortality rate, as the outcome, was computed using the synthetic Medicare data (CMS, 2021). We use the average of surrounding counties to impute missing observations, except in the case of the CDC confounders, where we imputed missing values by generating a normal distribution for each state and randomly imputing from this distribution. The steps for generating the merged dataset are provided at NSAPH Synthetic Data Github Repository (https://github.com/NSAPH/synthetic_data). Analytic inferences based on this synthetic dataset should not be made. The aggregated dataset is composed of 46 columns and 3109 rows.
SDNist (v1.3) is a set of benchmark data and metrics for the evaluation of synthetic data generators on structured tabular data. This version (1.3) reproduces the challenge environment from Sprints 2 and 3 of the Temporal Map Challenge. These benchmarks are distributed as a simple open-source python package to allow standardized and reproducible comparison of synthetic generator models on real world data and use cases. These data and metrics were developed for and vetted through the NIST PSCR Differential Privacy Temporal Map Challenge, where the evaluation tools, k-marginal and Higher Order Conjunction, proved effective in distinguishing competing models in the competition environment.SDNist is available via pip
install: pip install sdnist==1.2.8
for Python >=3.6 or on the USNIST/Github. The sdnist Python module will download data from NIST as necessary, and users are not required to download data manually.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for semantic and instance segmentation experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.
Abstract:
Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.
Benchmark data
Two pose-estimation datasets were procured. Both datasets used first instar Sungaya nexpectata (Zompro 1996) stick insects as a model species. Recordings from an evenly lit platform served as representative for controlled laboratory conditions; recordings from a hand-held phone camera served as approximate example for serendipitous recordings in the field.
For the platform experiments, walking S. inexpectata were recorded using a calibrated array of five FLIR blackfly colour cameras (Blackfly S USB3, Teledyne FLIR LLC, Wilsonville, Oregon, U.S.), each equipped with 8 mm c-mount lenses (M0828-MPW3 8MM 6MP F2.8-16 C-MOUNT, CBC Co., Ltd., Tokyo, Japan). All videos were recorded with 55 fps, and at the sensors’ native resolution of 2048 px by 1536 px. The cameras were synchronised for simultaneous capture from five perspectives (top, front right and left, back right and left), allowing for time-resolved, 3D reconstruction of animal pose.
The handheld footage was recorded in landscape orientation with a Huawei P20 (Huawei Technologies Co., Ltd., Shenzhen, China) in stabilised video mode: S. inexpectata were recorded walking across cluttered environments (hands, lab benches, PhD desks etc), resulting in frequent partial occlusions, magnification changes, and uneven lighting, so creating a more varied pose-estimation dataset.
Representative frames were extracted from videos using DeepLabCut (DLC)-internal k-means clustering. 46 key points in 805 and 200 frames for the platform and handheld case, respectively, were subsequently hand-annotated using the DLC annotation GUI.
Synthetic data
We generated a synthetic dataset of 10,000 images at a resolution of 1500 by 1500 px, based on a 3D model of a first instar S. inexpectata specimen, generated with the scAnt photogrammetry workflow. Generating 10,000 samples took about three hours on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super). We applied 70\% scale variation, and enforced hue, brightness, contrast, and saturation shifts, to generate 10 separate sub-datasets containing 1000 samples each, which were combined to form the full dataset.
Funding
This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
About Dataset The File contains 3D point cloud data of a Fabricate plant with 10 sequences. Each sequence contains 0-19 days data at every growth stage of the specific sequence.
The Importance of Synthetic Plant Datasets Synthetic Plant Datasets: Synthetic plant FIle are carefully curated collections of computer- Beginning images that mimic the diverse appearance and growth stages of real plants.
Training and Evaluation: By fabricated plant file, researchers can train and evaluate machine learning models in a controlled environment, free from the limitations of real-world data collection. This controlled setting enables more efficient code development and ensures consistent performance across various environmental conditions.
Applications in Agricultural Technology Plant Phenotyping: Synthetic plant file enable researchers to analyze plant traits and characteristics on a large scale, facilitating plant phenotyping studies aimed at understanding genetic traits, environmental influences, and crop performance.
Crop Monitoring: With the rise of precision agriculture, fabricated plant file play a crucial role in developing remote sensing techniques for monitoring crop health, detecting pest infestations, and optimizing irrigation strategies.
Advancements in Computer Vision and Machine Learning Object Detection: It serve as benchmarking tools for training and evaluating object detection algorithms tailored to identifying plants, fruits, and diseases in agricultural settings.
Future Directions and Challenges Dataset Diversity: As the demand for more diverse and realistic grows, researchers face the challenge of generating data that accurately reflects the variability observed in real-world agricultural environments.
Researchers continue to explore techniques for bridging the gap between synthetic and real data to enhance model robustness and applicability.
Conclusion Synthetic plant datasets represent a cornerstone in the development of cutting-edge technologies for agricultural monitoring, plant phenotyping, and disease diagnosis. By harnessing the power of synthetic data generation and machine learning, researchers can unlock new insights into plant biology and revolutionize the future of agriculture.
This dataset is sourced from Kaggle.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Toxic Agent - Qwen Synthetic Data : Magpie-like Climate Disinformation Dataset
Dataset Description
Overview
This dataset contains synthetic climate change-related statements, including various forms of climate disinformation and denial. It was created by generating variations and transformations of real climate-related statements, producing a diverse set of synthetic examples across different categories of climate disinformation. Total examples from… See the full description on the dataset page: https://huggingface.co/datasets/DataTonic/climate-guard-synthetic_data_qwen_toxic_agent.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (ML) models often require large volumes of data to learn a given task. However, access and existence of training data can be difficult to acquire due to privacy laws and availability. A solution is to generate synthetic data that represents the real data. In the maritime environment, the ability to generate realistic vessel positional data is important for the development of ML models in ocean areas with scarce amounts of data, such as the Arctic, or for generating an abundance of anomalous or unique events needed for training detection models. This research explores the use of conditional generative adversarial networks (CGAN) to generate vessel displacement tracks over a 24-hour period in a constraint-free environment. The model is trained using Automatic Identification System (AIS) data that contains vessel tracking information. The results show that the CGAN is able to generate vessel displacement tracks for two different vessel types, cargo ships and pleasure crafts, for three months of the year (May, July, and September). To evaluate the usability of the generated data and robustness of the CGAN model, three ML vessel classification models using displacement track data are developed using generated data and tested with real data.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Synthetic data for DCASE 2019 task 4
Freesound dataset [1,2]: A subset of FSD is used as foreground sound events for the synthetic subset of the dataset for DCASE 2019 task 4. FSD is a large-scale, general-purpose audio dataset composed of Freesound content annotated with labels from the AudioSet Ontology [3].
SINS dataset [4]: The derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the synthetic subset of the dataset for DCASE 2019 task 4. The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week. It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of 4 linearly arranged microphones.
The synthetic set is composed of 10 sec audio clips generated with Scaper [5]. The foreground events are obtained from FSD. Each event audio clip was verified manually to ensure that the sound quality and the event-to-background ratio were sufficient to be used an isolated event. We also verified that the event was actually dominant in the clip and we controlled if the event onset and offset are present in the clip. Each selected clip was then segmented when needed to remove silences before and after the event and between events when the file contained multiple occurrences of the event class.
License:
All sounds comming from FSD are released under Creative Commons licences. Synthetic sounds can only be used for competition purposes until the full CC license list is made available at the end of the competition.
Further information on dcase website.
References:
[1] F. Font, G. Roma & X. Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013.
[2] E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter & X. Serra. Freesound Datasets: A Platform for the Creation of Open Audio Datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017.
[3] Jort F. Gemmeke and Daniel P. W. Ellis and Dylan Freedman and Aren Jansen and Wade Lawrence and R. Channing Moore and Manoj Plakal and Marvin Ritter. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings IEEE ICASSP 2017, New Orleans, LA, 2017.
[4] Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.
[5] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello. Scaper: A library for soundscape synthesis and augmentation In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2017.
http://dcat-ap.de/def/licenses/cc-byhttp://dcat-ap.de/def/licenses/cc-by
The Synset Signset Germany dataset contains a total of 211,000 synthetically generated images of current German traffic signs (including the 2020 update) for machine learning methods (ML) in the area of application (task) of traffic sign recognition.
The dataset contains 211 German traffic sign classes with 500 images each, and is divided into two sub-datasets, which were generated with different rendering engines. In addition to the classification annotations, the data set also contains label images for segmentation of traffic signs, binary masks, as well as extensive information on image and scene properties, in particular on image artifacts.
The dataset was presented in September 2024 by Anne Sielemann, Lena Lörcher, Max-Lion Schumacher, Stefan Wolf, Jens Ziehn, Masoud Roschani and Jürgen Beyerer in the publication: Sielemann, A., Loercher, L., Schumacher, M., Wolf, S., Roschani, M., Ziehn, J., and Beyerer, J. (2024). Synset Signset UK: A Synthetic Dataset for German Traffic Sign Recognition. In 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC).
The forms of traffic signs are based on the picture board of traffic signs in the Federal Republic of Germany since 2017 on Wikipedia (https://en.wikipedia.org/wiki/Bildtafel_der_Verkehrszeichen_in_der_Bundesrepublik_Deutschland_seit_2017).
The data was generated with the simulation environment OCTANE (www.octane.org). One subset uses the Cycles Raytracer of the Blender project (www.cycles-renderer.org), the other (otherwise identical) subset uses the 3D rasterization engine OGRE3D (www.ogre3d.org).
The dataset's website provides detailed information on the generation process and model assumptions. The dataset is therefore also intended to be used for the suitability analysis of simulated, synthetic datasets.
The data set was developed as part of the Fraunhofer PREPARE program in the "ML4Safety" project with the funding code PREPARE 40-02702, as well as funded by the "New Vehicle and System Technologies" funding program of the Federal Ministry for Economic Affairs and Climate Protection of the Federal Republic of Germany (BMWK) as part of the "AVEAS" research project (www.aveas.org).
The generative generation of textures with dirt and wear of the traffic signs was trained on real data of traffic signs, which was collected with the kind support of the Civil Engineering Office Karlsruhe.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.
Folder configuration The dataset consists of 3 folders:
JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.
Essential Terminology
Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.
Dataset Data The dataset includes 4 types of JSON annotation files files:
annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:
id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.
Most Labelers generate different annotation specifications in the spec key-value pair:
BoundingBox2DLabeler/BoundingBox3DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:
template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:
label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:
label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:
label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.
captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:
id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:
sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:
ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:
id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.
Each Labeler generates different annotation specifications in the values key-value pair:
BoundingBox2DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:
label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:
index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.
The SemanticSegmentationLabeler does not contain a values list.
egos.json: Contains collections of key-value pairs for each ego. These include:
id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:
id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).
Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:
e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.
http://dcat-ap.de/def/licenses/cc-byhttp://dcat-ap.de/def/licenses/cc-by
The Synset Boulevard dataset contains a total of 259,200 synthetically generated images of cars from a frontal traffic camera perspective, annotated by vehicle makes, models and years of construction for machine learning methods (ML) in the scope (task) of vehicle make and model recognition (VMMR).
The data set contains 162 vehicle models from 43 brands with 200 images each, as well as 8 sub-data sets each to be able to investigate different imaging qualities. In addition to the classification annotations, the data set also contains label images for semantic segmentation, as well as information on image and scene properties, as well as vehicle color.
The dataset was presented in May 2024 by Anne Sielemann, Stefan Wolf, Masoud Roschani, Jens Ziehn and Jürgen Beyerer in the publication: Sielemann, A., Wolf, S., Roschani, M., Ziehn, J. and Beyerer, J. (2024). Synset Boulevard: A Synthetic Image Dataset for VMMR. In 2024 IEEE International Conference on Robotics and Automation (ICRA).
The model information is based on information from the ADAC online database (www.adac.de/rund-ums-fahrzeug/autokatalog/marken-modelle).
The data was generated using the simulation environment OCTANE (www.octane.org), which uses the Cycles ray tracer of the Blender project.
The dataset's website provides detailed information on the generation process and model assumptions. The dataset is therefore also intended to be used for the suitability analysis of simulated, synthetic datasets.
The data set was developed as part of the Fraunhofer PREPARE program in the "ML4Safety" project with the funding code PREPARE 40-02702, as well as funded by the "Invest BW" funding program of the Ministry of Economic Affairs, Labour and Tourism as part of the "FeinSyn" research project.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Occupational stress is a major concern for employers and organizations as it compromises decision-making and overall safety of workers. Studies indicate that work-stress contributes to severe mental strain, increased accident rates, and in extreme cases, even suicides. This study aims to enhance early detection of occupational stress through machine learning (ML) methods, providing stakeholders with better insights into the underlying causes of stress to improve occupational safety. Utilizing a newly published workplace survey dataset, we developed a novel feature selection pipeline identifying 39 key indicators of work-stress. An ensemble of three ML models achieved a state-of-the-art accuracy of 90.32%, surpassing existing studies. The framework’s generalizability was confirmed through a three-step validation technique: holdout-validation, 10-fold cross-validation, and external-validation with synthetic data generation, achieving an accuracy of 89% on unseen data. We also introduced a 1D-CNN to enable hierarchical and temporal learning from the data. Additionally, we created an algorithm to convert tabular data into texts with 100% information retention, facilitating domain analysis with large language models, revealing that occupational stress is more closely related to the biomedical domain than clinical or generalist domains. Ablation studies reinforced our feature selection pipeline, and revealed sociodemographic features as the most important. Explainable AI techniques identified excessive workload and ambiguity (27%), poor communication (17%), and a positive work environment (16%) as key stress factors. Unlike previous studies relying on clinical settings or biomarkers, our approach streamlines stress detection from simple survey questions, offering a real-time, deployable tool for periodic stress assessment in workplaces.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CPIQA is a large scale QA dataset focused on figured extracted from scientific research papers from various peer reviewed venues in the climate science domain. The figures extracted include tables, graphs and diagrams, which inform the generation of questions using large language models (LLMs). Notably this dataset includes questions for 3 audiences: general public, climate skeptic and climate expert. 4 types of questions are generated with various focusses including figures, numerical, text-only and general. This results in 12 questions generated per scientific paper. Alongside figures, descriptions of the figures generated using multimodal LLMs are included and used.
This work was funded through the WCSSP South Africa project, a collaborative initiative between the Met Office, South African and UK partners, supported by the International Science Partnership Fund (ISPF) from the UK's Department for Science, Innovation and Technology (DSIT). It is also supported by the Natural Environment Research Council (grant NE/S015604/1) project GloSAT.
Mutalik, R. Panchalingam, A. Loitongbam, G. Osborn, T. J. Hawkins, E. Middleton, S. E. CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion, ClimateNLP-2025, ACL, 31st July 2025, https://nlp4climate.github.io/
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Ray tracin based image generation methodology to render realistic images of particle image velocimetry (PIV) and background oriented schlieren (BOS) experiments in the presence of density/refractive index gradients.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record provides a dataset created as part of the study presented in the following publication and is made publicly available for research purposes. The associated article provides a comprehensive description of the dataset, its structure, and the methodology used in its creation. If you use this dataset, please cite the following article published in the journal IEEE Communications Magazine:
A. Karamchandani, J. Nunez, L. de-la-Cal, Y. Moreno, A. Mozo, and A. Pastor, “On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination,” IEEE Communications Magazine, pp. 2–8, 2025, DOI: 10.1109/MCOM.003.2400648.
More specifically, the record contains several synthetic datasets generated to differentiate between benign and malicious heavy hitter flows within a realistic virtualized network environment. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.
To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.
The feature set includes the following flow statistics commonly used in the literature on network traffic classification:
To accommodate diverse research needs and scenarios, the dataset is provided in the following variations:
All at Once
:
Balanced Traffic Generation
:
DDoS at Intervals
:
Only Benign HH Traffic
:
Only DDoS Traffic
:
Only Normal Traffic
:
Unbalanced Traffic Generation
:
For each variation, the output of the different packet aggregators is provided separated in its respective folder.
Each variation was generated using the NDT approach to demonstrate its flexibility and ensure the reproducibility of our study's experiments, while also contributing to future research on network traffic patterns and the detection and classification of heavy hitter traffic flows. The dataset is designed to support research in network security, machine learning model development, and applications of digital twin technology.
The Synset Boulevard dataset contains a total of 259,200 synthetically generated images of cars from a frontal traffic camera perspective, annotated by vehicle makes, models and years of construction for machine learning methods (ML) in the scope (task) of vehicle make and model recognition (VMMR). The data set contains 162 vehicle models from 43 brands with 200 images each, as well as 8 sub-data sets each to be able to investigate different imaging qualities. In addition to the classification annotations, the data set also contains label images for semantic segmentation, as well as information on image and scene properties, as well as vehicle color. The dataset was presented in May 2024 by Anne Sielemann, Stefan Wolf, Masoud Roschani, Jens Ziehn and Jürgen Beyerer in the publication: Sielemann, A., Wolf, S., Roschani, M., Ziehn, J. and Beyerer, J. (2024). Synset Boulevard: A Synthetic Image Dataset for VMMR. In 2024 IEEE International Conference on Robotics and Automation (ICRA). The model information is based on information from the ADAC online database (www.adac.de/rund-ums-fahrzeug/autokatalog/marken-modelle). The data was generated using the simulation environment OCTANE (www.octane.org), which uses the Cycles ray tracer of the Blender project. The dataset's website provides detailed information on the generation process and model assumptions. The dataset is therefore also intended to be used for the suitability analysis of simulated, synthetic datasets. The data set was developed as part of the Fraunhofer PREPARE program in the "ML4Safety" project with the funding code PREPARE 40-02702, as well as funded by the "Invest BW" funding program of the Ministry of Economic Affairs, Labour and Tourism as part of the "FeinSyn" research project.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Wetlands in the Prairie Pothole region of Canada are critical ecosystems that support biodiversity and provide essential ecosystem services, yet they face increasing threats from agricultural expansion and climate change. Remote sensing offers a powerful tool for monitoring these landscapes over time, enabling large-scale, consistent land cover classification compared to conventional field-based methods. However, machine learning approaches often struggle with rare wetland classes, such as fens, due to limited training data. To address this challenge, a CycleGAN model, a generative adversarial network (GAN) designed for image-to-image translation, was used to generate synthetic four-band orthophoto imagery of fens from more readily available marsh imagery. The model was trained using classified wetland imagery from central Saskatchewan within the ArcGIS deep learning framework, and the resulting synthetic fen images were statistically compared to real fen images. A t-test (p < 0.05) revealed significant differences in mean pixel intensity across all bands except blue, while Jensen-Shannon divergence values (Blue: 0.1288, Green: 0.2077, Red: 0.2339, IR: 0.1885) indicated relative similarity between real and synthetic histograms. Additionally, synthetic images exhibited significantly higher mean entropy values in all four bands (p < 0.05), suggesting increased variability. These results demonstrate that CycleGAN-generated images retain key spectral characteristics of real fens while introducing additional diversity, offering a potential solution for improving wetland classification models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Protein synthesis in natural cells involves intricate interactions between chemical environments, protein-protein interactions, and protein machinery. Replicating such interactions in artificial and cell-free environments can control the precision of protein synthesis, elucidate complex cellular mechanisms, create synthetic cells, and discover new therapeutics. Yet, creating artificial synthesis environments, particularly for membrane proteins, is challenging due to the poorly defined chemical-protein-lipid interactions. Here, we introduce MEMPLEX (Membrane Protein Learning and Expression), which utilizes machine learning and a fluorescent reporter to rapidly design artificial synthesis environments of membrane proteins. MEMPLEX generates over 20,000 different artificial chemical-protein environments spanning 28 membrane proteins. It captures the interdependent impact of lipid types, chemical environments, chaperone proteins, and protein structures on membrane protein synthesis. As a result, MEMPLEX creates new artificial environments that successfully synthesize membrane proteins of broad interest but previously intractable. In addition, we identify a quantitative metric, based on the hydrophobicity of the membrane-contacting amino acids, that predicts membrane protein synthesis in artificial environments. Our work allows others to rapidly study and resolve the “dark” proteome using predictive generation of artificial chemical-protein environments. Furthermore, the results represent a new frontier in artificial intelligence-guided approaches to creating synthetic environments for protein synthesis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Occupational stress is a major concern for employers and organizations as it compromises decision-making and overall safety of workers. Studies indicate that work-stress contributes to severe mental strain, increased accident rates, and in extreme cases, even suicides. This study aims to enhance early detection of occupational stress through machine learning (ML) methods, providing stakeholders with better insights into the underlying causes of stress to improve occupational safety. Utilizing a newly published workplace survey dataset, we developed a novel feature selection pipeline identifying 39 key indicators of work-stress. An ensemble of three ML models achieved a state-of-the-art accuracy of 90.32%, surpassing existing studies. The framework’s generalizability was confirmed through a three-step validation technique: holdout-validation, 10-fold cross-validation, and external-validation with synthetic data generation, achieving an accuracy of 89% on unseen data. We also introduced a 1D-CNN to enable hierarchical and temporal learning from the data. Additionally, we created an algorithm to convert tabular data into texts with 100% information retention, facilitating domain analysis with large language models, revealing that occupational stress is more closely related to the biomedical domain than clinical or generalist domains. Ablation studies reinforced our feature selection pipeline, and revealed sociodemographic features as the most important. Explainable AI techniques identified excessive workload and ambiguity (27%), poor communication (17%), and a positive work environment (16%) as key stress factors. Unlike previous studies relying on clinical settings or biomarkers, our approach streamlines stress detection from simple survey questions, offering a real-time, deployable tool for periodic stress assessment in workplaces.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Synthetic Data Generation For Ocean Environment With Raycast is a dataset for object detection tasks - it contains Human Boat annotations for 6,299 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).