90 datasets found

R
Synthetic Data Generation For Ocean Environment With Raycast Dataset
universe.roboflow.com
zip
Updated May 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of SouthEastern Norway (2023). Synthetic Data Generation For Ocean Environment With Raycast Dataset [Dataset]. https://universe.roboflow.com/university-of-southeastern-norway-7kvm1/synthetic-data-generation-for-ocean-environment-with-raycast
Explore at:
zipAvailable download formats
Dataset updated
May 20, 2023
Dataset authored and provided by
University of SouthEastern Norway
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Human Boat Bounding Boxes
Description
Synthetic Data Generation For Ocean Environment With Raycast

## Overview Synthetic Data Generation For Ocean Environment With Raycast is a dataset for object detection tasks - it contains Human Boat annotations for 6,299 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
S
Synthetic Data Generation Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Synthetic Data Generation Report [Dataset]. https://www.datainsightsmarket.com/reports/synthetic-data-generation-1124388
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Jun 16, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The synthetic data generation market is experiencing explosive growth, driven by the increasing need for high-quality data in various applications, including AI/ML model training, data privacy compliance, and software testing. The market, currently estimated at $2 billion in 2025, is projected to experience a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated $10 billion by 2033. This significant expansion is fueled by several key factors. Firstly, the rising adoption of artificial intelligence and machine learning across industries demands large, high-quality datasets, often unavailable due to privacy concerns or data scarcity. Synthetic data provides a solution by generating realistic, privacy-preserving datasets that mirror real-world data without compromising sensitive information. Secondly, stringent data privacy regulations like GDPR and CCPA are compelling organizations to explore alternative data solutions, making synthetic data a crucial tool for compliance. Finally, the advancements in generative AI models and algorithms are improving the quality and realism of synthetic data, expanding its applicability in various domains. Major players like Microsoft, Google, and AWS are actively investing in this space, driving further market expansion. The market segmentation reveals a diverse landscape with numerous specialized solutions. While large technology firms dominate the broader market, smaller, more agile companies are making significant inroads with specialized offerings focused on specific industry needs or data types. The geographical distribution is expected to be skewed towards North America and Europe initially, given the high concentration of technology companies and early adoption of advanced data technologies. However, growing awareness and increasing data needs in other regions are expected to drive substantial market growth in Asia-Pacific and other emerging markets in the coming years. The competitive landscape is characterized by a mix of established players and innovative startups, leading to continuous innovation and expansion of market applications. This dynamic environment indicates sustained growth in the foreseeable future, driven by an increasing recognition of synthetic data's potential to address critical data challenges across industries.
H
Synthetic Medicare Data for Environmental Health Studies
dataverse.harvard.edu
search.dataone.org
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Naeem Khoshnevis; Xiao Wu; Danielle Braun (2024). Synthetic Medicare Data for Environmental Health Studies [Dataset]. http://doi.org/10.7910/DVN/L7YF2G
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/L7YF2G
Dataset updated
Jul 30, 2024
Dataset provided by
Harvard Dataverse
Authors
Naeem Khoshnevis; Xiao Wu; Danielle Braun
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We present a synthetic medicare claims dataset linked to environmental exposures and potential confounders. In most environmental health studies relying on claims data, data restrictions exist and the data cannot be shared publicly. Centers for Medicare and Medicaid services (CMS) has generated synthetic publicly available Medicare claims data for 2008-2010. In this dataset, we link the 2010 synthetic Medicare claims data to environmental exposures and potential confounders. We aggregated the Medicare claims synthetic data for 2010 to the county level. Data is compiled for the contiguous United States, which in 2010, included 3109 counties. We merged the Medicare claims synthetic data with air pollution exposure data, more specifically with estimates of 𝑃𝑀2.5 exposures obtained from Di et al., 2019, 2021, which provided daily and annual estimates of PM2.5 exposure at 1 km×1 km grid cells in the contiguous United States. We use Census Bureau (United States Census Bureau, 2021), the Center for Disease Control (Centers for Disease Control and Prevention (CDC), 2021), and GridMET (Abatzoglou, 2013) to obtain data on potential confounders. The mortality rate, as the outcome, was computed using the synthetic Medicare data (CMS, 2021). We use the average of surrounding counties to impute missing observations, except in the case of the CDC confounders, where we imputed missing values by generating a normal distribution for each state and randomly imputing from this distribution. The steps for generating the merged dataset are provided at NSAPH Synthetic Data Github Repository (https://github.com/NSAPH/synthetic_data). Analytic inferences based on this synthetic dataset should not be made. The aggregated dataset is composed of 46 columns and 3109 rows.
SDNist v1.3: Temporal Map Challenge Environment
datasets.ai
data.nist.gov
+1more
0, 23, 5, 8
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). SDNist v1.3: Temporal Map Challenge Environment [Dataset]. https://datasets.ai/datasets/sdnist-benchmark-data-and-evaluation-tools-for-data-synthesizers
Explore at:
5, 23, 8, 0Available download formats
Dataset updated
Aug 6, 2024
Dataset authored and provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
SDNist (v1.3) is a set of benchmark data and metrics for the evaluation of synthetic data generators on structured tabular data. This version (1.3) reproduces the challenge environment from Sprints 2 and 3 of the Temporal Map Challenge. These benchmarks are distributed as a simple open-source python package to allow standardized and reproducible comparison of synthetic generator models on real world data and use cases. These data and metrics were developed for and vetted through the NIST PSCR Differential Privacy Temporal Map Challenge, where the evaluation tools, k-marginal and Higher Order Conjunction, proved effective in distinguishing competing models in the competition environment.SDNist is available via pip install: pip install sdnist==1.2.8 for Python >=3.6 or on the USNIST/Github. The sdnist Python module will download data from NIST as necessary, and users are not required to download data manually.
Z
replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models
data.niaid.nih.gov
zenodo.org
Updated Apr 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Labonte, David (2023). replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7849595
Explore at:
Dataset updated
Apr 21, 2023
Dataset provided by
Imirzian, Natalie
Beck, Hendrik
Labonte, David
Plum, Fabian
Bulla, René
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for semantic and instance segmentation experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

Abstract:

Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

Benchmark data

Two pose-estimation datasets were procured. Both datasets used first instar Sungaya nexpectata (Zompro 1996) stick insects as a model species. Recordings from an evenly lit platform served as representative for controlled laboratory conditions; recordings from a hand-held phone camera served as approximate example for serendipitous recordings in the field.

For the platform experiments, walking S. inexpectata were recorded using a calibrated array of five FLIR blackfly colour cameras (Blackfly S USB3, Teledyne FLIR LLC, Wilsonville, Oregon, U.S.), each equipped with 8 mm c-mount lenses (M0828-MPW3 8MM 6MP F2.8-16 C-MOUNT, CBC Co., Ltd., Tokyo, Japan). All videos were recorded with 55 fps, and at the sensors’ native resolution of 2048 px by 1536 px. The cameras were synchronised for simultaneous capture from five perspectives (top, front right and left, back right and left), allowing for time-resolved, 3D reconstruction of animal pose.

The handheld footage was recorded in landscape orientation with a Huawei P20 (Huawei Technologies Co., Ltd., Shenzhen, China) in stabilised video mode: S. inexpectata were recorded walking across cluttered environments (hands, lab benches, PhD desks etc), resulting in frequent partial occlusions, magnification changes, and uneven lighting, so creating a more varied pose-estimation dataset.

Representative frames were extracted from videos using DeepLabCut (DLC)-internal k-means clustering. 46 key points in 805 and 200 frames for the platform and handheld case, respectively, were subsequently hand-annotated using the DLC annotation GUI.

Synthetic data

We generated a synthetic dataset of 10,000 images at a resolution of 1500 by 1500 px, based on a 3D model of a first instar S. inexpectata specimen, generated with the scAnt photogrammetry workflow. Generating 10,000 samples took about three hours on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super). We applied 70\% scale variation, and enforced hue, brightness, contrast, and saturation shifts, to generate 10 separate sub-datasets containing 1000 samples each, which were combined to form the full dataset.

Funding

This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
P
Synthetic Plant Dataset Dataset
paperswithcode.com
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiruo Liu; Shibani Singh; Cory Cornelius; Colin Busho; Mike Tan; Anindya Paul; Jason Martin (2025). Synthetic Plant Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/synthetic-plant-dataset
Explore at:
Dataset updated
Feb 11, 2025
Authors
Xiruo Liu; Shibani Singh; Cory Cornelius; Colin Busho; Mike Tan; Anindya Paul; Jason Martin
Description
About Dataset The File contains 3D point cloud data of a Fabricate plant with 10 sequences. Each sequence contains 0-19 days data at every growth stage of the specific sequence.

The Importance of Synthetic Plant Datasets Synthetic Plant Datasets: Synthetic plant FIle are carefully curated collections of computer- Beginning images that mimic the diverse appearance and growth stages of real plants.

Training and Evaluation: By fabricated plant file, researchers can train and evaluate machine learning models in a controlled environment, free from the limitations of real-world data collection. This controlled setting enables more efficient code development and ensures consistent performance across various environmental conditions.

Applications in Agricultural Technology Plant Phenotyping: Synthetic plant file enable researchers to analyze plant traits and characteristics on a large scale, facilitating plant phenotyping studies aimed at understanding genetic traits, environmental influences, and crop performance.

Crop Monitoring: With the rise of precision agriculture, fabricated plant file play a crucial role in developing remote sensing techniques for monitoring crop health, detecting pest infestations, and optimizing irrigation strategies.

Advancements in Computer Vision and Machine Learning Object Detection: It serve as benchmarking tools for training and evaluating object detection algorithms tailored to identifying plants, fruits, and diseases in agricultural settings.

Future Directions and Challenges Dataset Diversity: As the demand for more diverse and realistic grows, researchers face the challenge of generating data that accurately reflects the variability observed in real-world agricultural environments.

Researchers continue to explore techniques for bridging the gap between synthetic and real data to enhance model robustness and applicability.

Conclusion Synthetic plant datasets represent a cornerstone in the development of cutting-edge technologies for agricultural monitoring, plant phenotyping, and disease diagnosis. By harnessing the power of synthetic data generation and machine learning, researchers can unlock new insights into plant biology and revolutionize the future of agriculture.

This dataset is sourced from Kaggle.
h
climate-guard-synthetic_data_qwen_toxic_agent
huggingface.co
Updated Feb 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Tonic (Alignment Lab) (2025). climate-guard-synthetic_data_qwen_toxic_agent [Dataset]. https://huggingface.co/datasets/DataTonic/climate-guard-synthetic_data_qwen_toxic_agent
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 13, 2025
Dataset authored and provided by
Data Tonic (Alignment Lab)
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Toxic Agent - Qwen Synthetic Data : Magpie-like Climate Disinformation Dataset

Dataset Description Overview

This dataset contains synthetic climate change-related statements, including various forms of climate disinformation and denial. It was created by generating variations and transformations of real climate-related statements, producing a diverse set of synthetic examples across different categories of climate disinformation. Total examples from… See the full description on the dataset page: https://huggingface.co/datasets/DataTonic/climate-guard-synthetic_data_qwen_toxic_agent.
f
Data from: Generation of Vessel Track Characteristics Using a Conditional...
tandf.figshare.com
txt
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica N.A Campbell; Martha Dais Ferreira; Anthony W. Isenor (2024). Generation of Vessel Track Characteristics Using a Conditional Generative Adversarial Network (CGAN) [Dataset]. http://doi.org/10.6084/m9.figshare.25942783.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25942783.v1
Dataset updated
Dec 16, 2024
Dataset provided by
Taylor & Francis
Authors
Jessica N.A Campbell; Martha Dais Ferreira; Anthony W. Isenor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) models often require large volumes of data to learn a given task. However, access and existence of training data can be difficult to acquire due to privacy laws and availability. A solution is to generate synthetic data that represents the real data. In the maritime environment, the ability to generate realistic vessel positional data is important for the development of ML models in ocean areas with scarce amounts of data, such as the Arctic, or for generating an abundance of anomalous or unique events needed for training detection models. This research explores the use of conditional generative adversarial networks (CGAN) to generate vessel displacement tracks over a 24-hour period in a constraint-free environment. The model is trained using Automatic Identification System (AIS) data that contains vessel tracking information. The results show that the CGAN is able to generate vessel displacement tracks for two different vessel types, cargo ships and pleasure crafts, for three months of the year (May, July, and September). To evaluate the usability of the generated data and robustness of the CGAN model, three ML vessel classification models using displacement track data are developed using generated data and tested with real data.
Z
DCASE2019_task4_synthetic_data
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shah Ankit Parag (2020). DCASE2019_task4_synthetic_data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2583795
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Shah Ankit Parag
Turpault Nicolas
Serizel Romain
Salamon Justin
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Synthetic data for DCASE 2019 task 4

Freesound dataset [1,2]: A subset of FSD is used as foreground sound events for the synthetic subset of the dataset for DCASE 2019 task 4. FSD is a large-scale, general-purpose audio dataset composed of Freesound content annotated with labels from the AudioSet Ontology [3].

SINS dataset [4]: The derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the synthetic subset of the dataset for DCASE 2019 task 4. The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week. It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of 4 linearly arranged microphones.

The synthetic set is composed of 10 sec audio clips generated with Scaper [5]. The foreground events are obtained from FSD. Each event audio clip was verified manually to ensure that the sound quality and the event-to-background ratio were sufficient to be used an isolated event. We also verified that the event was actually dominant in the clip and we controlled if the event onset and offset are present in the clip. Each selected clip was then segmented when needed to remove silences before and after the event and between events when the file contained multiple occurrences of the event class.

License:

All sounds comming from FSD are released under Creative Commons licences. Synthetic sounds can only be used for competition purposes until the full CC license list is made available at the end of the competition.

Further information on dcase website.

References:

[1] F. Font, G. Roma & X. Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013.

[2] E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter & X. Serra. Freesound Datasets: A Platform for the Creation of Open Audio Datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017.

[3] Jort F. Gemmeke and Daniel P. W. Ellis and Dylan Freedman and Aren Jansen and Wade Lawrence and R. Channing Moore and Manoj Plakal and Marvin Ritter. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings IEEE ICASSP 2017, New Orleans, LA, 2017.

[4] Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.

[5] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello. Scaper: A library for soundscape synthesis and augmentation In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2017.
e
Synset Signset UK: Synthetic image data set for traffic sign recognition
data.europa.eu
binary data
Updated Oct 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. (2024). Synset Signset UK: Synthetic image data set for traffic sign recognition [Dataset]. https://data.europa.eu/data/datasets/773196217178836992/embed
Explore at:
binary dataAvailable download formats
Dataset updated
Oct 19, 2024
Dataset authored and provided by
Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V.
License
http://dcat-ap.de/def/licenses/cc-byhttp://dcat-ap.de/def/licenses/cc-by
Area covered
United Kingdom
Description
The Synset Signset Germany dataset contains a total of 211,000 synthetically generated images of current German traffic signs (including the 2020 update) for machine learning methods (ML) in the area of application (task) of traffic sign recognition.

The dataset contains 211 German traffic sign classes with 500 images each, and is divided into two sub-datasets, which were generated with different rendering engines. In addition to the classification annotations, the data set also contains label images for segmentation of traffic signs, binary masks, as well as extensive information on image and scene properties, in particular on image artifacts.

The dataset was presented in September 2024 by Anne Sielemann, Lena Lörcher, Max-Lion Schumacher, Stefan Wolf, Jens Ziehn, Masoud Roschani and Jürgen Beyerer in the publication: Sielemann, A., Loercher, L., Schumacher, M., Wolf, S., Roschani, M., Ziehn, J., and Beyerer, J. (2024). Synset Signset UK: A Synthetic Dataset for German Traffic Sign Recognition. In 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC).

The forms of traffic signs are based on the picture board of traffic signs in the Federal Republic of Germany since 2017 on Wikipedia (https://en.wikipedia.org/wiki/Bildtafel_der_Verkehrszeichen_in_der_Bundesrepublik_Deutschland_seit_2017).

The data was generated with the simulation environment OCTANE (www.octane.org). One subset uses the Cycles Raytracer of the Blender project (www.cycles-renderer.org), the other (otherwise identical) subset uses the 3D rasterization engine OGRE3D (www.ogre3d.org).

The dataset's website provides detailed information on the generation process and model assumptions. The dataset is therefore also intended to be used for the suitability analysis of simulated, synthetic datasets.

The data set was developed as part of the Fraunhofer PREPARE program in the "ML4Safety" project with the funding code PREPARE 40-02702, as well as funded by the "New Vehicle and System Technologies" funding program of the Federal Ministry for Economic Affairs and Climate Protection of the Federal Republic of Germany (BMWK) as part of the "AVEAS" research project (www.aveas.org).

The generative generation of textures with dirt and wear of the traffic signs was trained on real data of traffic signs, which was collected with the kind support of the Civil Engineering Office Karlsruhe.
Z
ActiveHuman Part 2
data.niaid.nih.gov
zenodo.org
Updated Nov 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charalampos Georgiadis (2023). ActiveHuman Part 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8361113
Explore at:
Dataset updated
Nov 14, 2023
Dataset authored and provided by
Charalampos Georgiadis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

Folder configuration The dataset consists of 3 folders:

JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.

Essential Terminology

Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

Dataset Data The dataset includes 4 types of JSON annotation files files:

annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:

id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.

Most Labelers generate different annotation specifications in the spec key-value pair:

BoundingBox2DLabeler/BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:

template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:

label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:

label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:

label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.

captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:

id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:

sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:

ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:

id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.

Each Labeler generates different annotation specifications in the values key-value pair:

BoundingBox2DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:

label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:

index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.

The SemanticSegmentationLabeler does not contain a values list.

egos.json: Contains collections of key-value pairs for each ego. These include:

id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:

id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).

Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:

e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.
e
Synset Boulevard: Synthetic image dataset for Vehicle Make and Model...
data.europa.eu
binary data
Updated Aug 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. (2024). Synset Boulevard: Synthetic image dataset for Vehicle Make and Model Recognition (VMMR) [Dataset]. https://data.europa.eu/data/datasets/725679870677258240?locale=en
Explore at:
binary dataAvailable download formats
Dataset updated
Aug 8, 2024
Dataset authored and provided by
Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V.
License
http://dcat-ap.de/def/licenses/cc-byhttp://dcat-ap.de/def/licenses/cc-by
Description
The Synset Boulevard dataset contains a total of 259,200 synthetically generated images of cars from a frontal traffic camera perspective, annotated by vehicle makes, models and years of construction for machine learning methods (ML) in the scope (task) of vehicle make and model recognition (VMMR).

The data set contains 162 vehicle models from 43 brands with 200 images each, as well as 8 sub-data sets each to be able to investigate different imaging qualities. In addition to the classification annotations, the data set also contains label images for semantic segmentation, as well as information on image and scene properties, as well as vehicle color.

The dataset was presented in May 2024 by Anne Sielemann, Stefan Wolf, Masoud Roschani, Jens Ziehn and Jürgen Beyerer in the publication: Sielemann, A., Wolf, S., Roschani, M., Ziehn, J. and Beyerer, J. (2024). Synset Boulevard: A Synthetic Image Dataset for VMMR. In 2024 IEEE International Conference on Robotics and Automation (ICRA).

The model information is based on information from the ADAC online database (www.adac.de/rund-ums-fahrzeug/autokatalog/marken-modelle).

The data was generated using the simulation environment OCTANE (www.octane.org), which uses the Cycles ray tracer of the Blender project.

The dataset's website provides detailed information on the generation process and model assumptions. The dataset is therefore also intended to be used for the suitability analysis of simulated, synthetic datasets.

The data set was developed as part of the Fraunhofer PREPARE program in the "ML4Safety" project with the funding code PREPARE 40-02702, as well as funded by the "Invest BW" funding program of the Ministry of Economic Affairs, Labour and Tourism as part of the "FeinSyn" research project.
f
Synthetic data using TVAE.
plos.figshare.com
csv
Updated Jun 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Junayed Hasan; Jannat Sultana; Silvia Ahmed; Sifat Momen (2025). Synthetic data using TVAE. [Dataset]. http://doi.org/10.1371/journal.pone.0323265.s004
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0323265.s004
Dataset updated
Jun 2, 2025
Dataset provided by
PLOS ONE
Authors
Mohammad Junayed Hasan; Jannat Sultana; Silvia Ahmed; Sifat Momen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Occupational stress is a major concern for employers and organizations as it compromises decision-making and overall safety of workers. Studies indicate that work-stress contributes to severe mental strain, increased accident rates, and in extreme cases, even suicides. This study aims to enhance early detection of occupational stress through machine learning (ML) methods, providing stakeholders with better insights into the underlying causes of stress to improve occupational safety. Utilizing a newly published workplace survey dataset, we developed a novel feature selection pipeline identifying 39 key indicators of work-stress. An ensemble of three ML models achieved a state-of-the-art accuracy of 90.32%, surpassing existing studies. The framework’s generalizability was confirmed through a three-step validation technique: holdout-validation, 10-fold cross-validation, and external-validation with synthetic data generation, achieving an accuracy of 89% on unseen data. We also introduced a 1D-CNN to enable hierarchical and temporal learning from the data. Additionally, we created an algorithm to convert tabular data into texts with 100% information retention, facilitating domain analysis with large language models, revealing that occupational stress is more closely related to the biomedical domain than clinical or generalist domains. Ablation studies reinforced our feature selection pipeline, and revealed sociodemographic features as the most important. Explainable AI techniques identified excessive workload and ambiguity (27%), poor communication (17%), and a positive work environment (16%) as key stress factors. Unlike previous studies relying on clinical settings or biomarkers, our approach streamlines stress detection from simple survey questions, offering a real-time, deployable tool for periodic stress assessment in workplaces.
CPIQA: Climate Paper Image Question Answering Dataset for...
zenodo.org
zip
Updated May 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rudra Mutalik; Rudra Mutalik; Stuart Middleton; Stuart Middleton (2025). CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion [Dataset]. http://doi.org/10.5281/zenodo.15374870
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15374870
Dataset updated
May 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rudra Mutalik; Rudra Mutalik; Stuart Middleton; Stuart Middleton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CPIQA is a large scale QA dataset focused on figured extracted from scientific research papers from various peer reviewed venues in the climate science domain. The figures extracted include tables, graphs and diagrams, which inform the generation of questions using large language models (LLMs). Notably this dataset includes questions for 3 audiences: general public, climate skeptic and climate expert. 4 types of questions are generated with various focusses including figures, numerical, text-only and general. This results in 12 questions generated per scientific paper. Alongside figures, descriptions of the figures generated using multimodal LLMs are included and used.

This work was funded through the WCSSP South Africa project, a collaborative initiative between the Met Office, South African and UK partners, supported by the International Science Partnership Fund (ISPF) from the UK's Department for Science, Innovation and Technology (DSIT). It is also supported by the Natural Environment Research Council (grant NE/S015604/1) project GloSAT.

Mutalik, R. Panchalingam, A. Loitongbam, G. Osborn, T. J. Hawkins, E. Middleton, S. E. CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion, ClimateNLP-2025, ACL, 31st July 2025, https://nlp4climate.github.io/
p
Data from: PIV/BOS synthetic image generation in variable density...
purr.purdue.edu
Updated Oct 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lalit Rajendran (2020). PIV/BOS synthetic image generation in variable density environments for error analysis and experiment design [Dataset]. http://doi.org/10.4231/P45Z-8361
Explore at:
Unique identifier
https://doi.org/10.4231/P45Z-8361
Dataset updated
Oct 5, 2020
Dataset provided by
PURR
Authors
Lalit Rajendran
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Ray tracin based image generation methodology to render realistic images of particle image velocimetry (PIV) and background oriented schlieren (BOS) experiments in the presence of density/refractive index gradients.
Network Digital Twin-Generated Dataset for Machine Learning-based Detection...
zenodo.org
zip
Updated Jun 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses (2025). Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows [Dataset]. http://doi.org/10.5281/zenodo.14841650
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14841650
Dataset updated
Jun 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 11, 2024
Description
Overview

This record provides a dataset created as part of the study presented in the following publication and is made publicly available for research purposes. The associated article provides a comprehensive description of the dataset, its structure, and the methodology used in its creation. If you use this dataset, please cite the following article published in the journal IEEE Communications Magazine:

A. Karamchandani, J. Nunez, L. de-la-Cal, Y. Moreno, A. Mozo, and A. Pastor, “On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination,” IEEE Communications Magazine, pp. 2–8, 2025, DOI: 10.1109/MCOM.003.2400648.

More specifically, the record contains several synthetic datasets generated to differentiate between benign and malicious heavy hitter flows within a realistic virtualized network environment. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.

To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.

Feature Set:

The feature set includes the following flow statistics commonly used in the literature on network traffic classification:

The protocol used for the connection, identifying whether it is TCP, UDP, ICMP, or OSPF.

The time (relative to the connection start) of the most recent packet sent from source to destination at the time of each snapshot.

The time (relative to the connection start) of the most recent packet sent from destination to source at the time of each snapshot.

The cumulative count of data packets sent from source to destination at the time of each snapshot.

The cumulative count of data packets sent from destination to source at the time of each snapshot.

The cumulative bytes sent from source to destination at the time of each snapshot.

The cumulative bytes sent from destination to source at the time of each snapshot.

The time difference between the first packet sent from source to destination and the first packet sent from destination to source.

Dataset Variations:

To accommodate diverse research needs and scenarios, the dataset is provided in the following variations:

All at Once:

Contains a synthetic dataset where all traffic types, including benign, normal, and malicious DDoS heavy hitter (HH) flows, are combined into a single dataset.

This version represents a holistic view of the traffic environment, simulating real-world scenarios where all traffic occurs simultaneously.

Balanced Traffic Generation:

Represents a balanced traffic dataset with an equal proportion of benign, normal, and malicious DDoS traffic.

Designed for scenarios where a balanced dataset is needed for fair training and evaluation of machine learning models.

DDoS at Intervals:

Contains traffic data where malicious DDoS HH traffic occurs at specific time intervals, mimicking real-world attack patterns.

Useful for studying the impact and detection of intermittent malicious activities.

Only Benign HH Traffic:

Includes only benign HH traffic flows.

Suitable for training and evaluating models to identify and differentiate benign heavy hitter traffic patterns.

Only DDoS Traffic:

Contains only malicious DDoS HH traffic.

Helps in isolating and analyzing attack characteristics for targeted threat detection.

Only Normal Traffic:

Comprises only regular, non-HH traffic flows.

Useful for understanding baseline network behavior in the absence of heavy hitters.

Unbalanced Traffic Generation:

Features an unbalanced dataset with varying proportions of benign, normal, and malicious traffic.

Simulates real-world scenarios where certain types of traffic dominate, providing insights into model performance in unbalanced conditions.

For each variation, the output of the different packet aggregators is provided separated in its respective folder.

Each variation was generated using the NDT approach to demonstrate its flexibility and ensure the reproducibility of our study's experiments, while also contributing to future research on network traffic patterns and the detection and classification of heavy hitter traffic flows. The dataset is designed to support research in network security, machine learning model development, and applications of digital twin technology.
t
Synset Boulevard: Synthetic image dataset for Vehicle Make and Model...
service.tib.eu
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Synset Boulevard: Synthetic image dataset for Vehicle Make and Model Recognition (VMMR) - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/govdata_725679870677258240
Explore at:
Dataset updated
Feb 5, 2025
Description
The Synset Boulevard dataset contains a total of 259,200 synthetically generated images of cars from a frontal traffic camera perspective, annotated by vehicle makes, models and years of construction for machine learning methods (ML) in the scope (task) of vehicle make and model recognition (VMMR). The data set contains 162 vehicle models from 43 brands with 200 images each, as well as 8 sub-data sets each to be able to investigate different imaging qualities. In addition to the classification annotations, the data set also contains label images for semantic segmentation, as well as information on image and scene properties, as well as vehicle color. The dataset was presented in May 2024 by Anne Sielemann, Stefan Wolf, Masoud Roschani, Jens Ziehn and Jürgen Beyerer in the publication: Sielemann, A., Wolf, S., Roschani, M., Ziehn, J. and Beyerer, J. (2024). Synset Boulevard: A Synthetic Image Dataset for VMMR. In 2024 IEEE International Conference on Robotics and Automation (ICRA). The model information is based on information from the ADAC online database (www.adac.de/rund-ums-fahrzeug/autokatalog/marken-modelle). The data was generated using the simulation environment OCTANE (www.octane.org), which uses the Cycles ray tracer of the Blender project. The dataset's website provides detailed information on the generation process and model assumptions. The dataset is therefore also intended to be used for the suitability analysis of simulated, synthetic datasets. The data set was developed as part of the Fraunhofer PREPARE program in the "ML4Safety" project with the funding code PREPARE 40-02702, as well as funded by the "Invest BW" funding program of the Ministry of Economic Affairs, Labour and Tourism as part of the "FeinSyn" research project.
u
Generating Synthetic Remote Sensing Data with Deep Learning for Improved...
open.library.ubc.ca
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heuver, Nathan (2025). Generating Synthetic Remote Sensing Data with Deep Learning for Improved Wetland Classification [Dataset]. http://doi.org/10.14288/1.0448454
Explore at:
Unique identifier
https://doi.org/10.14288/1.0448454
Dataset updated
Apr 22, 2025
Authors
Heuver, Nathan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Time period covered
Apr 3, 2025
Area covered
Saskatchewan
Description
Wetlands in the Prairie Pothole region of Canada are critical ecosystems that support biodiversity and provide essential ecosystem services, yet they face increasing threats from agricultural expansion and climate change. Remote sensing offers a powerful tool for monitoring these landscapes over time, enabling large-scale, consistent land cover classification compared to conventional field-based methods. However, machine learning approaches often struggle with rare wetland classes, such as fens, due to limited training data. To address this challenge, a CycleGAN model, a generative adversarial network (GAN) designed for image-to-image translation, was used to generate synthetic four-band orthophoto imagery of fens from more readily available marsh imagery. The model was trained using classified wetland imagery from central Saskatchewan within the ArcGIS deep learning framework, and the resulting synthetic fen images were statistically compared to real fen images. A t-test (p < 0.05) revealed significant differences in mean pixel intensity across all bands except blue, while Jensen-Shannon divergence values (Blue: 0.1288, Green: 0.2077, Red: 0.2339, IR: 0.1885) indicated relative similarity between real and synthetic histograms. Additionally, synthetic images exhibited significantly higher mean entropy values in all four bands (p < 0.05), suggesting increased variability. These results demonstrate that CycleGAN-generated images retain key spectral characteristics of real fens while introducing additional diversity, offering a potential solution for improving wetland classification models.
Z
Data from: Designer artificial environments for membrane protein synthesis
data.niaid.nih.gov
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meyer, Conary (2025). Designer artificial environments for membrane protein synthesis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15086207
Explore at:
Dataset updated
Mar 26, 2025
Dataset authored and provided by
Meyer, Conary
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Protein synthesis in natural cells involves intricate interactions between chemical environments, protein-protein interactions, and protein machinery. Replicating such interactions in artificial and cell-free environments can control the precision of protein synthesis, elucidate complex cellular mechanisms, create synthetic cells, and discover new therapeutics. Yet, creating artificial synthesis environments, particularly for membrane proteins, is challenging due to the poorly defined chemical-protein-lipid interactions. Here, we introduce MEMPLEX (Membrane Protein Learning and Expression), which utilizes machine learning and a fluorescent reporter to rapidly design artificial synthesis environments of membrane proteins. MEMPLEX generates over 20,000 different artificial chemical-protein environments spanning 28 membrane proteins. It captures the interdependent impact of lipid types, chemical environments, chaperone proteins, and protein structures on membrane protein synthesis. As a result, MEMPLEX creates new artificial environments that successfully synthesize membrane proteins of broad interest but previously intractable. In addition, we identify a quantitative metric, based on the hydrophobicity of the membrane-contacting amino acids, that predicts membrane protein synthesis in artificial environments. Our work allows others to rapidly study and resolve the “dark” proteome using predictive generation of artificial chemical-protein environments. Furthermore, the results represent a new frontier in artificial intelligence-guided approaches to creating synthetic environments for protein synthesis.
f
Synthetic data using GaussianCopula.
plos.figshare.com
figshare.com
csv
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Junayed Hasan; Jannat Sultana; Silvia Ahmed; Sifat Momen (2025). Synthetic data using GaussianCopula. [Dataset]. http://doi.org/10.1371/journal.pone.0323265.s003
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0323265.s003
Dataset updated
Jun 2, 2025
Dataset provided by
PLOS ONE
Authors
Mohammad Junayed Hasan; Jannat Sultana; Silvia Ahmed; Sifat Momen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Occupational stress is a major concern for employers and organizations as it compromises decision-making and overall safety of workers. Studies indicate that work-stress contributes to severe mental strain, increased accident rates, and in extreme cases, even suicides. This study aims to enhance early detection of occupational stress through machine learning (ML) methods, providing stakeholders with better insights into the underlying causes of stress to improve occupational safety. Utilizing a newly published workplace survey dataset, we developed a novel feature selection pipeline identifying 39 key indicators of work-stress. An ensemble of three ML models achieved a state-of-the-art accuracy of 90.32%, surpassing existing studies. The framework’s generalizability was confirmed through a three-step validation technique: holdout-validation, 10-fold cross-validation, and external-validation with synthetic data generation, achieving an accuracy of 89% on unseen data. We also introduced a 1D-CNN to enable hierarchical and temporal learning from the data. Additionally, we created an algorithm to convert tabular data into texts with 100% information retention, facilitating domain analysis with large language models, revealing that occupational stress is more closely related to the biomedical domain than clinical or generalist domains. Ablation studies reinforced our feature selection pipeline, and revealed sociodemographic features as the most important. Explainable AI techniques identified excessive workload and ambiguity (27%), poor communication (17%), and a positive work environment (16%) as key stress factors. Unlike previous studies relying on clinical settings or biomarkers, our approach streamlines stress detection from simple survey questions, offering a real-time, deployable tool for periodic stress assessment in workplaces.

Facebook

Twitter

Click to copy link

Link copied

Cite

University of SouthEastern Norway (2023). Synthetic Data Generation For Ocean Environment With Raycast Dataset [Dataset]. https://universe.roboflow.com/university-of-southeastern-norway-7kvm1/synthetic-data-generation-for-ocean-environment-with-raycast

Synthetic Data Generation For Ocean Environment With Raycast Dataset

synthetic-data-generation-for-ocean-environment-with-raycast

synthetic-data-generation-for-ocean-environment-with-raycast-dataset

Explore at:

zipAvailable download formats

Dataset updated

May 20, 2023

Dataset authored and provided by

University of SouthEastern Norway

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Variables measured

Human Boat Bounding Boxes

Description

Synthetic Data Generation For Ocean Environment With Raycast

## Overview

Synthetic Data Generation For Ocean Environment With Raycast is a dataset for object detection tasks - it contains Human Boat annotations for 6,299 images.

## Getting Started

You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.

  ## License

  This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).

Clear search

Close search

Google apps

Main menu

Synthetic Data Generation For Ocean Environment With Raycast Dataset

Synthetic Data Generation For Ocean Environment With Raycast

Synthetic Data Generation Report

Synthetic Medicare Data for Environmental Health Studies

SDNist v1.3: Temporal Map Challenge Environment

replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models

Synthetic Plant Dataset Dataset

climate-guard-synthetic_data_qwen_toxic_agent

Data from: Generation of Vessel Track Characteristics Using a Conditional...

DCASE2019_task4_synthetic_data

Synset Signset UK: Synthetic image data set for traffic sign recognition

ActiveHuman Part 2

Synset Boulevard: Synthetic image dataset for Vehicle Make and Model...

Synthetic data using TVAE.

CPIQA: Climate Paper Image Question Answering Dataset for...

Data from: PIV/BOS synthetic image generation in variable density...

Network Digital Twin-Generated Dataset for Machine Learning-based Detection...

Overview

Feature Set:

Dataset Variations:

Synset Boulevard: Synthetic image dataset for Vehicle Make and Model...

Generating Synthetic Remote Sensing Data with Deep Learning for Improved...

Data from: Designer artificial environments for membrane protein synthesis

Synthetic data using GaussianCopula.

Synthetic Data Generation For Ocean Environment With Raycast Dataset

synthetic-data-generation-for-ocean-environment-with-raycast

synthetic-data-generation-for-ocean-environment-with-raycast-dataset

Synthetic Data Generation For Ocean Environment With Raycast