The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Certainly. Here's a suggested "Description" field for your Zenodo submission, tailored for the International Journal of Geo-Information (IJGI) context:
This archive contains simulation and analysis code used in the study "[Manuscript Title]" submitted to International Journal of Geo-Information (IJGI). The research investigates pathfinding through stochastic obstacle environments using fully synthetic data generated via spatial point processes and parameterized simulation models.
The simulation/
folder includes R scripts to generate the traversal cost data under various configurations, while the analysis/
folder provides code for statistical modeling (robust regression, random forest, and zero-inflated negative binomial regression) and visualization of the results. The results/
folder contains sample simulation outputs and figures used in the manuscript. All data used in the paper can be exactly reproduced using the provided code.
For inquiries or reproducibility questions, please contact the corresponding author.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Developing robust deep learning models for fetal ultrasound image analysis requires comprehensive, high-quality datasets to effectively learn informative data representations within the domain. However, the scarcity of labelled ultrasound images poses substantial challenges, especially in low-resource settings. To tackle this challenge, we leverage synthetic data to enhance the generalizability of deep learning models. This study proposes a diffusion-based method, Fetal Ultrasound LoRA (FU-LoRA), which involves fine-tuning latent diffusion models using the LoRA technique to generate synthetic fetal ultrasound images. These synthetic images are integrated into a hybrid dataset that combines real-world and synthetic images to improve the performance of zero-shot classifiers in low-resource settings. Our experimental results on fetal ultrasound images from African cohorts demonstrate that FU-LoRA outperforms the baseline method by a 13.73% increase in zero-shot classification accuracy. Furthermore, FU-LoRA achieves the highest accuracy of 82.40%, the highest F-score of 86.54%, and the highest AUC of 89.78%. It demonstrates that the FU-LoRA method is effective in the zero-shot classification of fetal ultrasound images in low-resource settings. Our code and data are publicly accessible on GitHub.
Our FU-LoRA method: Fine-tuning the pre-trained latent diffusion model (LDM) [2] using the LoRA method on a small fetal ultrasound dataset from high-resource settings (HRS). This approach integrates synthetic images to enhance generalization and performance of deep learning models. We conduct three fine-tuning sessions for the diffusion model to generate three LoRA models with different hyper-parameters: alpha in [8, 32, 128], and r in [8, 32, 128]. The merging rate alpha/r is fixed to 1. The purpose of this operation is to delve deeper into LoRA to uncover optimizations that can improve the model's performance and evaluate the effectiveness of parameter r in generating synthetic images.
The Spanish dataset (URL) in HRS includes 1,792 patient records in Spain [1]. All images are acquired during screening in the second and third trimesters of pregnancy using six different machines operated by operators with similar expertise. We randomly selected 20 Spanish ultrasound images from each of the five maternal–fetal planes (Abdomen, Brain, Femur, Thorax, and Other) to fine-tune the LDM using LoRA technique, and 1150 Spanish images (230 x 5 planes) to create the hybrid dataset. In summary, fine-tuning the LDM utilizes 100 images including 85 patients. Training downstream classifiers uses 6148 images from 612 patients. Within the 6148 images used for training, a subset of 200 images is randomly selected for validation purposes. The hybrid dataset employed in this study has a total of 1150 Spanish images, representing 486 patients.
We create the synthetic dataset comprising 5000 fetal ultrasound images (500 x 2 samplers x 5 planes) accessible to the open-source community. The generation process utilizes our LoRA model Rank r = 128 with Euler and UniPC samplers known for their efficiency. Subsequently, we integrate this synthetic dataset with a small amount of Spanish data to create a hybrid dataset.
The hyper-parameters of LoRA models are defined as follows: batch size to 2; LoRA learning rate to 1e-4; total training steps to 10000; LoRA dimension to 128; mixed precision selection to fp16; learning scheduler to constant; and input size (resolution) to 512. The model is trained on a single NVIDIA RTX A5000, 24 GB with 8-bit Adam optimizer on PyTorch.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Background: In 2020, 287,000 mothers died from complications of pregnancy or childbirth; one-third of these deaths (30%) occur during the first 6 weeks after birth. Precision public health approaches leverage risk prediction to identify the most vulnerable patients and inform decisions around use of scarce resources, including the frequency, intensity, and type of postnatal care follow-up visits. However, these approaches may not accurately or precisely predict risk for specific sub-groups of women who are statistically underrepresented in the total population, such as women who experience stillbirths. Methods: We leverage our existing dataset of sociodemographic and clinical variables and health outcomes for mother and baby dyads in Uganda to generate a synthetic dataset to enhance our risk prediction model for identifying women at a high-risk of death or readmission in the 6 weeks after a hospital delivery. Data Collection Methods: The original mom and baby project data were collected at the point of care using encrypted study tablets and these data were then uploaded to a Research Electronic Data Capture (REDCap) database hosted at the BC Children’s Hospital Research Institute (Vancouver, Canada). Following delivery and obtaining informed written consent, trained study nurses collected data grouped according to four periods of care; admission, delivery, discharge, and six-week post-discharge follow up. Data from admission and delivery were captured from the hospital medical record where possible and by direct observation, direct measurement or patient interview when not. Discharge and post-discharge data were collected by observation, measurement or interview. Six-weeks after delivery, field officers contacted every mother and/or caregivers of newborns who survived to discharge to determine vital status, readmission and care seeking for illnesses and routine postnatal care. In-person visits were completed in situations where participants could not be reached by phone. Mothers who had experienced a stillbirth were filtered from the overall dataset. The synthetic dataset was subsequently based off the stillbirth cohort and evaluated it to ensure its statistical properties were maintained. Data Processing Methods: Synthetic data and evaluation metrics were generated using the synthpop R package. The first variable (column) in the dataset is generated via random sampling with replacement with subsequent variables generated conditioned on all previously synthesized variables using a pre-specified algorithm. We used the classification and regression tree (CART) algorithm as it is non-parametric and compatible with all data types (continuous, categorical, ordinal). Additional setup for generating the synthetic dataset included identifying eligible and relevant variables for synthesis and outlining rules for variables that have branching logic (i.e., variables that are only entered if a previous variable has a specific response). For evaluation, we used the utility metric recommended by the authors of the synthpop package, the standardized propensity-score mean squared error (pMSE) ratio which measures how easy it is to tell whether a data point comes from the original data or the synthetic dataset. All the standardized pMSE ratios were below 10, which is the suggested cut-off for acceptable utility as proposed by the synthpop authors. Plots were also generated to visually compare the univariate distribution of each variable in the synthetic dataset against the original dataset. Ethics Declaration: Ethics approvals have been obtained from the Makerere University School of Public Health (MakSPH) Institutional Review Board (SPH-2021-177), the Uganda National Council of Science and Technology (UNCST) in Uganda (HS2174ES) and the University of British Columbia in Canada (H21-03709). This study has been registered at clinicaltrials.gov (NCT05730387). Abbreviations: JRRH: Jinja Regional Referral Hospital MRRH: Mbarara Regional Referral Hospital PNC: Post-natal care SES: Socio-economic index SpO2: Oxygen saturation Study Protocol & Supplementary Materials: Smart Discharges for Mom & Baby 2.0: A cohort study to develop prognostic algorithms for post-discharge readmission and mortality among mother-infant dyads NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days. Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator at sepsiscolab@bcchr.ca or visit our website.
This is a synthetic dataset that can be used by users that are interested in benchmarking methods of explainable artificial intelligence (XAI) for geoscientific applications. The dataset is specifically inspired from a climate forecasting setting (seasonal timescales) where the task is to predict regional climate variability given global climate information lagged in time. The dataset consists of a synthetic input X (series of 2D arrays of random fields drawn from a multivariate normal distribution) and a synthetic output Y (scalar series) generated by using a nonlinear function F: R^d -> R.
The synthetic input aims to represent temporally independent realizations of anomalous global fields of sea surface temperature, the synthetic output series represents some type of regional climate variability that is of interest (temperature, precipitation totals, etc.) and the function F is a simplification of the climate system.
Since the nonlinear function F that is used to generate the output given the input is known, we also derive and provide the attribution of each output value to the corresponding input features. Using this synthetic dataset users can train any AI model to predict Y given X and then implement XAI methods to interpret it. Based on the “ground truth” of attribution of F the user can assess the faithfulness of any XAI method.
NOTE: the spatial configuration of the observations in the NetCDF database file conform to the planetocentric coordinate system (89.5N - 89.5S, 0.5E - 359.5E), where longitude is measured in the positive heading east from the prime meridian.
Abstract copyright UK Data Service and data collection copyright owner.
The aim of this project was to create a synthetic dataset without using the original (secure, controlled) dataset to do so, and instead using only publicly available analytical output (i.e. output that was cleared for publication) to create the synthetic data. Such synthetic data may allow users to gain familiarity with and practise on data that is like the original before they gain access to the original data (where time in a secure setting may be limited).
The Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data was created without access to the original ASHE-2011 Census dataset (which is only available in a secure setting via the ONS Secure Research Service: "Annual Survey of Hours and Earnings linked to 2011 Census - England and Wales"). It was created as a teaching aid to support a training course "An Introduction to the linked ASHE-2011 Census dataset" organised by Administrative Data Research UK and the National Centre for Research Methods. The synthetic dataset contains a subset of the variables in the original dataset and was designed to reproduce the analytical output contained in the ASHE-Census 2011 Data Linkage User Guide.
Variables available in this study relate to synthetic employment, earnings and demographic information for adults employed in England and Wales in 2011.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The synthetic predictor tables and fully synthetic neuroimaging data produced for the analysis of fully synthetic data in the current study are available as Research Data available from Mendeley Data. Ten fully synthetic datasets include synthetic gray matter images (nifti files) that were generated for analysis with simulated participant data (text files). An archive file predictor_tables.tar.gz contains ten fully synthetic predictor tables with information for 264 simulated subjects. Due to large file sizes, a separate archive was created for each set of synthetic gray matter image data: RBS001.tar.gz, …, RBS010.tar.gz. Regression analyses were performed for each synthetic dataset, then average statistic maps were made for each contrast, which were then smoothed (see accompanying paper for additional information).
The supplementary materials also include commented MATLAB and R code to implement the current neuroimaging data synthesis methods (SKexample.zip). The example data were selected from an earlier fMRI study (Kuchinsky et al., 2012) to demonstrate that the current approach can be used with other types of neuroimaging data. The example code can also be adapted to produce fully synthetic group-level datasets based on observed neuroimaging data from other sources. The zip archive includes a document with important information for performing the example analyses, and details that should be communicated with recipients of a synthetic neuroimaging dataset.
Kuchinsky, S.E., Vaden, K.I., Keren, N.I., Harris, K.C., Ahlstrom, J.B., Dubno, J.R., Eckert, M.A., 2012. Word intelligibility and age predict visual cortex activity during word listening. Cerebral Cortex 22, 1360–71. https://doi.org/10.1093/cercor/bhr211
This repository contains scripts, input files, and some example output files for the Residential Population Generator, an R-based tool to generate synthetic human residental populations to use in making estimates of near-field chemical exposures. This tool is most readily adapted for using in the workflow for CHEM, the Combined Human Exposure Model, avaialable in two other GitHub repositories in the HumanExposure project, including ProductUseScheduler and source2dose. CHEM is currently best suited to estimating exposure to product use. Outputs from RPGen are translated into ProductUseScheduler, which with subsequent outputs used in source2dose.
AirGapAgent-R 🛡️🧠 A Benchmark for Evaluating Contextual Privacy of Personal LLM Agents
Code Repository: parameterlab/leaky_thoughts Paper: Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers Original Paper that detailed the procedure to create the dataset: AirGapAgent: Protecting Privacy-Conscious Conversational Agents (Bagdasarian et al.)
🧠 What is AirGapAgent-R? AirGapAgent-R is a probing benchmark designed to test contextual privacy in personal LLM agents, reconstructed from the original (unreleased) benchmark used in the AirGapAgent paper (Bagdasarian et al.). It simulates real-world data-sharing decisions where models must reason about whether user-specific data (e.g., age, medical history) should be revealed based on a specific task context.
The procedure used to create the dataset is detailed in Appendix C of our paper (see below).
📦 Dataset Structure
Profiles: 20 synthetic user profiles
Fields per Profile: 26 personal data fields (e.g., name, phone, medication)
Scenarios: 8 task contexts (e.g., doctor appointment, travel booking)
Total Prompts: 4,160 (user profile × scenario × question)
Each example includes: - The user profile - The scenario context - The domain - The data field that the model should consider whether to share or not - A ground-truth label (should share / should not share the specific data field)
The prompt is empty, as all the prompts depends on the specific model / reasoning type being used. All prompts available are in the prompts folder of the code repository (parameterlab/leaky_thoughts).
We also include a smaller variant used in some of our experiments (e.g., in RAnA experiments) together with the two datasets used in the swapping experiments detailed in Appendix A.3 of our paper.
🧪 Use Cases Use this dataset to evaluate:
Reasoning trace privacy leakage
Trade-offs between utility (task performance) and privacy
Prompting strategies and anonymization techniques
Susceptibility to prompt injection and reasoning-based attacks
📊 Metrics In the associated paper, we evaluate:
Utility Score: % of correct data sharing decisions
Privacy Score: % of cases with no inappropriate leakage in either answer or reasoning
📥 Clone via Hugging Face CLI bash huggingface-cli download --repo-type dataset parameterlab/leaky_thoughts
Introduction
Vessel segmentation in fundus images is essential in the diagnosis and prognosis of retinal diseases and the identification of image-based biomarkers. However, creating a vessel segmentation map can be a tedious and time consuming process, requiring careful delineation of the vasculature, which is especially hard for microcapillary plexi in fundus images. Optical coherence tomography angiography (OCT-A) is a relatively novel modality visualizing blood flow and microcapillary plexi not clearly observed in fundus photography. Unfortunately, current commercial OCT-A cameras have various limitations due to their complex optics making them more expensive, less portable, and with a reduced field of view (FOV) compared to fundus cameras. Moreover, the vast majority of population health data collection efforts do not include OCT-A data.
We believe that strategies able to map fundus images to en-face OCT-A can create precise vascular vessel segmentation with less effort.
In this dataset, called UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA), we include fundus images and en-face OCT-A images for 112 subjects. The two modalities have been manually aligned to allow for training of medical imaging machine learning pipelines. This dataset is accompanied by a manuscript that describes an approach to generate fundus vessel segmentations using OCT-A for training (Coronado et al., 2022). We refer to this approach as "Synthetic OCT-A".
Fundus Imaging
We include 45 degree macula centered fundus images that cover both macula and optic disc. All images were acquired using a OptoVue iVue fundus camera without pupil dilation.
The full images are available at the fov45/fundus
directory. In addition, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/fundus/disc
and cropped/fundus/macula
.
Enface OCT-A
We include the en-face OCT-A images of the superficial capillary plexus. All images were acquired using an OptoVue Avanti OCT camera with OCT-A reconstruction software (AngioVue). Low quality images with errors in the retina layer segmentations were not included.
En-face OCTA images are located in cropped/octa/disc
and cropped/octa/macula
. In addition, we include a denoised version of these images where only vessels are included. This has been performed automatically using the ROSE algorithm (Ma et al. 2021). These can be found in cropped/GT_OCT_net/noThresh
and cropped/GT_OCT_net/Thresh
, the former contains the probabilities of the ROSE algorithm the latter a binary map.
Synthetic OCT-A
We train a custom conditional generative adversarial network (cGAN) to map a fundus image to an en face OCT-A image. Our model consists of a generator synthesizing en face OCT-A images from corresponding areas in fundus photographs and a discriminator judging the resemblance of the synthesized images to the real en face OCT-A samples. This allows us to avoid the use of manual vessel segmentation maps altogether.
The full images are available at the fov45/synthetic_octa
directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/synthetic_octa/disc
and cropped/synthetic_octa/macula
. In addition, we performed the same denoising ROSE algorithm (Ma et al. 2021) used for the original enface OCT-A images, the results are available in cropped/denoised_synthetic_octa/noThresh
and cropped/denoised_synthetic_octa/Thresh
, the former contains the probabilities of the ROSE algorithm the latter a binary map.
Other Fundus Vessel Segmentations Included
In this dataset, we have also included the output of two recent vessel segmentation algorithms trained on external datasets with manual vessel segmentations. SA-Unet (Li et. al, 2020) and IterNet (Guo et. al, 2021).
SA-Unet. The full images are available at the fov45/SA_Unet
directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/SA_Unet/disc
and cropped/SA_Unet/macula
.
IterNet. The full images are available at the fov45/Iternet
directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/Iternet/disc
and cropped/Iternet/macula
.
Train/Validation/Test Replication
In order to replicate or compare your model to the results of our paper, we report below the data split used.
Training subjects IDs: 1 - 25
Validation subjects IDs: 26 - 30
Testing subjects IDs: 31 - 112
Data Acquisition
This dataset was acquired at the Texas Medical Center - Memorial Hermann Hospital in accordance with the guidelines from the Helsinki Declaration and it was approved by the UTHealth IRB with protocol HSC-MS-19-0352.
User Agreement
The UT-FSOCTA dataset is free to use for non-commercial scientific research only. In case of any publication the following paper needs to be cited
Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9.
Funding
This work is supported by the Translational Research Institute for Space Health through NASA Cooperative Agreement NNX16AO69A.
Research Team and Acknowledgements
Here are the people behind this data acquisition effort:
Ivan Coronado, Samiksha Pachade, Rania Abdelkhaleq, Juntao Yan, Sergio Salazar-Marioni, Amanda Jagolino, Mozhdeh Bahrainian, Roomasa Channa, Sunil Sheth, Luca Giancardo
We would also like to acknowledge for their support: the Institute for Stroke and Cerebrovascular Diseases at UTHealth, the VAMPIRE team at University of Dundee, UK and Memorial Hermann Hospital System.
References
Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9.
C. Guo, M. Szemenyei, Y. Yi, W. Wang, B. Chen, and C. Fan, "SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation," in 2020 25th International Conference on Pattern Recognition (ICPR), Jan. 2021, pp. 1236–1242. doi: 10.1109/ICPR48806.2021.9413346.
L. Li, M. Verma, Y. Nakashima, H. Nagahara, and R. Kawasaki, "IterNet: Retinal Image Segmentation Utilizing Structural Redundancy in Vessel Networks," 2020 IEEE Winter Conf. Appl. Comput. Vis. WACV, 2020, doi: 10.1109/WACV45572.2020.9093621.
Y. Ma et al., "ROSE: A Retinal OCT-Angiography Vessel Segmentation Dataset and New Model," IEEE Trans. Med. Imaging, vol. 40, no. 3, pp. 928–939, Mar. 2021, doi: 10.1109/TMI.2020.3042802.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code to generate synthetic data. (R 1 kb)
We create a synthetic administrative dataset to be used in the development of the R package for calculating quality indicators for administrative data (see: https://github.com/sook-tusk/qualadmin) that mimic the properties of a real administrative dataset according to specifications by the ONS. Taking over 1 million records from a synthetic 1991 UK census dataset, we deleted records, moved records to a different geography and duplicated records to a different geography according to pre-specified proportions for each broad ethnic group (White, Non-white) and gender (males, females). The final size of the synthetic administrative data was 1033664 individuals.
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Ce dossier contient un script R pour générer des tableaux synthétiques (saisonniers et annuels) d'indicateurs climatiques pertinents pour soutenir les producteurs de légumes dans leur réflexion sur l'adaptation au changement climatique à court terme (2021-2040) et plus long terme (2060). Ce script a pour données d'entrée des projections climatiques issues du portail DRIAS. Un exemple de données d'entrée et de tableau de sortie est donné pour la zone de Saclay (Essonne, France). Ce travail a été réalisé dans le cadre du projet CLIMALEG: adaptation des producteurs de légumes au changement climatique, de 2021 à 2022 en Ile-de-France. Les fichiers sont organisés en différents dossiers, donc pour les voir, penser à utiliser la visualisation par "Arborescence". This folders contains an R script to generate synthetic tables (at seasonal and annual scale) of climate indicators which are relevant to support vegetable farmers in anticipating climate change at short (2021-2040) and long (2060) term. The input data are climate projections coming from the DRIAS platform. One example of input data and output tables is given for the Saclay area (Essone, France). This work was carried out in the framework of the following project: "CLIMALEG: adaptation des producteurs de légumes au changement climatique", from 2021 to 2022 in the Ile-de-France region, France. The files are organized in folder, so in order to see it, use the "Tree" view.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of observed SRTR and synthetically generated candidate populations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study is publicly available for research purposes. If you are using this dataset, please cite the following paper (currently under submission process), which outlines the complete details of the dataset and the methodology used for its generation:
Amit Karamchandani, Javier Núñez, Luis de-la-Cal, Yenny Moreno, Alberto Mozo, Antonio Pastor, "On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination," under submission.
This dataset contains a synthetic dataset generated to differentiate between benign and malicious heavy hitter flows within complex network environments. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.
To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.
The feature set includes the following flow statistics commonly used in the literature on network traffic classification:
To accommodate diverse research needs and scenarios, the dataset is provided in the following variations:
All at Once
:
Balanced Traffic Generation
:
DDoS at Intervals
:
Only Benign HH Traffic
:
Only DDoS Traffic
:
Only Normal Traffic
:
Unbalanced Traffic Generation
:
For each variation, the output of the different packet aggregators is provided separated in its respective folder.
Each variation was generated using the NDT approach to demonstrate its flexibility and ensure the reproducibility of our study's experiments, while also contributing to future research on network traffic patterns and the detection and classification of heavy hitter traffic flows. The dataset is designed to support research in network security, machine learning model development, and applications of digital twin technology.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the replication dataset for "A Synthetic Control Analysis of Philadelphia, Pennsylvania’s Excise Tax on Sugar-Sweetened and Artificially Sweetened Beverages and Supplemental Nutrition Assistance Program Benefit Redemption", published in the American Journal of Public Health (accepted 18-Jun-2021). Included are analyses (.csv), codebook (.docx), Supplemental Materials (.docx), and R code used to generate synthetic controls and conduct robustness checks (.R).
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This CO2 data is available in the R package `fields' except that the data here has been converted into a more user-friendly long-table format. In the package contains the following data description: This is an example of moderately large spatial data set and consists of simulated CO2 concentrations that are irregularly sampled from a lon/lat grid. Also included is the complete CO2 field used to generate the synthetic observations. This data was generously provided by Dorit Hammerling and Randy Kawa as a test example for the spatial analysis of remotely sensed (i.e. satellite) and irregular observations. The synthetic data is based on a true CO2 field simulated from a geophysical, numerical model. Format: co2sim.csv The CSV file has three columns: lon: longitude coordinate. lat: latitude coordinate. z: CO2 concentration in parts per million. co2true.csv The CSV file has three columns: lon: longitude coordinate.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The synthetic event log described in this document consists of 25,000 traces, generated using the process model outlined in Geyer et al. (2024) [1] and the DALG tool [2]. This event log simulates the treatment process of malignant melanoma patients, adhering to clinical guidelines. Each trace in the log represents a unique patient journey through various stages of melanoma treatment, providing detailed insights into decision points, treatments, and outcomes.
The DALG tool [2] was employed to generate this data-aware event log, ensuring realistic data distribution and variability.
DALG: https://github.com/DavidJilg/DALG
[1] Geyer, T., Grüger, J., & Kuhn, M. (2024). Clinical Guideline-based Model for the Treatment of Malignant Melanoma (Data Petri Net) (1.0). Zenodo. https://doi.org/10.5281/zenodo.10785431
[2] Jilg, D., Grüger, J., Geyer, T., Bergmann, R.: DALG: the data aware event log generator. In: BPM 2023 - Demos & Resources. CEUR Workshop Proceedings, vol. 3469, pp. 142–146. CEUR-WS.org (2023)
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Motivated by the challenging of deep learning on the low data regime and the urgent demand for intelligent design on highly energetic materials, we explore a correlated deep learning framework, which consists of three recurrent neural networks (RNNs) correlated by the transfer learning strategy, to efficiently generate new energetic molecules with a high detonation velocity in the case of very limited data available. To avoid the dependence on the external big data set, data augmentation by fragment shuffling of 303 energetic compounds is utilized to produce 500,000 molecules to pretrain RNN, through which the model can learn sufficient structure knowledge. Then the pretrained RNN is fine-tuned by focusing on the 303 energetic compounds to generate 7153 molecules similar to the energetic compounds. In order to more reliably screen the molecules with a high detonation velocity, the SMILE enumeration augmentation coupled with the pretrained knowledge is utilized to build an RNN-based prediction model, through which R2 is boosted from 0.4446 to 0.9572. The comparable performance with the transfer learning strategy based on an existing big database (ChEMBL) to produce the energetic molecules and drug-like ones further supports the effectiveness and generality of our strategy in the low data regime. High-precision quantum mechanics calculations further confirm that 35 new molecules present a higher detonation velocity and lower synthetic accessibility than the classic explosive RDX, along with good thermal stability. In particular, three new molecules are comparable to caged CL-20 in the detonation velocity. All the source codes and the data set are freely available at https://github.com/wangchenghuidream/RNNMGM.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Geographic location of the outer margin of the Antarctic ice cover for the sector between longitudes 79E and 108E, including margins of ice shelves, glaciers, and iceberg tongues. The data set does not in general include the grounding zone at the inland margin of the ice shelves or glaciers.\r \r The margin was defined by interpretation of an image mosaic generated from Synthetic Aperture Radar data. The image mosaic was built using navigation data accompanying the SAR images to transform the images to a map projection. The image navigation data were adjusted so that overlapping images were registered to one another, the indivual images merged into a mosaic, and the overall process adjusted so that the mosaic was tied to the few ground control points available in this large sector. Two separate mosaics were used to span the whole sector.\r \r The majority of the SAR data were acquired by the ERS-SAR instruments in August 1996, some ERS data were acquired in August 1993, and one Radarsat scene was acquired in September 1997. The data were pre-processed to produce a mosaic with a 100 m pixel size, and adjusted so that the majority of the coastline positions refer to the August 1996 epoch.\r \r The location data are internally consistent, and extracted at nominally 200 m intervals. The external position accuracy is generally better than 600 m. The coverage is complete over the whole sector. The coordinate set includes some island/ice rise features. Two very large grounded icebergs are included.\r \r Data are in an ascii arc/info export file format as geographic coordinates on the ITRF1996 system and contains attribute information.\r \r ERS-SAR data, copyright ESA, 1993, 1996\r \r Radarsat data, copyright Canadian Space Agency, Agence spatiale canadienne, 1997.\r \r This work was completed as part of ASAC projects 454, 1125 and 2224 (ASAC_454, ASAC_1125 and ASAC_2224).
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.