35 datasets found

w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
Simulation and Analysis Code for "Adversarial Obstacle Placement under...
zenodo.org
zip
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elvan Ceyhan; Elvan Ceyhan; Li Zhou; Li Zhou; Polat Charyyev; Polat Charyyev (2025). Simulation and Analysis Code for "Adversarial Obstacle Placement under Spatial Point Processes for Optimal Path Disruption" [Dataset]. http://doi.org/10.5281/zenodo.15634344
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15634344
Dataset updated
Jun 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Elvan Ceyhan; Elvan Ceyhan; Li Zhou; Li Zhou; Polat Charyyev; Polat Charyyev
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Certainly. Here's a suggested "Description" field for your Zenodo submission, tailored for the International Journal of Geo-Information (IJGI) context:

This archive contains simulation and analysis code used in the study "[Manuscript Title]" submitted to International Journal of Geo-Information (IJGI). The research investigates pathfinding through stochastic obstacle environments using fully synthetic data generated via spatial point processes and parameterized simulation models.

The simulation/ folder includes R scripts to generate the traversal cost data under various configurations, while the analysis/ folder provides code for statistical modeling (robust regression, random forest, and zero-inflated negative binomial regression) and visualization of the results. The results/ folder contains sample simulation outputs and figures used in the manuscript. All data used in the paper can be exactly reproduced using the provided code.

For inquiries or reproducibility questions, please contact the corresponding author.
FU-LoRA: Synthetic Fetal Ultrasound Images for Standard Anatomical Planes...
zenodo.org
zip
Updated Aug 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fangyijie Wang; Fangyijie Wang (2024). FU-LoRA: Synthetic Fetal Ultrasound Images for Standard Anatomical Planes Classification [Dataset]. http://doi.org/10.48550/arxiv.2407.20072
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48550/arxiv.2407.20072
Dataset updated
Aug 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fangyijie Wang; Fangyijie Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 21, 2024
Description
Abstract

Developing robust deep learning models for fetal ultrasound image analysis requires comprehensive, high-quality datasets to effectively learn informative data representations within the domain. However, the scarcity of labelled ultrasound images poses substantial challenges, especially in low-resource settings. To tackle this challenge, we leverage synthetic data to enhance the generalizability of deep learning models. This study proposes a diffusion-based method, Fetal Ultrasound LoRA (FU-LoRA), which involves fine-tuning latent diffusion models using the LoRA technique to generate synthetic fetal ultrasound images. These synthetic images are integrated into a hybrid dataset that combines real-world and synthetic images to improve the performance of zero-shot classifiers in low-resource settings. Our experimental results on fetal ultrasound images from African cohorts demonstrate that FU-LoRA outperforms the baseline method by a 13.73% increase in zero-shot classification accuracy. Furthermore, FU-LoRA achieves the highest accuracy of 82.40%, the highest F-score of 86.54%, and the highest AUC of 89.78%. It demonstrates that the FU-LoRA method is effective in the zero-shot classification of fetal ultrasound images in low-resource settings. Our code and data are publicly accessible on GitHub.

Method

Our FU-LoRA method: Fine-tuning the pre-trained latent diffusion model (LDM) [2] using the LoRA method on a small fetal ultrasound dataset from high-resource settings (HRS). This approach integrates synthetic images to enhance generalization and performance of deep learning models. We conduct three fine-tuning sessions for the diffusion model to generate three LoRA models with different hyper-parameters: alpha in [8, 32, 128], and r in [8, 32, 128]. The merging rate alpha/r is fixed to 1. The purpose of this operation is to delve deeper into LoRA to uncover optimizations that can improve the model's performance and evaluate the effectiveness of parameter r in generating synthetic images.

Datasets

The Spanish dataset (URL) in HRS includes 1,792 patient records in Spain [1]. All images are acquired during screening in the second and third trimesters of pregnancy using six different machines operated by operators with similar expertise. We randomly selected 20 Spanish ultrasound images from each of the five maternal–fetal planes (Abdomen, Brain, Femur, Thorax, and Other) to fine-tune the LDM using LoRA technique, and 1150 Spanish images (230 x 5 planes) to create the hybrid dataset. In summary, fine-tuning the LDM utilizes 100 images including 85 patients. Training downstream classifiers uses 6148 images from 612 patients. Within the 6148 images used for training, a subset of 200 images is randomly selected for validation purposes. The hybrid dataset employed in this study has a total of 1150 Spanish images, representing 486 patients.

We create the synthetic dataset comprising 5000 fetal ultrasound images (500 x 2 samplers x 5 planes) accessible to the open-source community. The generation process utilizes our LoRA model Rank r = 128 with Euler and UniPC samplers known for their efficiency. Subsequently, we integrate this synthetic dataset with a small amount of Spanish data to create a hybrid dataset.

Implementation Details

The hyper-parameters of LoRA models are defined as follows: batch size to 2; LoRA learning rate to 1e-4; total training steps to 10000; LoRA dimension to 128; mixed precision selection to fp16; learning scheduler to constant; and input size (resolution) to 512. The model is trained on a single NVIDIA RTX A5000, 24 GB with 8-bit Adam optimizer on PyTorch.
B
The development of a synthetic dataset of women at risk of readmission...
borealisdata.ca
Updated Feb 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Obed Twinamatsiko; Vuong Nguyen; Matthew O Wiens (2025). The development of a synthetic dataset of women at risk of readmission following stillbirth deliveries in Uganda [Dataset]. http://doi.org/10.5683/SP3/W3TBZY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/W3TBZY
Dataset updated
Feb 18, 2025
Dataset provided by
Borealis
Authors
Obed Twinamatsiko; Vuong Nguyen; Matthew O Wiens
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Uganda
Dataset funded by
Canadian Insitute for Health Research
Lacuna Fund
Michael Smith Health Research BC
Description
Background: In 2020, 287,000 mothers died from complications of pregnancy or childbirth; one-third of these deaths (30%) occur during the first 6 weeks after birth. Precision public health approaches leverage risk prediction to identify the most vulnerable patients and inform decisions around use of scarce resources, including the frequency, intensity, and type of postnatal care follow-up visits. However, these approaches may not accurately or precisely predict risk for specific sub-groups of women who are statistically underrepresented in the total population, such as women who experience stillbirths. Methods: We leverage our existing dataset of sociodemographic and clinical variables and health outcomes for mother and baby dyads in Uganda to generate a synthetic dataset to enhance our risk prediction model for identifying women at a high-risk of death or readmission in the 6 weeks after a hospital delivery. Data Collection Methods: The original mom and baby project data were collected at the point of care using encrypted study tablets and these data were then uploaded to a Research Electronic Data Capture (REDCap) database hosted at the BC Children’s Hospital Research Institute (Vancouver, Canada). Following delivery and obtaining informed written consent, trained study nurses collected data grouped according to four periods of care; admission, delivery, discharge, and six-week post-discharge follow up. Data from admission and delivery were captured from the hospital medical record where possible and by direct observation, direct measurement or patient interview when not. Discharge and post-discharge data were collected by observation, measurement or interview. Six-weeks after delivery, field officers contacted every mother and/or caregivers of newborns who survived to discharge to determine vital status, readmission and care seeking for illnesses and routine postnatal care. In-person visits were completed in situations where participants could not be reached by phone. Mothers who had experienced a stillbirth were filtered from the overall dataset. The synthetic dataset was subsequently based off the stillbirth cohort and evaluated it to ensure its statistical properties were maintained. Data Processing Methods: Synthetic data and evaluation metrics were generated using the synthpop R package. The first variable (column) in the dataset is generated via random sampling with replacement with subsequent variables generated conditioned on all previously synthesized variables using a pre-specified algorithm. We used the classification and regression tree (CART) algorithm as it is non-parametric and compatible with all data types (continuous, categorical, ordinal). Additional setup for generating the synthetic dataset included identifying eligible and relevant variables for synthesis and outlining rules for variables that have branching logic (i.e., variables that are only entered if a previous variable has a specific response). For evaluation, we used the utility metric recommended by the authors of the synthpop package, the standardized propensity-score mean squared error (pMSE) ratio which measures how easy it is to tell whether a data point comes from the original data or the synthetic dataset. All the standardized pMSE ratios were below 10, which is the suggested cut-off for acceptable utility as proposed by the synthpop authors. Plots were also generated to visually compare the univariate distribution of each variable in the synthetic dataset against the original dataset. Ethics Declaration: Ethics approvals have been obtained from the Makerere University School of Public Health (MakSPH) Institutional Review Board (SPH-2021-177), the Uganda National Council of Science and Technology (UNCST) in Uganda (HS2174ES) and the University of British Columbia in Canada (H21-03709). This study has been registered at clinicaltrials.gov (NCT05730387). Abbreviations: JRRH: Jinja Regional Referral Hospital MRRH: Mbarara Regional Referral Hospital PNC: Post-natal care SES: Socio-economic index SpO2: Oxygen saturation Study Protocol & Supplementary Materials: Smart Discharges for Mom & Baby 2.0: A cohort study to develop prognostic algorithms for post-discharge readmission and mortality among mother-infant dyads NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days. Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator at sepsiscolab@bcchr.ca or visit our website.
n
CSU Synthetic Attribution Benchmark Dataset
cmr.earthdata.nasa.gov
Updated Oct 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). CSU Synthetic Attribution Benchmark Dataset [Dataset]. http://doi.org/10.34911/rdnt.8snx6c
Explore at:
Unique identifier
https://doi.org/10.34911/rdnt.8snx6c
Dataset updated
Oct 10, 2023
Time period covered
Jan 1, 2020 - Jan 1, 2023
Area covered
Description
This is a synthetic dataset that can be used by users that are interested in benchmarking methods of explainable artificial intelligence (XAI) for geoscientific applications. The dataset is specifically inspired from a climate forecasting setting (seasonal timescales) where the task is to predict regional climate variability given global climate information lagged in time. The dataset consists of a synthetic input X (series of 2D arrays of random fields drawn from a multivariate normal distribution) and a synthetic output Y (scalar series) generated by using a nonlinear function F: R^d -> R.

The synthetic input aims to represent temporally independent realizations of anomalous global fields of sea surface temperature, the synthetic output series represents some type of regional climate variability that is of interest (temperature, precipitation totals, etc.) and the function F is a simplification of the climate system.

Since the nonlinear function F that is used to generate the output given the input is known, we also derive and provide the attribution of each output value to the corresponding input features. Using this synthetic dataset users can train any AI model to predict Y given X and then implement XAI methods to interpret it. Based on the “ground truth” of attribution of F the user can assess the faithfulness of any XAI method.

NOTE: the spatial configuration of the observations in the NetCDF database file conform to the planetocentric coordinate system (89.5N - 89.5S, 0.5E - 359.5E), where longitude is measured in the positive heading east from the prime meridian.
c
Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data
datacatalogue.cessda.eu
beta.ukdataservice.ac.uk
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Little, C.; Elliott, M.; Allmendinger, R., University of Manchester (2024). Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data [Dataset]. http://doi.org/10.5255/UKDA-SN-9282-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-9282-1
Dataset updated
Nov 29, 2024
Dataset provided by
University of Manchester
Manchester Business School
Authors
Little, C.; Elliott, M.; Allmendinger, R., University of Manchester
Time period covered
Jan 1, 2023 - Dec 31, 2023
Area covered
England and Wales
Variables measured
Individuals, National
Measurement technique
Compilation/Synthesis
Description
Abstract copyright UK Data Service and data collection copyright owner.

The aim of this project was to create a synthetic dataset without using the original (secure, controlled) dataset to do so, and instead using only publicly available analytical output (i.e. output that was cleared for publication) to create the synthetic data. Such synthetic data may allow users to gain familiarity with and practise on data that is like the original before they gain access to the original data (where time in a secure setting may be limited).
The Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data was created without access to the original ASHE-2011 Census dataset (which is only available in a secure setting via the ONS Secure Research Service: "Annual Survey of Hours and Earnings linked to 2011 Census - England and Wales"). It was created as a teaching aid to support a training course "An Introduction to the linked ASHE-2011 Census dataset" organised by Administrative Data Research UK and the National Centre for Research Methods. The synthetic dataset contains a subset of the variables in the original dataset and was designed to reproduce the analytical output contained in the ASHE-Census 2011 Data Linkage User Guide.

Main Topics:

Variables available in this study relate to synthetic employment, earnings and demographic information for adults employed in England and Wales in 2011.
The Residential Population Generator (RPGen): A tool to parameterize...
catalog.data.gov
Updated Mar 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). The Residential Population Generator (RPGen): A tool to parameterize residential, demographic, and physiological data to model intraindividual exposure, dose, and risk [Dataset]. https://catalog.data.gov/dataset/the-residential-population-generator-rpgen-a-tool-to-parameterize-residential-demographic-
Explore at:
Dataset updated
Mar 9, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This repository contains scripts, input files, and some example output files for the Residential Population Generator, an R-based tool to generate synthetic human residental populations to use in making estimates of near-field chemical exposures. This tool is most readily adapted for using in the workflow for CHEM, the Combined Human Exposure Model, avaialable in two other GitHub repositories in the HumanExposure project, including ProductUseScheduler and source2dose. CHEM is currently best suited to estimating exposure to product use. Outputs from RPGen are translated into ProductUseScheduler, which with subsequent outputs used in source2dose.
d
Data for: A principled approach to synthesize neuroimaging data for...
musc.digitalcommonsdata.com
Updated Apr 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kenneth Vaden (2021). Data for: A principled approach to synthesize neuroimaging data for replication and exploration [Dataset]. http://doi.org/10.17632/3w9662wjpr.1
Explore at:
Unique identifier
https://doi.org/10.17632/3w9662wjpr.1
Dataset updated
Apr 26, 2021
Authors
Kenneth Vaden
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The synthetic predictor tables and fully synthetic neuroimaging data produced for the analysis of fully synthetic data in the current study are available as Research Data available from Mendeley Data. Ten fully synthetic datasets include synthetic gray matter images (nifti files) that were generated for analysis with simulated participant data (text files). An archive file predictor_tables.tar.gz contains ten fully synthetic predictor tables with information for 264 simulated subjects. Due to large file sizes, a separate archive was created for each set of synthetic gray matter image data: RBS001.tar.gz, …, RBS010.tar.gz. Regression analyses were performed for each synthetic dataset, then average statistic maps were made for each contrast, which were then smoothed (see accompanying paper for additional information).

The supplementary materials also include commented MATLAB and R code to implement the current neuroimaging data synthesis methods (SKexample.zip). The example data were selected from an earlier fMRI study (Kuchinsky et al., 2012) to demonstrate that the current approach can be used with other types of neuroimaging data. The example code can also be adapted to produce fully synthetic group-level datasets based on observed neuroimaging data from other sources. The zip archive includes a document with important information for performing the example analyses, and details that should be communicated with recipients of a synthetic neuroimaging dataset.

Kuchinsky, S.E., Vaden, K.I., Keren, N.I., Harris, K.C., Ahlstrom, J.B., Dubno, J.R., Eckert, M.A., 2012. Word intelligibility and age predict visual cortex activity during word listening. Cerebral Cortex 22, 1360–71. https://doi.org/10.1093/cercor/bhr211
f
Comparison of observed SRTR and synthetically generated candidate...
plos.figshare.com
figshare.com
xls
Updated Mar 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul R. Gunsalus; Johnie Rose; Carli J. Lehr; Maryam Valapour; Jarrod E. Dalton (2024). Comparison of observed SRTR and synthetically generated candidate populations. [Dataset]. http://doi.org/10.1371/journal.pone.0296839.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0296839.t004
Dataset updated
Mar 21, 2024
Dataset provided by
PLOS ONE
Authors
Paul R. Gunsalus; Johnie Rose; Carli J. Lehr; Maryam Valapour; Jarrod E. Dalton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of observed SRTR and synthetically generated candidate populations.
P
AirGapAgent-R Dataset
paperswithcode.com
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tommaso Green; Martin Gubri; Haritz Puerto; Sangdoo Yun; Seong Joon Oh (2025). AirGapAgent-R Dataset [Dataset]. https://paperswithcode.com/dataset/airgapagent-r
Explore at:
Dataset updated
Jun 20, 2025
Authors
Tommaso Green; Martin Gubri; Haritz Puerto; Sangdoo Yun; Seong Joon Oh
Description
AirGapAgent-R 🛡️🧠 A Benchmark for Evaluating Contextual Privacy of Personal LLM Agents

Code Repository: parameterlab/leaky_thoughts Paper: Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers Original Paper that detailed the procedure to create the dataset: AirGapAgent: Protecting Privacy-Conscious Conversational Agents (Bagdasarian et al.)

🧠 What is AirGapAgent-R? AirGapAgent-R is a probing benchmark designed to test contextual privacy in personal LLM agents, reconstructed from the original (unreleased) benchmark used in the AirGapAgent paper (Bagdasarian et al.). It simulates real-world data-sharing decisions where models must reason about whether user-specific data (e.g., age, medical history) should be revealed based on a specific task context.

The procedure used to create the dataset is detailed in Appendix C of our paper (see below).

📦 Dataset Structure

Profiles: 20 synthetic user profiles
Fields per Profile: 26 personal data fields (e.g., name, phone, medication)
Scenarios: 8 task contexts (e.g., doctor appointment, travel booking)
Total Prompts: 4,160 (user profile × scenario × question)

Each example includes: - The user profile - The scenario context - The domain - The data field that the model should consider whether to share or not - A ground-truth label (should share / should not share the specific data field)

The prompt is empty, as all the prompts depends on the specific model / reasoning type being used. All prompts available are in the prompts folder of the code repository (parameterlab/leaky_thoughts).

We also include a smaller variant used in some of our experiments (e.g., in RAnA experiments) together with the two datasets used in the swapping experiments detailed in Appendix A.3 of our paper.

🧪 Use Cases Use this dataset to evaluate:

Reasoning trace privacy leakage
Trade-offs between utility (task performance) and privacy Prompting strategies and anonymization techniques
Susceptibility to prompt injection and reasoning-based attacks

📊 Metrics In the associated paper, we evaluate:

Utility Score: % of correct data sharing decisions
Privacy Score: % of cases with no inappropriate leakage in either answer or reasoning

📥 Clone via Hugging Face CLI bash huggingface-cli download --repo-type dataset parameterlab/leaky_thoughts
UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA)
zenodo.org
bin, zip
Updated Dec 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth; Luca Giancardo; Luca Giancardo; Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth (2023). UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA) [Dataset]. http://doi.org/10.5281/zenodo.6476639
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6476639
Dataset updated
Dec 11, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth; Luca Giancardo; Luca Giancardo; Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth
Description
Introduction

Vessel segmentation in fundus images is essential in the diagnosis and prognosis of retinal diseases and the identification of image-based biomarkers. However, creating a vessel segmentation map can be a tedious and time consuming process, requiring careful delineation of the vasculature, which is especially hard for microcapillary plexi in fundus images. Optical coherence tomography angiography (OCT-A) is a relatively novel modality visualizing blood flow and microcapillary plexi not clearly observed in fundus photography. Unfortunately, current commercial OCT-A cameras have various limitations due to their complex optics making them more expensive, less portable, and with a reduced field of view (FOV) compared to fundus cameras. Moreover, the vast majority of population health data collection efforts do not include OCT-A data.

We believe that strategies able to map fundus images to en-face OCT-A can create precise vascular vessel segmentation with less effort.

In this dataset, called UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA), we include fundus images and en-face OCT-A images for 112 subjects. The two modalities have been manually aligned to allow for training of medical imaging machine learning pipelines. This dataset is accompanied by a manuscript that describes an approach to generate fundus vessel segmentations using OCT-A for training (Coronado et al., 2022). We refer to this approach as "Synthetic OCT-A".

Fundus Imaging

We include 45 degree macula centered fundus images that cover both macula and optic disc. All images were acquired using a OptoVue iVue fundus camera without pupil dilation.

The full images are available at the fov45/fundus directory. In addition, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/fundus/disc and cropped/fundus/macula.

Enface OCT-A

We include the en-face OCT-A images of the superficial capillary plexus. All images were acquired using an OptoVue Avanti OCT camera with OCT-A reconstruction software (AngioVue). Low quality images with errors in the retina layer segmentations were not included.

En-face OCTA images are located in cropped/octa/disc and cropped/octa/macula. In addition, we include a denoised version of these images where only vessels are included. This has been performed automatically using the ROSE algorithm (Ma et al. 2021). These can be found in cropped/GT_OCT_net/noThresh and cropped/GT_OCT_net/Thresh, the former contains the probabilities of the ROSE algorithm the latter a binary map.

Synthetic OCT-A

We train a custom conditional generative adversarial network (cGAN) to map a fundus image to an en face OCT-A image. Our model consists of a generator synthesizing en face OCT-A images from corresponding areas in fundus photographs and a discriminator judging the resemblance of the synthesized images to the real en face OCT-A samples. This allows us to avoid the use of manual vessel segmentation maps altogether.

The full images are available at the fov45/synthetic_octa directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/synthetic_octa/disc and cropped/synthetic_octa/macula. In addition, we performed the same denoising ROSE algorithm (Ma et al. 2021) used for the original enface OCT-A images, the results are available in cropped/denoised_synthetic_octa/noThresh and cropped/denoised_synthetic_octa/Thresh, the former contains the probabilities of the ROSE algorithm the latter a binary map.

Other Fundus Vessel Segmentations Included

In this dataset, we have also included the output of two recent vessel segmentation algorithms trained on external datasets with manual vessel segmentations. SA-Unet (Li et. al, 2020) and IterNet (Guo et. al, 2021).

SA-Unet. The full images are available at the fov45/SA_Unet directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/SA_Unet/disc and cropped/SA_Unet/macula.

IterNet. The full images are available at the fov45/Iternet directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/Iternet/disc and cropped/Iternet/macula.

Train/Validation/Test Replication

In order to replicate or compare your model to the results of our paper, we report below the data split used.

Training subjects IDs: 1 - 25

Validation subjects IDs: 26 - 30

Testing subjects IDs: 31 - 112

Data Acquisition

This dataset was acquired at the Texas Medical Center - Memorial Hermann Hospital in accordance with the guidelines from the Helsinki Declaration and it was approved by the UTHealth IRB with protocol HSC-MS-19-0352.

User Agreement

The UT-FSOCTA dataset is free to use for non-commercial scientific research only. In case of any publication the following paper needs to be cited

Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9.

Funding

This work is supported by the Translational Research Institute for Space Health through NASA Cooperative Agreement NNX16AO69A.

Research Team and Acknowledgements

Here are the people behind this data acquisition effort:

Ivan Coronado, Samiksha Pachade, Rania Abdelkhaleq, Juntao Yan, Sergio Salazar-Marioni, Amanda Jagolino, Mozhdeh Bahrainian, Roomasa Channa, Sunil Sheth, Luca Giancardo

We would also like to acknowledge for their support: the Institute for Stroke and Cerebrovascular Diseases at UTHealth, the VAMPIRE team at University of Dundee, UK and Memorial Hermann Hospital System.

References

Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9. C. Guo, M. Szemenyei, Y. Yi, W. Wang, B. Chen, and C. Fan, "SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation," in 2020 25th International Conference on Pattern Recognition (ICPR), Jan. 2021, pp. 1236–1242. doi: 10.1109/ICPR48806.2021.9413346. L. Li, M. Verma, Y. Nakashima, H. Nagahara, and R. Kawasaki, "IterNet: Retinal Image Segmentation Utilizing Structural Redundancy in Vessel Networks," 2020 IEEE Winter Conf. Appl. Comput. Vis. WACV, 2020, doi: 10.1109/WACV45572.2020.9093621. Y. Ma et al., "ROSE: A Retinal OCT-Angiography Vessel Segmentation Dataset and New Model," IEEE Trans. Med. Imaging, vol. 40, no. 3, pp. 928–939, Mar. 2021, doi: 10.1109/TMI.2020.3042802.
Additional file 7 of Bayesian modeling of plant drought resistance pathway
springernature.figshare.com
txt
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aditya Lahiri; Priyadharshini S. Venkatasubramani; Aniruddha Datta (2023). Additional file 7 of Bayesian modeling of plant drought resistance pathway [Dataset]. http://doi.org/10.6084/m9.figshare.7842716.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7842716.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Aditya Lahiri; Priyadharshini S. Venkatasubramani; Aniruddha Datta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R code to generate synthetic data. (R 1 kb)
Z
Synthetic XES Event Log of Malignant Melanoma Treatment
data.niaid.nih.gov
Updated Sep 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kuhn, Martin (2024). Synthetic XES Event Log of Malignant Melanoma Treatment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13828518
Explore at:
Dataset updated
Sep 23, 2024
Dataset provided by
Grüger, Joscha
Kuhn, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The synthetic event log described in this document consists of 25,000 traces, generated using the process model outlined in Geyer et al. (2024) [1] and the DALG tool [2]. This event log simulates the treatment process of malignant melanoma patients, adhering to clinical guidelines. Each trace in the log represents a unique patient journey through various stages of melanoma treatment, providing detailed insights into decision points, treatments, and outcomes.

The DALG tool [2] was employed to generate this data-aware event log, ensuring realistic data distribution and variability.

DALG: https://github.com/DavidJilg/DALG

[1] Geyer, T., Grüger, J., & Kuhn, M. (2024). Clinical Guideline-based Model for the Treatment of Malignant Melanoma (Data Petri Net) (1.0). Zenodo. https://doi.org/10.5281/zenodo.10785431

[2] Jilg, D., Grüger, J., Geyer, T., Bergmann, R.: DALG: the data aware event log generator. In: BPM 2023 - Demos & Resources. CEUR Workshop Proceedings, vol. 3469, pp. 142–146. CEUR-WS.org (2023)
Data from: Synthetic Administrative Data: Census 1991, 2023
beta.ukdataservice.ac.uk
Updated 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UK Data Service (2023). Synthetic Administrative Data: Census 1991, 2023 [Dataset]. http://doi.org/10.5255/ukda-sn-856310
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-856310
Dataset updated
2023
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
datacite
Description
We create a synthetic administrative dataset to be used in the development of the R package for calculating quality indicators for administrative data (see: https://github.com/sook-tusk/qualadmin) that mimic the properties of a real administrative dataset according to specifications by the ONS. Taking over 1 million records from a synthetic 1991 UK census dataset, we deleted records, moved records to a different geography and duplicated records to a different geography according to pre-specified proportions for each broad ethnic group (White, Non-white) and gender (males, females). The final size of the synthetic administrative data was 1033664 individuals.
f
Data from: Correlated RNN Framework to Quickly Generate Molecules with...
acs.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chuan Li; Chenghui Wang; Ming Sun; Yan Zeng; Yuan Yuan; Qiaolin Gou; Guangchuan Wang; Yanzhi Guo; Xuemei Pu (2023). Correlated RNN Framework to Quickly Generate Molecules with Desired Properties for Energetic Materials in the Low Data Regime [Dataset]. http://doi.org/10.1021/acs.jcim.2c00997.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.2c00997.s002
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Chuan Li; Chenghui Wang; Ming Sun; Yan Zeng; Yuan Yuan; Qiaolin Gou; Guangchuan Wang; Yanzhi Guo; Xuemei Pu
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Motivated by the challenging of deep learning on the low data regime and the urgent demand for intelligent design on highly energetic materials, we explore a correlated deep learning framework, which consists of three recurrent neural networks (RNNs) correlated by the transfer learning strategy, to efficiently generate new energetic molecules with a high detonation velocity in the case of very limited data available. To avoid the dependence on the external big data set, data augmentation by fragment shuffling of 303 energetic compounds is utilized to produce 500,000 molecules to pretrain RNN, through which the model can learn sufficient structure knowledge. Then the pretrained RNN is fine-tuned by focusing on the 303 energetic compounds to generate 7153 molecules similar to the energetic compounds. In order to more reliably screen the molecules with a high detonation velocity, the SMILE enumeration augmentation coupled with the pretrained knowledge is utilized to build an RNN-based prediction model, through which R2 is boosted from 0.4446 to 0.9572. The comparable performance with the transfer learning strategy based on an existing big database (ChEMBL) to produce the energetic molecules and drug-like ones further supports the effectiveness and generality of our strategy in the low data regime. High-precision quantum mechanics calculations further confirm that 35 new molecules present a higher detonation velocity and lower synthetic accessibility than the classic explosive RDX, along with good thermal stability. In particular, three new molecules are comparable to caged CL-20 in the detonation velocity. All the source codes and the data set are freely available at https://github.com/wangchenghuidream/RNNMGM.
Network Digital Twin-Generated Dataset for Machine Learning-based Detection...
zenodo.org
zip
Updated Feb 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses (2025). Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows [Dataset]. http://doi.org/10.5281/zenodo.14841650
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14841650
Dataset updated
Feb 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 11, 2024
Description
Overview

The dataset used in this study is publicly available for research purposes. If you are using this dataset, please cite the following paper (currently under submission process), which outlines the complete details of the dataset and the methodology used for its generation:

Amit Karamchandani, Javier Núñez, Luis de-la-Cal, Yenny Moreno, Alberto Mozo, Antonio Pastor, "On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination," under submission.

This dataset contains a synthetic dataset generated to differentiate between benign and malicious heavy hitter flows within complex network environments. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.

To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.

Feature Set:

The feature set includes the following flow statistics commonly used in the literature on network traffic classification:

The protocol used for the connection, identifying whether it is TCP, UDP, ICMP, or OSPF.

The time (relative to the connection start) of the most recent packet sent from source to destination at the time of each snapshot.

The time (relative to the connection start) of the most recent packet sent from destination to source at the time of each snapshot.

The cumulative count of data packets sent from source to destination at the time of each snapshot.

The cumulative count of data packets sent from destination to source at the time of each snapshot.

The cumulative bytes sent from source to destination at the time of each snapshot.

The cumulative bytes sent from destination to source at the time of each snapshot.

The time difference between the first packet sent from source to destination and the first packet sent from destination to source.

Dataset Variations:

To accommodate diverse research needs and scenarios, the dataset is provided in the following variations:

All at Once:

Contains a synthetic dataset where all traffic types, including benign, normal, and malicious DDoS heavy hitter (HH) flows, are combined into a single dataset.

This version represents a holistic view of the traffic environment, simulating real-world scenarios where all traffic occurs simultaneously.

Balanced Traffic Generation:

Represents a balanced traffic dataset with an equal proportion of benign, normal, and malicious DDoS traffic.

Designed for scenarios where a balanced dataset is needed for fair training and evaluation of machine learning models.

DDoS at Intervals:

Contains traffic data where malicious DDoS HH traffic occurs at specific time intervals, mimicking real-world attack patterns.

Useful for studying the impact and detection of intermittent malicious activities.

Only Benign HH Traffic:

Includes only benign HH traffic flows.

Suitable for training and evaluating models to identify and differentiate benign heavy hitter traffic patterns.

Only DDoS Traffic:

Contains only malicious DDoS HH traffic.

Helps in isolating and analyzing attack characteristics for targeted threat detection.

Only Normal Traffic:

Comprises only regular, non-HH traffic flows.

Useful for understanding baseline network behavior in the absence of heavy hitters.

Unbalanced Traffic Generation:

Features an unbalanced dataset with varying proportions of benign, normal, and malicious traffic.

Simulates real-world scenarios where certain types of traffic dominate, providing insights into model performance in unbalanced conditions.

For each variation, the output of the different packet aggregators is provided separated in its respective folder.

Each variation was generated using the NDT approach to demonstrate its flexibility and ensure the reproducibility of our study's experiments, while also contributing to future research on network traffic patterns and the detection and classification of heavy hitter traffic flows. The dataset is designed to support research in network security, machine learning model development, and applications of digital twin technology.
R
R script to generate climate indicators to support adaptation of vegetable...
entrepot.recherche.data.gouv.fr
pdf, tsv, txt +1
Updated Feb 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Morel; Nabil Touili; Kevin Morel; Nabil Touili (2023). R script to generate climate indicators to support adaptation of vegetable farms [Dataset]. http://doi.org/10.57745/0BWBPD
Explore at:
tsv(435317), tsv(3187), txt(2496608), tsv(389), txt(33174), tsv(2015), txt(3848), txt(427799), type/x-r-syntax(33227), pdf(261998), tsv(22002), tsv(2561071), txt(35742), txt(4210), tsv(292), tsv(3385)Available download formats
Unique identifier
https://doi.org/10.57745/0BWBPD
Dataset updated
Feb 13, 2023
Dataset provided by
Recherche Data Gouv
Authors
Kevin Morel; Nabil Touili; Kevin Morel; Nabil Touili
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Dataset funded by
Conseil départemental de l'Essonne
LEADER
Labex BASC
Description
Ce dossier contient un script R pour générer des tableaux synthétiques (saisonniers et annuels) d'indicateurs climatiques pertinents pour soutenir les producteurs de légumes dans leur réflexion sur l'adaptation au changement climatique à court terme (2021-2040) et plus long terme (2060). Ce script a pour données d'entrée des projections climatiques issues du portail DRIAS. Un exemple de données d'entrée et de tableau de sortie est donné pour la zone de Saclay (Essonne, France). Ce travail a été réalisé dans le cadre du projet CLIMALEG: adaptation des producteurs de légumes au changement climatique, de 2021 à 2022 en Ile-de-France. Les fichiers sont organisés en différents dossiers, donc pour les voir, penser à utiliser la visualisation par "Arborescence". This folders contains an R script to generate synthetic tables (at seasonal and annual scale) of climate indicators which are relevant to support vegetable farmers in anticipating climate change at short (2021-2040) and long (2060) term. The input data are climate projections coming from the DRIAS platform. One example of input data and output tables is given for the Saclay area (Essone, France). This work was carried out in the framework of the following project: "CLIMALEG: adaptation des producteurs de légumes au changement climatique", from 2021 to 2022 in the Ile-de-France region, France. The files are organized in folder, so in order to see it, use the "Tree" view.
o
Replication Data: A Synthetic Control Analysis of Philadelphia,...
ora.ox.ac.uk
document, zip
Updated Jan 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chrisinger, B (2021). Replication Data: A Synthetic Control Analysis of Philadelphia, Pennsylvania’s Excise Tax on Sugar-Sweetened and Artificially Sweetened Beverages and Supplemental Nutrition Assistance Program Benefit Redemption [Dataset]. http://doi.org/10.5287/bodleian:0oqGkDBdy
Explore at:
document(19858), zip(8563135)Available download formats
Unique identifier
https://doi.org/10.5287/bodleian:0oqGkDBdy
Dataset updated
Jan 1, 2021
Dataset provided by
University of Oxford
Authors
Chrisinger, B
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2005 - 2019
Area covered
Pennsylvania, Philadelphia, United States
Description
This is the replication dataset for "A Synthetic Control Analysis of Philadelphia, Pennsylvania’s Excise Tax on Sugar-Sweetened and Artificially Sweetened Beverages and Supplemental Nutrition Assistance Program Benefit Redemption", published in the American Journal of Public Health (accepted 18-Jun-2021). Included are analyses (.csv), codebook (.docx), Supplemental Materials (.docx), and R code used to generate synthetic controls and conduct robustness checks (.R).
u
Simulated CO2 data - Dataset - NIASRA
hpc.niasra.uow.edu.au
Updated Dec 3, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). Simulated CO2 data - Dataset - NIASRA [Dataset]. https://hpc.niasra.uow.edu.au/ckan/dataset/simulated-co2-data
Explore at:
Dataset updated
Dec 3, 2014
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This CO2 data is available in the R package `fields' except that the data here has been converted into a more user-friendly long-table format. In the package contains the following data description: This is an example of moderately large spatial data set and consists of simulated CO2 concentrations that are irregularly sampled from a lon/lat grid. Also included is the complete CO2 field used to generate the synthetic observations. This data was generously provided by Dorit Hammerling and Randy Kawa as a test example for the spatial analysis of remotely sensed (i.e. satellite) and irregular observations. The synthetic data is based on a true CO2 field simulated from a geophysical, numerical model. Format: co2sim.csv The CSV file has three columns: lon: longitude coordinate. lat: latitude coordinate. z: CO2 concentration in parts per million. co2true.csv The CSV file has three columns: lon: longitude coordinate.
r
Margin of the Antarctic ice cover derived from Synthetic Aperture Radar...
researchdata.edu.au
Updated Dec 10, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Australian Antarctic Division (2015). Margin of the Antarctic ice cover derived from Synthetic Aperture Radar images for the sector 79E-108E [Dataset]. https://researchdata.edu.au/margin-antarctic-ice-79e-108e/3530928
Explore at:
Dataset updated
Dec 10, 2015
Dataset provided by
data.gov.au
Authors
Australian Antarctic Division
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered

Description
Geographic location of the outer margin of the Antarctic ice cover for the sector between longitudes 79E and 108E, including margins of ice shelves, glaciers, and iceberg tongues. The data set does not in general include the grounding zone at the inland margin of the ice shelves or glaciers.\r \r The margin was defined by interpretation of an image mosaic generated from Synthetic Aperture Radar data. The image mosaic was built using navigation data accompanying the SAR images to transform the images to a map projection. The image navigation data were adjusted so that overlapping images were registered to one another, the indivual images merged into a mosaic, and the overall process adjusted so that the mosaic was tied to the few ground control points available in this large sector. Two separate mosaics were used to span the whole sector.\r \r The majority of the SAR data were acquired by the ERS-SAR instruments in August 1996, some ERS data were acquired in August 1993, and one Radarsat scene was acquired in September 1997. The data were pre-processed to produce a mosaic with a 100 m pixel size, and adjusted so that the majority of the coastline positions refer to the August 1996 epoch.\r \r The location data are internally consistent, and extracted at nominally 200 m intervals. The external position accuracy is generally better than 600 m. The coverage is complete over the whole sector. The coordinate set includes some island/ice rise features. Two very large grounded icebergs are included.\r \r Data are in an ascii arc/info export file format as geographic coordinates on the ITRF1996 system and contains attribute information.\r \r ERS-SAR data, copyright ESA, 1993, 1996\r \r Radarsat data, copyright Canadian Space Agency, Agence spatiale canadienne, 1997.\r \r This work was completed as part of ASAC projects 454, 1125 and 2224 (ASAC_454, ASAC_1125 and ASAC_2224).

Facebook

Twitter

Click to copy link

Link copied

Cite

Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Explore at:

Dataset updated

Jul 7, 2023

Dataset authored and provided by

Development Data Group, Data Analytics Unit

Time period covered

2023

Area covered

World, World

Description

Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.

Clear search

Close search

Google apps

Main menu

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Simulation and Analysis Code for "Adversarial Obstacle Placement under...

FU-LoRA: Synthetic Fetal Ultrasound Images for Standard Anatomical Planes...

Abstract

Method

Datasets

Implementation Details

The development of a synthetic dataset of women at risk of readmission...

CSU Synthetic Attribution Benchmark Dataset

Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data

The Residential Population Generator (RPGen): A tool to parameterize...

Data for: A principled approach to synthesize neuroimaging data for...

Comparison of observed SRTR and synthetically generated candidate...

AirGapAgent-R Dataset

UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA)

Additional file 7 of Bayesian modeling of plant drought resistance pathway

Synthetic XES Event Log of Malignant Melanoma Treatment

Data from: Synthetic Administrative Data: Census 1991, 2023

Data from: Correlated RNN Framework to Quickly Generate Molecules with...

Network Digital Twin-Generated Dataset for Machine Learning-based Detection...

Overview

Feature Set:

Dataset Variations:

R script to generate climate indicators to support adaptation of vegetable...

Replication Data: A Synthetic Control Analysis of Philadelphia,...

Simulated CO2 data - Dataset - NIASRA

Margin of the Antarctic ice cover derived from Synthetic Aperture Radar...

Synthetic Data for an Imaginary Country, Sample, 2023 - WorldSee More Versions

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Synthetic Data for an Imaginary Country, Sample, 2023 - World