35 datasets found
  1. w

    Synthetic Data for an Imaginary Country, Sample, 2023 - World

    • microdata.worldbank.org
    • nada-demo.ihsn.org
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
    Explore at:
    Dataset updated
    Jul 7, 2023
    Dataset authored and provided by
    Development Data Group, Data Analytics Unit
    Time period covered
    2023
    Area covered
    World
    Description

    Abstract

    The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    The full-population dataset (with about 10 million individuals) is also distributed as open data.

    Geographic coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

    Analysis unit

    Household, Individual

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Kind of data

    ssd

    Sampling procedure

    The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

    Mode of data collection

    other

    Research instrument

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Cleaning operations

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Response rate

    This is a synthetic dataset; the "response rate" is 100%.

  2. Synthetic datasets of the UK Biobank cohort

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, pdf, zip
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
    Explore at:
    bin, csv, zip, pdfAvailable download formats
    Dataset updated
    Sep 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

    The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

    The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

    • Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]
    • Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

    Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

    The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

    Content

    The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

    • synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
    • synthbdbasevar: baseline variables, mostly collected at recruitment.
    • synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
    • synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

    In addition, this repository provides these additional files:

    • codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
    • asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
    • Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

    Generation of the synthetic data

    The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

    The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

    This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

  3. FU-LoRA: Synthetic Fetal Ultrasound Images for Standard Anatomical Planes...

    • zenodo.org
    zip
    Updated Aug 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fangyijie Wang; Fangyijie Wang (2024). FU-LoRA: Synthetic Fetal Ultrasound Images for Standard Anatomical Planes Classification [Dataset]. http://doi.org/10.48550/arxiv.2407.20072
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Fangyijie Wang; Fangyijie Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 21, 2024
    Description

    Abstract

    Developing robust deep learning models for fetal ultrasound image analysis requires comprehensive, high-quality datasets to effectively learn informative data representations within the domain. However, the scarcity of labelled ultrasound images poses substantial challenges, especially in low-resource settings. To tackle this challenge, we leverage synthetic data to enhance the generalizability of deep learning models. This study proposes a diffusion-based method, Fetal Ultrasound LoRA (FU-LoRA), which involves fine-tuning latent diffusion models using the LoRA technique to generate synthetic fetal ultrasound images. These synthetic images are integrated into a hybrid dataset that combines real-world and synthetic images to improve the performance of zero-shot classifiers in low-resource settings. Our experimental results on fetal ultrasound images from African cohorts demonstrate that FU-LoRA outperforms the baseline method by a 13.73% increase in zero-shot classification accuracy. Furthermore, FU-LoRA achieves the highest accuracy of 82.40%, the highest F-score of 86.54%, and the highest AUC of 89.78%. It demonstrates that the FU-LoRA method is effective in the zero-shot classification of fetal ultrasound images in low-resource settings. Our code and data are publicly accessible on GitHub.

    Method

    Our FU-LoRA method: Fine-tuning the pre-trained latent diffusion model (LDM) [2] using the LoRA method on a small fetal ultrasound dataset from high-resource settings (HRS). This approach integrates synthetic images to enhance generalization and performance of deep learning models. We conduct three fine-tuning sessions for the diffusion model to generate three LoRA models with different hyper-parameters: alpha in [8, 32, 128], and r in [8, 32, 128]. The merging rate alpha/r is fixed to 1. The purpose of this operation is to delve deeper into LoRA to uncover optimizations that can improve the model's performance and evaluate the effectiveness of parameter r in generating synthetic images.

    Datasets

    The Spanish dataset (URL) in HRS includes 1,792 patient records in Spain [1]. All images are acquired during screening in the second and third trimesters of pregnancy using six different machines operated by operators with similar expertise. We randomly selected 20 Spanish ultrasound images from each of the five maternal–fetal planes (Abdomen, Brain, Femur, Thorax, and Other) to fine-tune the LDM using LoRA technique, and 1150 Spanish images (230 x 5 planes) to create the hybrid dataset. In summary, fine-tuning the LDM utilizes 100 images including 85 patients. Training downstream classifiers uses 6148 images from 612 patients. Within the 6148 images used for training, a subset of 200 images is randomly selected for validation purposes. The hybrid dataset employed in this study has a total of 1150 Spanish images, representing 486 patients.

    We create the synthetic dataset comprising 5000 fetal ultrasound images (500 x 2 samplers x 5 planes) accessible to the open-source community. The generation process utilizes our LoRA model Rank r = 128 with Euler and UniPC samplers known for their efficiency. Subsequently, we integrate this synthetic dataset with a small amount of Spanish data to create a hybrid dataset.

    Implementation Details

    The hyper-parameters of LoRA models are defined as follows: batch size to 2; LoRA learning rate to 1e-4; total training steps to 10000; LoRA dimension to 128; mixed precision selection to fp16; learning scheduler to constant; and input size (resolution) to 512. The model is trained on a single NVIDIA RTX A5000, 24 GB with 8-bit Adam optimizer on PyTorch.

  4. HaDR: Dataset for hands instance segmentation

    • kaggle.com
    zip
    Updated Mar 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ales Vysocky (2023). HaDR: Dataset for hands instance segmentation [Dataset]. https://www.kaggle.com/datasets/alevysock/hadr-dataset-for-hands-instance-segmentation
    Explore at:
    zip(10662295286 bytes)Available download formats
    Dataset updated
    Mar 7, 2023
    Authors
    Ales Vysocky
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    If you use this dataset for your work, please cite the related papers: A. Vysocky, S. Grushko, T. Spurny, R. Pastor and T. Kot, Generating Synthetic Depth Image Dataset for Industrial Applications of Hand Localisation, in IEEE Access, 2022, doi: 10.1109/ACCESS.2022.3206948.

    S. Grushko, A. Vysocký, J. Chlebek, P. Prokop, HaDR: Applying Domain Randomization for Generating Synthetic Multimodal Dataset for Hand Instance Segmentation in Cluttered Industrial Environments. preprint in arXiv, 2023, https://doi.org/10.48550/arXiv.2304.05826

    The HaDR dataset is a multimodal dataset designed for human-robot gesture-based interaction research, consisting of RGB and Depth frames, with binary masks for each hand instance (i1, i2, single class data). The dataset is entirely synthetic, generated using Domain Randomization technique in CoppeliaSim 3D. The dataset can be used to train Deep Learning models to recognize hands using either a single modality (RGB or depth) or both simultaneously. The training-validation split comprises 95K and 22K samples, respectively, with annotations provided in COCO format. The instances are uniformly distributed across the image boundaries. The vision sensor captures depth and color images of the scene, with the depth pixel values scaled into a single channel 8-bit grayscale image in the range [0.2, 1.0] m. The following aspects of the scene were randomly varied during generation of dataset: • Number, colors, textures, scales and types of distractor objects selected from a set of 3D models of general tools and geometric primitives. A special type of distractor – an articulated dummy without hands (for instance-free samples) • Hand gestures (9 options). • Hand models’ positions and orientations. • Texture and surface properties (diffuse, specular and emissive properties) and number (from none to 2) of the object of interest, as well as its background. • Number and locations of directional lights sources (from 1 to 4), in addition to a planar light for ambient illumination. The sample resolution is set to 320×256, encoded in lossless PNG format, and contains only right hand meshes (we suggest using Flip augmentations during training), with a maximum of two instances per sample.

    Test dataset (real camera images): Test dataset containing 706 images was captured using a real RGB-D camera (RealSense L515) in a cluttered and unstructured industrial environment. The dataset comprises various scenarios with diverse lighting conditions, backgrounds, obstacles, number of hands, and different types of work gloves (red, green, white, yellow, no gloves) with varying sleeve lengths. The dataset is assumed to have only one user, and the maximum number of hand instances per sample was limited to two. The dataset was manually labelled, and we provide hand instance segmentation COCO annotations in instances_hands_full.json (separately for train and val) and full arm instance annotations in instances_arms_full.json. The sample resolution was set to 640×480, and depth images were encoded in the same way as those of the synthetic dataset.

    Channel-wise normalization and standardization parameters for datasets

    DatasetMean (R, G, B, D)STD (R, G, B, D)
    Train98.173, 95.456, 93.858, 55.87267.539, 67.194, 67.796, 47.284
    Validation99.321, 97.284, 96.318, 58.18967.814, 67.518, 67.576, 47.186
    Test123.675, 116.28, 103.53, 35.379258.395, 57.12, 57.375, 45.978

    If you use this dataset for your work, please cite the related papers: A. Vysocky, S. Grushko, T. Spurny, R. Pastor and T. Kot, Generating Synthetic Depth Image Dataset for Industrial Applications of Hand Localisation, in IEEE Access, 2022, doi: 10.1109/ACCESS.2022.3206948.

    S. Grushko, A. Vysocký, J. Chlebek, P. Prokop, HaDR: Applying Domain Randomization for Generating Synthetic Multimodal Dataset for Hand Instance Segmentation in Cluttered Industrial Environments. preprint in arXiv, 2023, https://doi.org/10.48550/arXiv.2304.05826

  5. Variant data used during semi-synthetic data generation.

    • plos.figshare.com
    xls
    Updated Sep 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric V. Strobl; Eric R. Gamazon (2025). Variant data used during semi-synthetic data generation. [Dataset]. http://doi.org/10.1371/journal.pcbi.1013461.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 5, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Eric V. Strobl; Eric R. Gamazon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Variant data used during semi-synthetic data generation.

  6. Network Digital Twin-Generated Dataset for Machine Learning-Based Detection...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses (2024). Network Digital Twin-Generated Dataset for Machine Learning-Based Detection of Benign and Malicious Heavy Hitter Flows [Dataset]. http://doi.org/10.5281/zenodo.14134646
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 11, 2024
    Description

    The dataset used in this study is publicly available for research purposes. If you are using this dataset, please cite the following paper, which outlines the complete details of the dataset and the methodology used for its generation:

    Amit Karamchandani, Javier Núñez, Luis de-la-Cal, Yenny Moreno, Alberto Mozo, Antonio Pastor, "On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination," under submission.

    This dataset contains a synthetic dataset generated to differentiate between benign and malicious heavy hitter flows within complex network environments. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.

    To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.

    The feature set includes flow statistics commonly used in network analysis, such as:

    • Traffic protocol type,
    • Flow duration (the time between the initial and final packet in both directions),
    • Total count of payload packets transmitted in both directions,
    • Cumulative bytes transmitted in both directions,
    • Time discrepancy between the first packet observations at the source and destination,
    • Packet and byte transmission rates per second within each interval, and
    • Total packet and byte counts within each interval in both directions.
  7. Study Hours vs Grades Dataset

    • kaggle.com
    zip
    Updated Oct 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Silva (2025). Study Hours vs Grades Dataset [Dataset]. https://www.kaggle.com/datasets/andreylss/study-hours-vs-grades-dataset
    Explore at:
    zip(33964 bytes)Available download formats
    Dataset updated
    Oct 12, 2025
    Authors
    Andrey Silva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This synthetic dataset contains 5,000 student records exploring the relationship between study hours and academic performance.

    Dataset Features

    • student_id: Unique identifier for each student (1-5000)
    • study_hours: Hours spent studying (0-12 hours, continuous)
    • grade: Final exam score (0-100 points, continuous)

    Potential Use Cases

    • Linear regression modeling and practice
    • Data visualization exercises
    • Statistical analysis tutorials
    • Machine learning for beginners
    • Educational research simulations

    Data Quality

    • No missing values
    • Normally distributed residuals
    • Realistic educational scenario
    • Ready for immediate analysis

    Data Generation Code

    This dataset was generated using R.

    R Code

    # Set seed for reproducibility
    set.seed(42)
    
    # Define number of observations (students)
    n <- 5000
    
    # Generate study hours (independent variable)
    # Uniform distribution between 0 and 12 hours
    study_hours <- runif(n, min = 0, max = 12)
    
    # Create relationship between study hours and grade
    # Base grade: 40 points
    # Each study hour adds an average of 5 points
    # Add normal noise (standard deviation = 10)
    theoretical_grade <- 40 + 5 * study_hours
    
    # Add normal noise to make it realistic
    noise <- rnorm(n, mean = 0, sd = 10)
    
    # Calculate final grade
    grade <- theoretical_grade + noise
    
    # Limit grades between 0 and 100
    grade <- pmin(pmax(grade, 0), 100)
    
    # Create the dataframe
    dataset <- data.frame(
     student_id = 1:n,
     study_hours = round(study_hours, 2),
     grade = round(grade, 2)
    )
    
  8. PeopleSansPeople (PeopleSansPeople: A Synthetic Data Generator for...

    • opendatalab.com
    zip
    Updated Dec 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unity Technologies (2021). PeopleSansPeople (PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision) [Dataset]. https://opendatalab.com/OpenDataLab/PeopleSansPeople
    Explore at:
    zip(1547423033 bytes)Available download formats
    Dataset updated
    Dec 20, 2021
    Dataset provided by
    Unity Technologieshttps://unity.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    We release a human-centric synthetic data generator PeopleSansPeople which contains simulation-ready 3D human assets, a parameterized lighting and camera system, and generates 2D and 3D bounding box, instance and semantic segmentation, and COCO pose labels. Using PeopleSansPeople, we performed benchmark synthetic data training using a Detectron2 Keypoint R-CNN variant [1]. We found that pre-training a network using synthetic data and fine-tuning on target real-world data (few-shot transfer to limited subsets of COCO-person train [2]) resulted in a keypoint AP of 60.37±0.48 (COCO test-dev2017) outperforming models trained with the same real data alone (keypoint AP of 55.80) and pre-trained with ImageNet (keypoint AP of 57.50). This freely-available data generator should enable a wide range of research into the emerging field of simulation to real transfer learning in the critical area of human-centric computer vision.

  9. The Residential Population Generator (RPGen): A tool to parameterize...

    • catalog.data.gov
    Updated Mar 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). The Residential Population Generator (RPGen): A tool to parameterize residential, demographic, and physiological data to model intraindividual exposure, dose, and risk [Dataset]. https://catalog.data.gov/dataset/the-residential-population-generator-rpgen-a-tool-to-parameterize-residential-demographic-
    Explore at:
    Dataset updated
    Mar 9, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This repository contains scripts, input files, and some example output files for the Residential Population Generator, an R-based tool to generate synthetic human residental populations to use in making estimates of near-field chemical exposures. This tool is most readily adapted for using in the workflow for CHEM, the Combined Human Exposure Model, avaialable in two other GitHub repositories in the HumanExposure project, including ProductUseScheduler and source2dose. CHEM is currently best suited to estimating exposure to product use. Outputs from RPGen are translated into ProductUseScheduler, which with subsequent outputs used in source2dose.

  10. Description of the generator parameters.

    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christine Largeron; Pierre-Nicolas Mougel; Reihaneh Rabbany; Osmar R. Zaïane (2023). Description of the generator parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0122777.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Christine Largeron; Pierre-Nicolas Mougel; Reihaneh Rabbany; Osmar R. Zaïane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description of the generator parametersDescription of the generator parameters.

  11. Simulation Machine - 1198888 claims (R 3.5)

    • kaggle.com
    zip
    Updated Jan 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    floser (2022). Simulation Machine - 1198888 claims (R 3.5) [Dataset]. https://www.kaggle.com/datasets/floser/simulation-machine-1198888-claims-r-35
    Explore at:
    zip(17310149 bytes)Available download formats
    Dataset updated
    Jan 24, 2022
    Authors
    floser
    Description

    1198888 Non-Life Insurance Claims Cash Flows for a Synthetic Portfolio

    Data Generator:

    "An Individual Claims History Simulation Machine" by Andrea Gabrielli and Mario V. Wüthrich

    Data generated with:

    "A Neural Network Boosted Double Over-Dispersed Poisson Claims Reserving Model" Data Generation.R by Andrea Gabrielli, 28.05.2020
    Path: https://github.com/gabrielliandrea/neuralnetworkdoubleODP/blob/master/RCodeNeuralNetworkDoubleODP.zip

    Remark: Random number generation has changed with R Version 3.6

    To get published results (pre 3.6) the following option is needed: RNGkind(sample.kind = "Rounding") (source: https://stackoverflow.com/questions/47199415/is-set-seed-consistent-over-different-versions-of-r-and-ubuntu/56381613#56381613)

    Count Claims per LoB (1 to 6): 1 2 3 4 5 6 Sum 250040 250197 99969 249683 249298 99701 1198888

    Read:

    "Claims Reserving and Neural Networks" (Doctorial Thesis) by Andrea Gabrielli, ETHZ 2020.

  12. Synthetic Temperature Data for Predicting Flashover Occurrence Using...

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Oct 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). Synthetic Temperature Data for Predicting Flashover Occurrence Using Surrogate Temperature Data [Dataset]. https://catalog.data.gov/dataset/synthetic-temperature-data-for-predicting-flashover-occurrence-using-surrogate-temperature
    Explore at:
    Dataset updated
    Oct 28, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This data set provides heat detector temperatures in a single-story ranch structure with a living room, kitchen, dining room, and three bedrooms. 20000 sets of detector temperatures are generated using CData [1]. The data set are obtained based on simulation runs with various fire conditions and door opening conditions. The fire is described based on t-squared law. The peak heat release rate and time to peak range from approximately 1667 kW to 4620 kW and from 210 s to 1540 s, respectively. A detailed description of this work can be found in Ref. [2].[1] Reneke, P.A., Peacock, R.D., Gilbert, S.W. and Cleary, T.G., 2021. CFAST Consolidated Fire and Smoke Transport (Version 7) Volume 5: CFAST Fire Data Generator (CData). NIST Technical Note 1889v5. Gaithersburg, MD.[2] Fu, E.Y., Tam, W.C., Wang, J., Peacock, R., Reneke, P., Ngai, G., Leong, H.V. and Cleary, T., 2021, May. Predicting Flashover Occurrence using Surrogate Temperature Data. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 17, pp. 14785-14794).

  13. c

    Insider Threat Test Dataset

    • kilthub.cmu.edu
    txt
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Lindauer (2023). Insider Threat Test Dataset [Dataset]. http://doi.org/10.1184/R1/12841247.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Carnegie Mellon University
    Authors
    Brian Lindauer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Insider Threat Test Dataset is a collection of synthetic insider threat test datasets that provide both background and malicious actor synthetic data.The CERT Division, in partnership with ExactData, LLC, and under sponsorship from DARPA I2O, generated a collection of synthetic insider threat test datasets. These datasets provide both synthetic background data and data from synthetic malicious actors.For more background on this data, please see the paper, Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data.Datasets are organized according to the data generator release that created them. Most releases include multiple datasets (e.g., r3.1 and r3.2). Generally, later releases include a superset of the data generation functionality of earlier releases. Each dataset file contains a readme file that provides detailed notes about the features of that release.The answer key file answers.tar.bz2 contains the details of the malicious activity included in each dataset, including descriptions of the scenarios enacted and the identifiers of the synthetic users involved.

  14. Z

    Synthetic XES Event Log of Malignant Melanoma Treatment

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grüger, Joscha; Kuhn, Martin (2024). Synthetic XES Event Log of Malignant Melanoma Treatment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13828518
    Explore at:
    Dataset updated
    Sep 23, 2024
    Dataset provided by
    German Research Centre for Artificial Intelligence
    Authors
    Grüger, Joscha; Kuhn, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The synthetic event log described in this document consists of 25,000 traces, generated using the process model outlined in Geyer et al. (2024) [1] and the DALG tool [2]. This event log simulates the treatment process of malignant melanoma patients, adhering to clinical guidelines. Each trace in the log represents a unique patient journey through various stages of melanoma treatment, providing detailed insights into decision points, treatments, and outcomes.

    The DALG tool [2] was employed to generate this data-aware event log, ensuring realistic data distribution and variability.

    DALG: https://github.com/DavidJilg/DALG

    [1] Geyer, T., Grüger, J., & Kuhn, M. (2024). Clinical Guideline-based Model for the Treatment of Malignant Melanoma (Data Petri Net) (1.0). Zenodo. https://doi.org/10.5281/zenodo.10785431

    [2] Jilg, D., Grüger, J., Geyer, T., Bergmann, R.: DALG: the data aware event log generator. In: BPM 2023 - Demos & Resources. CEUR Workshop Proceedings, vol. 3469, pp. 142–146. CEUR-WS.org (2023)

  15. UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA)

    • zenodo.org
    bin, zip
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth; Luca Giancardo; Luca Giancardo; Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth (2023). UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA) [Dataset]. http://doi.org/10.5281/zenodo.6476639
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth; Luca Giancardo; Luca Giancardo; Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth
    Description

    Introduction

    Vessel segmentation in fundus images is essential in the diagnosis and prognosis of retinal diseases and the identification of image-based biomarkers. However, creating a vessel segmentation map can be a tedious and time consuming process, requiring careful delineation of the vasculature, which is especially hard for microcapillary plexi in fundus images. Optical coherence tomography angiography (OCT-A) is a relatively novel modality visualizing blood flow and microcapillary plexi not clearly observed in fundus photography. Unfortunately, current commercial OCT-A cameras have various limitations due to their complex optics making them more expensive, less portable, and with a reduced field of view (FOV) compared to fundus cameras. Moreover, the vast majority of population health data collection efforts do not include OCT-A data.

    We believe that strategies able to map fundus images to en-face OCT-A can create precise vascular vessel segmentation with less effort.

    In this dataset, called UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA), we include fundus images and en-face OCT-A images for 112 subjects. The two modalities have been manually aligned to allow for training of medical imaging machine learning pipelines. This dataset is accompanied by a manuscript that describes an approach to generate fundus vessel segmentations using OCT-A for training (Coronado et al., 2022). We refer to this approach as "Synthetic OCT-A".

    Fundus Imaging

    We include 45 degree macula centered fundus images that cover both macula and optic disc. All images were acquired using a OptoVue iVue fundus camera without pupil dilation.

    The full images are available at the fov45/fundus directory. In addition, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/fundus/disc and cropped/fundus/macula.

    Enface OCT-A

    We include the en-face OCT-A images of the superficial capillary plexus. All images were acquired using an OptoVue Avanti OCT camera with OCT-A reconstruction software (AngioVue). Low quality images with errors in the retina layer segmentations were not included.

    En-face OCTA images are located in cropped/octa/disc and cropped/octa/macula. In addition, we include a denoised version of these images where only vessels are included. This has been performed automatically using the ROSE algorithm (Ma et al. 2021). These can be found in cropped/GT_OCT_net/noThresh and cropped/GT_OCT_net/Thresh, the former contains the probabilities of the ROSE algorithm the latter a binary map.

    Synthetic OCT-A

    We train a custom conditional generative adversarial network (cGAN) to map a fundus image to an en face OCT-A image. Our model consists of a generator synthesizing en face OCT-A images from corresponding areas in fundus photographs and a discriminator judging the resemblance of the synthesized images to the real en face OCT-A samples. This allows us to avoid the use of manual vessel segmentation maps altogether.

    The full images are available at the fov45/synthetic_octa directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/synthetic_octa/disc and cropped/synthetic_octa/macula. In addition, we performed the same denoising ROSE algorithm (Ma et al. 2021) used for the original enface OCT-A images, the results are available in cropped/denoised_synthetic_octa/noThresh and cropped/denoised_synthetic_octa/Thresh, the former contains the probabilities of the ROSE algorithm the latter a binary map.

    Other Fundus Vessel Segmentations Included

    In this dataset, we have also included the output of two recent vessel segmentation algorithms trained on external datasets with manual vessel segmentations. SA-Unet (Li et. al, 2020) and IterNet (Guo et. al, 2021).

    • SA-Unet. The full images are available at the fov45/SA_Unet directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/SA_Unet/disc and cropped/SA_Unet/macula.

    • IterNet. The full images are available at the fov45/Iternet directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/Iternet/disc and cropped/Iternet/macula.

    Train/Validation/Test Replication

    In order to replicate or compare your model to the results of our paper, we report below the data split used.

    • Training subjects IDs: 1 - 25

    • Validation subjects IDs: 26 - 30

    • Testing subjects IDs: 31 - 112

    Data Acquisition

    This dataset was acquired at the Texas Medical Center - Memorial Hermann Hospital in accordance with the guidelines from the Helsinki Declaration and it was approved by the UTHealth IRB with protocol HSC-MS-19-0352.

    User Agreement

    The UT-FSOCTA dataset is free to use for non-commercial scientific research only. In case of any publication the following paper needs to be cited

    
    Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9.
    

    Funding

    This work is supported by the Translational Research Institute for Space Health through NASA Cooperative Agreement NNX16AO69A.

    Research Team and Acknowledgements

    Here are the people behind this data acquisition effort:

    Ivan Coronado, Samiksha Pachade, Rania Abdelkhaleq, Juntao Yan, Sergio Salazar-Marioni, Amanda Jagolino, Mozhdeh Bahrainian, Roomasa Channa, Sunil Sheth, Luca Giancardo

    We would also like to acknowledge for their support: the Institute for Stroke and Cerebrovascular Diseases at UTHealth, the VAMPIRE team at University of Dundee, UK and Memorial Hermann Hospital System.

    References

    Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9.
    
    
    C. Guo, M. Szemenyei, Y. Yi, W. Wang, B. Chen, and C. Fan, "SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation," in 2020 25th International Conference on Pattern Recognition (ICPR), Jan. 2021, pp. 1236–1242. doi: 10.1109/ICPR48806.2021.9413346.
    
    L. Li, M. Verma, Y. Nakashima, H. Nagahara, and R. Kawasaki, "IterNet: Retinal Image Segmentation Utilizing Structural Redundancy in Vessel Networks," 2020 IEEE Winter Conf. Appl. Comput. Vis. WACV, 2020, doi: 10.1109/WACV45572.2020.9093621.
    
    Y. Ma et al., "ROSE: A Retinal OCT-Angiography Vessel Segmentation Dataset and New Model," IEEE Trans. Med. Imaging, vol. 40, no. 3, pp. 928–939, Mar. 2021, doi: 10.1109/TMI.2020.3042802.
    
  16. H

    YRS Synthetic Forecast Generation Dataset

    • hydroshare.org
    • search.dataone.org
    zip
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zachary Paul Brodeur (2025). YRS Synthetic Forecast Generation Dataset [Dataset]. http://doi.org/10.4211/hs.29a7c696ee4e4766883078ca0d681884
    Explore at:
    zip(1006.5 MB)Available download formats
    Dataset updated
    Jun 3, 2025
    Dataset provided by
    HydroShare
    Authors
    Zachary Paul Brodeur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 2, 1985 - Sep 30, 2019
    Area covered
    Description

    Pre-processed subset of raw HEFS hindcast data for Feather-Yuba system (YRS) configured for compatibility with the repository structure of the versions 1 and 2 synthetic forecast model contained here: https://github.com/zpb4/Synthetic-Forecast-v1-FIRO-DISES and here: https://github.com/zpb4/Synthetic-Forecast-v2-FIRO-DISES. The data are pre-structured for the repository setup and instructions are included in README files for both GitHub repos on how to setup the data contained in this resource.

    Contains HEFS hindcast .csv files and observed full-natural-flow files for the following sites: ORDC1 - main reservoir inflow to Oroville Lake NBBC1 - main reservoir inflow to New Bullards Bar MRYC1L - downstream local flows at Marysville junction

    Data also contains R scripts used to preprocess the raw HEFS data. These raw data are too large for easy storage in a public repository (YRS has 30+ modeled sites) but are available upon reasonable request from: Zach Brodeur, zpb4@cornell.edu

  17. H

    ADO Synthetic Forecast Generation Dataset

    • hydroshare.org
    • search.dataone.org
    zip
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zachary Paul Brodeur (2025). ADO Synthetic Forecast Generation Dataset [Dataset]. http://doi.org/10.4211/hs.b6788237717c41e0bcc69bcaa851694f
    Explore at:
    zip(265.4 MB)Available download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    HydroShare
    Authors
    Zachary Paul Brodeur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 4, 1940 - Sep 30, 2019
    Area covered
    Description

    Pre-processed subset of raw HEFS hindcast data for Prado Dam system (ADO) configured for compatibility with the repository structure of the versions 1 and 2 synthetic forecast model contained here: https://github.com/zpb4/Synthetic-Forecast-v1-FIRO-DISES and here: https://github.com/zpb4/Synthetic-Forecast-v2-FIRO-DISES. The data are pre-structured for the repository setup and instructions are included in README files for both GitHub repos on how to setup the data contained in this resource.

    Contains HEFS hindcast .csv files and observed full-natural-flow files for the following sites: ADOC1 - main reservoir inflow to Prado Dam (ADOC1)

    Data also contains R scripts used to preprocess the raw HEFS data contained in the associated public Hydroshare resource here: https://www.hydroshare.org/resource/3b665049965c4a2b88f6f6c1abb0ff94/

  18. Z

    Synthetic gene expression data with underlying gene network

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hu, Jianchang; Szymczak, Silke (2023). Synthetic gene expression data with underlying gene network [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_8242660
    Explore at:
    Dataset updated
    Aug 15, 2023
    Dataset provided by
    Institute of Medical Biometry and Statistics, University of Lübeck, Lübeck, Germany
    Authors
    Hu, Jianchang; Szymczak, Silke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the synthetic gene expression data along with the underlying gene network used in the simulation studies of Hu and Szymczak (2023) for evaluating network-guided random forest.

    In this dataset we consider the situation of 1000 genes and 1000 samples each for training and testing sets. Each file contains a list of 100 replications of the considered scenario which can be identified via the file name. In particular, we consider 6 different scenarios depending on the number of disease modules and how are the effects of disease genes distributed within the disease module. When there are disease genes, we also consider 3 different levels of effect sizes. The binary responses are then generated via a logistic regression model. More details on these scenarios and the data generation mechanism can be found in Hu and Szymczak (2023).

    The data is generated by the function gen_data in R package networkRF which can be accessed at https://github.com/imbs-hl/networkRF. To obtain the datasets with 3000 genes, which is the other part of the data used in the simulation studies of Hu and Szymczak (2023), simply modify the num.var argument of the function gen_data. More descriptions on the implementation and the format of the output can be found in the help page of the R package.

  19. Community homogeneity measures for θ varying.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christine Largeron; Pierre-Nicolas Mougel; Reihaneh Rabbany; Osmar R. Zaïane (2023). Community homogeneity measures for θ varying. [Dataset]. http://doi.org/10.1371/journal.pone.0122777.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Christine Largeron; Pierre-Nicolas Mougel; Reihaneh Rabbany; Osmar R. Zaïane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Community homogeneity measures for θ varyingCommunity homogeneity measures for θ varying.

  20. Synthetic Industrial Parts dataset (SIP-17)

    • kaggle.com
    zip
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaomeng (Mandy) Zhu (2023). Synthetic Industrial Parts dataset (SIP-17) [Dataset]. https://www.kaggle.com/datasets/mandymm/synthetic-industrial-parts-dataset-sip-17/data
    Explore at:
    zip(7516467671 bytes)Available download formats
    Dataset updated
    Apr 6, 2023
    Authors
    Xiaomeng (Mandy) Zhu
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Synthetic Industrial Parts dataset (SIP-17) is designed for the Sim-to-Real challenge in Industrial Parts Classification.

    The dataset comprises 17 objects that represent six typical industry use cases. The first four use cases require the classification of isolated parts, and the remaining two (Oring_assembly and Wheel_assembly) require the classification of assembled parts.

    For each object, we provided three types of images: Syn_O, synthetic images without random backgrounds and post-processing; Syn_R, synthetic images with random backgrounds and post-processing; and Real, images captured from cameras in real industrial scenarios.

    To facilitate model training and validation, we generated 1,200 synthetic images for each object for training and 300 synthetic images for validation. In total, we have created 33,000 images for both Syn_O and Syn_R. For testing, we captured 566 real images from various industrial scenarios.

    For more detailed information, please find in our Paper

    To evaluate the performance of the dataset, we benchmarked it using five state-of-the-art models, including ResNet, EfficientNet, ConvNext, VIT, and DINO. Notably, we trained the models only on synthetic data and tested them on real data. Our code can be found on Github.

    The samples of each category of the dataset is shown below: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F8856762%2F46dd289edd2d44b0b4b13d7b4d44d2b6%2Fimage_23.jpg?generation=1680823468856999&alt=media" alt="">

    Number of images per category in the SIP-17 dataset: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F8856762%2F371e1005f1fcb29db7d4e24e7b332071%2FDataset.PNG?generation=1686460138221877&alt=media" alt="">

    Acknowledgement If you use this dataset in your research, please cite our paper

    @InProceedings{Zhu_2023_CVPR, author = {Zhu, Xiaomeng and Bilal, Talha and M\r{a}rtensson, P"ar and Hanson, Lars and Bj"orkman, M\r{a}rten and Maki, Atsuto}, title = {Towards Sim-to-Real Industrial Parts Classification With Synthetic Dataset}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2023}, pages = {4453-4462} }

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World
Description

Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.

Search
Clear search
Close search
Google apps
Main menu