98 datasets found
  1. Creating_simple_Sintetic_dataset

    • kaggle.com
    zip
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lala Ibadullayeva (2025). Creating_simple_Sintetic_dataset [Dataset]. https://www.kaggle.com/datasets/lalaibadullayeva/creating-simple-sintetic-dataset
    Explore at:
    zip(476698 bytes)Available download formats
    Dataset updated
    Jan 20, 2025
    Authors
    Lala Ibadullayeva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Description

    Overview: This dataset contains three distinct fake datasets generated using the Faker and Mimesis libraries. These libraries are commonly used for generating realistic-looking synthetic data for testing, prototyping, and data science projects. The datasets were created to simulate real-world scenarios while ensuring no sensitive or private information is included.

    Data Generation Process: The data creation process is documented in the accompanying notebook, Creating_simple_Sintetic_data.ipynb. This notebook showcases the step-by-step procedure for generating synthetic datasets with customizable structures and fields using the Faker and Mimesis libraries.

    File Contents:

    Datasets: CSV files containing the three synthetic datasets. Notebook: Creating_simple_Sintetic_data.ipynb detailing the data generation process and the code used to create these datasets.

  2. LLM Prompt Recovery - Synthetic Datastore

    • kaggle.com
    zip
    Updated Feb 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2024). LLM Prompt Recovery - Synthetic Datastore [Dataset]. https://www.kaggle.com/datasets/dschettler8845/llm-prompt-recovery-synthetic-datastore
    Explore at:
    zip(988448 bytes)Available download formats
    Dataset updated
    Feb 29, 2024
    Authors
    Darien Schettler
    License

    https://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses

    Description

    High Level Description

    This dataset uses Gemma 7B-IT to generate synthetic dataset for the LLM Prompt Recovery competition.

    Contributors

    Please go upvote these other datasets as my work is not possible without them

    First Dataset - 1000 Examples From @thedrcat

    Update 1 - February 29, 2024

    The only file presently found in this dataset is gemma1000_7b.csv which uses the dataset created by @thedrcat found here: https://www.kaggle.com/datasets/thedrcat/llm-prompt-recovery-data?select=gemma1000.csv

    The file below is the file Darek created with two additional columns appended. The first is the output of Gemma 7B-IT (raw based on the instructions below)(vs. 2B-IT that Darek used) and the second is the output with the 'Sure... blah blah

    ' sentence removed.

    I generated things using the following setup:

    # I used a vLLM server to host Gemma 7B on paperspace (A100)
    
    # Step 1 - Install vLLM
    >>> pip install vllm
    
    # Step 2 - Authenticate HuggingFace CLI (for model weights)
    >>> huggingface-cli login --token
    
  3. Synthetic Data for Khmer Word Detection

    • kaggle.com
    zip
    Updated Oct 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chanveasna ENG (2025). Synthetic Data for Khmer Word Detection [Dataset]. https://www.kaggle.com/datasets/veasnaecevilsna/synthetic-data-for-khmer-word-detection
    Explore at:
    zip(8863660119 bytes)Available download formats
    Dataset updated
    Oct 12, 2025
    Authors
    Chanveasna ENG
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic Data for Khmer Word Detection

    This dataset contains 10,000 synthetic images and corresponding bounding box labels for training object detection models to detect Khmer words.

    The dataset is generated using a custom tool designed to create diverse and realistic training data for computer vision tasks, especially where real annotated data is scarce.

    ✨ Highlights

    • 100,000 images (.png) with random backgrounds and styles.
    • Bounding boxes provided in YOLO (.txt) and Pascal VOC (.xml) formats.
    • 50+ real background images + unlimited random background colors.
    • 250+ different Khmer fonts.
    • Randomized effects: brightness, contrast, blur, color jitter, and more.
    • Wide variety of text sizes, positions, and layouts.

    📂 Folder Structure

    /
    ├── synthetic_images/   # Synthetic images (.png)
    ├── synthetic_labels/   # YOLO format labels (.txt)
    ├── synthetic_xml_labels/ # Pascal VOC format labels (.xml)
    

    Each image has corresponding .txt and .xml files with the same filename.

    📏 Annotation Formats

    • YOLO Format (.txt):
      Each line represents a word, with format: class_id center_x center_y width height All values are normalized between 0 and 1.
      Example: 0 0.235 0.051 0.144 0.081

    • Pascal VOC Format (.xml):
      Standard XML structure containing image metadata and bounding box coordinates (absolute pixel values).
      Example: ```xml

    🖼️ Image Samples

    Each image contains random Khmer words placed naturally over backgrounds, with different font styles, sizes, and visual effects.
    The dataset was carefully generated to simulate real-world challenges like:

    • Different lighting conditions
    • Different text sizes
    • Motion blur and color variations

    🧠 Use Cases

    • Train YOLOv5, YOLOv8, EfficientDet, and other object detection models.
    • Fine-tune OCR (Optical Character Recognition) systems for Khmer language.
    • Research on low-resource language computer vision tasks.
    • Data augmentation for scene text detection.

    ⚙️ How It Was Generated

    1. A random real-world background or random color is chosen.
    2. Random Khmer words are selected from a large cleaned text file.
    3. Words are rendered with random font, size, color, spacing, and position.
    4. Image effects like motion blur and color jitter are randomly applied.
    5. Bounding boxes are automatically generated for each word.

    🧹 Data Cleaning

    • Words were sourced from a cleaned Khmer corpus to avoid duplicates and garbage data.
    • Fonts were tested to make sure they render Khmer characters properly.

    📢 Important Notes

    • This dataset is synthetic. While it simulates real-world conditions, it may not fully replace real-world labeled data for final model evaluation.
    • All labels assume one class only (i.e., "word" = class_id 0).

    ❤️ Credits

    📈 Future Updates

    We plan to release:

    • Datasets with rotated bounding boxes for detecting skewed text.
    • More realistic mixing of real-world backgrounds and synthetic text.
    • Advanced distortions (e.g., handwriting-like simulation).

    Stay tuned!

    📜 License

    This project is licensed under MIT license.

    Please credit the original authors when using this data and provide a link to this dataset.

    ✉️ Contact

    If you have any questions or want to collaborate, feel free to reach out:

  4. Data archive for paper "Copula-based synthetic data augmentation for...

    • zenodo.org
    zip
    Updated Mar 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Meyer; David Meyer (2022). Data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators" [Dataset]. http://doi.org/10.5281/zenodo.5150327
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Meyer; David Meyer
    Description

    Overview

    This is the data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators". It contains the paper’s data archive with model outputs (see results folder) and the Singularity image for (optionally) re-running experiments.

    For the Python tool used to generate synthetic data, please refer to Synthia.

    Requirements

    *Although PBS in not a strict requirement, it is required to run all helper scripts as included in this repository. Please note that depending on your specific system settings and resource availability, you may need to modify PBS parameters at the top of submit scripts stored in the hpc directory (e.g. #PBS -lwalltime=72:00:00).

    Usage

    To reproduce the results from the experiments described in the paper, first fit all copula models to the reduced NWP-SAF dataset with:

    qsub hpc/fit.sh

    then, to generate synthetic data, run all machine learning model configurations, and compute the relevant statistics use:

    qsub hpc/stats.sh
    qsub hpc/ml_control.sh
    qsub hpc/ml_synth.sh

    Finally, to plot all artifacts included in the paper use:

    qsub hpc/plot.sh

    Licence

    Code released under MIT license. Data from the reduced NWP-SAF dataset released under CC BY 4.0.

  5. Z

    replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Apr 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David (2023). replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7849595
    Explore at:
    Dataset updated
    Apr 21, 2023
    Dataset provided by
    Imperial College London
    The Pocket Dimension, Munich
    Authors
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for semantic and instance segmentation experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

    Abstract:

    Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

    Benchmark data

    Two pose-estimation datasets were procured. Both datasets used first instar Sungaya nexpectata (Zompro 1996) stick insects as a model species. Recordings from an evenly lit platform served as representative for controlled laboratory conditions; recordings from a hand-held phone camera served as approximate example for serendipitous recordings in the field.

    For the platform experiments, walking S. inexpectata were recorded using a calibrated array of five FLIR blackfly colour cameras (Blackfly S USB3, Teledyne FLIR LLC, Wilsonville, Oregon, U.S.), each equipped with 8 mm c-mount lenses (M0828-MPW3 8MM 6MP F2.8-16 C-MOUNT, CBC Co., Ltd., Tokyo, Japan). All videos were recorded with 55 fps, and at the sensors’ native resolution of 2048 px by 1536 px. The cameras were synchronised for simultaneous capture from five perspectives (top, front right and left, back right and left), allowing for time-resolved, 3D reconstruction of animal pose.

    The handheld footage was recorded in landscape orientation with a Huawei P20 (Huawei Technologies Co., Ltd., Shenzhen, China) in stabilised video mode: S. inexpectata were recorded walking across cluttered environments (hands, lab benches, PhD desks etc), resulting in frequent partial occlusions, magnification changes, and uneven lighting, so creating a more varied pose-estimation dataset.

    Representative frames were extracted from videos using DeepLabCut (DLC)-internal k-means clustering. 46 key points in 805 and 200 frames for the platform and handheld case, respectively, were subsequently hand-annotated using the DLC annotation GUI.

    Synthetic data

    We generated a synthetic dataset of 10,000 images at a resolution of 1500 by 1500 px, based on a 3D model of a first instar S. inexpectata specimen, generated with the scAnt photogrammetry workflow. Generating 10,000 samples took about three hours on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super). We applied 70\% scale variation, and enforced hue, brightness, contrast, and saturation shifts, to generate 10 separate sub-datasets containing 1000 samples each, which were combined to form the full dataset.

    Funding

    This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

  6. h

    pythonic-function-calling

    • huggingface.co
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dria (2025). pythonic-function-calling [Dataset]. https://huggingface.co/datasets/driaforall/pythonic-function-calling
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 14, 2025
    Dataset authored and provided by
    Dria
    Description

    Pythonic Function Calling Dataset

    This dataset contains synthetic data used for training Pythonic function calling models Dria-Agent-a-3B and Dria-Agent-a-7B. Dria is a python framework to generate synthetic data on globally connected edge devices with 50+ models. See the network here

      Dataset Summary
    

    The dataset includes various examples of function calling scenarios, ranging from simple to complex multi-turn interactions. It was generated synthetically using the… See the full description on the dataset page: https://huggingface.co/datasets/driaforall/pythonic-function-calling.

  7. Data from: Domain-adaptive Data Synthesis for Large-scale Supermarket...

    • zenodo.org
    zip
    Updated Apr 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Strohmayer; Julian Strohmayer; Martin Kampel; Martin Kampel (2024). Domain-adaptive Data Synthesis for Large-scale Supermarket Product Recognition [Dataset]. http://doi.org/10.5281/zenodo.7750242
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julian Strohmayer; Julian Strohmayer; Martin Kampel; Martin Kampel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition

    This repository contains the data synthesis pipeline and synthetic product recognition datasets proposed in [1].

    Data Synthesis Pipeline:

    We provide the Blender 3.1 project files and Python source code of our data synthesis pipeline pipeline.zip, accompanied by the FastCUT models used for synthetic-to-real domain translation models.zip. For the synthesis of new shelf images, a product assortment list and product images must be provided in the corresponding directories products/assortment/ and products/img/. The pipeline expects product images to follow the naming convention c.png, with c corresponding to a GTIN or generic class label (e.g., 9120050882171.png). The assortment list, assortment.csv, is expected to use the sample format [c, w, d, h], with c being the class label and w, d, and h being the packaging dimensions of the given product in mm (e.g., [4004218143128, 140, 70, 160]). The assortment list to use and the number of images to generate can be specified in generateImages.py (see comments). The rendering process is initiated by either executing load.py from within Blender or within a command-line terminal as a background process.

    Datasets:

    • SG3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 851,801 instances of 3,234 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.
    • SG3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.
    • SGI3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 838,696 instances of 1,063 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.
    • SGI3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.
    • SPS8k - Synthetic Product Shelves 8k (SPS8k) dataset, comprised of 16,224 synthetic shelf images with 1,981,967 instances of 8,112 supermarket products. Instance-level bounding boxes and GTIN class labels are provided for all product instances.
    • SPS8kt - Domain-translated version of SPS8k, utilizing SKU110k as the target domain. Instance-level bounding boxes and GTIN class labels for all product instances.

    Table 1: Dataset characteristics.

    Dataset#images#products#instances labels translation
    SG3k10,0003,234851,801bounding box & generic class¹none
    SG3kt10,0003,234851,801bounding box & generic class¹GroZi-3.2k
    SGI3k10,0001,063838,696bounding box & generic class²none
    SGI3kt10,0001,063838,696bounding box & generic class²GroZi-3.2k
    SPS8k16,2248,1121,981,967bounding box & GTINnone
    SPS8kt16,2248,1121,981,967bounding box & GTINSKU110k

    Sample Format

    A sample consists of an RGB image (i.png) and an accompanying label file (i.txt), which contains the labels for all product instances present in the image. Labels use the YOLO format [c, x, y, w, h].

    ¹SG3k and SG3kt use generic pseudo-GTIN class labels, created by combining the GroZi-3.2k food product category number i (1-27) with the product image index j (j.jpg), following the convention i0000j (e.g., 13000097).

    ²SGI3k and SGI3kt use the generic GroZi-3.2k class labels from https://arxiv.org/abs/2003.06800.

    Download and Use
    This data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

    [1] Strohmayer, Julian, and Martin Kampel. "Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition." International Conference on Computer Analysis of Images and Patterns. Cham: Springer Nature Switzerland, 2023.

    BibTeX citation:

    @inproceedings{strohmayer2023domain,
     title={Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition},
     author={Strohmayer, Julian and Kampel, Martin},
     booktitle={International Conference on Computer Analysis of Images and Patterns},
     pages={239--250},
     year={2023},
     organization={Springer}
    }
  8. Z

    replicAnt - Plum2023 - 3D Models - Unreal Engine 5

    • data-staging.niaid.nih.gov
    • nde-dev.biothings.io
    Updated Apr 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David (2023). replicAnt - Plum2023 - 3D Models - Unreal Engine 5 [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7849058
    Explore at:
    Dataset updated
    Apr 20, 2023
    Dataset provided by
    Imperial College London
    The Pocket Dimension, Munich
    Authors
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the 3D models used to generate all synthetic data presented in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. The models have been generated with the open-source photogrammetry platform scAnt peerj.com/articles/11155/ and have been pre-processed and converted into Unreal Engine 5 compatible .uasset files, to be used with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

    Abstract:

    Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

    Funding

    This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

  9. Z

    replicAnt - Plum2023 - Segmentation Datasets and Trained Models

    • data-staging.niaid.nih.gov
    Updated Apr 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David (2023). replicAnt - Plum2023 - Segmentation Datasets and Trained Models [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7849569
    Explore at:
    Dataset updated
    Apr 21, 2023
    Dataset provided by
    Imperial College London
    The Pocket Dimension, Munich
    Authors
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for semantic and instance segmentation experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

    Abstract:

    Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

    Benchmark data

    Semantic and instance segmentation is used only rarely in non-human animals, partially due to the laborious process of curating sufficiently large annotated datasets. replicAnt can produce pixel-perfect segmentation maps with minimal manual effort. In order to assess the quality of the segmentations inferred by networks trained with these maps, semi-quantitative verification was conducted using a set of macro-photographs of Leptoglossus zonatus (Dallas, 1852) and Leptoglossus phyllopus (Linnaeus, 1767), provided by Prof. Christine Miller (University of Florida), and Royal Tyler (Bugwood.org. For further qualitative assessment of instance segmentation, we used laboratory footage, and field photographs of Atta vollenweideri provided by Prof. Flavio Roces. More extensive quantitative validation was infeasible, due to the considerable effort involved in hand-annotating larger datasets on a per-pixel basis.

    Synthetic data

    We generated two synthetic datasets from a single 3D scanned Leptoglossus zonatus (Dallas, 1852) specimen: one using the default pipeline, and one with additional plant assets, spawned by three dedicated scatterers. The plant assets were taken from the Quixel library and include 20 grass and 11 fern and shrub assets. Two dedicated grass scatterers were configured to spawn between 10,000 and 100,000 instances; the fern and shrub scatterer spawned between 500 to 10,000 instances. A total of 10,000 samples were generated for each sub dataset, leading to a combined dataset comprising 20,000 image render and ID passes. The addition of plant assets was necessary, as many of the macro-photographs also contained truncated plant stems or similar fragments, which networks trained on the default data struggled to distinguish from insect body segments. The ability to simply supplement the asset library underlines one of the main strengths of replicAnt: training data can be tailored to specific use cases with minimal effort.

    Funding

    This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

  10. h

    Instella-GSM8K-synthetic

    • huggingface.co
    Updated Nov 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AMD (2025). Instella-GSM8K-synthetic [Dataset]. https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2025
    Dataset authored and provided by
    AMD
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Instella-GSM8K-synthetic

    The Instella-GSM8K-synthetic dataset was used in the second stage pre-training of Instella-3B model, which was trained on top of the Instella-3B-Stage1 model. This synthetic dataset was generated using the training set of GSM8k dataset, where we first used Qwen2.5-72B-Instruct to

    Abstract numerical values as function parameters and generate a Python program to solve the math question. Identify and replace numerical values in the existing question with… See the full description on the dataset page: https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic.

  11. Customer Information Simluation Dataset

    • kaggle.com
    zip
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nidarshana P S (2025). Customer Information Simluation Dataset [Dataset]. https://www.kaggle.com/datasets/psnidarshana/customer-information-simluation-dataset
    Explore at:
    zip(4635 bytes)Available download formats
    Dataset updated
    Oct 15, 2025
    Authors
    Nidarshana P S
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains synthetic customer profile information created using the Python Faker Library. It is designed to help the students who are enthusiastic in machine learning, model building, visualization ,usage of python library in creating a dataset that doesn't involve any personal information

    Data features include: Name Email id Phone number Location Profession USE CASES: Data cleansing and preprocessing practice. Exploratory data analysis. Machine learning and model training

  12. h

    OpenDataGen-factuality-en-v0.1

    • huggingface.co
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas (2024). OpenDataGen-factuality-en-v0.1 [Dataset]. https://huggingface.co/datasets/thoddnn/OpenDataGen-factuality-en-v0.1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2024
    Authors
    Thomas
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This synthetic dataset was generated using the Open DataGen Python library. (https://github.com/thoddnn/open-datagen)

      Methodology:
    

    Retrieve random article content from the HuggingFace Wikipedia English dataset. Construct a Chain of Thought (CoT) to generate a Multiple Choice Question (MCQ). Utilize a Large Language Model (LLM) to score the results then filter it.

    All these steps are prompted in the 'template.json' file located in the specified code folder. Code:… See the full description on the dataset page: https://huggingface.co/datasets/thoddnn/OpenDataGen-factuality-en-v0.1.

  13. d

    6DOF pose estimation - synthetically generated dataset using BlenderProc

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyam Sheth (2025). 6DOF pose estimation - synthetically generated dataset using BlenderProc [Dataset]. http://doi.org/10.5061/dryad.rbnzs7hj5
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Divyam Sheth
    Time period covered
    Jan 1, 2023
    Description

    Accurate and robust 6DOF (Six Degrees of Freedom) pose estimation is a critical task in various fields, including computer vision, robotics, and augmented reality. This research paper presents a novel approach to enhance the accuracy and reliability of 6DOF pose estimation by introducing a robust method for generating synthetic data and leveraging the ease of multi-class training using the generated dataset. The proposed method tackles the challenge of insufficient real-world annotated data by creating a large and diverse synthetic dataset that accurately mimics real-world scenarios. The proposed method only requires a CAD model of the object and there is no limit to the number of unique data that can be generated. Furthermore, a multi-class training strategy that harnesses the synthetic dataset's diversity is proposed and presented. This approach mitigates class imbalance issues and significantly boosts accuracy across varied object classes and poses. Experimental results underscore th..., This dataset has been synthetically generated using 3D software like Blender and APIs like Blendeproc., , # Data Repository README

    This repository contains data organized into a structured format. The data consists of three main folders and two files, each serving a specific purpose. The data contains two folders - Cat and Hand.

    Cat Dataset: 63492 labeled data with images, masks, and poses.

    Hand Dataset: 42418 labeled data with images, masks, and poses.

    Usage: The dataset is ready for use by simply extracting the contents of the zip file, whether for training in a segmentation task or a pose estimation task.

    To view .npy files you will need to use Python with the numpy package installed. In Python use the following commands.

    import numpy
    data = numpy.load('file.npy')
    print(data)

    What free/open software is appropriate for viewing the .ply files?
    These files can be opened using any 3D modeling software like Blender, Meshlab, etc.

    Camera Matrix Intrinstics Format :

    Fx 0 px 0 Fy py 0 0 0

    Below is an overview of the data organization:

    Folder Structure

    1. Rgb:
      • This ...
  14. D

    A dataset of 1500-word stories generated by gpt-4o-mini for 236...

    • dataverse.no
    • dataverse.azure.uit.no
    • +1more
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jill Walker Rettberg; Jill Walker Rettberg; Hermann Wigers; Hermann Wigers (2025). A dataset of 1500-word stories generated by gpt-4o-mini for 236 nationalities [Dataset]. http://doi.org/10.18710/VM2K4O
    Explore at:
    text/comma-separated-values(18583), txt(19740), zip(42408986)Available download formats
    Dataset updated
    May 28, 2025
    Dataset provided by
    DataverseNO
    Authors
    Jill Walker Rettberg; Jill Walker Rettberg; Hermann Wigers; Hermann Wigers
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    European Research Council
    Research Council of Norway
    Description

    We created a dataset of stories generated by OpenAI’s gpt-4o-miniby using a Python script to construct prompts that were sent to the OpenAI API. We used Statistics Norway’s list of 252 countries, added demonyms for each country, for example Norwegian for Norway, and removed countries without demonyms, leaving us with 236 countries. Our base prompt was “Write a 1500 word potential {demonym} story”, and we generated 50 stories for each country. The scripts used to generate the data, and additional scripts for analysis are available at the GitHub repository https://github.com/MachineVisionUiB/GPT_stories

  15. MatSeg: Material State Segmentation Dataset and Benchmark

    • zenodo.org
    zip
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). MatSeg: Material State Segmentation Dataset and Benchmark [Dataset]. http://doi.org/10.5281/zenodo.11331618
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    MatSeg Dataset and benchmark for zero-shot material state segmentation.

    MatSeg Benchmark containing 1220 real-world images and their annotations is available at MatSeg_Benchmark.zip the file contains documentation and Python readers.

    MatSeg dataset containing synthetic images with infused natural images patterns is available at MatSeg3D_part_*.zip and MatSeg3D_part_*.zip (* stand for number).

    MatSeg3D_part_*.zip: contain synthethc 3D scenes

    MatSeg2D_part_*.zip: contain syntethc 2D scenes

    Readers and documentation for the synthetic data are available at: Dataset_Documentation_And_Readers.zip

    Readers and documentation for the real-images benchmark are available at: MatSeg_Benchmark.zip

    The Code used to generate the MatSeg Dataset is available at: https://zenodo.org/records/11401072

    Additional permanent sources for downloading the dataset and metadata: 1, 2

    Evaluation scripts for the Benchmark are now available at:

    https://zenodo.org/records/13402003 and https://e.pcloud.link/publink/show?code=XZsP8PZbT7AJzG98tV1gnVoEsxKRbBl8awX

    Description

    Materials and their states form a vast array of patterns and textures that define the physical and visual world. Minerals in rocks, sediment in soil, dust on surfaces, infection on leaves, stains on fruits, and foam in liquids are some of these almost infinite numbers of states and patterns.

    Image segmentation of materials and their states is fundamental to the understanding of the world and is essential for a wide range of tasks, from cooking and cleaning to construction, agriculture, and chemistry laboratory work.

    The MatSeg dataset focuses on zero-shot segmentation of materials and their states, meaning identifying the region of an image belonging to a specific material type of state, without previous knowledge or training of the material type, states, or environment.

    The dataset contains a large set of (100k) synthetic images and benchmarks of 1220 real-world images for testing.

    Benchmark

    The benchmark contains 1220 real-world images with a wide range of material states and settings. For example: food states (cooked/burned..), plants (infected/dry.) to rocks/soil (minerals/sediment), construction/metals (rusted, worn), liquids (foam/sediment), and many other states in without being limited to a set of classes or environment. The goal is to evaluate the segmentation of material materials without knowledge or pretraining on the material or setting. The focus is on materials with complex scattered boundaries, and gradual transition (like the level of wetness of the surface).

    Evaluation scripts for the Benchmark are now available at: 1 and 2.

    Synthetic Dataset

    The synthetic dataset is composed of synthetic scenes rendered in 2d and 3d using a blender. The synthetic data is infused with patterns, materials, and textures automatically extracted from real images allowing it to capture the complexity and diversity of the real world while maintaining the precision and scale of synthetic data. 100k images and their annotation are available to download.

    License

    This dataset, including all its components, is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. To the extent possible under law, the authors have dedicated all copyright and related and neighboring rights to this dataset to the public domain worldwide. This dedication applies to the dataset and all derivative works.

    The MatSeg 2D and 3D synthetic were generated using the open-images dataset which is licensed under the https://www.apache.org/licenses/LICENSE-2.0. For these components, you must comply with the terms of the Apache License. In addition, the MatSege3D dataset uses Shapenet 3D assets with GNU license.

    Example Usage:

    An Example of a training and evaluation code for a net trained on the dataset and evaluated on the benchmark is given at these urls: 1, 2

    This include an evaluation script on the MatSeg benchmark.

    Training script using the MatSeg dataset.

    And weights of a trained model

    Paper:

    More detail on the work ca be found in the paper "Infusing Synthetic Data with Real-World Patterns for
    Zero-Shot Material State Segmentation"

    Croissant metadata and additional sources for downloading the dataset are available at 1,2

  16. Data from: Synthetic Datasets for Numeric Uncertainty Quantification

    • figshare.com
    zip
    Updated Aug 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hussain Mohammed Kabir (2021). Synthetic Datasets for Numeric Uncertainty Quantification [Dataset]. http://doi.org/10.6084/m9.figshare.16528650.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 28, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Hussain Mohammed Kabir
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic Datasets for Numeric Uncertainty QuantificationThe Source of Dataset with Generation ScriptWe generate these synthetic datasets with the help of the following python script in the Kaggle.https://www.kaggle.com/dipuk0506/toy-dataset-for-regression-and-uqHow to Use DatasetsTrain Shallow NNsThe following notebook presents how to train Shallow NNs.https://www.kaggle.com/dipuk0506/shallow-nn-on-toy-datasetsVersion-N of the notebook applies a shallow NN to Data-N.Train RVFLThe following notebook presents how to train Random Vector Functional Link (RVFL) Networks.https://www.kaggle.com/dipuk0506/shallow-nn-on-toy-datasetsVersion-N of the notebook applies an RVFL network to Data-N.

  17. h

    verifiable-pythonic-function-calling-lite

    • huggingface.co
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dria (2025). verifiable-pythonic-function-calling-lite [Dataset]. https://huggingface.co/datasets/driaforall/verifiable-pythonic-function-calling-lite
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 14, 2025
    Dataset authored and provided by
    Dria
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Verifiable Pythonic Function Calling Lite

    This dataset is a subset of pythonic function calling dataset that is used for training Pythonic function calling models Dria-Agent-a-3B and Dria-Agent-a-7B. Dria is a python framework to generate synthetic data on globally connected edge devices with 50+ models. See the network here

      Dataset Summary
    

    The dataset includes various examples of function calling scenarios, ranging from simple to complex multi-turn interactions. It… See the full description on the dataset page: https://huggingface.co/datasets/driaforall/verifiable-pythonic-function-calling-lite.

  18. Regression datasets with different noise levels

    • kaggle.com
    zip
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Igor Vladimirovich Lapshin (2024). Regression datasets with different noise levels [Dataset]. https://www.kaggle.com/datasets/igorvlapshin/regression-datasets-with-different-noise-levels/data
    Explore at:
    zip(6628 bytes)Available download formats
    Dataset updated
    Jun 28, 2024
    Authors
    Igor Vladimirovich Lapshin
    Description

    Despite the huge number of datasets that can be found on various websites, it is sometimes useful to have data with different levels of noise to set up your model. Python is an excellent tool for creating such datasets. Several samples are presented here. In the Code section, I've uploaded a notebook that shows the process of creating, validating, and saving a dataset.

  19. Data archive for paper "Machine Learning Emulation of 3D Cloud Radiative...

    • zenodo.org
    zip
    Updated Mar 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Meyer; David Meyer (2022). Data archive for paper "Machine Learning Emulation of 3D Cloud Radiative Effects" [Dataset]. http://doi.org/10.5281/zenodo.4625414
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Meyer; David Meyer
    Description

    Overview

    This is the data archive for the paper "Machine Learning Emulation of 3D Cloud Radiative Effects". It contains the paper’s data archive with all model outputs (see the results folder) as well as theSingularity image.

    For the development repository (i.e. only containing source files without generated artefacts), please see https://github.com/dmey/ml-3d-cre.

    For the Python tool to generate the synthetic data, please refer to the Synthia repository.

    Requirements

    *Although PBS in not a strict requirement, it is required to run all helper scripts as included in this repository. Please note that depending on your specific system settings and resource availability, you may need to modify PBS parameters at the top of submit scripts stored in the hpc directory (e.g. #PBS -lwalltime=24:00:00).

    Initialization

    Deflate the data archive with:

    ./init.sh
    

    Build the Singularity image with:

    singularity build --remote tools/singularity/image.sif tools/singularity/image.def
    

    Compile ecRad with Singularity:

    ./tools/singularity/compile_ecrad.sh
    

    Usage

    To reproduce the results as described in the paper, run the following commands from the hpc folder:

    qsub -v JOB_NAME=mlp_synthia ./submit_grid_search_synthia.sh
    qsub -v JOB_NAME=mlp_default ./submit_grid_search_default.sh
    qsub submit_benchmark.sh
    

    then, to plot stats and identify notebooks run:

    qsub submit_stats.sh
    

    Local development

    For local development, notebooks can be run independently. To install the required dependencies, run the following through Anaconda conda env create -f environment.yml. Then, to activate the environment use conda activate radiation. For ecRad, the list of system dependencies are listed in tools\singularity\image.def and can be run with tools\singularity\compile_ecrad.sh.

    Licence

    Paper code released under the MIT license. Data released under CC BY 4.0. ecRad released under the Apache 2.0 license.

  20. Z

    replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Apr 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David (2023). replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7849416
    Explore at:
    Dataset updated
    Apr 21, 2023
    Dataset provided by
    Imperial College London
    The Pocket Dimension, Munich
    Authors
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

    Abstract:

    Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

    Benchmark data

    Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).

    The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.

    Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.

    Synthetic data generation

    Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.

    A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.

    Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).

    Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.

    Funding

    This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lala Ibadullayeva (2025). Creating_simple_Sintetic_dataset [Dataset]. https://www.kaggle.com/datasets/lalaibadullayeva/creating-simple-sintetic-dataset
Organization logo

Creating_simple_Sintetic_dataset

Synthetic Data Generated Using Python Libraries for Testing

Explore at:
zip(476698 bytes)Available download formats
Dataset updated
Jan 20, 2025
Authors
Lala Ibadullayeva
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Description

Overview: This dataset contains three distinct fake datasets generated using the Faker and Mimesis libraries. These libraries are commonly used for generating realistic-looking synthetic data for testing, prototyping, and data science projects. The datasets were created to simulate real-world scenarios while ensuring no sensitive or private information is included.

Data Generation Process: The data creation process is documented in the accompanying notebook, Creating_simple_Sintetic_data.ipynb. This notebook showcases the step-by-step procedure for generating synthetic datasets with customizable structures and fields using the Faker and Mimesis libraries.

File Contents:

Datasets: CSV files containing the three synthetic datasets. Notebook: Creating_simple_Sintetic_data.ipynb detailing the data generation process and the code used to create these datasets.

Search
Clear search
Close search
Google apps
Main menu