100+ datasets found
  1. Creating_simple_Sintetic_dataset

    • kaggle.com
    zip
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lala Ibadullayeva (2025). Creating_simple_Sintetic_dataset [Dataset]. https://www.kaggle.com/datasets/lalaibadullayeva/creating-simple-sintetic-dataset
    Explore at:
    zip(476698 bytes)Available download formats
    Dataset updated
    Jan 20, 2025
    Authors
    Lala Ibadullayeva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Description

    Overview: This dataset contains three distinct fake datasets generated using the Faker and Mimesis libraries. These libraries are commonly used for generating realistic-looking synthetic data for testing, prototyping, and data science projects. The datasets were created to simulate real-world scenarios while ensuring no sensitive or private information is included.

    Data Generation Process: The data creation process is documented in the accompanying notebook, Creating_simple_Sintetic_data.ipynb. This notebook showcases the step-by-step procedure for generating synthetic datasets with customizable structures and fields using the Faker and Mimesis libraries.

    File Contents:

    Datasets: CSV files containing the three synthetic datasets. Notebook: Creating_simple_Sintetic_data.ipynb detailing the data generation process and the code used to create these datasets.

  2. Synthetic Data for Khmer Word Detection

    • kaggle.com
    zip
    Updated Oct 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chanveasna ENG (2025). Synthetic Data for Khmer Word Detection [Dataset]. https://www.kaggle.com/datasets/veasnaecevilsna/synthetic-data-for-khmer-word-detection
    Explore at:
    zip(8863660119 bytes)Available download formats
    Dataset updated
    Oct 12, 2025
    Authors
    Chanveasna ENG
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic Data for Khmer Word Detection

    This dataset contains 10,000 synthetic images and corresponding bounding box labels for training object detection models to detect Khmer words.

    The dataset is generated using a custom tool designed to create diverse and realistic training data for computer vision tasks, especially where real annotated data is scarce.

    ✨ Highlights

    • 100,000 images (.png) with random backgrounds and styles.
    • Bounding boxes provided in YOLO (.txt) and Pascal VOC (.xml) formats.
    • 50+ real background images + unlimited random background colors.
    • 250+ different Khmer fonts.
    • Randomized effects: brightness, contrast, blur, color jitter, and more.
    • Wide variety of text sizes, positions, and layouts.

    📂 Folder Structure

    /
    ├── synthetic_images/   # Synthetic images (.png)
    ├── synthetic_labels/   # YOLO format labels (.txt)
    ├── synthetic_xml_labels/ # Pascal VOC format labels (.xml)
    

    Each image has corresponding .txt and .xml files with the same filename.

    📏 Annotation Formats

    • YOLO Format (.txt):
      Each line represents a word, with format: class_id center_x center_y width height All values are normalized between 0 and 1.
      Example: 0 0.235 0.051 0.144 0.081

    • Pascal VOC Format (.xml):
      Standard XML structure containing image metadata and bounding box coordinates (absolute pixel values).
      Example: ```xml

    🖼️ Image Samples

    Each image contains random Khmer words placed naturally over backgrounds, with different font styles, sizes, and visual effects.
    The dataset was carefully generated to simulate real-world challenges like:

    • Different lighting conditions
    • Different text sizes
    • Motion blur and color variations

    🧠 Use Cases

    • Train YOLOv5, YOLOv8, EfficientDet, and other object detection models.
    • Fine-tune OCR (Optical Character Recognition) systems for Khmer language.
    • Research on low-resource language computer vision tasks.
    • Data augmentation for scene text detection.

    ⚙️ How It Was Generated

    1. A random real-world background or random color is chosen.
    2. Random Khmer words are selected from a large cleaned text file.
    3. Words are rendered with random font, size, color, spacing, and position.
    4. Image effects like motion blur and color jitter are randomly applied.
    5. Bounding boxes are automatically generated for each word.

    🧹 Data Cleaning

    • Words were sourced from a cleaned Khmer corpus to avoid duplicates and garbage data.
    • Fonts were tested to make sure they render Khmer characters properly.

    📢 Important Notes

    • This dataset is synthetic. While it simulates real-world conditions, it may not fully replace real-world labeled data for final model evaluation.
    • All labels assume one class only (i.e., "word" = class_id 0).

    ❤️ Credits

    📈 Future Updates

    We plan to release:

    • Datasets with rotated bounding boxes for detecting skewed text.
    • More realistic mixing of real-world backgrounds and synthetic text.
    • Advanced distortions (e.g., handwriting-like simulation).

    Stay tuned!

    📜 License

    This project is licensed under MIT license.

    Please credit the original authors when using this data and provide a link to this dataset.

    ✉️ Contact

    If you have any questions or want to collaborate, feel free to reach out:

  3. h

    text-to-python-synthetic

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Data Advice, text-to-python-synthetic [Dataset]. https://huggingface.co/datasets/AI-Data-Advice-Comp/text-to-python-synthetic
    Explore at:
    Dataset authored and provided by
    AI Data Advice
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    AI-Data-Advice-Comp/text-to-python-synthetic dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. Mimesis Fake dataset

    • kaggle.com
    zip
    Updated Dec 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhik Dhar (2023). Mimesis Fake dataset [Dataset]. https://www.kaggle.com/datasets/abhikdhar/mimesis-fake-dataset
    Explore at:
    zip(1074507 bytes)Available download formats
    Dataset updated
    Dec 3, 2023
    Authors
    Abhik Dhar
    Description

    This is a dataset servicing unique features which can be used for data analytics education and model experimetal purposes. This is a synthetic dataset created artificially by the help of mimesis module in python.

  5. d

    SDNist v1.3: Temporal Map Challenge Environment

    • datasets.ai
    • gimi9.com
    • +2more
    0, 23, 5, 8
    Updated Jan 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). SDNist v1.3: Temporal Map Challenge Environment [Dataset]. https://datasets.ai/datasets/sdnist-benchmark-data-and-evaluation-tools-for-data-synthesizers
    Explore at:
    5, 23, 8, 0Available download formats
    Dataset updated
    Jan 24, 2022
    Dataset authored and provided by
    National Institute of Standards and Technology
    Description

    SDNist (v1.3) is a set of benchmark data and metrics for the evaluation of synthetic data generators on structured tabular data. This version (1.3) reproduces the challenge environment from Sprints 2 and 3 of the Temporal Map Challenge. These benchmarks are distributed as a simple open-source python package to allow standardized and reproducible comparison of synthetic generator models on real world data and use cases. These data and metrics were developed for and vetted through the NIST PSCR Differential Privacy Temporal Map Challenge, where the evaluation tools, k-marginal and Higher Order Conjunction, proved effective in distinguishing competing models in the competition environment.SDNist is available via pip install: pip install sdnist==1.2.8 for Python >=3.6 or on the USNIST/Github. The sdnist Python module will download data from NIST as necessary, and users are not required to download data manually.

  6. h

    python-code-generation-synthetic

    • huggingface.co
    Updated Oct 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    V S Vishwas (2025). python-code-generation-synthetic [Dataset]. https://huggingface.co/datasets/v-i-s-h-w-a-s/python-code-generation-synthetic
    Explore at:
    Dataset updated
    Oct 4, 2025
    Authors
    V S Vishwas
    Description

    v-i-s-h-w-a-s/python-code-generation-synthetic dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. Data from: Domain-adaptive Data Synthesis for Large-scale Supermarket...

    • zenodo.org
    zip
    Updated Apr 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Strohmayer; Julian Strohmayer; Martin Kampel; Martin Kampel (2024). Domain-adaptive Data Synthesis for Large-scale Supermarket Product Recognition [Dataset]. http://doi.org/10.5281/zenodo.7750242
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julian Strohmayer; Julian Strohmayer; Martin Kampel; Martin Kampel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition

    This repository contains the data synthesis pipeline and synthetic product recognition datasets proposed in [1].

    Data Synthesis Pipeline:

    We provide the Blender 3.1 project files and Python source code of our data synthesis pipeline pipeline.zip, accompanied by the FastCUT models used for synthetic-to-real domain translation models.zip. For the synthesis of new shelf images, a product assortment list and product images must be provided in the corresponding directories products/assortment/ and products/img/. The pipeline expects product images to follow the naming convention c.png, with c corresponding to a GTIN or generic class label (e.g., 9120050882171.png). The assortment list, assortment.csv, is expected to use the sample format [c, w, d, h], with c being the class label and w, d, and h being the packaging dimensions of the given product in mm (e.g., [4004218143128, 140, 70, 160]). The assortment list to use and the number of images to generate can be specified in generateImages.py (see comments). The rendering process is initiated by either executing load.py from within Blender or within a command-line terminal as a background process.

    Datasets:

    • SG3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 851,801 instances of 3,234 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.
    • SG3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.
    • SGI3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 838,696 instances of 1,063 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.
    • SGI3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.
    • SPS8k - Synthetic Product Shelves 8k (SPS8k) dataset, comprised of 16,224 synthetic shelf images with 1,981,967 instances of 8,112 supermarket products. Instance-level bounding boxes and GTIN class labels are provided for all product instances.
    • SPS8kt - Domain-translated version of SPS8k, utilizing SKU110k as the target domain. Instance-level bounding boxes and GTIN class labels for all product instances.

    Table 1: Dataset characteristics.

    Dataset#images#products#instances labels translation
    SG3k10,0003,234851,801bounding box & generic class¹none
    SG3kt10,0003,234851,801bounding box & generic class¹GroZi-3.2k
    SGI3k10,0001,063838,696bounding box & generic class²none
    SGI3kt10,0001,063838,696bounding box & generic class²GroZi-3.2k
    SPS8k16,2248,1121,981,967bounding box & GTINnone
    SPS8kt16,2248,1121,981,967bounding box & GTINSKU110k

    Sample Format

    A sample consists of an RGB image (i.png) and an accompanying label file (i.txt), which contains the labels for all product instances present in the image. Labels use the YOLO format [c, x, y, w, h].

    ¹SG3k and SG3kt use generic pseudo-GTIN class labels, created by combining the GroZi-3.2k food product category number i (1-27) with the product image index j (j.jpg), following the convention i0000j (e.g., 13000097).

    ²SGI3k and SGI3kt use the generic GroZi-3.2k class labels from https://arxiv.org/abs/2003.06800.

    Download and Use
    This data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

    [1] Strohmayer, Julian, and Martin Kampel. "Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition." International Conference on Computer Analysis of Images and Patterns. Cham: Springer Nature Switzerland, 2023.

    BibTeX citation:

    @inproceedings{strohmayer2023domain,
     title={Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition},
     author={Strohmayer, Julian and Kampel, Martin},
     booktitle={International Conference on Computer Analysis of Images and Patterns},
     pages={239--250},
     year={2023},
     organization={Springer}
    }
  8. Customer Information Simluation Dataset

    • kaggle.com
    zip
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nidarshana P S (2025). Customer Information Simluation Dataset [Dataset]. https://www.kaggle.com/datasets/psnidarshana/customer-information-simluation-dataset
    Explore at:
    zip(4635 bytes)Available download formats
    Dataset updated
    Oct 15, 2025
    Authors
    Nidarshana P S
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains synthetic customer profile information created using the Python Faker Library. It is designed to help the students who are enthusiastic in machine learning, model building, visualization ,usage of python library in creating a dataset that doesn't involve any personal information

    Data features include: Name Email id Phone number Location Profession USE CASES: Data cleansing and preprocessing practice. Exploratory data analysis. Machine learning and model training

  9. Data archive for paper "Copula-based synthetic data augmentation for...

    • zenodo.org
    zip
    Updated Mar 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Meyer; David Meyer (2022). Data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators" [Dataset]. http://doi.org/10.5281/zenodo.5150327
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Meyer; David Meyer
    Description

    Overview

    This is the data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators". It contains the paper’s data archive with model outputs (see results folder) and the Singularity image for (optionally) re-running experiments.

    For the Python tool used to generate synthetic data, please refer to Synthia.

    Requirements

    *Although PBS in not a strict requirement, it is required to run all helper scripts as included in this repository. Please note that depending on your specific system settings and resource availability, you may need to modify PBS parameters at the top of submit scripts stored in the hpc directory (e.g. #PBS -lwalltime=72:00:00).

    Usage

    To reproduce the results from the experiments described in the paper, first fit all copula models to the reduced NWP-SAF dataset with:

    qsub hpc/fit.sh

    then, to generate synthetic data, run all machine learning model configurations, and compute the relevant statistics use:

    qsub hpc/stats.sh
    qsub hpc/ml_control.sh
    qsub hpc/ml_synth.sh

    Finally, to plot all artifacts included in the paper use:

    qsub hpc/plot.sh

    Licence

    Code released under MIT license. Data from the reduced NWP-SAF dataset released under CC BY 4.0.

  10. CK4Gen, High Utility Synthetic Survival Datasets

    • figshare.com
    zip
    Updated Nov 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Kuo (2024). CK4Gen, High Utility Synthetic Survival Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.27611388.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 5, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Nicholas Kuo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ===###Overview:This repository provides high-utility synthetic survival datasets generated using the CK4Gen framework, optimised to retain critical clinical characteristics for use in research and educational settings. Each dataset is based on a carefully curated ground truth dataset, processed with standardised variable definitions and analytical approaches, ensuring a consistent baseline for survival analysis.###===###Description:The repository includes synthetic versions of four widely utilised and publicly accessible survival analysis datasets, each anchored in foundational studies and aligned with established ground truth variations to support robust clinical research and training.#---GBSG2: Based on Schumacher et al. [1]. The study evaluated the effects of hormonal treatment and chemotherapy duration in node-positive breast cancer patients, tracking recurrence-free and overall survival among 686 women over a median of 5 years. Our synthetic version is derived from a variation of the GBSG2 dataset available in the lifelines package [2], formatted to match the descriptions in Sauerbrei et al. [3], which we treat as the ground truth.ACTG320: Based on Hammer et al. [4]. The study investigates the impact of adding the protease inhibitor indinavir to a standard two-drug regimen for HIV-1 treatment. The original clinical trial involved 1,151 patients with prior zidovudine exposure and low CD4 cell counts, tracking outcomes over a median follow-up of 38 weeks. Our synthetic dataset is derived from a variation of the ACTG320 dataset available in the sksurv package [5], which we treat as the ground truth dataset.WHAS500: Based on Goldberg et al. [6]. The study follows 500 patients to investigate survival rates following acute myocardial infarction (MI), capturing a range of factors influencing MI incidence and outcomes. Our synthetic data replicates a ground truth variation from the sksurv package, which we treat as the ground truth dataset.FLChain: Based on Dispenzieri et al. [7]. The study assesses the prognostic relevance of serum immunoglobulin free light chains (FLCs) for overall survival in a large cohort of 15,859 participants. Our synthetic version is based on a variation available in the sksurv package, which we treat as the ground truth dataset.###===###Notes:Please find an in-depth discussion on these datasets, as well as their generation process, in the link below, to our paper:https://arxiv.org/abs/2410.16872Kuo, et al. "CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare." arXiv preprint arXiv:2410.16872 (2024).###===###References:[1]: Schumacher, et al. “Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German breast cancer study group.”, Journal of Clinical Oncology, 1994.[2]: Davidson-Pilon “lifelines: Survival Analysis in Python”, Journal of Open Source Software, 2019.[3]: Sauerbrei, et al. “Modelling the effects of standard prognostic factors in node-positive breast cancer”, British Journal of Cancer, 1999.[4]: Hammer, et al. “A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less”, New England Journal of Medicine, 1997.[5]: Pölsterl “scikit-survival: A library for time-to-event analysis built on top of scikit-learn”, Journal of Machine Learning Research, 2020.[6]: Goldberg, et al. “Incidence and case fatality rates of acute myocardial infarction (1975–1984): the Worcester heart attack study”, American Heart Journal, 1988.[7]: Dispenzieri, et al. “Use of nonclonal serum immunoglobulin free light chains to predict overall survival in the general population”, in Mayo Clinic Proceedings, 2012.

  11. LLM Prompt Recovery - Synthetic Datastore

    • kaggle.com
    zip
    Updated Feb 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2024). LLM Prompt Recovery - Synthetic Datastore [Dataset]. https://www.kaggle.com/datasets/dschettler8845/llm-prompt-recovery-synthetic-datastore
    Explore at:
    zip(988448 bytes)Available download formats
    Dataset updated
    Feb 29, 2024
    Authors
    Darien Schettler
    License

    https://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses

    Description

    High Level Description

    This dataset uses Gemma 7B-IT to generate synthetic dataset for the LLM Prompt Recovery competition.

    Contributors

    Please go upvote these other datasets as my work is not possible without them

    First Dataset - 1000 Examples From @thedrcat

    Update 1 - February 29, 2024

    The only file presently found in this dataset is gemma1000_7b.csv which uses the dataset created by @thedrcat found here: https://www.kaggle.com/datasets/thedrcat/llm-prompt-recovery-data?select=gemma1000.csv

    The file below is the file Darek created with two additional columns appended. The first is the output of Gemma 7B-IT (raw based on the instructions below)(vs. 2B-IT that Darek used) and the second is the output with the 'Sure... blah blah

    ' sentence removed.

    I generated things using the following setup:

    # I used a vLLM server to host Gemma 7B on paperspace (A100)
    
    # Step 1 - Install vLLM
    >>> pip install vllm
    
    # Step 2 - Authenticate HuggingFace CLI (for model weights)
    >>> huggingface-cli login --token
    
  12. h

    dummy_health_data

    • huggingface.co
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mudumbai Vraja Kishore (2025). dummy_health_data [Dataset]. https://huggingface.co/datasets/vrajakishore/dummy_health_data
    Explore at:
    Dataset updated
    May 29, 2025
    Authors
    Mudumbai Vraja Kishore
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Synthetic Healthcare Dataset

      Overview
    

    This dataset is a synthetic healthcare dataset created for use in data analysis. It mimics real-world patient healthcare data and is intended for applications within the healthcare industry.

      Data Generation
    

    The data has been generated using the Faker Python library, which produces randomized and synthetic records that resemble real-world data patterns. It includes various healthcare-related fields such as patient… See the full description on the dataset page: https://huggingface.co/datasets/vrajakishore/dummy_health_data.

  13. synthetic-energy-data

    • kaggle.com
    zip
    Updated Mar 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Solomon Matthews (2025). synthetic-energy-data [Dataset]. https://www.kaggle.com/datasets/solomonmatthews/synthetic-energy-data/data
    Explore at:
    zip(432063 bytes)Available download formats
    Dataset updated
    Mar 16, 2025
    Authors
    Solomon Matthews
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This synthetic dataset simulates energy consumption patterns and user behavior for 10,000 fictional smart households. Designed for privacy-conscious research, it mirrors real-world trends in energy usage, household demographics, and weather correlations while avoiding sensitive or identifiable information.

    Synthetic Data: Programmatically generated using Python’s Faker, Pandas, and statistical models.

    Real-World Relevance: Patterns align with benchmarks from the IEA and Indian Census.

    Use Cases: Ideal for regression, clustering, and time-series forecasting tasks.

  14. Z

    replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Apr 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David (2023). replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7849595
    Explore at:
    Dataset updated
    Apr 21, 2023
    Dataset provided by
    The Pocket Dimension, Munich
    Imperial College London
    Authors
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for semantic and instance segmentation experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

    Abstract:

    Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

    Benchmark data

    Two pose-estimation datasets were procured. Both datasets used first instar Sungaya nexpectata (Zompro 1996) stick insects as a model species. Recordings from an evenly lit platform served as representative for controlled laboratory conditions; recordings from a hand-held phone camera served as approximate example for serendipitous recordings in the field.

    For the platform experiments, walking S. inexpectata were recorded using a calibrated array of five FLIR blackfly colour cameras (Blackfly S USB3, Teledyne FLIR LLC, Wilsonville, Oregon, U.S.), each equipped with 8 mm c-mount lenses (M0828-MPW3 8MM 6MP F2.8-16 C-MOUNT, CBC Co., Ltd., Tokyo, Japan). All videos were recorded with 55 fps, and at the sensors’ native resolution of 2048 px by 1536 px. The cameras were synchronised for simultaneous capture from five perspectives (top, front right and left, back right and left), allowing for time-resolved, 3D reconstruction of animal pose.

    The handheld footage was recorded in landscape orientation with a Huawei P20 (Huawei Technologies Co., Ltd., Shenzhen, China) in stabilised video mode: S. inexpectata were recorded walking across cluttered environments (hands, lab benches, PhD desks etc), resulting in frequent partial occlusions, magnification changes, and uneven lighting, so creating a more varied pose-estimation dataset.

    Representative frames were extracted from videos using DeepLabCut (DLC)-internal k-means clustering. 46 key points in 805 and 200 frames for the platform and handheld case, respectively, were subsequently hand-annotated using the DLC annotation GUI.

    Synthetic data

    We generated a synthetic dataset of 10,000 images at a resolution of 1500 by 1500 px, based on a 3D model of a first instar S. inexpectata specimen, generated with the scAnt photogrammetry workflow. Generating 10,000 samples took about three hours on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super). We applied 70\% scale variation, and enforced hue, brightness, contrast, and saturation shifts, to generate 10 separate sub-datasets containing 1000 samples each, which were combined to form the full dataset.

    Funding

    This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

  15. h

    synthetic-privacy

    • huggingface.co
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kristian Billeskov (2025). synthetic-privacy [Dataset]. https://huggingface.co/datasets/kbillesk/synthetic-privacy
    Explore at:
    Dataset updated
    Nov 6, 2025
    Authors
    Kristian Billeskov
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    We have created a test data set for assessing the speed and efficiency of processing unstructured data. We have based the data set partly on demo data, which is intended to demonstrate the quality of our classification and synthetic data, providing a realistic sample for testing speed and efficiency. The test data set has been sampled from the following sources:

    Most textual/tabular data has been generated synthetically with the Faker Python library Images and some .pdf have been sourced from… See the full description on the dataset page: https://huggingface.co/datasets/kbillesk/synthetic-privacy.

  16. f

    SPIDER - Synthetic Person Information Dataset for Entity Resolution

    • figshare.com
    Updated Jul 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur (2025). SPIDER - Synthetic Person Information Dataset for Entity Resolution [Dataset]. http://doi.org/10.6084/m9.figshare.29595599.v2
    Explore at:
    text/x-script.pythonAvailable download formats
    Dataset updated
    Jul 24, 2025
    Dataset provided by
    figshare
    Authors
    Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SPIDER - Synthetic Person Information Dataset for Entity Resolution offers researchers with ready to use data that can be utilized in benchmarking Duplicate or Entity Resolution algorithms. The dataset is aimed at person-level fields that are typical in customer data. As it is hard to source real world person level data due to Personally Identifiable Information (PII), there are very few synthetic data available publicly. The current datasets also come with limitations of small volume and core person-level fields missing in the dataset. SPIDER addresses the challenges by focusing on core person level attributes - first/last name, email, phone, address and dob. Using Python Faker library, 40,000 unique, synthetic person records are created. An additional 10,000 duplicate records are generated from the base records using 7 real-world transformation rules. The duplicate records are labelled with original base record and the duplicate rule used for record generation through is_duplicate_of and duplication_rule fieldsDuplicate RulesDuplicate record with a variation in email address.Duplicate record with a variation in email addressDuplicate record with last name variationDuplicate record with first name variationDuplicate record with a nicknameDuplicate record with near exact spellingDuplicate record with only same email and nameOutput FormatThe dataset is presented in both JSON and CSV formats for use in data processing and machine learning tools.Data RegenerationThe project includes the python script used for generating the 50,000 person records. The Python script can be expanded to include - additional duplicate rules, fuzzy name, geographical names' variations and volume adjustments.Files Includedspider_dataset_20250714_035016.csvspider_dataset_20250714_035016.jsonspider_readme.mdDataDescriptionspythoncodeV1.py

  17. m

    Simple Synthetic Fruits Image Dataset (Apples, Bananas, Oranges)

    • data.mendeley.com
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MD SOHAG HOSSAIN (2025). Simple Synthetic Fruits Image Dataset (Apples, Bananas, Oranges) [Dataset]. http://doi.org/10.17632/s2gkwbgwz4.1
    Explore at:
    Dataset updated
    Sep 11, 2025
    Authors
    MD SOHAG HOSSAIN
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 36 synthetic fruit images generated using the Python PIL library. It includes three categories of fruits: Apple, Banana, and Orange, with 12 images per class. Each image has a resolution of 224×224 pixels in RGB PNG format and is properly labeled.

    The dataset is primarily designed for educational and research purposes, including: - Multi-class image classification tasks - Introductory computer vision practice - Demonstration of dataset creation and publishing on Mendeley Data

    File Structure: ├── apple/ → 12 images ├── banana/ → 12 images └── orange/ → 12 images

    Key Features: - 3 fruit categories (apple, banana, orange) - 36 images in total - 224×224 pixels, RGB, PNG format - Synthetic illustrations (not real photographs) - Suitable for classification tasks, teaching, and dataset publishing demonstrations

    ... License: CC BY 4.0

    Keywords: Fruits, Image Classification, Computer Vision, Synthetic Dataset, Machine Learning

  18. Data from: ESAT: Environmental Source Apportionment Toolkit Python package

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2024). ESAT: Environmental Source Apportionment Toolkit Python package [Dataset]. https://catalog.data.gov/dataset/esat-environmental-source-apportionment-toolkit-python-package
    Explore at:
    Dataset updated
    Nov 29, 2024
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The Environmental Source Apportionment Toolkit (ESAT) is an open-source software package that provides API and CLI functionality to create source apportionment workflows specifically targeting environmental datasets. Source apportionment in environment science is the process of mathematically estimating the profiles and contributions of multiple sources in some dataset, and in the case of ESAT, while considering data uncertainty. There are many potential use cases for source apportionment in environmental science research, such as in the fields of air quality, water quality and potentially many others. The ESAT toolkit is written in Python and Rust, and uses common packages such as numpy, scipy and pandas for data processing. The source apportionment algorithms provided in ESAT include two variants of non-negative matrix factorization (NMF), both of which have been written in Rust and contained within the python package. A collection of data processing and visualization features are included for data and model analytics. The ESAT package includes a synthetic data generator and comparison tools to evaluate ESAT model outputs.

  19. h

    pythonic-function-calling

    • huggingface.co
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dria (2025). pythonic-function-calling [Dataset]. https://huggingface.co/datasets/driaforall/pythonic-function-calling
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 14, 2025
    Dataset authored and provided by
    Dria
    Description

    Pythonic Function Calling Dataset

    This dataset contains synthetic data used for training Pythonic function calling models Dria-Agent-a-3B and Dria-Agent-a-7B. Dria is a python framework to generate synthetic data on globally connected edge devices with 50+ models. See the network here

      Dataset Summary
    

    The dataset includes various examples of function calling scenarios, ranging from simple to complex multi-turn interactions. It was generated synthetically using the… See the full description on the dataset page: https://huggingface.co/datasets/driaforall/pythonic-function-calling.

  20. d

    Robot Control Gestures (RoCoG)

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Aug 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Celso de Melo; Brandon Rothrock; Prudhvi Gurram; Oytun Ulutan; B.S. Manjunath (2020). Robot Control Gestures (RoCoG) [Dataset]. http://doi.org/10.25349/D9PP5J
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 26, 2020
    Dataset provided by
    Dryad
    Authors
    Celso de Melo; Brandon Rothrock; Prudhvi Gurram; Oytun Ulutan; B.S. Manjunath
    Time period covered
    Aug 21, 2020
    Description

    The following files are available with the dataset:

    rocog_s00.zip, ..., rocog_s12.zip (26.2 GB): Raw videos for the human subjects performing the gestures and annotations

    rocog_human_frames.zip, ..., rocog_human_frames.z02 (18.7 GB): Frames for human data used for training and testing. Each folder also has annotations for gesture (label.bin), orientation (orientation.bin), and the number of times the gesture is repeated (repetitions.bin)

    rocog_synth_frames.zip, ..., rocog_synth_frames.z09 (~85.0 GB): Frames for synthetic data used for training and testing. Each folder also has annotations for gesture (label.bin), orientation (orientation.bin), and the number of times the gesture is repeated (repetitions.bin)

    The labels are saved into Python binary struct arrays. Each file contains one entry per frame in the corresponding directory. Here's Python sample code to open these files:

    import glob import os import struct

    frames_dir = 'FemaleCivilian\10_Advance_11_1_2019_1...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lala Ibadullayeva (2025). Creating_simple_Sintetic_dataset [Dataset]. https://www.kaggle.com/datasets/lalaibadullayeva/creating-simple-sintetic-dataset
Organization logo

Creating_simple_Sintetic_dataset

Synthetic Data Generated Using Python Libraries for Testing

Explore at:
zip(476698 bytes)Available download formats
Dataset updated
Jan 20, 2025
Authors
Lala Ibadullayeva
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Description

Overview: This dataset contains three distinct fake datasets generated using the Faker and Mimesis libraries. These libraries are commonly used for generating realistic-looking synthetic data for testing, prototyping, and data science projects. The datasets were created to simulate real-world scenarios while ensuring no sensitive or private information is included.

Data Generation Process: The data creation process is documented in the accompanying notebook, Creating_simple_Sintetic_data.ipynb. This notebook showcases the step-by-step procedure for generating synthetic datasets with customizable structures and fields using the Faker and Mimesis libraries.

File Contents:

Datasets: CSV files containing the three synthetic datasets. Notebook: Creating_simple_Sintetic_data.ipynb detailing the data generation process and the code used to create these datasets.

Search
Clear search
Close search
Google apps
Main menu