Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Description
Overview: This dataset contains three distinct fake datasets generated using the Faker and Mimesis libraries. These libraries are commonly used for generating realistic-looking synthetic data for testing, prototyping, and data science projects. The datasets were created to simulate real-world scenarios while ensuring no sensitive or private information is included.
Data Generation Process: The data creation process is documented in the accompanying notebook, Creating_simple_Sintetic_data.ipynb. This notebook showcases the step-by-step procedure for generating synthetic datasets with customizable structures and fields using the Faker and Mimesis libraries.
File Contents:
Datasets: CSV files containing the three synthetic datasets. Notebook: Creating_simple_Sintetic_data.ipynb detailing the data generation process and the code used to create these datasets.
Facebook
Twitterhttps://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses
This dataset uses Gemma 7B-IT to generate synthetic dataset for the LLM Prompt Recovery competition.
Please go upvote these other datasets as my work is not possible without them
Update 1 - February 29, 2024
The only file presently found in this dataset is gemma1000_7b.csv which uses the dataset created by @thedrcat found here: https://www.kaggle.com/datasets/thedrcat/llm-prompt-recovery-data?select=gemma1000.csv
The file below is the file Darek created with two additional columns appended. The first is the output of Gemma 7B-IT (raw based on the instructions below)(vs. 2B-IT that Darek used) and the second is the output with the 'Sure... blah blah
' sentence removed.
I generated things using the following setup:
# I used a vLLM server to host Gemma 7B on paperspace (A100)
# Step 1 - Install vLLM
>>> pip install vllm
# Step 2 - Authenticate HuggingFace CLI (for model weights)
>>> huggingface-cli login --token
Facebook
Twitterv-i-s-h-w-a-s/python-code-generation-synthetic dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This competition features two independent synthetic data challenges that you can join separately: - The FLAT DATA Challenge - The SEQUENTIAL DATA Challenge
For each challenge, generate a dataset with the same size and structure as the original, capturing its statistical patterns — but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.
Train a generative model that generalizes well, using any open-source tools (Synthetic Data SDK, synthcity, reprosyn, etc.) or your own solution. Submissions must be fully open-source, reproducible, and runnable within 6 hours on a standard machine.
Flat Data - 100,000 records - 80 data columns: 60 numeric, 20 categorical
Sequential Data - 20,000 groups - each group contains 5-10 records - 10 data columns: 7 numeric, 3 categorical
If you use this dataset in your research, please cite:
@dataset{mostlyaiprize,
author = {MOSTLY AI},
title = {MOSTLY AI Prize Dataset},
year = {2025},
url = {https://www.mostlyaiprize.com/},
}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAbout the Dataset This is a simulated credit card transaction dataset containing legitimate and fraud transactions from the duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants.
Source of Simulation This was generated using Sparkov Data Generation | Github tool created by Brandon Harris. This simulation was run for the duration - 1 Jan 2019 to 31 Dec 2020. The files were combined and converted into a standard format.
Information about the Simulator I do not own the simulator. I used the one used by Brandon Harris and just to understand how it works, I went through few portions of the code. This is what I understood from what I read:
The simulator has certain pre-defined list of merchants, customers and transaction categories. And then using a python library called "faker", and with the number of customers, merchants that you mention during simulation, an intermediate list is created.
After this, depending on the profile you choose for e.g. "adults 2550 female rural.json" (which means simulation properties of adult females in the age range of 25-50 who are from rural areas), the transactions are created. Say, for this profile, you could check "Sparkov | Github | adults_2550_female_rural.json", there are parameter value ranges defined in terms of min, max transactions per day, distribution of transactions across days of the week and normal distribution properties (mean, standard deviation) for amounts in various categories. Using these measures of distributions, the transactions are generated using faker.
What I did was generate transactions across all profiles and then merged them together to create a more realistic representation of simulated transactions.
Acknowledgements - Brandon Harris for his amazing work in creating this easy-to-use simulation tool for creating fraud transaction datasets.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic Healthcare Dataset
Overview
This dataset is a synthetic healthcare dataset created for use in data analysis. It mimics real-world patient healthcare data and is intended for applications within the healthcare industry.
Data Generation
The data has been generated using the Faker Python library, which produces randomized and synthetic records that resemble real-world data patterns. It includes various healthcare-related fields such as patient… See the full description on the dataset page: https://huggingface.co/datasets/vrajakishore/dummy_health_data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SPIDER - Synthetic Person Information Dataset for Entity Resolution offers researchers with ready to use data that can be utilized in benchmarking Duplicate or Entity Resolution algorithms. The dataset is aimed at person-level fields that are typical in customer data. As it is hard to source real world person level data due to Personally Identifiable Information (PII), there are very few synthetic data available publicly. The current datasets also come with limitations of small volume and core person-level fields missing in the dataset. SPIDER addresses the challenges by focusing on core person level attributes - first/last name, email, phone, address and dob. Using Python Faker library, 40,000 unique, synthetic person records are created. An additional 10,000 duplicate records are generated from the base records using 7 real-world transformation rules. The duplicate records are labelled with original base record and the duplicate rule used for record generation through is_duplicate_of and duplication_rule fieldsDuplicate RulesDuplicate record with a variation in email address.Duplicate record with a variation in email addressDuplicate record with last name variationDuplicate record with first name variationDuplicate record with a nicknameDuplicate record with near exact spellingDuplicate record with only same email and nameOutput FormatThe dataset is presented in both JSON and CSV formats for use in data processing and machine learning tools.Data RegenerationThe project includes the python script used for generating the 50,000 person records. The Python script can be expanded to include - additional duplicate rules, fuzzy name, geographical names' variations and volume adjustments.Files Includedspider_dataset_20250714_035016.csvspider_dataset_20250714_035016.jsonspider_readme.mdDataDescriptionspythoncodeV1.py
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The MatSim Dataset and benchmark
Synthetic dataset and real images benchmark for visual similarity recognition of materials and textures.
MatSim: a synthetic dataset, a benchmark, and a method for computer vision-based recognition of similarities and transitions between materials and textures focusing on identifying any material under any conditions using one or a few examples (one-shot learning).
Based on the paper: One-shot recognition of any material anywhere using contrastive learning with physics-based rendering
Benchmark_MATSIM.zip: contain the benchmark made of real-world images as described in the paper
MatSim_object_train_split_1,2,3.zip: Contain a subset of the synthetics dataset for images of CGI images materials on random objects as described in the paper.
MatSim_Vessels_Train_1,2,3.zip : Contain a subset of the synthetics dataset for images of CGI images materials inside transparent containers as described in the paper.
*Note: these are subsets of the dataset; the full dataset can be found at:
https://e1.pcloud.link/publink/show?code=kZIiSQZCYU5M4HOvnQykql9jxF4h0KiC5MX
or
https://icedrive.net/s/A13FWzZ8V2aP9T4ufGQ1N3fBZxDF
Code:
Up to date code for generating the dataset, reading and evaluation and trained nets can be found in this URL:https://github.com/sagieppel/MatSim-Dataset-Generator-Scripts-And-Neural-net
Dataset Generation Scripts.zip: Contain the Blender (3.1) Python scripts used for generating the dataset, this code might be odl up to date code can be found here
Net_Code_And_Trained_Model.zip: Contain a reference neural net code, including loaders, trained models, and evaluators scripts that can be used to read and train with the synthetic dataset or test the model with the benchmark. Note code in the ZIP file is not up to date and contains some bugs For the Latest version of this code see this URL
Further documentation can be found inside the zip files or in the paper.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Exploring the creation of a unique dataset of synthetic influencer profiles using AI technologies, including OpenAI's GPT-3.5.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic Datasets for Numeric Uncertainty QuantificationThe Source of Dataset with Generation ScriptWe generate these synthetic datasets with the help of the following python script in the Kaggle.https://www.kaggle.com/dipuk0506/toy-dataset-for-regression-and-uqHow to Use DatasetsTrain Shallow NNsThe following notebook presents how to train Shallow NNs.https://www.kaggle.com/dipuk0506/shallow-nn-on-toy-datasetsVersion-N of the notebook applies a shallow NN to Data-N.Train RVFLThe following notebook presents how to train Random Vector Functional Link (RVFL) Networks.https://www.kaggle.com/dipuk0506/shallow-nn-on-toy-datasetsVersion-N of the notebook applies an RVFL network to Data-N.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
C program implementing the method described in the paper;Gcc makefile for compiling the C program;Example data for use with the C program;Python program for generating synthetic test data;Instructions for use of the other files
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.
Abstract:
Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.
Benchmark data
Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).
The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.
Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.
Synthetic data generation
Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.
A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.
Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).
Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.
Funding
This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Facebook
TwitterThis dataset was created synthetically with the python package faker. It is intended to practice the deduplication of databases.
unique_data.csv is our main data frame without duplicates. Everything starts here. The other files (01_duplicate*, 02_duplicate*, etc...) hold only duplicate values from the unique_data.csv entries. You can mix unique_data.csv with one of the duplicate csvs or parts of the duplicate csv to get a dataset with duplicate values to practice your deduplication skills.
Replaces a random fraction (50%) of cells in the dataframe with np.nan. ['company', 'name', 'uuid4'] are excluded by this augmentation
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Glaive-code-assistant
Glaive-code-assistant is a dataset of ~140k code problems and solutions generated using Glaive’s synthetic data generation platform. The data is intended to be used to make models act as code assistants, and so the data is structured in a QA format where the questions are worded similar to how real users will ask code related questions. The data has ~60% python samples. To report any problems or suggestions in the data, join the Glaive discord
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data collection is perhaps the most crucial part of any machine learning model: without it being done properly, not enough information is present for the model to learn from the patterns leading to one output or another. Data collection is however a very complex endeavor, time-consuming due to the volume of data that needs to be acquired and annotated. Annotation is an especially problematic step, due to its difficulty, length, and vulnerability to human error and inaccuracies when annotating complex data.
With high processing power becoming ever more accessible, synthetic dataset generation is becoming a viable option when looking to generate large volumes of accurately annotated data. With the help of photorealistic renderers, it is for example possible now to generate immense amounts of data, annotated with pixel-perfect precision and whose content is virtually indistinguishable from real-world pictures.
As an exercise of synthetic dataset generation, the data offered here was generated using the Python API of Blender, with the images rendered through the Cycles raycaster. It represents plausible images representing pictures of chessboard and pieces. The goal is, from those pictures and their annotation, to build a model capable of recognizing the pieces, as well as their positions on the board.
The dataset contains a large amount of synthetic, randomly generated images representing pictures of chess images, taken at an angle overlooking the board and its pieces. Each image is associated with a .json file containing its annotations. The naming convention is that each render is associated with a number X, and that the images and annotations associated with that render are respectively named X.jpg and X.json.
The data has been generated using the Python scripts and .blend file present in this repository. The chess board and pieces models that have been used for those renders are not provided with the code.
Data characteristics :
No distinction has been hard-built between training, validation, and testing data, and is left completely up to the users. A proposed pipeline for the extraction, recognition, and placement of chess pieces is proposed in a notebook added with this dataset.
I would like to express my gratitude for the efforts of the Blender Foundation and all its participants, for their incredible open-source tool which once again has allowed me to conduct interesting projects with great ease.
Two interesting papers on the generation and use of synthetic data, which have inspired me to conduct this project :
Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt (2021) Fake It Till You Make It: Face analysis in the wild using synthetic data alone https://arxiv.org/abs/2109.15102 Salehe Erfanian Ebadi, You-Cyuan Jhang, Alex Zook (2021) PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision https://arxiv.org/abs/2112.09290
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 36 synthetic fruit images generated using the Python PIL library. It includes three categories of fruits: Apple, Banana, and Orange, with 12 images per class. Each image has a resolution of 224×224 pixels in RGB PNG format and is properly labeled.
The dataset is primarily designed for educational and research purposes, including: - Multi-class image classification tasks - Introductory computer vision practice - Demonstration of dataset creation and publishing on Mendeley Data
File Structure: ├── apple/ → 12 images ├── banana/ → 12 images └── orange/ → 12 images
Key Features: - 3 fruit categories (apple, banana, orange) - 36 images in total - 224×224 pixels, RGB, PNG format - Synthetic illustrations (not real photographs) - Suitable for classification tasks, teaching, and dataset publishing demonstrations
... License: CC BY 4.0
Keywords: Fruits, Image Classification, Computer Vision, Synthetic Dataset, Machine Learning
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data files containg code sources for dataset creation & model learning (Joint-Space-SCA.zip) and collected synthetic dataset of free & collided postures for humanoid robot iCub (raw_binary_data.zip). Follow the Readme.MD files to launch the code if needed.
Corresponding Git repo: https://github.com/epfl-lasa/Joint-Space-SCA
Facebook
TwitterAugmented Texas 7000-bus synthetic grid Augmented version of the synthetic Texas 7k dataset published by Texas A&M University. The system has been populated with high-resolution distributed photovoltaic (PV) generation, comprising 4,499 PV plants of varying sizes with associated time series for 1 year of operation. This high-resolution dataset was produced following publicly available data and it is free of CEII. Details on the procedure followed to generate the PV dataset can be found in the Open COG Grid Project Year 1 Report (Chapter 6). The technical data of the system is provided using the (open) CTM specification for easy accessibility from Python without additional packages (data can be loaded as a dictionary). The time series for demand and PV production are provided as a HDF5 file, also loadable with standard open-source tools. We additionally provide example scripts for parsing the data in Python. Prepared by LLNL under Contract DE-AC52-07NA27344. LLNL control number: LLNL-DATA-2001833.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
3D-DST-models
As part of our data release in 3D-DST, we present aligned CAD models for all 1000 classes in ImageNet-1k. See wufeim/DST3D for synthetic data generation with 3D annotations using the CAD models here. Besides the .csv file as visualized in the dataset viewer above, we also provide a python script (models_3d_dst.py) to help integrate with other Python modules.
Fields
For each CAD model, there are seven fields:
synset: synset associated with each… See the full description on the dataset page: https://huggingface.co/datasets/ccvl/3D-DST-models.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Description
Overview: This dataset contains three distinct fake datasets generated using the Faker and Mimesis libraries. These libraries are commonly used for generating realistic-looking synthetic data for testing, prototyping, and data science projects. The datasets were created to simulate real-world scenarios while ensuring no sensitive or private information is included.
Data Generation Process: The data creation process is documented in the accompanying notebook, Creating_simple_Sintetic_data.ipynb. This notebook showcases the step-by-step procedure for generating synthetic datasets with customizable structures and fields using the Faker and Mimesis libraries.
File Contents:
Datasets: CSV files containing the three synthetic datasets. Notebook: Creating_simple_Sintetic_data.ipynb detailing the data generation process and the code used to create these datasets.