68 datasets found

Creating_simple_Sintetic_dataset
kaggle.com
zip
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lala Ibadullayeva (2025). Creating_simple_Sintetic_dataset [Dataset]. https://www.kaggle.com/datasets/lalaibadullayeva/creating-simple-sintetic-dataset
Explore at:
zip(476698 bytes)Available download formats
Dataset updated
Jan 20, 2025
Authors
Lala Ibadullayeva
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Description

Overview: This dataset contains three distinct fake datasets generated using the Faker and Mimesis libraries. These libraries are commonly used for generating realistic-looking synthetic data for testing, prototyping, and data science projects. The datasets were created to simulate real-world scenarios while ensuring no sensitive or private information is included.

Data Generation Process: The data creation process is documented in the accompanying notebook, Creating_simple_Sintetic_data.ipynb. This notebook showcases the step-by-step procedure for generating synthetic datasets with customizable structures and fields using the Faker and Mimesis libraries.

File Contents:

Datasets: CSV files containing the three synthetic datasets. Notebook: Creating_simple_Sintetic_data.ipynb detailing the data generation process and the code used to create these datasets.
Synthetic Data for Khmer Word Detection
kaggle.com
zip
Updated Oct 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chanveasna ENG (2025). Synthetic Data for Khmer Word Detection [Dataset]. https://www.kaggle.com/datasets/veasnaecevilsna/synthetic-data-for-khmer-word-detection
Explore at:
zip(8863660119 bytes)Available download formats
Dataset updated
Oct 12, 2025
Authors
Chanveasna ENG
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Synthetic Data for Khmer Word Detection

This dataset contains 10,000 synthetic images and corresponding bounding box labels for training object detection models to detect Khmer words.

The dataset is generated using a custom tool designed to create diverse and realistic training data for computer vision tasks, especially where real annotated data is scarce.

✨ Highlights

100,000 images (.png) with random backgrounds and styles.

Bounding boxes provided in YOLO (.txt) and Pascal VOC (.xml) formats.

50+ real background images + unlimited random background colors.

250+ different Khmer fonts.

Randomized effects: brightness, contrast, blur, color jitter, and more.

Wide variety of text sizes, positions, and layouts.

📂 Folder Structure

/ ├── synthetic_images/ # Synthetic images (.png) ├── synthetic_labels/ # YOLO format labels (.txt) ├── synthetic_xml_labels/ # Pascal VOC format labels (.xml)

Each image has corresponding .txt and .xml files with the same filename.

📏 Annotation Formats

YOLO Format (.txt):
Each line represents a word, with format: class_id center_x center_y width height All values are normalized between 0 and 1.
Example: 0 0.235 0.051 0.144 0.081

Pascal VOC Format (.xml):
Standard XML structure containing image metadata and bounding box coordinates (absolute pixel values).
Example: ```xml

🖼️ Image Samples

Each image contains random Khmer words placed naturally over backgrounds, with different font styles, sizes, and visual effects.
The dataset was carefully generated to simulate real-world challenges like:

Different lighting conditions

Different text sizes

Motion blur and color variations

🧠 Use Cases

Train YOLOv5, YOLOv8, EfficientDet, and other object detection models.

Fine-tune OCR (Optical Character Recognition) systems for Khmer language.

Research on low-resource language computer vision tasks.

Data augmentation for scene text detection.

⚙️ How It Was Generated

A random real-world background or random color is chosen.

Random Khmer words are selected from a large cleaned text file.

Words are rendered with random font, size, color, spacing, and position.

Image effects like motion blur and color jitter are randomly applied.

Bounding boxes are automatically generated for each word.

🧹 Data Cleaning

Words were sourced from a cleaned Khmer corpus to avoid duplicates and garbage data.

Fonts were tested to make sure they render Khmer characters properly.

📢 Important Notes

This dataset is synthetic. While it simulates real-world conditions, it may not fully replace real-world labeled data for final model evaluation.

All labels assume one class only (i.e., "word" = class_id 0).

❤️ Credits

Khmer text data collected from open-source projects such as khmer-text-data.

Khmer fonts collected from free and open repositories like Khmer Fonts Project.

📈 Future Updates

We plan to release:

Datasets with rotated bounding boxes for detecting skewed text.

More realistic mixing of real-world backgrounds and synthetic text.

Advanced distortions (e.g., handwriting-like simulation).

Stay tuned!

📜 License

This project is licensed under MIT license.

Please credit the original authors when using this data and provide a link to this dataset.

✉️ Contact

If you have any questions or want to collaborate, feel free to reach out:

📧 Email: veasnaec@gmail.com

🌐 GitHub: Chanveasna ENG
LLM Prompt Recovery - Synthetic Datastore
kaggle.com
zip
Updated Feb 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darien Schettler (2024). LLM Prompt Recovery - Synthetic Datastore [Dataset]. https://www.kaggle.com/datasets/dschettler8845/llm-prompt-recovery-synthetic-datastore
Explore at:
zip(988448 bytes)Available download formats
Dataset updated
Feb 29, 2024
Authors
Darien Schettler
License
https://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses
Description
High Level Description

This dataset uses Gemma 7B-IT to generate synthetic dataset for the LLM Prompt Recovery competition.

Contributors

Please go upvote these other datasets as my work is not possible without them

thedrcat's dataset - LLM Prompt Recovery Data

TBD

First Dataset - 1000 Examples From @thedrcat

Update 1 - February 29, 2024

The only file presently found in this dataset is gemma1000_7b.csv which uses the dataset created by @thedrcat found here: https://www.kaggle.com/datasets/thedrcat/llm-prompt-recovery-data?select=gemma1000.csv

The file below is the file Darek created with two additional columns appended. The first is the output of Gemma 7B-IT (raw based on the instructions below)(vs. 2B-IT that Darek used) and the second is the output with the 'Sure... blah blah

' sentence removed.

I generated things using the following setup:

# I used a vLLM server to host Gemma 7B on paperspace (A100) # Step 1 - Install vLLM >>> pip install vllm # Step 2 - Authenticate HuggingFace CLI (for model weights) >>> huggingface-cli login --token
Data archive for paper "Copula-based synthetic data augmentation for...
zenodo.org
zip
Updated Mar 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Meyer; David Meyer (2022). Data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators" [Dataset]. http://doi.org/10.5281/zenodo.5150327
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5150327
Dataset updated
Mar 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Meyer; David Meyer
Description
Overview

This is the data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators". It contains the paper’s data archive with model outputs (see results folder) and the Singularity image for (optionally) re-running experiments.

For the Python tool used to generate synthetic data, please refer to Synthia.

Requirements

Singularity >= 3

Portable Batch System (PBS) job scheduler*

Today's high-performance computer (e.g. ~ 32 CPUs @ 2 500 MHz with 64 GB of RAM )

*Although PBS in not a strict requirement, it is required to run all helper scripts as included in this repository. Please note that depending on your specific system settings and resource availability, you may need to modify PBS parameters at the top of submit scripts stored in the hpc directory (e.g. #PBS -lwalltime=72:00:00).

Usage

To reproduce the results from the experiments described in the paper, first fit all copula models to the reduced NWP-SAF dataset with:

qsub hpc/fit.sh

then, to generate synthetic data, run all machine learning model configurations, and compute the relevant statistics use:

qsub hpc/stats.sh qsub hpc/ml_control.sh qsub hpc/ml_synth.sh

Finally, to plot all artifacts included in the paper use:

qsub hpc/plot.sh

Licence

Code released under MIT license. Data from the reduced NWP-SAF dataset released under CC BY 4.0.
Z
replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Apr 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David (2023). replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7849595
Explore at:
Dataset updated
Apr 21, 2023
Dataset provided by
The Pocket Dimension, Munich
Imperial College London
Authors
Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for semantic and instance segmentation experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

Abstract:

Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

Benchmark data

Two pose-estimation datasets were procured. Both datasets used first instar Sungaya nexpectata (Zompro 1996) stick insects as a model species. Recordings from an evenly lit platform served as representative for controlled laboratory conditions; recordings from a hand-held phone camera served as approximate example for serendipitous recordings in the field.

For the platform experiments, walking S. inexpectata were recorded using a calibrated array of five FLIR blackfly colour cameras (Blackfly S USB3, Teledyne FLIR LLC, Wilsonville, Oregon, U.S.), each equipped with 8 mm c-mount lenses (M0828-MPW3 8MM 6MP F2.8-16 C-MOUNT, CBC Co., Ltd., Tokyo, Japan). All videos were recorded with 55 fps, and at the sensors’ native resolution of 2048 px by 1536 px. The cameras were synchronised for simultaneous capture from five perspectives (top, front right and left, back right and left), allowing for time-resolved, 3D reconstruction of animal pose.

The handheld footage was recorded in landscape orientation with a Huawei P20 (Huawei Technologies Co., Ltd., Shenzhen, China) in stabilised video mode: S. inexpectata were recorded walking across cluttered environments (hands, lab benches, PhD desks etc), resulting in frequent partial occlusions, magnification changes, and uneven lighting, so creating a more varied pose-estimation dataset.

Representative frames were extracted from videos using DeepLabCut (DLC)-internal k-means clustering. 46 key points in 805 and 200 frames for the platform and handheld case, respectively, were subsequently hand-annotated using the DLC annotation GUI.

Synthetic data

We generated a synthetic dataset of 10,000 images at a resolution of 1500 by 1500 px, based on a 3D model of a first instar S. inexpectata specimen, generated with the scAnt photogrammetry workflow. Generating 10,000 samples took about three hours on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super). We applied 70\% scale variation, and enforced hue, brightness, contrast, and saturation shifts, to generate 10 separate sub-datasets containing 1000 samples each, which were combined to form the full dataset.

Funding

This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
h
pythonic-function-calling
huggingface.co
Updated Feb 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dria (2025). pythonic-function-calling [Dataset]. https://huggingface.co/datasets/driaforall/pythonic-function-calling
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 14, 2025
Dataset authored and provided by
Dria
Description
Pythonic Function Calling Dataset

This dataset contains synthetic data used for training Pythonic function calling models Dria-Agent-a-3B and Dria-Agent-a-7B. Dria is a python framework to generate synthetic data on globally connected edge devices with 50+ models. See the network here

Dataset Summary

The dataset includes various examples of function calling scenarios, ranging from simple to complex multi-turn interactions. It was generated synthetically using the… See the full description on the dataset page: https://huggingface.co/datasets/driaforall/pythonic-function-calling.

Data from: Domain-adaptive Data Synthesis for Large-scale Supermarket...

zenodo.org

zip

Updated Apr 5, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Julian Strohmayer; Julian Strohmayer; Martin Kampel; Martin Kampel (2024). Domain-adaptive Data Synthesis for Large-scale Supermarket Product Recognition [Dataset]. http://doi.org/10.5281/zenodo.7750242

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7750242

Dataset updated

Apr 5, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Julian Strohmayer; Julian Strohmayer; Martin Kampel; Martin Kampel

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition

This repository contains the data synthesis pipeline and synthetic product recognition datasets proposed in [1].

Data Synthesis Pipeline:

We provide the Blender 3.1 project files and Python source code of our data synthesis pipeline pipeline.zip, accompanied by the FastCUT models used for synthetic-to-real domain translation models.zip. For the synthesis of new shelf images, a product assortment list and product images must be provided in the corresponding directories products/assortment/ and products/img/. The pipeline expects product images to follow the naming convention c.png, with c corresponding to a GTIN or generic class label (e.g., 9120050882171.png). The assortment list, assortment.csv, is expected to use the sample format [c, w, d, h], with c being the class label and w, d, and h being the packaging dimensions of the given product in mm (e.g., [4004218143128, 140, 70, 160]). The assortment list to use and the number of images to generate can be specified in generateImages.py (see comments). The rendering process is initiated by either executing load.py from within Blender or within a command-line terminal as a background process.

Datasets:

SG3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 851,801 instances of 3,234 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.
SG3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.
SGI3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 838,696 instances of 1,063 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.
SGI3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.
SPS8k - Synthetic Product Shelves 8k (SPS8k) dataset, comprised of 16,224 synthetic shelf images with 1,981,967 instances of 8,112 supermarket products. Instance-level bounding boxes and GTIN class labels are provided for all product instances.
SPS8kt - Domain-translated version of SPS8k, utilizing SKU110k as the target domain. Instance-level bounding boxes and GTIN class labels for all product instances.

Table 1: Dataset characteristics.

Dataset	#images	#products	#instances	labels	translation
SG3k	10,000	3,234	851,801	bounding box & generic class¹	none
SG3kt	10,000	3,234	851,801	bounding box & generic class¹	GroZi-3.2k
SGI3k	10,000	1,063	838,696	bounding box & generic class²	none
SGI3kt	10,000	1,063	838,696	bounding box & generic class²	GroZi-3.2k
SPS8k	16,224	8,112	1,981,967	bounding box & GTIN	none
SPS8kt	16,224	8,112	1,981,967	bounding box & GTIN	SKU110k

Sample Format

A sample consists of an RGB image (i.png) and an accompanying label file (i.txt), which contains the labels for all product instances present in the image. Labels use the YOLO format [c, x, y, w, h].

¹SG3k and SG3kt use generic pseudo-GTIN class labels, created by combining the GroZi-3.2k food product category number i (1-27) with the product image index j (j.jpg), following the convention i0000j (e.g., 13000097).

²SGI3k and SGI3kt use the generic GroZi-3.2k class labels from https://arxiv.org/abs/2003.06800.

Download and Use
This data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

[1] Strohmayer, Julian, and Martin Kampel. "Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition." International Conference on Computer Analysis of Images and Patterns. Cham: Springer Nature Switzerland, 2023.

BibTeX citation:

@inproceedings{strohmayer2023domain,
 title={Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition},
 author={Strohmayer, Julian and Kampel, Martin},
 booktitle={International Conference on Computer Analysis of Images and Patterns},
 pages={239--250},
 year={2023},
 organization={Springer}
}

MOSTLY AI Prize Data
kaggle.com
zip
Updated May 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ivonaK (2025). MOSTLY AI Prize Data [Dataset]. https://www.kaggle.com/datasets/ivonav/mostly-ai-prize-data/code
Explore at:
zip(9871594 bytes)Available download formats
Dataset updated
May 16, 2025
Authors
ivonaK
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Competition

Generate the BEST tabular synthetic data and win 100,000 USD in cash.

Competition runs for 50 days: May 14 - July 3, 2025.

MOSTLY AI Prize

This competition features two independent synthetic data challenges that you can join separately: - The FLAT DATA Challenge - The SEQUENTIAL DATA Challenge

For each challenge, generate a dataset with the same size and structure as the original, capturing its statistical patterns — but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.

Train a generative model that generalizes well, using any open-source tools (Synthetic Data SDK, synthcity, reprosyn, etc.) or your own solution. Submissions must be fully open-source, reproducible, and runnable within 6 hours on a standard machine.

Timeline

Submissions open: May 14, 2025, 15:30 UTC

Submission credits: 3 per calendar week (+bonus)

Submissions close: July 3, 2025, 23:59 UTC

Evaluation of Leaders: July 3 - July 9

Winners announced: on July 9 🏆

Datasets

Flat Data - 100,000 records - 80 data columns: 60 numeric, 20 categorical

Sequential Data - 20,000 groups - each group contains 5-10 records - 10 data columns: 7 numeric, 3 categorical

Evaluation

CSV submissions are parsed using pandas.read_csv() and checked for expected structure & size

Evaluated using the Synthetic Data Quality Assurance toolkit

Compared against the released training set and a hidden holdout set (same size, non-overlapping, from the same source)

Submission

MOSTLY AI Prize

Citation

If you use this dataset in your research, please cite:

@dataset{mostlyaiprize, author = {MOSTLY AI}, title = {MOSTLY AI Prize Dataset}, year = {2025}, url = {https://www.mostlyaiprize.com/}, }
h
python_plagiarism_code_dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nop, python_plagiarism_code_dataset [Dataset]. https://huggingface.co/datasets/nop12/python_plagiarism_code_dataset
Explore at:
Authors
nop
Description
Python Plagiarism Code Dataset

Overview

This dataset contains pairs of Python code samples with varying degrees of similarity, designed for training and evaluating plagiarism detection systems. The dataset was created using Large Language Models (LLMs) to generate synthetic code variations at different transformation levels, simulating real-world plagiarism scenarios in an academic context.

Purpose

The dataset addresses the limitations of existing code… See the full description on the dataset page: https://huggingface.co/datasets/nop12/python_plagiarism_code_dataset.
h
OpenDataGen-factuality-en-v0.1
huggingface.co
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas (2024). OpenDataGen-factuality-en-v0.1 [Dataset]. https://huggingface.co/datasets/thoddnn/OpenDataGen-factuality-en-v0.1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 9, 2024
Authors
Thomas
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This synthetic dataset was generated using the Open DataGen Python library. (https://github.com/thoddnn/open-datagen)

Methodology:

Retrieve random article content from the HuggingFace Wikipedia English dataset. Construct a Chain of Thought (CoT) to generate a Multiple Choice Question (MCQ). Utilize a Large Language Model (LLM) to score the results then filter it.

All these steps are prompted in the 'template.json' file located in the specified code folder. Code:… See the full description on the dataset page: https://huggingface.co/datasets/thoddnn/OpenDataGen-factuality-en-v0.1.
h
verifiable-pythonic-function-calling-lite
huggingface.co
Updated Feb 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dria (2025). verifiable-pythonic-function-calling-lite [Dataset]. https://huggingface.co/datasets/driaforall/verifiable-pythonic-function-calling-lite
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 14, 2025
Dataset authored and provided by
Dria
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Verifiable Pythonic Function Calling Lite

This dataset is a subset of pythonic function calling dataset that is used for training Pythonic function calling models Dria-Agent-a-3B and Dria-Agent-a-7B. Dria is a python framework to generate synthetic data on globally connected edge devices with 50+ models. See the network here

Dataset Summary

The dataset includes various examples of function calling scenarios, ranging from simple to complex multi-turn interactions. It… See the full description on the dataset page: https://huggingface.co/datasets/driaforall/verifiable-pythonic-function-calling-lite.
Data from: Synthetic Datasets for Numeric Uncertainty Quantification
figshare.com
zip
Updated Aug 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hussain Mohammed Kabir (2021). Synthetic Datasets for Numeric Uncertainty Quantification [Dataset]. http://doi.org/10.6084/m9.figshare.16528650.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16528650.v1
Dataset updated
Aug 28, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Hussain Mohammed Kabir
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic Datasets for Numeric Uncertainty QuantificationThe Source of Dataset with Generation ScriptWe generate these synthetic datasets with the help of the following python script in the Kaggle.https://www.kaggle.com/dipuk0506/toy-dataset-for-regression-and-uqHow to Use DatasetsTrain Shallow NNsThe following notebook presents how to train Shallow NNs.https://www.kaggle.com/dipuk0506/shallow-nn-on-toy-datasetsVersion-N of the notebook applies a shallow NN to Data-N.Train RVFLThe following notebook presents how to train Random Vector Functional Link (RVFL) Networks.https://www.kaggle.com/dipuk0506/shallow-nn-on-toy-datasetsVersion-N of the notebook applies an RVFL network to Data-N.
MatSeg: Material State Segmentation Dataset and Benchmark
zenodo.org
zip
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). MatSeg: Material State Segmentation Dataset and Benchmark [Dataset]. http://doi.org/10.5281/zenodo.11331618
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11331618
Dataset updated
May 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
MatSeg Dataset and benchmark for zero-shot material state segmentation.

MatSeg Benchmark containing 1220 real-world images and their annotations is available at MatSeg_Benchmark.zip the file contains documentation and Python readers.

MatSeg dataset containing synthetic images with infused natural images patterns is available at MatSeg3D_part_*.zip and MatSeg3D_part_*.zip (* stand for number).

MatSeg3D_part_*.zip: contain synthethc 3D scenes

MatSeg2D_part_*.zip: contain syntethc 2D scenes

Readers and documentation for the synthetic data are available at: Dataset_Documentation_And_Readers.zip

Readers and documentation for the real-images benchmark are available at: MatSeg_Benchmark.zip

The Code used to generate the MatSeg Dataset is available at: https://zenodo.org/records/11401072

Additional permanent sources for downloading the dataset and metadata: 1, 2

Evaluation scripts for the Benchmark are now available at:

https://zenodo.org/records/13402003 and https://e.pcloud.link/publink/show?code=XZsP8PZbT7AJzG98tV1gnVoEsxKRbBl8awX

Description

Materials and their states form a vast array of patterns and textures that define the physical and visual world. Minerals in rocks, sediment in soil, dust on surfaces, infection on leaves, stains on fruits, and foam in liquids are some of these almost infinite numbers of states and patterns.

Image segmentation of materials and their states is fundamental to the understanding of the world and is essential for a wide range of tasks, from cooking and cleaning to construction, agriculture, and chemistry laboratory work.

The MatSeg dataset focuses on zero-shot segmentation of materials and their states, meaning identifying the region of an image belonging to a specific material type of state, without previous knowledge or training of the material type, states, or environment.

The dataset contains a large set of (100k) synthetic images and benchmarks of 1220 real-world images for testing.

Benchmark

The benchmark contains 1220 real-world images with a wide range of material states and settings. For example: food states (cooked/burned..), plants (infected/dry.) to rocks/soil (minerals/sediment), construction/metals (rusted, worn), liquids (foam/sediment), and many other states in without being limited to a set of classes or environment. The goal is to evaluate the segmentation of material materials without knowledge or pretraining on the material or setting. The focus is on materials with complex scattered boundaries, and gradual transition (like the level of wetness of the surface).

Evaluation scripts for the Benchmark are now available at: 1 and 2.

"https://sites.google.com/view/matseg/home#h.2otka7pobcz1">

Synthetic Dataset

The synthetic dataset is composed of synthetic scenes rendered in 2d and 3d using a blender. The synthetic data is infused with patterns, materials, and textures automatically extracted from real images allowing it to capture the complexity and diversity of the real world while maintaining the precision and scale of synthetic data. 100k images and their annotation are available to download.

License

This dataset, including all its components, is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. To the extent possible under law, the authors have dedicated all copyright and related and neighboring rights to this dataset to the public domain worldwide. This dedication applies to the dataset and all derivative works.

The MatSeg 2D and 3D synthetic were generated using the open-images dataset which is licensed under the https://www.apache.org/licenses/LICENSE-2.0. For these components, you must comply with the terms of the Apache License. In addition, the MatSege3D dataset uses Shapenet 3D assets with GNU license.

Example Usage:

An Example of a training and evaluation code for a net trained on the dataset and evaluated on the benchmark is given at these urls: 1, 2

This include an evaluation script on the MatSeg benchmark.

Training script using the MatSeg dataset.

And weights of a trained model

Paper:

More detail on the work ca be found in the paper "Infusing Synthetic Data with Real-World Patterns for
Zero-Shot Material State Segmentation"

Croissant metadata and additional sources for downloading the dataset are available at 1,2
Z
replicAnt - Plum2023 - Segmentation Datasets and Trained Models
data-staging.niaid.nih.gov
Updated Apr 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David (2023). replicAnt - Plum2023 - Segmentation Datasets and Trained Models [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7849569
Explore at:
Dataset updated
Apr 21, 2023
Dataset provided by
The Pocket Dimension, Munich
Imperial College London
Authors
Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for semantic and instance segmentation experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

Abstract:

Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

Benchmark data

Semantic and instance segmentation is used only rarely in non-human animals, partially due to the laborious process of curating sufficiently large annotated datasets. replicAnt can produce pixel-perfect segmentation maps with minimal manual effort. In order to assess the quality of the segmentations inferred by networks trained with these maps, semi-quantitative verification was conducted using a set of macro-photographs of Leptoglossus zonatus (Dallas, 1852) and Leptoglossus phyllopus (Linnaeus, 1767), provided by Prof. Christine Miller (University of Florida), and Royal Tyler (Bugwood.org. For further qualitative assessment of instance segmentation, we used laboratory footage, and field photographs of Atta vollenweideri provided by Prof. Flavio Roces. More extensive quantitative validation was infeasible, due to the considerable effort involved in hand-annotating larger datasets on a per-pixel basis.

Synthetic data

We generated two synthetic datasets from a single 3D scanned Leptoglossus zonatus (Dallas, 1852) specimen: one using the default pipeline, and one with additional plant assets, spawned by three dedicated scatterers. The plant assets were taken from the Quixel library and include 20 grass and 11 fern and shrub assets. Two dedicated grass scatterers were configured to spawn between 10,000 and 100,000 instances; the fern and shrub scatterer spawned between 500 to 10,000 instances. A total of 10,000 samples were generated for each sub dataset, leading to a combined dataset comprising 20,000 image render and ID passes. The addition of plant assets was necessary, as many of the macro-photographs also contained truncated plant stems or similar fragments, which networks trained on the default data struggled to distinguish from insect body segments. The ability to simply supplement the asset library underlines one of the main strengths of replicAnt: training data can be tailored to specific use cases with minimal effort.

Funding

This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
h
Data from: TextbooksAreAllYouNeed
huggingface.co
Updated Aug 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SebastianB (2024). TextbooksAreAllYouNeed [Dataset]. https://huggingface.co/datasets/SebastianBodza/TextbooksAreAllYouNeed
Explore at:
Dataset updated
Aug 11, 2024
Authors
SebastianB
Description
Creating high quality synthetic Datasets:

Python Textbook with hands-on experience and Code-Exercises -> 42,491 words 285,786 characters Test-Driven development with Python -> 66,126 words 478,070 characters Torch in Python Textbook with hands-on experience and Code-Exercises -> 60,149 words 473,343 characters

Todo:

[programming language] hands-on experience and Code-Exercises Test-driven development with [programming language] hands-on experience and Code-Exercises [special lib]… See the full description on the dataset page: https://huggingface.co/datasets/SebastianBodza/TextbooksAreAllYouNeed.
Data archive for paper "Machine Learning Emulation of 3D Cloud Radiative...
zenodo.org
zip
Updated Mar 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Meyer; David Meyer (2022). Data archive for paper "Machine Learning Emulation of 3D Cloud Radiative Effects" [Dataset]. http://doi.org/10.5281/zenodo.4625414
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4625414
Dataset updated
Mar 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Meyer; David Meyer
Description
Overview

This is the data archive for the paper "Machine Learning Emulation of 3D Cloud Radiative Effects". It contains the paper’s data archive with all model outputs (see the results folder) as well as theSingularity image.

For the development repository (i.e. only containing source files without generated artefacts), please see https://github.com/dmey/ml-3d-cre.

For the Python tool to generate the synthetic data, please refer to the Synthia repository.

Requirements

Linux

Singularity >= 3

Portable Batch System (PBS) job scheduler*

*Although PBS in not a strict requirement, it is required to run all helper scripts as included in this repository. Please note that depending on your specific system settings and resource availability, you may need to modify PBS parameters at the top of submit scripts stored in the hpc directory (e.g. #PBS -lwalltime=24:00:00).

Initialization

Deflate the data archive with:

./init.sh

Build the Singularity image with:

singularity build --remote tools/singularity/image.sif tools/singularity/image.def

Compile ecRad with Singularity:

./tools/singularity/compile_ecrad.sh

Usage

To reproduce the results as described in the paper, run the following commands from the hpc folder:

qsub -v JOB_NAME=mlp_synthia ./submit_grid_search_synthia.sh qsub -v JOB_NAME=mlp_default ./submit_grid_search_default.sh qsub submit_benchmark.sh

then, to plot stats and identify notebooks run:

qsub submit_stats.sh

Local development

For local development, notebooks can be run independently. To install the required dependencies, run the following through Anaconda conda env create -f environment.yml. Then, to activate the environment use conda activate radiation. For ecRad, the list of system dependencies are listed in tools\singularity\image.def and can be run with tools\singularity\compile_ecrad.sh.

Licence

Paper code released under the MIT license. Data released under CC BY 4.0. ecRad released under the Apache 2.0 license.
m
Data and Code for: Hybrid Modelling of Chemical Processes - A Unified...
data.mendeley.com
Updated Aug 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raymoon Hwang (2025). Data and Code for: Hybrid Modelling of Chemical Processes - A Unified Framework [Dataset]. http://doi.org/10.17632/3v72vcdkyy.2
Explore at:
Unique identifier
https://doi.org/10.17632/3v72vcdkyy.2
Dataset updated
Aug 11, 2025
Authors
Raymoon Hwang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains the Python source code and synthetic data required to reproduce the results in the paper, "Hybrid Modelling of Chemical Processes: A Unified Framework Based on Deductive, Inductive, and Abductive Inference."

The project implements a layered hybrid modeling framework for a batch polymerization reactor, combining physics-based models with data-driven methods. The framework is composed of three distinct layers: - Deductive Layer (Tp): Enforces first-principles mass and energy balances to simulate the physical dynamics of the reactor. - Inductive Layer (Tm): An LSTM-based neural network that learns the unknown reaction kinetics from process data. - Abductive Layer (Ta): A feedforward neural network that functions as a soft sensor to infer latent (unmeasured) variables such as molecular weight, viscosity, and branching index.

The dataset includes all necessary Python scripts to generate synthetic data, define and train the neural network models, run the integrated hybrid simulation, and visualize the results. The framework is built using Python with libraries including PyTorch, SciPy, and Scikit-learn.
h
glaive-code-assistant
huggingface.co
Updated Sep 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Glaive AI (2023). glaive-code-assistant [Dataset]. https://huggingface.co/datasets/glaiveai/glaive-code-assistant
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 22, 2023
Dataset authored and provided by
Glaive AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Glaive-code-assistant

Glaive-code-assistant is a dataset of ~140k code problems and solutions generated using Glaive’s synthetic data generation platform. The data is intended to be used to make models act as code assistants, and so the data is structured in a QA format where the questions are worded similar to how real users will ask code related questions. The data has ~60% python samples. To report any problems or suggestions in the data, join the Glaive discord
Data product and code for: Spatiotemporal Distribution of Dissolved...
zenodo.org
nc, zip
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tobias Ehmen; Tobias Ehmen; Neill Mackay; Neill Mackay; Andrew Watson; Andrew Watson (2024). Data product and code for: Spatiotemporal Distribution of Dissolved Inorganic Carbon in the Global Ocean Interior - Reconstructed through Machine Learning [Dataset]. http://doi.org/10.5281/zenodo.14575969
Explore at:
nc, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14575969
Dataset updated
Dec 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tobias Ehmen; Tobias Ehmen; Neill Mackay; Neill Mackay; Andrew Watson; Andrew Watson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data product and code for: Ehmen et al.: Spatiotemporal Distribution of Dissolved Inorganic Carbon in the Global Ocean Interior - Reconstructed through Machine Learning

Note that due to the data limit on Zenodo only a compressed version of the ensemble mean is uploaded here (compressed_DIC_mean_15fold_ensemble_aveRMSE7.46_0.15TTcasts_1990-2023.nc). Individual ensemble members can be generated through the weight and scaler files found in weights_and_scalers_DIC_paper.zip and the code "ResNet_DIC_loading_past_prediction_2024-12-28.py" (see description below).

EN4_thickness_GEBCO.nc contains the scaling factors used in "plot_carbon_inventory_for_ensemble_2024-01-27.py" (see description below).

DIC_paper_code_Ehmen_et_al.zip contains the python code used to generate products and figures.

Prerequisites: Python running the modules tensorflow, shap, xarray, pandas and scipy. Plots additionally use matplotlib, cartopy, seaborn, statsmodels, gsw and cmocean.

The main scripts used to generate reconstructions are “ResNet_DIC_2024-12-28.py” (for new training runs) and “ResNet_DIC_loading_past_prediction_2024-12-28.py” (for already trained past weight and scaler files). Usage:

Assign the correct directories in the function “create_directories” according to your own system. You won’t need the same if-statements for individual platforms and computers

Download the most recent version of GLODAP and store it in the directory chosen in “create_directories”. Check if the filename is the same as used in “import_GLODAP_dataset”. Unless the GLODAP creators change their naming system of the columns, newer versions can be used instead of GLODAPv2.2023

Download the HOT, BATS and Drake Passage time series and ensure the filenames are the same as in “import_time_series_data”. Store them in the time series directory chosen in “create_directories”. This is optional and the time series prediction can be commented out.

Download EN4 analysis files for the years you want and store them in the EN4 analysis directory chosen in “create_directories”. For the reconstruction to be created from all available EN4 analysis files, the variable prediction_to_file needs to be True, otherwise only a single time slice will be predicted (but not saved) for testing and plotting.

If you want to generate reconstructions pre-trained models, make sure the “scalers” and “weight_files” subdirectories are correctly stored in the “training” directory defined in “create_directories”.

Store the synthetic dataset of ECCO-Darwin values at GLODAP locations in the directory chosen in “create_directories”. For predicting the full model fields ECCO-Darwin needs to be in a csv-style format (for use in pandas dataframes), i.e. the multi-dimensional data needs to be flattened. Store these altered csv-style files in the directory chosen in “create_directories”

Once a reconstruction has been generated the following scripts found in the subdirectory “working_with_finished_reconstructions” can be used:

ensemble_create_mean_and_std_2023-11-27.py: this creates an ensemble mean from ideally 15 ensemble members (number can be adjusted, if less reconstruction files are found than this number it is adjusted automatically). For DIC it also calculates the uncertainty following the method by Keppler et al. 2023.

plot_carbon_inventory_for_ensemble_2024-01-27.py: plots the carbon inventory change for DIC from both ensemble mean and the individual ensemble members. The most important settings are the default. Other options include plotting the seasonal change, others are not supported in this version as they require additional files not supplied here.

depth_slices_and_zonal_means_full_prediction_2024-07-05.py: creates several world maps for individual depths and zonal means for the Indian, Atlantic and Pacific Ocean.

Hovmoeller_plots_from_predictions_2024-05-02.py: generates simplified Hovmöller plots from individual reconstructions.

DIC_comparison_with_other_products_2024-06-27: interpolates and compares this product with climatologies and products from other studies. These need to be downloaded first. Products can be excluded if they are removed from the list “files_to_compare”.
Data from: ESAT: Environmental Source Apportionment Toolkit Python package
catalog.data.gov
s.cnmilf.com
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2024). ESAT: Environmental Source Apportionment Toolkit Python package [Dataset]. https://catalog.data.gov/dataset/esat-environmental-source-apportionment-toolkit-python-package
Explore at:
Dataset updated
Nov 29, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The Environmental Source Apportionment Toolkit (ESAT) is an open-source software package that provides API and CLI functionality to create source apportionment workflows specifically targeting environmental datasets. Source apportionment in environment science is the process of mathematically estimating the profiles and contributions of multiple sources in some dataset, and in the case of ESAT, while considering data uncertainty. There are many potential use cases for source apportionment in environmental science research, such as in the fields of air quality, water quality and potentially many others. The ESAT toolkit is written in Python and Rust, and uses common packages such as numpy, scipy and pandas for data processing. The source apportionment algorithms provided in ESAT include two variants of non-negative matrix factorization (NMF), both of which have been written in Rust and contained within the python package. A collection of data processing and visualization features are included for data and model analytics. The ESAT package includes a synthetic data generator and comparison tools to evaluate ESAT model outputs.

Facebook

Twitter

Click to copy link

Link copied

Cite

Lala Ibadullayeva (2025). Creating_simple_Sintetic_dataset [Dataset]. https://www.kaggle.com/datasets/lalaibadullayeva/creating-simple-sintetic-dataset

Creating_simple_Sintetic_dataset

Synthetic Data Generated Using Python Libraries for Testing

Explore at:

zip(476698 bytes)Available download formats

Dataset updated

Jan 20, 2025

Authors

Lala Ibadullayeva

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Description

Overview: This dataset contains three distinct fake datasets generated using the Faker and Mimesis libraries. These libraries are commonly used for generating realistic-looking synthetic data for testing, prototyping, and data science projects. The datasets were created to simulate real-world scenarios while ensuring no sensitive or private information is included.

Data Generation Process: The data creation process is documented in the accompanying notebook, Creating_simple_Sintetic_data.ipynb. This notebook showcases the step-by-step procedure for generating synthetic datasets with customizable structures and fields using the Faker and Mimesis libraries.

File Contents:

Datasets: CSV files containing the three synthetic datasets. Notebook: Creating_simple_Sintetic_data.ipynb detailing the data generation process and the code used to create these datasets.

Clear search

Close search

Google apps

Main menu

Creating_simple_Sintetic_dataset

Synthetic Data for Khmer Word Detection

Synthetic Data for Khmer Word Detection

✨ Highlights

📂 Folder Structure

📏 Annotation Formats

🖼️ Image Samples

🧠 Use Cases

⚙️ How It Was Generated

🧹 Data Cleaning

📢 Important Notes

❤️ Credits

📈 Future Updates

📜 License

✉️ Contact

LLM Prompt Recovery - Synthetic Datastore

High Level Description

Contributors

First Dataset - 1000 Examples From @thedrcat

Data archive for paper "Copula-based synthetic data augmentation for...

replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models

pythonic-function-calling

Data from: Domain-adaptive Data Synthesis for Large-scale Supermarket...

MOSTLY AI Prize Data

Competition

Timeline

Datasets

Evaluation

Submission

Citation

python_plagiarism_code_dataset

OpenDataGen-factuality-en-v0.1

verifiable-pythonic-function-calling-lite

Data from: Synthetic Datasets for Numeric Uncertainty Quantification

MatSeg: Material State Segmentation Dataset and Benchmark

Description

Benchmark

Synthetic Dataset

Example Usage:

replicAnt - Plum2023 - Segmentation Datasets and Trained Models

Data from: TextbooksAreAllYouNeed

Data archive for paper "Machine Learning Emulation of 3D Cloud Radiative...

Data and Code for: Hybrid Modelling of Chemical Processes - A Unified...

glaive-code-assistant

Data product and code for: Spatiotemporal Distribution of Dissolved...

Data from: ESAT: Environmental Source Apportionment Toolkit Python package

Creating_simple_Sintetic_dataset

Synthetic Data Generated Using Python Libraries for Testing