๐ Dataset Overview
This dataset was created as part of the Uplimit course "Synthetic Data Generation for Fine-Tuning." It represents a cleaned and filtered subset of a real instruction-tuning dataset, intended for experimentation with supervised fine-tuning (SFT). The goal was to produce a high-quality synthetic dataset by curating and filtering real-world instruction-response data using semantic deduplication and automated quality scoring.
๐ Data Preparation Stepsโฆ See the full description on the dataset page: https://huggingface.co/datasets/mimartin1234/uplimit-synthetic-data-week-1-filtered.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CAI-Synthetic Model
Overview
The CAI-Synthetic Model is a large language model designed to understand and respond to complex questions. This model has been fine-tuned on a synthetic dataset from Mostly AI, allowing it to engage in a variety of contexts with reliable responses. It is designed to perform well in diverse scenarios.
Base Model and Fine-Tuning
Base Model: Google/Gemma-7b
Fine-Tuning Adapter: LoRA Adapter
Synthetic Dataset: Mostly AI Syntheticโฆ See the full description on the dataset page: https://huggingface.co/datasets/InnerI/CAI-synthetic-10k.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is dataset was created for project 1 of the Uplimit course Synthetic Data Generation for Fine-tuning AI Models. The inspiration comes from wanting a model that can be used to handle all debates about which basketball player is the greatest of all time (Lebron) The dataset was generated using a compiled list of facts about Lebron James using chatGPTs Deep Research and then a two distinct distilabel pipelines followed up with some quality analysis and filtering. The entire process can beโฆ See the full description on the dataset page: https://huggingface.co/datasets/egillv/uplimit-synthetic-data-week-1-filtered.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fine-tuning set organic chemistry synthetic procedures for gpt-4o-mini-2024-07-18. 31 records. Taken from USPTO dataset 2024 and sampled from academic literature 2024. Objective: scrape relevant information for reactants, reagents, solvents and product. Filetype: jsonl.Trained tokens 61.065Epochs 3 Batch size 1 raining loss 0.0258LR multiplier 1.8
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Large-language models (LLMs) show promise for extracting information from clinical notes. Deploying these models at scale can be challenging due to high computational costs, regulatory constraints, and privacy concerns. To address these challenges, synthetic data distillation can be used to fine-tune smaller, open-source LLMs that achieve performance similar to the teacher model. These smaller models can be run on less expensive local hardware or at a vastly reduced cost in cloud deployments. In our recent study [1], we used Llama-3.1-70B-Instruct to generate synthetic training examples in the form of question-answer pairs along with supporting information. We manually reviewed 1000 of these examples and release them here. These examples can then be used to fine-tune smaller versions of Llama to improve their ability to extract clinical information from notes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Optical Character Recognition (OCR) systems frequently encounter difficulties when processing rare or ancient scripts, especially when they occur in historical contexts involving multiple writing systems. These challenges often constrain researchers to fine-tune or to train new OCR models tailored to their specific needs. To support these efforts, we introduce a synthetic dataset comprising 6.2 million lines, specifically geared towards mixed polytonic Greek and Latin scripts. Being augmented with artificially degraded lines, the dataset bolsters strong results when used to train historical OCR models. This resource can be used both for training and testing purposes, and is particularly valuable for researchers working with ancient Greek and limited annotated data. The software used to generate this datasets is linked to below on our Git. This is a sample, but please contact us if you would like access to the whole dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the datamix created by Team ๐ ๐ ๐ต๏ธโโ๏ธ ๐ค
during the LLM - Detect AI Generated Text
competition. This dataset helped us to win the competition. It facilitates a text-classification
task to separate LLM generate essays from the student written ones.
It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.
To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.
Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset
We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays
Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was generated using an open-source large language model and carefully curated prompts, simulating realistic clinical narratives while ensuring no real patient data is included. The primary purpose of this dataset is to support the development, evaluation, and benchmarking of Artificial Intelligence tools for clinical and biomedical applications in the Portuguese language, especially European Portuguese. It is particularly valuable for information extraction (IE) tasks such as named entity recognition, clinical note classification, summarization, and synthetic data generation in low-resource language settings. The dataset promotes research on the responsible use of synthetic data in healthcare and aims to serve as a foundation for training or fine-tuning domain-specific Portuguese language models in clinical IE and other natural language processing tasks. About the dataset XML files comprising 98,571 fully synthetic clinical notes in European Portuguese, divided into 4 types: 24,759 admission notes, 24,411 ambulatory notes, 24,639 discharge summaries, and 24,762 nursing notes; CSV file with prompts and responses from prompt engineering; CSV files with prompts and responses from synthetic dataset generation; CSV file with results from human evaluation; TXT files containing 1,000 clinical notes (250 of each type) taken from the synthetic dataset and used during automatic evaluation.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
MicroRNAs (miRNAs), small RNA molecules of 20โ24 nts, have many features that make them useful tools for gene expression regulation๎ธsmall size, flexible design, target predictability, and action at a late stage of the gene expression pipeline. In addition, their role in fine-tuning gene expression can be harnessed to increase robustness of synthetic gene networks. In this work, we apply a synthetic biology approach to characterize miRNA-mediated gene expression regulation in the unicellular green alga Chlamydomonas reinhardtii. This characterization is then used to build tools based on miRNAs, such as synthetic miRNAs, miRNA-responsive 3โฒUTRs, miRNA decoys, and self-regulatory loops. These tools will facilitate the engineering of gene expression for new applications and improved traits in this alga.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analyzing the dynamics of information diffusion cascades and accurately predicting their behavior holds significant importance in various applications. In this paper, we concentrate specifically on a recently introduced contrastive cascade graph learning framework, for the task of predicting cascade popularity. This framework follows a pre-training and fine-tuning paradigm to address cascade prediction tasks. In a previous study, the transferability of pre-trained models within the contrastive cascade graph learning framework was examined solely between two social media datasets. However, in our present study, we comprehensively evaluate the transferability of pre-trained models across 13 real datasets and six synthetic datasets. We construct several pre-trained models using real cascades and synthetic cascades generated by the independent cascade model and the Profile model. Then, we fine-tune these pre-trained models on real cascade datasets and evaluate their prediction accuracy based on the mean squared logarithmic error. The main findings derived from our results are as follows. (1) The pre-trained models exhibit transferability across diverse types of real datasets in different domains, encompassing different languages, social media platforms, and diffusion time scales. (2) Synthetic cascade data prove effective for pre-training purposes. The pre-trained models constructed with synthetic cascade data demonstrate comparable effectiveness to those constructed using real data. (3) Synthetic cascade data prove beneficial for fine-tuning the contrastive cascade graph learning models and training other state-of-the-art popularity prediction models. Models trained using a combination of real and synthetic cascades yield significantly lower mean squared logarithmic error compared to those trained solely on real cascades. Our findings affirm the effectiveness of synthetic cascade data in enhancing the accuracy of cascade popularity prediction.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This synthetic dataset contains 20,000 records of X-ray data labeled as "Normal" or "Tuberculosis". It is specifically created for training and evaluating classification models in the field of medical image analysis. The dataset aims to aid in building machine learning and deep learning models for detecting tuberculosis from X-ray data.
Tuberculosis (TB) is a highly infectious disease that primarily affects the lungs. Accurate detection of TB using chest X-rays can significantly enhance medical diagnostics. However, real-world datasets are often scarce or restricted due to privacy concerns. This synthetic dataset bridges that gap by providing simulated patient data while maintaining realistic distributions and patterns commonly observed in TB cases.
Column Name | Description |
---|---|
Patient_ID | Unique ID for each patient (e.g., PID000001) |
Age | Age of the patient (in years) |
Gender | Gender of the patient (Male/Female) |
Chest_Pain | Presence of chest pain (Yes/No) |
Cough_Severity | Severity of cough (Scale: 0-9) |
Breathlessness | Severity of breathlessness (Scale: 0-4) |
Fatigue | Level of fatigue experienced (Scale: 0-9) |
Weight_Loss | Weight loss (in kg) |
Fever | Level of fever (Mild, Moderate, High) |
Night_Sweats | Whether night sweats are present (Yes/No) |
Sputum_Production | Level of sputum production (Low, Medium, High) |
Blood_in_Sputum | Presence of blood in sputum (Yes/No) |
Smoking_History | Smoking status (Never, Former, Current) |
Previous_TB_History | Previous tuberculosis history (Yes/No) |
Class | Target variable indicating the condition (Normal, Tuberculosis) |
The dataset was generated using Python with the following libraries:
- Pandas: To create and save the dataset as a CSV file
- NumPy: To generate random numbers and simulate realistic data
- Random Seed: Set to ensure reproducibility
The target variable "Class" has a 70-30 distribution between Normal and Tuberculosis cases. The data is randomly generated with realistic patterns that mimic typical TB symptoms and demographic distributions.
This dataset is intended for:
- Machine Learning and Deep Learning classification tasks
- Data exploration and feature analysis
- Model evaluation and comparison
- Educational and research purposes
This synthetic dataset is open for educational and research use. Please credit the creator if used in any public or academic work.
This dataset was generated as a synthetic alternative to real-world data to help developers and researchers practice building and fine-tuning classification models without the constraints of sensitive patient data.
{"references": ["Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2021). High-Resolution Image Synthesis with Latent Diffusion Models. arXiv. https://doi.org/10.48550/arXiv.2112.10752", "Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2022). DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. arXiv. https://doi.org/10.48550/arXiv.2208.12242"]} Geo Fossils-I is a synthetic dataset of fossil images that can be a pioneer in solving the limited availability of Image Classification and Object Detection on 2D images from geological outcrops. The dataset consists of six different fossil types found in geological outcrops, with 200 images per class, for a total of 1200 fossil images.
A synthetic dataset from an agent simulation for planner LLM fine-tuning. See Planner fine-tuning on synthetic agent trajectories and bot-with-plan for further details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DESCRIPTION:This audio dataset serves serves as supplementary material for the DCASE2024 Challenge Task 3: Audio and Audiovisual Sound Event Localization and Detection with Distance Estimation. The dataset consists of synthetic spatial audio mixtures of sound events spatialized for two different spatial formats using real measured room impulse responses (RIRs) measured in various spaces of Tampere University (TAU). The mixtures are generated using the same process as the one used to generate the recordings of the TAU-NIGENS Spatial Sound Scenes 2021 dataset for the DCASE2021 Challenge Task 3.
The SELD task setup in DCASE2024 is based on spatial recordings of real scenes, captured in the STARS23 dataset. Since the task setup allows use of external data, these synthetic mixtures serve as additional training material for the baseline model. For more details on the task setup, please refer to the task description.
Note that the generator code and the collection of room responses used to spatialize sound samples will be also be made available soon. For more details on the recording of RIRs, spatialization, and generation, see:
Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen (2021). A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2021), Barcelona, Spain.
available here.
SPECIFICATIONS:
13 target sound classes (see task description for details)
The sound event samples are sources from the FSD50K dataset, based on affinity of the labels in that dataset to the target classes. The selection on distinguishing which labels in FSD50K corresponded to the target ones, then selecting samples that were tagged with only those labels, and additionally that they had annotator rating of Present and Predominant (see FSD50K for more details). The list of the selected files is included here.
1200 1-minute long spatial recordings
Sampling rate of 24kHz
Two 4-channel recording formats, first-order Ambisonics (FOA) and tetrahedral microphone array (MIC)
Spatial events spatialized in 9 unique rooms, using measured RIRs for the two formats
Maximum polyphony of 3 (with possible same-class events overlapping)
Even though the whole set is used for training of the baseline without distinction between the mixtures, we have included a separation into a training and testing split, in case on one needs to test the performance purely on those synthetic conditions (for example for comparisons with training on mixed synthetic-real data, fine-tuning on real data, or training on real data only).
The training split is indicated as fold1 in the dataset, contains 900 recordings spatialized on 6 rooms (150 recordings/room) and it is based on samples from the development set of FSD50K.
The testing split is indicated as fold2 in the dataset, contains 300 recordings spatialized on 3 rooms (100 recordings/room) and it is based on samples from the evaluation set of FSD50K.
Common metadata files for both formats are provided. For the file naming and the metadata format, refer to the task setup.
DOWNLOAD INSTRUCTIONS:
Download the zip files and use your preferred compression tool to unzip these split zip files. To extract a split zip archive (named as zip, z01, z02, ...), you could use, for example, the following syntax in Linux or OSX terminal:
Combine the split archive to a single archive:
zip -s 0 split.zip --out single.zip
Extract the single archive using unzip:
unzip single.zip
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a composition of six toxic or hateful synthetic datasets based on the datasets published by:
"Large scale crowdsourcing and characterization of twitter abusive behavior"
"Hate Speech Dataset from a White Supremacy Forum"
"Automated hate speech detection and the problem of offensive language"
"Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter"
"Overview of the GermEval 2021 shared task on the identification of toxic, engaging, and fact-claiming comments"
"Don't patronize me! An annotated dataset with patronizing and condescending language towards vulnerable communities"
All data is generated by a separate GPT-3 Curie model fine-tuned on one label of the dataset. The data is not filtered and likely needs to be processed before being useful.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Synthetic biology and metabolic engineering experiments frequently require the fine-tuning of gene expression to balance and optimize protein levels of regulators or metabolic enzymes. A key concept of synthetic biology is the development of modular parts that can be used in different contexts. Here, we have applied a computational multifactor design approach to generate de novo synthetic core promoters and 5โฒ untranslated regions (UTRs) for yeast cells. In contrast to upstream cis-regulatory modules (CRMs), core promoters are typically not subject to specific regulation, making them ideal engineering targets for gene expression fine-tuning. 112 synthetic core promoter sequences were designed on the basis of the sequence/function relationship of natural core promoters, nucleosome occupancy and the presence of short motifs. The synthetic core promoters were fused to the Pichia pastoris AOX1 CRM, and the resulting activity spanned more than a 200-fold range (0.3% to 70.6% of the wild type AOX1 level). The top-ten synthetic core promoters with highest activity were fused to six additional CRMs (three in P. pastoris and three in Saccharomyces cerevisiae). Inducible CRM constructs showed significantly higher activity than constitutive CRMs, reaching up to 176% of natural core promoters. Comparing the activity of the same synthetic core promoters fused to different CRMs revealed high correlations only for CRMs within the same organism. These data suggest that modularity is maintained to some extent but only within the same organism. Due to the conserved role of eukaryotic core promoters, this rational design concept may be transferred to other organisms as a generic engineering tool.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.
Key Features:
Dataset Composition:
Intended Use:
Additional Information:
Acknowledgments:
This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SYNTHETIC-1: Two Million Crowdsourced Reasoning Traces from Deepseek-R1
SYNTHETIC-1 is a reasoning dataset obtained from Deepseek-R1, generated with crowdsourced compute and annotated with diverse verifiers such as LLM judges or symbolic mathematics verifiers. This is the raw version of the dataset, without any filtering for correctness - Filtered datasets specifically for fine-tuning as well as our 7B model can be found in our ๐ค SYNTHETIC-1 Collection. The dataset consists of theโฆ See the full description on the dataset page: https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-1.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
A curated collection of ultra-high-resolution Ferrari car images, scraped from WSupercars.com and neatly organized by model. This dataset is ideal for machine learning, computer vision, and creative applications such as wallpaper generators, AR design tools, and synthetic data modeling. All images are native 3840ร2160 resolution perfect for both research and visual content creation.
๐ Educational and research use only โ All images are copyright of their respective owners.
Folder: ferrari_images/ Subfolders by car model (e.g., f80, 812, sf90) Each folder contains multiple ultra-HD wallpapers (3840ร2160)
- Car Model Classification โ Train AI to recognize different Ferrari models
- Vision Tasks โ Use for super-resolution, enhancement, detection, and segmentation
- Generative Models โ Ideal input for GANs, diffusion models, or neural style transfer
- Wallpaper & Web Apps โ Populate high-quality visual content for websites or mobile platforms
- Fine-Tuning Vision Models โ Compatible with CNNs, ViTs, and transformer architectures
- Self-Supervised Learning โ Leverage unlabeled images for contrastive training methods
- Game/Simulation Prototyping โ Use as visual references or placeholders in 3D environments
- AR & Design Tools โ Integrate into automotive mockups, design UIs, or creative workflows
- This release includes only Ferrari vehicle images
- All images are native UHD (3840ร2160), with no duplicates or downscaled versions
- Novitec-tuned models are included both in the
novitec/
folder and within their respective model folders(e.g., 296/, sf90/)
for convenience.
This pre-registration outlines the replication of two health-related randomized controlled trials (RCTs) originally conducted in Ghana, using LLM-generated AI subjects. Our primary hypotheses focus on the similarity of treatment effects between AI and human samples, as well as on relative effectiveness of different LLM configurations, varying by model size and fine-tuning strategies. By leveraging cutting-edge LLM techniques to simulate human behaviour, the objective of the study is to establish the viability of synthetic RCTs as a cost-effective and scalable tool for social science research, with a focus on the Global South. Prior to this pre-registration, we conducted an exploration of various model sizes, strategies for prompt-engineering, fine-tuning, and models' evaluation with the objective to inform some key methodological choices.
๐ Dataset Overview
This dataset was created as part of the Uplimit course "Synthetic Data Generation for Fine-Tuning." It represents a cleaned and filtered subset of a real instruction-tuning dataset, intended for experimentation with supervised fine-tuning (SFT). The goal was to produce a high-quality synthetic dataset by curating and filtering real-world instruction-response data using semantic deduplication and automated quality scoring.
๐ Data Preparation Stepsโฆ See the full description on the dataset page: https://huggingface.co/datasets/mimartin1234/uplimit-synthetic-data-week-1-filtered.