97 datasets found

h
uplimit-synthetic-data-week-1-filtered
huggingface.co
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Martin (2025). uplimit-synthetic-data-week-1-filtered [Dataset]. https://huggingface.co/datasets/mimartin1234/uplimit-synthetic-data-week-1-filtered
Explore at:
Dataset updated
Apr 4, 2025
Authors
Michael Martin
Description
📘 Dataset Overview

This dataset was created as part of the Uplimit course "Synthetic Data Generation for Fine-Tuning." It represents a cleaned and filtered subset of a real instruction-tuning dataset, intended for experimentation with supervised fine-tuning (SFT). The goal was to produce a high-quality synthetic dataset by curating and filtering real-world instruction-response data using semantic deduplication and automated quality scoring.

🔄 Data Preparation Steps… See the full description on the dataset page: https://huggingface.co/datasets/mimartin1234/uplimit-synthetic-data-week-1-filtered.
h
CAI-synthetic-10k
huggingface.co
Updated Apr 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inner I Network (2024). CAI-synthetic-10k [Dataset]. https://huggingface.co/datasets/InnerI/CAI-synthetic-10k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 27, 2024
Authors
Inner I Network
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CAI-Synthetic Model

Overview

The CAI-Synthetic Model is a large language model designed to understand and respond to complex questions. This model has been fine-tuned on a synthetic dataset from Mostly AI, allowing it to engage in a variety of contexts with reliable responses. It is designed to perform well in diverse scenarios.

Base Model and Fine-Tuning

Base Model: Google/Gemma-7b

Fine-Tuning Adapter: LoRA Adapter

Synthetic Dataset: Mostly AI Synthetic… See the full description on the dataset page: https://huggingface.co/datasets/InnerI/CAI-synthetic-10k.
h
uplimit-synthetic-data-week-1-filtered
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Egill Vignisson, uplimit-synthetic-data-week-1-filtered [Dataset]. https://huggingface.co/datasets/egillv/uplimit-synthetic-data-week-1-filtered
Explore at:
Authors
Egill Vignisson
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is dataset was created for project 1 of the Uplimit course Synthetic Data Generation for Fine-tuning AI Models. The inspiration comes from wanting a model that can be used to handle all debates about which basketball player is the greatest of all time (Lebron) The dataset was generated using a compiled list of facts about Lebron James using chatGPTs Deep Research and then a two distinct distilabel pipelines followed up with some quality analysis and filtering. The entire process can be… See the full description on the dataset page: https://huggingface.co/datasets/egillv/uplimit-synthetic-data-week-1-filtered.
f
Fine-tuning set organic chemistry synthetic procedures for AI
figshare.com
json
Updated Sep 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rik van der Lingen (2024). Fine-tuning set organic chemistry synthetic procedures for AI [Dataset]. http://doi.org/10.6084/m9.figshare.26964226.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26964226.v1
Dataset updated
Sep 7, 2024
Dataset provided by
figshare
Authors
Rik van der Lingen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Fine-tuning set organic chemistry synthetic procedures for gpt-4o-mini-2024-07-18. 31 records. Taken from USPTO dataset 2024 and sampled from academic literature 2024. Objective: scrape relevant information for reactants, reagents, solvents and product. Filetype: jsonl.Trained tokens 61.065Epochs 3 Batch size 1 raining loss 0.0258LR multiplier 1.8
p
MIMIC-III-Ext-Synthetic-Clinical-Trial-Questions
physionet.org
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones (2025). MIMIC-III-Ext-Synthetic-Clinical-Trial-Questions [Dataset]. http://doi.org/10.13026/30k0-av04
Explore at:
Unique identifier
https://doi.org/10.13026/30k0-av04
Dataset updated
Apr 22, 2025
Authors
Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Large-language models (LLMs) show promise for extracting information from clinical notes. Deploying these models at scale can be challenging due to high computational costs, regulatory constraints, and privacy concerns. To address these challenges, synthetic data distillation can be used to fine-tune smaller, open-source LLMs that achieve performance similar to the teacher model. These smaller models can be run on less expensive local hardware or at a vastly reduced cost in cloud deployments. In our recent study [1], we used Llama-3.1-70B-Instruct to generate synthetic training examples in the form of question-answer pairs along with supporting information. We manually reviewed 1000 of these examples and release them here. These examples can then be used to fine-tune smaller versions of Llama to improve their ability to extract clinical information from notes.
Synthetic dataset for multi-script text line recognition
zenodo.org
application/gzip
Updated Feb 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SVEN NAJEM-MEYER; SVEN NAJEM-MEYER (2025). Synthetic dataset for multi-script text line recognition [Dataset]. http://doi.org/10.5281/zenodo.14840349
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14840349
Dataset updated
Feb 9, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
SVEN NAJEM-MEYER; SVEN NAJEM-MEYER
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Optical Character Recognition (OCR) systems frequently encounter difficulties when processing rare or ancient scripts, especially when they occur in historical contexts involving multiple writing systems. These challenges often constrain researchers to fine-tune or to train new OCR models tailored to their specific needs. To support these efforts, we introduce a synthetic dataset comprising 6.2 million lines, specifically geared towards mixed polytonic Greek and Latin scripts. Being augmented with artificially degraded lines, the dataset bolsters strong results when used to train historical OCR models. This resource can be used both for training and testing purposes, and is particularly valuable for researchers working with ancient Greek and limited annotated data. The software used to generate this datasets is linked to below on our Git. This is a sample, but please contact us if you would like access to the whole dataset.
LLM - Detect AI Datamix
kaggle.com
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Raja Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:

Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.

One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.

Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
i
Dataset of synthetic clinical notes in European Portuguese generated using...
rdm.inesctec.pt
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Dataset of synthetic clinical notes in European Portuguese generated using an open-source large language model, along with prompting and evaluation data - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2025-005
Explore at:
Dataset updated
Jun 26, 2025
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset was generated using an open-source large language model and carefully curated prompts, simulating realistic clinical narratives while ensuring no real patient data is included. The primary purpose of this dataset is to support the development, evaluation, and benchmarking of Artificial Intelligence tools for clinical and biomedical applications in the Portuguese language, especially European Portuguese. It is particularly valuable for information extraction (IE) tasks such as named entity recognition, clinical note classification, summarization, and synthetic data generation in low-resource language settings. The dataset promotes research on the responsible use of synthetic data in healthcare and aims to serve as a foundation for training or fine-tuning domain-specific Portuguese language models in clinical IE and other natural language processing tasks. About the dataset XML files comprising 98,571 fully synthetic clinical notes in European Portuguese, divided into 4 types: 24,759 admission notes, 24,411 ambulatory notes, 24,639 discharge summaries, and 24,762 nursing notes; CSV file with prompts and responses from prompt engineering; CSV files with prompts and responses from synthetic dataset generation; CSV file with results from human evaluation; TXT files containing 1,000 clinical notes (250 of each type) taken from the synthetic dataset and used during automatic evaluation.
f
Data from: miRNA-Mediated Regulation of Synthetic Gene Circuits in the Green...
figshare.com
xlsx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco J. Navarro; David C. Baulcombe (2023). miRNA-Mediated Regulation of Synthetic Gene Circuits in the Green Alga Chlamydomonas reinhardtii [Dataset]. http://doi.org/10.1021/acssynbio.8b00393.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acssynbio.8b00393.s002
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Francisco J. Navarro; David C. Baulcombe
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
MicroRNAs (miRNAs), small RNA molecules of 20–24 nts, have many features that make them useful tools for gene expression regulationsmall size, flexible design, target predictability, and action at a late stage of the gene expression pipeline. In addition, their role in fine-tuning gene expression can be harnessed to increase robustness of synthetic gene networks. In this work, we apply a synthetic biology approach to characterize miRNA-mediated gene expression regulation in the unicellular green alga Chlamydomonas reinhardtii. This characterization is then used to build tools based on miRNAs, such as synthetic miRNAs, miRNA-responsive 3′UTRs, miRNA decoys, and self-regulatory loops. These tools will facilitate the engineering of gene expression for new applications and improved traits in this alga.
f
Statistics of social networks.
plos.figshare.com
xls
Updated Oct 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daiki Suzuki; Sho Tsugawa; Keiichiro Tsukamoto; Shintaro Igari (2023). Statistics of social networks. [Dataset]. http://doi.org/10.1371/journal.pone.0293032.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0293032.t003
Dataset updated
Oct 16, 2023
Dataset provided by
PLOS ONE
Authors
Daiki Suzuki; Sho Tsugawa; Keiichiro Tsukamoto; Shintaro Igari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analyzing the dynamics of information diffusion cascades and accurately predicting their behavior holds significant importance in various applications. In this paper, we concentrate specifically on a recently introduced contrastive cascade graph learning framework, for the task of predicting cascade popularity. This framework follows a pre-training and fine-tuning paradigm to address cascade prediction tasks. In a previous study, the transferability of pre-trained models within the contrastive cascade graph learning framework was examined solely between two social media datasets. However, in our present study, we comprehensively evaluate the transferability of pre-trained models across 13 real datasets and six synthetic datasets. We construct several pre-trained models using real cascades and synthetic cascades generated by the independent cascade model and the Profile model. Then, we fine-tune these pre-trained models on real cascade datasets and evaluate their prediction accuracy based on the mean squared logarithmic error. The main findings derived from our results are as follows. (1) The pre-trained models exhibit transferability across diverse types of real datasets in different domains, encompassing different languages, social media platforms, and diffusion time scales. (2) Synthetic cascade data prove effective for pre-training purposes. The pre-trained models constructed with synthetic cascade data demonstrate comparable effectiveness to those constructed using real data. (3) Synthetic cascade data prove beneficial for fine-tuning the contrastive cascade graph learning models and training other state-of-the-art popularity prediction models. Models trained using a combination of real and synthetic cascades yield significantly lower mean squared logarithmic error compared to those trained solely on real cascades. Our findings affirm the effectiveness of synthetic cascade data in enhancing the accuracy of cascade popularity prediction.

Tuberculosis X-Ray Dataset (Synthetic)

kaggle.com

Updated Mar 12, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Arif Miah (2025). Tuberculosis X-Ray Dataset (Synthetic) [Dataset]. https://www.kaggle.com/datasets/miadul/tuberculosis-x-ray-dataset-synthetic/code

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 12, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Arif Miah

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📝 Dataset Summary

This synthetic dataset contains 20,000 records of X-ray data labeled as "Normal" or "Tuberculosis". It is specifically created for training and evaluating classification models in the field of medical image analysis. The dataset aims to aid in building machine learning and deep learning models for detecting tuberculosis from X-ray data.

💡 Context

Tuberculosis (TB) is a highly infectious disease that primarily affects the lungs. Accurate detection of TB using chest X-rays can significantly enhance medical diagnostics. However, real-world datasets are often scarce or restricted due to privacy concerns. This synthetic dataset bridges that gap by providing simulated patient data while maintaining realistic distributions and patterns commonly observed in TB cases.

🗃️ Dataset Details

Number of Rows: 20,000
Number of Columns: 15
File Format: CSV
Resolution: Simulated patient data, not real X-ray images
Size: Approximately 10 MB

🏷️ Columns and Descriptions

Column Name	Description
Patient_ID	Unique ID for each patient (e.g., PID000001)
Age	Age of the patient (in years)
Gender	Gender of the patient (Male/Female)
Chest_Pain	Presence of chest pain (Yes/No)
Cough_Severity	Severity of cough (Scale: 0-9)
Breathlessness	Severity of breathlessness (Scale: 0-4)
Fatigue	Level of fatigue experienced (Scale: 0-9)
Weight_Loss	Weight loss (in kg)
Fever	Level of fever (Mild, Moderate, High)
Night_Sweats	Whether night sweats are present (Yes/No)
Sputum_Production	Level of sputum production (Low, Medium, High)
Blood_in_Sputum	Presence of blood in sputum (Yes/No)
Smoking_History	Smoking status (Never, Former, Current)
Previous_TB_History	Previous tuberculosis history (Yes/No)
Class	Target variable indicating the condition (Normal, Tuberculosis)

🔍 Data Generation Process

The dataset was generated using Python with the following libraries:
- Pandas: To create and save the dataset as a CSV file
- NumPy: To generate random numbers and simulate realistic data
- Random Seed: Set to ensure reproducibility

The target variable "Class" has a 70-30 distribution between Normal and Tuberculosis cases. The data is randomly generated with realistic patterns that mimic typical TB symptoms and demographic distributions.

🔧 Usage

This dataset is intended for:
- Machine Learning and Deep Learning classification tasks
- Data exploration and feature analysis
- Model evaluation and comparison
- Educational and research purposes

📊 Potential Applications

Tuberculosis Detection Models: Train CNNs or other classification algorithms to detect TB.
Healthcare Research: Analyze the correlation between symptoms and TB outcomes.
Data Visualization: Perform EDA to uncover patterns and insights.
Model Benchmarking: Compare various algorithms for TB detection.

📑 License

This synthetic dataset is open for educational and research use. Please credit the creator if used in any public or academic work.

🙌 Acknowledgments

This dataset was generated as a synthetic alternative to real-world data to help developers and researchers practice building and fine-tuning classification models without the constraints of sensitive patient data.

o
Geo Fossils-I Dataset
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Jan 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Athanasios Nathanail (2023). Geo Fossils-I Dataset [Dataset]. http://doi.org/10.5281/zenodo.7510741
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7510741
Dataset updated
Jan 6, 2023
Authors
Athanasios Nathanail
Description
{"references": ["Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2021). High-Resolution Image Synthesis with Latent Diffusion Models. arXiv. https://doi.org/10.48550/arXiv.2112.10752", "Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2022). DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. arXiv. https://doi.org/10.48550/arXiv.2208.12242"]} Geo Fossils-I is a synthetic dataset of fossil images that can be a pioneer in solving the limited availability of Image Classification and Object Detection on 2D images from geological outcrops. The dataset consists of six different fossil types found in geological outcrops, with 200 images per class, for a total of 1200 fossil images.
h
gba-trajectories
huggingface.co
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Krasser (2025). gba-trajectories [Dataset]. https://huggingface.co/datasets/krasserm/gba-trajectories
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 9, 2025
Authors
Martin Krasser
Description
A synthetic dataset from an agent simulation for planner LLM fine-tuning. See Planner fine-tuning on synthetic agent trajectories and bot-with-plan for further details.
Z
[DCASE2024 Task 3] Synthetic SELD mixtures for baseline training
data.niaid.nih.gov
zenodo.org
Updated Apr 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krause, Daniel Aleksander (2024). [DCASE2024 Task 3] Synthetic SELD mixtures for baseline training [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10932240
Explore at:
Dataset updated
Apr 6, 2024
Dataset provided by
Politis, Archontis
Krause, Daniel Aleksander
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DESCRIPTION:This audio dataset serves serves as supplementary material for the DCASE2024 Challenge Task 3: Audio and Audiovisual Sound Event Localization and Detection with Distance Estimation. The dataset consists of synthetic spatial audio mixtures of sound events spatialized for two different spatial formats using real measured room impulse responses (RIRs) measured in various spaces of Tampere University (TAU). The mixtures are generated using the same process as the one used to generate the recordings of the TAU-NIGENS Spatial Sound Scenes 2021 dataset for the DCASE2021 Challenge Task 3.

The SELD task setup in DCASE2024 is based on spatial recordings of real scenes, captured in the STARS23 dataset. Since the task setup allows use of external data, these synthetic mixtures serve as additional training material for the baseline model. For more details on the task setup, please refer to the task description.

Note that the generator code and the collection of room responses used to spatialize sound samples will be also be made available soon. For more details on the recording of RIRs, spatialization, and generation, see:

Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen (2021). A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2021), Barcelona, Spain.

available here.

SPECIFICATIONS:

13 target sound classes (see task description for details)

The sound event samples are sources from the FSD50K dataset, based on affinity of the labels in that dataset to the target classes. The selection on distinguishing which labels in FSD50K corresponded to the target ones, then selecting samples that were tagged with only those labels, and additionally that they had annotator rating of Present and Predominant (see FSD50K for more details). The list of the selected files is included here.

1200 1-minute long spatial recordings

Sampling rate of 24kHz

Two 4-channel recording formats, first-order Ambisonics (FOA) and tetrahedral microphone array (MIC)

Spatial events spatialized in 9 unique rooms, using measured RIRs for the two formats

Maximum polyphony of 3 (with possible same-class events overlapping)

Even though the whole set is used for training of the baseline without distinction between the mixtures, we have included a separation into a training and testing split, in case on one needs to test the performance purely on those synthetic conditions (for example for comparisons with training on mixed synthetic-real data, fine-tuning on real data, or training on real data only).

The training split is indicated as fold1 in the dataset, contains 900 recordings spatialized on 6 rooms (150 recordings/room) and it is based on samples from the development set of FSD50K.

The testing split is indicated as fold2 in the dataset, contains 300 recordings spatialized on 3 rooms (100 recordings/room) and it is based on samples from the evaluation set of FSD50K.

Common metadata files for both formats are provided. For the file naming and the metadata format, refer to the task setup.

DOWNLOAD INSTRUCTIONS:

Download the zip files and use your preferred compression tool to unzip these split zip files. To extract a split zip archive (named as zip, z01, z02, ...), you could use, for example, the following syntax in Linux or OSX terminal:

Combine the split archive to a single archive:

zip -s 0 split.zip --out single.zip

Extract the single archive using unzip:

unzip single.zip
GPT-3 Curie generated synthetic datasets based on the datasets: Founta,...
zenodo.org
data.niaid.nih.gov
csv, tsv
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maximilian Schmidhuber; Maximilian Schmidhuber (2025). GPT-3 Curie generated synthetic datasets based on the datasets: Founta, Stormfront, HatEval 2019, Davidson, GermEval 2021, SemEval 2022 Task 4 [Dataset]. http://doi.org/10.5281/zenodo.10022788
Explore at:
tsv, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10022788
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maximilian Schmidhuber; Maximilian Schmidhuber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 19, 2023
Description
This dataset is a composition of six toxic or hateful synthetic datasets based on the datasets published by:

"Large scale crowdsourcing and characterization of twitter abusive behavior"
"Hate Speech Dataset from a White Supremacy Forum"
"Automated hate speech detection and the problem of offensive language"
"Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter"
"Overview of the GermEval 2021 shared task on the identification of toxic, engaging, and fact-claiming comments"
"Don't patronize me! An annotated dataset with patronizing and condescending language towards vulnerable communities"

All data is generated by a separate GPT-3 Curie model fine-tuned on one label of the dataset. The data is not filtered and likely needs to be processed before being useful.
f
Data from: Synthetic Core Promoters as Universal Parts for Fine-Tuning...
acs.figshare.com
figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui M. C. Portela; Thomas Vogl; Claudia Kniely; Jasmin E. Fischer; Rui Oliveira; Anton Glieder (2023). Synthetic Core Promoters as Universal Parts for Fine-Tuning Expression in Different Yeast Species [Dataset]. http://doi.org/10.1021/acssynbio.6b00178.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acssynbio.6b00178.s002
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Rui M. C. Portela; Thomas Vogl; Claudia Kniely; Jasmin E. Fischer; Rui Oliveira; Anton Glieder
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Synthetic biology and metabolic engineering experiments frequently require the fine-tuning of gene expression to balance and optimize protein levels of regulators or metabolic enzymes. A key concept of synthetic biology is the development of modular parts that can be used in different contexts. Here, we have applied a computational multifactor design approach to generate de novo synthetic core promoters and 5′ untranslated regions (UTRs) for yeast cells. In contrast to upstream cis-regulatory modules (CRMs), core promoters are typically not subject to specific regulation, making them ideal engineering targets for gene expression fine-tuning. 112 synthetic core promoter sequences were designed on the basis of the sequence/function relationship of natural core promoters, nucleosome occupancy and the presence of short motifs. The synthetic core promoters were fused to the Pichia pastoris AOX1 CRM, and the resulting activity spanned more than a 200-fold range (0.3% to 70.6% of the wild type AOX1 level). The top-ten synthetic core promoters with highest activity were fused to six additional CRMs (three in P. pastoris and three in Saccharomyces cerevisiae). Inducible CRM constructs showed significantly higher activity than constitutive CRMs, reaching up to 176% of natural core promoters. Comparing the activity of the same synthetic core promoters fused to different CRMs revealed high correlations only for CRMs within the same organism. These data suggest that modularity is maintained to some extent but only within the same organism. Due to the conserved role of eukaryotic core promoters, this rational design concept may be transferred to other organisms as a generic engineering tool.
Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...
zenodo.org
data.niaid.nih.gov
csv
Updated Dec 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. http://doi.org/10.5281/zenodo.10413068
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10413068
Dataset updated
Dec 21, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

Key Features:

LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.

Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.

Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

Dataset Composition:

curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot

curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

Intended Use:

Fine-tuning and advancing Homepage2Vec or similar website classification models

Research on LLM-generated datasets for text classification tasks

Exploration of multilingual website classification

Additional Information:

Project and report repository: https://github.com/CS-433/ml-project-2-mlp

Acknowledgments:

This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
h
SYNTHETIC-1
huggingface.co
Updated Jul 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prime Intellect (2025). SYNTHETIC-1 [Dataset]. https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 14, 2025
Dataset authored and provided by
Prime Intellect
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
SYNTHETIC-1: Two Million Crowdsourced Reasoning Traces from Deepseek-R1

SYNTHETIC-1 is a reasoning dataset obtained from Deepseek-R1, generated with crowdsourced compute and annotated with diverse verifiers such as LLM judges or symbolic mathematics verifiers. This is the raw version of the dataset, without any filtering for correctness - Filtered datasets specifically for fine-tuning as well as our 7B model can be found in our 🤗 SYNTHETIC-1 Collection. The dataset consists of the… See the full description on the dataset page: https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-1.
Ferrari Images Dataset (2025)
kaggle.com
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Urvish Ahir (2025). Ferrari Images Dataset (2025) [Dataset]. https://www.kaggle.com/datasets/urvishahir/ferrari-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Urvish Ahir
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
🏎️ Ferrari Image Dataset (3840×2160 UHD)

A curated collection of ultra-high-resolution Ferrari car images, scraped from WSupercars.com and neatly organized by model. This dataset is ideal for machine learning, computer vision, and creative applications such as wallpaper generators, AR design tools, and synthetic data modeling. All images are native 3840×2160 resolution perfect for both research and visual content creation.

📌 Educational and research use only — All images are copyright of their respective owners.

📁 Dataset Overview :

Folder: ferrari_images/ Subfolders by car model (e.g., f80, 812, sf90) Each folder contains multiple ultra-HD wallpapers (3840×2160)

Use Cases :

Car Model Classification – Train AI to recognize different Ferrari models

Vision Tasks – Use for super-resolution, enhancement, detection, and segmentation

Generative Models – Ideal input for GANs, diffusion models, or neural style transfer

Wallpaper & Web Apps – Populate high-quality visual content for websites or mobile platforms

Fine-Tuning Vision Models – Compatible with CNNs, ViTs, and transformer architectures

Self-Supervised Learning – Leverage unlabeled images for contrastive training methods

Game/Simulation Prototyping – Use as visual references or placeholders in 3D environments

AR & Design Tools – Integrate into automotive mockups, design UIs, or creative workflows

Notes :

This release includes only Ferrari vehicle images

All images are native UHD (3840×2160), with no duplicates or downscaled versions

Novitec-tuned models are included both in thenovitec/ folder and within their respective model folders (e.g., 296/, sf90/) for convenience.
o
Artificially Intelligent RCT
osf.io
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raymond Duch; Tommaso Batistoni; Raymond Low; Benjamin Manning (2024). Artificially Intelligent RCT [Dataset]. https://osf.io/9k8j7
Explore at:
Dataset updated
Dec 18, 2024
Dataset provided by
Center For Open Science
Authors
Raymond Duch; Tommaso Batistoni; Raymond Low; Benjamin Manning
Description
This pre-registration outlines the replication of two health-related randomized controlled trials (RCTs) originally conducted in Ghana, using LLM-generated AI subjects. Our primary hypotheses focus on the similarity of treatment effects between AI and human samples, as well as on relative effectiveness of different LLM configurations, varying by model size and fine-tuning strategies. By leveraging cutting-edge LLM techniques to simulate human behaviour, the objective of the study is to establish the viability of synthetic RCTs as a cost-effective and scalable tool for social science research, with a focus on the Global South. Prior to this pre-registration, we conducted an exploration of various model sizes, strategies for prompt-engineering, fine-tuning, and models' evaluation with the objective to inform some key methodological choices.

Facebook

Twitter

Click to copy link

Link copied

Cite

Michael Martin (2025). uplimit-synthetic-data-week-1-filtered [Dataset]. https://huggingface.co/datasets/mimartin1234/uplimit-synthetic-data-week-1-filtered

uplimit-synthetic-data-week-1-filtered

Uplimit - Synthetic Data for Fine Tuning - Week 1

mimartin1234/uplimit-synthetic-data-week-1-filtered

Explore at:

Dataset updated

Apr 4, 2025

Authors

Michael Martin

Description

📘 Dataset Overview

This dataset was created as part of the Uplimit course "Synthetic Data Generation for Fine-Tuning." It represents a cleaned and filtered subset of a real instruction-tuning dataset, intended for experimentation with supervised fine-tuning (SFT). The goal was to produce a high-quality synthetic dataset by curating and filtering real-world instruction-response data using semantic deduplication and automated quality scoring.

  🔄 Data Preparation Steps… See the full description on the dataset page: https://huggingface.co/datasets/mimartin1234/uplimit-synthetic-data-week-1-filtered.

Clear search

Close search

Google apps

Main menu

uplimit-synthetic-data-week-1-filtered

CAI-synthetic-10k

uplimit-synthetic-data-week-1-filtered

Fine-tuning set organic chemistry synthetic procedures for AI

MIMIC-III-Ext-Synthetic-Clinical-Trial-Questions

Synthetic dataset for multi-script text line recognition

LLM - Detect AI Datamix

Dataset of synthetic clinical notes in European Portuguese generated using...

Data from: miRNA-Mediated Regulation of Synthetic Gene Circuits in the Green...

Statistics of social networks.

Tuberculosis X-Ray Dataset (Synthetic)

📝 Dataset Summary

💡 Context

🗃️ Dataset Details

🏷️ Columns and Descriptions

🔍 Data Generation Process

🔧 Usage

📊 Potential Applications

📑 License

🙌 Acknowledgments

Geo Fossils-I Dataset

gba-trajectories

[DCASE2024 Task 3] Synthetic SELD mixtures for baseline training

GPT-3 Curie generated synthetic datasets based on the datasets: Founta,...

Data from: Synthetic Core Promoters as Universal Parts for Fine-Tuning...

Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

SYNTHETIC-1

Ferrari Images Dataset (2025)

🏎️ Ferrari Image Dataset (3840×2160 UHD)

📁 Dataset Overview :

Use Cases :

Notes :

Artificially Intelligent RCT

uplimit-synthetic-data-week-1-filtered

Uplimit - Synthetic Data for Fine Tuning - Week 1

mimartin1234/uplimit-synthetic-data-week-1-filtered