97 datasets found
  1. h

    uplimit-synthetic-data-week-1-filtered

    • huggingface.co
    Updated Apr 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Martin (2025). uplimit-synthetic-data-week-1-filtered [Dataset]. https://huggingface.co/datasets/mimartin1234/uplimit-synthetic-data-week-1-filtered
    Explore at:
    Dataset updated
    Apr 4, 2025
    Authors
    Michael Martin
    Description

    ๐Ÿ“˜ Dataset Overview

    This dataset was created as part of the Uplimit course "Synthetic Data Generation for Fine-Tuning." It represents a cleaned and filtered subset of a real instruction-tuning dataset, intended for experimentation with supervised fine-tuning (SFT). The goal was to produce a high-quality synthetic dataset by curating and filtering real-world instruction-response data using semantic deduplication and automated quality scoring.

      ๐Ÿ”„ Data Preparation Stepsโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/mimartin1234/uplimit-synthetic-data-week-1-filtered.
    
  2. h

    CAI-synthetic-10k

    • huggingface.co
    Updated Apr 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Inner I Network (2024). CAI-synthetic-10k [Dataset]. https://huggingface.co/datasets/InnerI/CAI-synthetic-10k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2024
    Authors
    Inner I Network
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CAI-Synthetic Model

      Overview
    

    The CAI-Synthetic Model is a large language model designed to understand and respond to complex questions. This model has been fine-tuned on a synthetic dataset from Mostly AI, allowing it to engage in a variety of contexts with reliable responses. It is designed to perform well in diverse scenarios.

      Base Model and Fine-Tuning
    

    Base Model: Google/Gemma-7b

    Fine-Tuning Adapter: LoRA Adapter

    Synthetic Dataset: Mostly AI Syntheticโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/InnerI/CAI-synthetic-10k.

  3. h

    uplimit-synthetic-data-week-1-filtered

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Egill Vignisson, uplimit-synthetic-data-week-1-filtered [Dataset]. https://huggingface.co/datasets/egillv/uplimit-synthetic-data-week-1-filtered
    Explore at:
    Authors
    Egill Vignisson
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is dataset was created for project 1 of the Uplimit course Synthetic Data Generation for Fine-tuning AI Models. The inspiration comes from wanting a model that can be used to handle all debates about which basketball player is the greatest of all time (Lebron) The dataset was generated using a compiled list of facts about Lebron James using chatGPTs Deep Research and then a two distinct distilabel pipelines followed up with some quality analysis and filtering. The entire process can beโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/egillv/uplimit-synthetic-data-week-1-filtered.

  4. f

    Fine-tuning set organic chemistry synthetic procedures for AI

    • figshare.com
    json
    Updated Sep 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rik van der Lingen (2024). Fine-tuning set organic chemistry synthetic procedures for AI [Dataset]. http://doi.org/10.6084/m9.figshare.26964226.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Sep 7, 2024
    Dataset provided by
    figshare
    Authors
    Rik van der Lingen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fine-tuning set organic chemistry synthetic procedures for gpt-4o-mini-2024-07-18. 31 records. Taken from USPTO dataset 2024 and sampled from academic literature 2024. Objective: scrape relevant information for reactants, reagents, solvents and product. Filetype: jsonl.Trained tokens 61.065Epochs 3 Batch size 1 raining loss 0.0258LR multiplier 1.8

  5. p

    MIMIC-III-Ext-Synthetic-Clinical-Trial-Questions

    • physionet.org
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones (2025). MIMIC-III-Ext-Synthetic-Clinical-Trial-Questions [Dataset]. http://doi.org/10.13026/30k0-av04
    Explore at:
    Dataset updated
    Apr 22, 2025
    Authors
    Elizabeth Woo; Michael Craig Burkhart; Emily Alsentzer; Brett Beaulieu-Jones
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    Large-language models (LLMs) show promise for extracting information from clinical notes. Deploying these models at scale can be challenging due to high computational costs, regulatory constraints, and privacy concerns. To address these challenges, synthetic data distillation can be used to fine-tune smaller, open-source LLMs that achieve performance similar to the teacher model. These smaller models can be run on less expensive local hardware or at a vastly reduced cost in cloud deployments. In our recent study [1], we used Llama-3.1-70B-Instruct to generate synthetic training examples in the form of question-answer pairs along with supporting information. We manually reviewed 1000 of these examples and release them here. These examples can then be used to fine-tune smaller versions of Llama to improve their ability to extract clinical information from notes.

  6. Synthetic dataset for multi-script text line recognition

    • zenodo.org
    application/gzip
    Updated Feb 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SVEN NAJEM-MEYER; SVEN NAJEM-MEYER (2025). Synthetic dataset for multi-script text line recognition [Dataset]. http://doi.org/10.5281/zenodo.14840349
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 9, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    SVEN NAJEM-MEYER; SVEN NAJEM-MEYER
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Optical Character Recognition (OCR) systems frequently encounter difficulties when processing rare or ancient scripts, especially when they occur in historical contexts involving multiple writing systems. These challenges often constrain researchers to fine-tune or to train new OCR models tailored to their specific needs. To support these efforts, we introduce a synthetic dataset comprising 6.2 million lines, specifically geared towards mixed polytonic Greek and Latin scripts. Being augmented with artificially degraded lines, the dataset bolsters strong results when used to train historical OCR models. This resource can be used both for training and testing purposes, and is particularly valuable for researchers working with ancient Greek and limited annotated data. The software used to generate this datasets is linked to below on our Git. This is a sample, but please contact us if you would like access to the whole dataset.

  7. LLM - Detect AI Datamix

    • kaggle.com
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Raja Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the datamix created by Team ๐Ÿ” ๐Ÿ“ ๐Ÿ•ต๏ธโ€โ™‚๏ธ ๐Ÿค– during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

    It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

    To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

    Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

    • Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:
      • Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.
      • One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.
      • Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

    We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

    Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence

  8. i

    Dataset of synthetic clinical notes in European Portuguese generated using...

    • rdm.inesctec.pt
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Dataset of synthetic clinical notes in European Portuguese generated using an open-source large language model, along with prompting and evaluation data - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2025-005
    Explore at:
    Dataset updated
    Jun 26, 2025
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset was generated using an open-source large language model and carefully curated prompts, simulating realistic clinical narratives while ensuring no real patient data is included. The primary purpose of this dataset is to support the development, evaluation, and benchmarking of Artificial Intelligence tools for clinical and biomedical applications in the Portuguese language, especially European Portuguese. It is particularly valuable for information extraction (IE) tasks such as named entity recognition, clinical note classification, summarization, and synthetic data generation in low-resource language settings. The dataset promotes research on the responsible use of synthetic data in healthcare and aims to serve as a foundation for training or fine-tuning domain-specific Portuguese language models in clinical IE and other natural language processing tasks. About the dataset XML files comprising 98,571 fully synthetic clinical notes in European Portuguese, divided into 4 types: 24,759 admission notes, 24,411 ambulatory notes, 24,639 discharge summaries, and 24,762 nursing notes; CSV file with prompts and responses from prompt engineering; CSV files with prompts and responses from synthetic dataset generation; CSV file with results from human evaluation; TXT files containing 1,000 clinical notes (250 of each type) taken from the synthetic dataset and used during automatic evaluation.

  9. f

    Data from: miRNA-Mediated Regulation of Synthetic Gene Circuits in the Green...

    • figshare.com
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco J. Navarro; David C. Baulcombe (2023). miRNA-Mediated Regulation of Synthetic Gene Circuits in the Green Alga Chlamydomonas reinhardtii [Dataset]. http://doi.org/10.1021/acssynbio.8b00393.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    ACS Publications
    Authors
    Francisco J. Navarro; David C. Baulcombe
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    MicroRNAs (miRNAs), small RNA molecules of 20โ€“24 nts, have many features that make them useful tools for gene expression regulation๎—ธsmall size, flexible design, target predictability, and action at a late stage of the gene expression pipeline. In addition, their role in fine-tuning gene expression can be harnessed to increase robustness of synthetic gene networks. In this work, we apply a synthetic biology approach to characterize miRNA-mediated gene expression regulation in the unicellular green alga Chlamydomonas reinhardtii. This characterization is then used to build tools based on miRNAs, such as synthetic miRNAs, miRNA-responsive 3โ€ฒUTRs, miRNA decoys, and self-regulatory loops. These tools will facilitate the engineering of gene expression for new applications and improved traits in this alga.

  10. f

    Statistics of social networks.

    • plos.figshare.com
    xls
    Updated Oct 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daiki Suzuki; Sho Tsugawa; Keiichiro Tsukamoto; Shintaro Igari (2023). Statistics of social networks. [Dataset]. http://doi.org/10.1371/journal.pone.0293032.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 16, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Daiki Suzuki; Sho Tsugawa; Keiichiro Tsukamoto; Shintaro Igari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analyzing the dynamics of information diffusion cascades and accurately predicting their behavior holds significant importance in various applications. In this paper, we concentrate specifically on a recently introduced contrastive cascade graph learning framework, for the task of predicting cascade popularity. This framework follows a pre-training and fine-tuning paradigm to address cascade prediction tasks. In a previous study, the transferability of pre-trained models within the contrastive cascade graph learning framework was examined solely between two social media datasets. However, in our present study, we comprehensively evaluate the transferability of pre-trained models across 13 real datasets and six synthetic datasets. We construct several pre-trained models using real cascades and synthetic cascades generated by the independent cascade model and the Profile model. Then, we fine-tune these pre-trained models on real cascade datasets and evaluate their prediction accuracy based on the mean squared logarithmic error. The main findings derived from our results are as follows. (1) The pre-trained models exhibit transferability across diverse types of real datasets in different domains, encompassing different languages, social media platforms, and diffusion time scales. (2) Synthetic cascade data prove effective for pre-training purposes. The pre-trained models constructed with synthetic cascade data demonstrate comparable effectiveness to those constructed using real data. (3) Synthetic cascade data prove beneficial for fine-tuning the contrastive cascade graph learning models and training other state-of-the-art popularity prediction models. Models trained using a combination of real and synthetic cascades yield significantly lower mean squared logarithmic error compared to those trained solely on real cascades. Our findings affirm the effectiveness of synthetic cascade data in enhancing the accuracy of cascade popularity prediction.

  11. Tuberculosis X-Ray Dataset (Synthetic)

    • kaggle.com
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). Tuberculosis X-Ray Dataset (Synthetic) [Dataset]. https://www.kaggle.com/datasets/miadul/tuberculosis-x-ray-dataset-synthetic/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arif Miah
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ๐Ÿ“ Dataset Summary

    This synthetic dataset contains 20,000 records of X-ray data labeled as "Normal" or "Tuberculosis". It is specifically created for training and evaluating classification models in the field of medical image analysis. The dataset aims to aid in building machine learning and deep learning models for detecting tuberculosis from X-ray data.

    ๐Ÿ’ก Context

    Tuberculosis (TB) is a highly infectious disease that primarily affects the lungs. Accurate detection of TB using chest X-rays can significantly enhance medical diagnostics. However, real-world datasets are often scarce or restricted due to privacy concerns. This synthetic dataset bridges that gap by providing simulated patient data while maintaining realistic distributions and patterns commonly observed in TB cases.

    ๐Ÿ—ƒ๏ธ Dataset Details

    • Number of Rows: 20,000
    • Number of Columns: 15
    • File Format: CSV
    • Resolution: Simulated patient data, not real X-ray images
    • Size: Approximately 10 MB

    ๐Ÿท๏ธ Columns and Descriptions

    Column NameDescription
    Patient_IDUnique ID for each patient (e.g., PID000001)
    AgeAge of the patient (in years)
    GenderGender of the patient (Male/Female)
    Chest_PainPresence of chest pain (Yes/No)
    Cough_SeveritySeverity of cough (Scale: 0-9)
    BreathlessnessSeverity of breathlessness (Scale: 0-4)
    FatigueLevel of fatigue experienced (Scale: 0-9)
    Weight_LossWeight loss (in kg)
    FeverLevel of fever (Mild, Moderate, High)
    Night_SweatsWhether night sweats are present (Yes/No)
    Sputum_ProductionLevel of sputum production (Low, Medium, High)
    Blood_in_SputumPresence of blood in sputum (Yes/No)
    Smoking_HistorySmoking status (Never, Former, Current)
    Previous_TB_HistoryPrevious tuberculosis history (Yes/No)
    ClassTarget variable indicating the condition (Normal, Tuberculosis)

    ๐Ÿ” Data Generation Process

    The dataset was generated using Python with the following libraries:
    - Pandas: To create and save the dataset as a CSV file
    - NumPy: To generate random numbers and simulate realistic data
    - Random Seed: Set to ensure reproducibility

    The target variable "Class" has a 70-30 distribution between Normal and Tuberculosis cases. The data is randomly generated with realistic patterns that mimic typical TB symptoms and demographic distributions.

    ๐Ÿ”ง Usage

    This dataset is intended for:
    - Machine Learning and Deep Learning classification tasks
    - Data exploration and feature analysis
    - Model evaluation and comparison
    - Educational and research purposes

    ๐Ÿ“Š Potential Applications

    1. Tuberculosis Detection Models: Train CNNs or other classification algorithms to detect TB.
    2. Healthcare Research: Analyze the correlation between symptoms and TB outcomes.
    3. Data Visualization: Perform EDA to uncover patterns and insights.
    4. Model Benchmarking: Compare various algorithms for TB detection.

    ๐Ÿ“‘ License

    This synthetic dataset is open for educational and research use. Please credit the creator if used in any public or academic work.

    ๐Ÿ™Œ Acknowledgments

    This dataset was generated as a synthetic alternative to real-world data to help developers and researchers practice building and fine-tuning classification models without the constraints of sensitive patient data.

  12. o

    Geo Fossils-I Dataset

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Jan 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Athanasios Nathanail (2023). Geo Fossils-I Dataset [Dataset]. http://doi.org/10.5281/zenodo.7510741
    Explore at:
    Dataset updated
    Jan 6, 2023
    Authors
    Athanasios Nathanail
    Description

    {"references": ["Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2021). High-Resolution Image Synthesis with Latent Diffusion Models. arXiv. https://doi.org/10.48550/arXiv.2112.10752", "Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2022). DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. arXiv. https://doi.org/10.48550/arXiv.2208.12242"]} Geo Fossils-I is a synthetic dataset of fossil images that can be a pioneer in solving the limited availability of Image Classification and Object Detection on 2D images from geological outcrops. The dataset consists of six different fossil types found in geological outcrops, with 200 images per class, for a total of 1200 fossil images.

  13. h

    gba-trajectories

    • huggingface.co
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Krasser (2025). gba-trajectories [Dataset]. https://huggingface.co/datasets/krasserm/gba-trajectories
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 9, 2025
    Authors
    Martin Krasser
    Description

    A synthetic dataset from an agent simulation for planner LLM fine-tuning. See Planner fine-tuning on synthetic agent trajectories and bot-with-plan for further details.

  14. Z

    [DCASE2024 Task 3] Synthetic SELD mixtures for baseline training

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krause, Daniel Aleksander (2024). [DCASE2024 Task 3] Synthetic SELD mixtures for baseline training [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10932240
    Explore at:
    Dataset updated
    Apr 6, 2024
    Dataset provided by
    Politis, Archontis
    Krause, Daniel Aleksander
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DESCRIPTION:This audio dataset serves serves as supplementary material for the DCASE2024 Challenge Task 3: Audio and Audiovisual Sound Event Localization and Detection with Distance Estimation. The dataset consists of synthetic spatial audio mixtures of sound events spatialized for two different spatial formats using real measured room impulse responses (RIRs) measured in various spaces of Tampere University (TAU). The mixtures are generated using the same process as the one used to generate the recordings of the TAU-NIGENS Spatial Sound Scenes 2021 dataset for the DCASE2021 Challenge Task 3.

    The SELD task setup in DCASE2024 is based on spatial recordings of real scenes, captured in the STARS23 dataset. Since the task setup allows use of external data, these synthetic mixtures serve as additional training material for the baseline model. For more details on the task setup, please refer to the task description.

    Note that the generator code and the collection of room responses used to spatialize sound samples will be also be made available soon. For more details on the recording of RIRs, spatialization, and generation, see:

    Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen (2021). A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2021), Barcelona, Spain.

    available here.

    SPECIFICATIONS:

    13 target sound classes (see task description for details)

    The sound event samples are sources from the FSD50K dataset, based on affinity of the labels in that dataset to the target classes. The selection on distinguishing which labels in FSD50K corresponded to the target ones, then selecting samples that were tagged with only those labels, and additionally that they had annotator rating of Present and Predominant (see FSD50K for more details). The list of the selected files is included here.

    1200 1-minute long spatial recordings

    Sampling rate of 24kHz

    Two 4-channel recording formats, first-order Ambisonics (FOA) and tetrahedral microphone array (MIC)

    Spatial events spatialized in 9 unique rooms, using measured RIRs for the two formats

    Maximum polyphony of 3 (with possible same-class events overlapping)

    Even though the whole set is used for training of the baseline without distinction between the mixtures, we have included a separation into a training and testing split, in case on one needs to test the performance purely on those synthetic conditions (for example for comparisons with training on mixed synthetic-real data, fine-tuning on real data, or training on real data only).

    The training split is indicated as fold1 in the dataset, contains 900 recordings spatialized on 6 rooms (150 recordings/room) and it is based on samples from the development set of FSD50K.

    The testing split is indicated as fold2 in the dataset, contains 300 recordings spatialized on 3 rooms (100 recordings/room) and it is based on samples from the evaluation set of FSD50K.

    Common metadata files for both formats are provided. For the file naming and the metadata format, refer to the task setup.

    DOWNLOAD INSTRUCTIONS:

    Download the zip files and use your preferred compression tool to unzip these split zip files. To extract a split zip archive (named as zip, z01, z02, ...), you could use, for example, the following syntax in Linux or OSX terminal:

    Combine the split archive to a single archive:

    zip -s 0 split.zip --out single.zip

    Extract the single archive using unzip:

    unzip single.zip

  15. GPT-3 Curie generated synthetic datasets based on the datasets: Founta,...

    • zenodo.org
    • data.niaid.nih.gov
    csv, tsv
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian Schmidhuber; Maximilian Schmidhuber (2025). GPT-3 Curie generated synthetic datasets based on the datasets: Founta, Stormfront, HatEval 2019, Davidson, GermEval 2021, SemEval 2022 Task 4 [Dataset]. http://doi.org/10.5281/zenodo.10022788
    Explore at:
    tsv, csvAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maximilian Schmidhuber; Maximilian Schmidhuber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 19, 2023
    Description

    This dataset is a composition of six toxic or hateful synthetic datasets based on the datasets published by:

    "Large scale crowdsourcing and characterization of twitter abusive behavior"

    "Hate Speech Dataset from a White Supremacy Forum"

    "Automated hate speech detection and the problem of offensive language"

    "Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter"

    "Overview of the GermEval 2021 shared task on the identification of toxic, engaging, and fact-claiming comments"

    "Don't patronize me! An annotated dataset with patronizing and condescending language towards vulnerable communities"

    All data is generated by a separate GPT-3 Curie model fine-tuned on one label of the dataset. The data is not filtered and likely needs to be processed before being useful.

  16. f

    Data from: Synthetic Core Promoters as Universal Parts for Fine-Tuning...

    • acs.figshare.com
    • figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui M. C. Portela; Thomas Vogl; Claudia Kniely; Jasmin E. Fischer; Rui Oliveira; Anton Glieder (2023). Synthetic Core Promoters as Universal Parts for Fine-Tuning Expression in Different Yeast Species [Dataset]. http://doi.org/10.1021/acssynbio.6b00178.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Rui M. C. Portela; Thomas Vogl; Claudia Kniely; Jasmin E. Fischer; Rui Oliveira; Anton Glieder
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Synthetic biology and metabolic engineering experiments frequently require the fine-tuning of gene expression to balance and optimize protein levels of regulators or metabolic enzymes. A key concept of synthetic biology is the development of modular parts that can be used in different contexts. Here, we have applied a computational multifactor design approach to generate de novo synthetic core promoters and 5โ€ฒ untranslated regions (UTRs) for yeast cells. In contrast to upstream cis-regulatory modules (CRMs), core promoters are typically not subject to specific regulation, making them ideal engineering targets for gene expression fine-tuning. 112 synthetic core promoter sequences were designed on the basis of the sequence/function relationship of natural core promoters, nucleosome occupancy and the presence of short motifs. The synthetic core promoters were fused to the Pichia pastoris AOX1 CRM, and the resulting activity spanned more than a 200-fold range (0.3% to 70.6% of the wild type AOX1 level). The top-ten synthetic core promoters with highest activity were fused to six additional CRMs (three in P. pastoris and three in Saccharomyces cerevisiae). Inducible CRM constructs showed significantly higher activity than constitutive CRMs, reaching up to 176% of natural core promoters. Comparing the activity of the same synthetic core promoters fused to different CRMs revealed high correlations only for CRMs within the same organism. These data suggest that modularity is maintained to some extent but only within the same organism. Due to the conserved role of eukaryotic core promoters, this rational design concept may be transferred to other organisms as a generic engineering tool.

  17. Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. http://doi.org/10.5281/zenodo.10413068
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

    This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

    Key Features:

    • LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.
    • Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.
    • Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

    Dataset Composition:

    • curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot
    • curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

    Intended Use:

    • Fine-tuning and advancing Homepage2Vec or similar website classification models
    • Research on LLM-generated datasets for text classification tasks
    • Exploration of multilingual website classification

    Additional Information:

    Acknowledgments:

    This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

  18. h

    SYNTHETIC-1

    • huggingface.co
    Updated Jul 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prime Intellect (2025). SYNTHETIC-1 [Dataset]. https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 14, 2025
    Dataset authored and provided by
    Prime Intellect
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SYNTHETIC-1: Two Million Crowdsourced Reasoning Traces from Deepseek-R1

    SYNTHETIC-1 is a reasoning dataset obtained from Deepseek-R1, generated with crowdsourced compute and annotated with diverse verifiers such as LLM judges or symbolic mathematics verifiers. This is the raw version of the dataset, without any filtering for correctness - Filtered datasets specifically for fine-tuning as well as our 7B model can be found in our ๐Ÿค— SYNTHETIC-1 Collection. The dataset consists of theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-1.

  19. Ferrari Images Dataset (2025)

    • kaggle.com
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Urvish Ahir (2025). Ferrari Images Dataset (2025) [Dataset]. https://www.kaggle.com/datasets/urvishahir/ferrari-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Urvish Ahir
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    ๐ŸŽ๏ธ Ferrari Image Dataset (3840ร—2160 UHD)

    A curated collection of ultra-high-resolution Ferrari car images, scraped from WSupercars.com and neatly organized by model. This dataset is ideal for machine learning, computer vision, and creative applications such as wallpaper generators, AR design tools, and synthetic data modeling. All images are native 3840ร—2160 resolution perfect for both research and visual content creation.

    ๐Ÿ“Œ Educational and research use only โ€” All images are copyright of their respective owners.

    ๐Ÿ“ Dataset Overview :

    Folder: ferrari_images/ Subfolders by car model (e.g., f80, 812, sf90) Each folder contains multiple ultra-HD wallpapers (3840ร—2160)

    Use Cases :

    • Car Model Classification โ€“ Train AI to recognize different Ferrari models
    • Vision Tasks โ€“ Use for super-resolution, enhancement, detection, and segmentation
    • Generative Models โ€“ Ideal input for GANs, diffusion models, or neural style transfer
    • Wallpaper & Web Apps โ€“ Populate high-quality visual content for websites or mobile platforms
    • Fine-Tuning Vision Models โ€“ Compatible with CNNs, ViTs, and transformer architectures
    • Self-Supervised Learning โ€“ Leverage unlabeled images for contrastive training methods
    • Game/Simulation Prototyping โ€“ Use as visual references or placeholders in 3D environments
    • AR & Design Tools โ€“ Integrate into automotive mockups, design UIs, or creative workflows

    Notes :

    • This release includes only Ferrari vehicle images
    • All images are native UHD (3840ร—2160), with no duplicates or downscaled versions
    • Novitec-tuned models are included both in thenovitec/ folder and within their respective model folders (e.g., 296/, sf90/) for convenience.
  20. o

    Artificially Intelligent RCT

    • osf.io
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raymond Duch; Tommaso Batistoni; Raymond Low; Benjamin Manning (2024). Artificially Intelligent RCT [Dataset]. https://osf.io/9k8j7
    Explore at:
    Dataset updated
    Dec 18, 2024
    Dataset provided by
    Center For Open Science
    Authors
    Raymond Duch; Tommaso Batistoni; Raymond Low; Benjamin Manning
    Description

    This pre-registration outlines the replication of two health-related randomized controlled trials (RCTs) originally conducted in Ghana, using LLM-generated AI subjects. Our primary hypotheses focus on the similarity of treatment effects between AI and human samples, as well as on relative effectiveness of different LLM configurations, varying by model size and fine-tuning strategies. By leveraging cutting-edge LLM techniques to simulate human behaviour, the objective of the study is to establish the viability of synthetic RCTs as a cost-effective and scalable tool for social science research, with a focus on the Global South. Prior to this pre-registration, we conducted an exploration of various model sizes, strategies for prompt-engineering, fine-tuning, and models' evaluation with the objective to inform some key methodological choices.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Michael Martin (2025). uplimit-synthetic-data-week-1-filtered [Dataset]. https://huggingface.co/datasets/mimartin1234/uplimit-synthetic-data-week-1-filtered

uplimit-synthetic-data-week-1-filtered

Uplimit - Synthetic Data for Fine Tuning - Week 1

mimartin1234/uplimit-synthetic-data-week-1-filtered

Explore at:
Dataset updated
Apr 4, 2025
Authors
Michael Martin
Description

๐Ÿ“˜ Dataset Overview

This dataset was created as part of the Uplimit course "Synthetic Data Generation for Fine-Tuning." It represents a cleaned and filtered subset of a real instruction-tuning dataset, intended for experimentation with supervised fine-tuning (SFT). The goal was to produce a high-quality synthetic dataset by curating and filtering real-world instruction-response data using semantic deduplication and automated quality scoring.

  ๐Ÿ”„ Data Preparation Stepsโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/mimartin1234/uplimit-synthetic-data-week-1-filtered.
Search
Clear search
Close search
Google apps
Main menu