21 datasets found
  1. h

    gemma-function-calling

    • huggingface.co
    Updated Oct 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinushi Jayasinghe (2022). gemma-function-calling [Dataset]. https://huggingface.co/datasets/dushj98/gemma-function-calling
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2022
    Authors
    Dinushi Jayasinghe
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    👉🏽 Important

    This dataset is adapted from hypervariance/function-calling-sharegpt to fine-tune the Google gemma-2-2b-it model for function calling.

      🔀 Changes Made
    

    Merged consecutive "GPT" responses into single responses (affected 8.49% of examples, 7372 out of 86864). Updated role names: "system" → Removed (function usage instructions moved to separate column) "human" → "user" "gpt" → "assistant" "function_response" → Unchanged

    Changed message keys from ["from"… See the full description on the dataset page: https://huggingface.co/datasets/dushj98/gemma-function-calling.

  2. h

    databricks-mini

    • huggingface.co
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrinivasan Sankar (2024). databricks-mini [Dataset]. https://huggingface.co/datasets/ai-bites/databricks-mini
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 27, 2024
    Authors
    Shrinivasan Sankar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is a subset of the databricks 15k dataset databricks/databricks-dolly-15k used for finetuning Google's Gemma model google/gemma-2b. This version has only those records without context to match the dataset used in the fine-tuning Keras example from Google.

  3. h

    MNAs-FLAN-T5-Large

    • huggingface.co
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Ge (2025). MNAs-FLAN-T5-Large [Dataset]. https://huggingface.co/datasets/alexge233/MNAs-FLAN-T5-Large
    Explore at:
    Dataset updated
    Apr 16, 2025
    Authors
    Alex Ge
    Description

    Trying to Finetune using LORA a LLM model using Huggingface on Acquisitions

    Currently we use Llama 3.3 Instruct 70b model it takes anywhere from 1-5 seconds for parallel 3-RAG prompt inference it uses a REST API from huggingface

    Can we fine tune a smaller model, such as Google's Gemma 3 on our small data?

  4. h

    TinyMarkdown-Instruct-PT

    • huggingface.co
    Updated Mar 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vitor Augusto Machado Jorge (2025). TinyMarkdown-Instruct-PT [Dataset]. https://huggingface.co/datasets/VAMJ-0042/TinyMarkdown-Instruct-PT
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 5, 2025
    Authors
    Vitor Augusto Machado Jorge
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Markdown Fine-Tuning Datasets (English & PT-BR)

      Overview
    

    These datasets are designed to fine-tune Large Language Models (LLMs) like Gemma to generate structured Markdown-formatted responses. The datasets contain instruction-response pairs, ensuring the model learns how to output Markdown elements correctly.

      Datasets
    
    
    
    
    
      1. English Markdown Dataset
    

    Available on Hugging Face: TinyMarkdown-Instruct-EN Size: Large-scale dataset with structured Markdown… See the full description on the dataset page: https://huggingface.co/datasets/VAMJ-0042/TinyMarkdown-Instruct-PT.

  5. h

    ms-marco-en-bge-gemma

    • huggingface.co
    Updated Apr 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LightOn AI (2025). ms-marco-en-bge-gemma [Dataset]. https://huggingface.co/datasets/lightonai/ms-marco-en-bge-gemma
    Explore at:
    Dataset updated
    Apr 28, 2025
    Dataset authored and provided by
    LightOn AI
    Description

    ms-marco-en-bge

    This dataset contains the MS MARCO dataset with negatives mined using ColBERT and then scored by bge-reranker-v2-gemma. It can be used to train a retrieval model using knowledge distillation, for example using PyLate.

      knowledge distillation
    

    To fine-tune a model using knowledge distillation loss we will need three distinct file:

    Datasetsfrom datasets import load_dataset

    train = load_dataset( "lightonai/ms-marco-en-gemma", "train"… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/ms-marco-en-bge-gemma.

  6. h

    Guilherme34_uncensor

    • huggingface.co
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    huihui.ai (2025). Guilherme34_uncensor [Dataset]. https://huggingface.co/datasets/huihui-ai/Guilherme34_uncensor
    Explore at:
    Dataset updated
    Apr 9, 2025
    Authors
    huihui.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    huihui-ai/Guilherme34_uncensor

    This dataset is a copy of Guilherme34/uncensor This dataset is used for fine-tuning of huihui-ai/gemma-3-1b-it-abliterated, please refer to GRPO with Unsloth.

      Usage
    

    from datasets import Dataset import json

    Define the system prompt that instructs the model to use a specific format

    SYSTEM_PROMPT = """ Respond in the following format: """

    def get_harmful_questions(split="train"… See the full description on the dataset page: https://huggingface.co/datasets/huihui-ai/Guilherme34_uncensor.

  7. h

    fincen_all_questions_5versions

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shijun Ju, fincen_all_questions_5versions [Dataset]. https://huggingface.co/datasets/shijunju/fincen_all_questions_5versions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Shijun Ju
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    About

    These question-answer pairs are created using published pdf documents at fincen.gov.

    Each question has 5 paraphased versions differentiated by column "question_version" (the first versions (No. 4) are at the end of the datafile). The data is used to fine-tune Gemma-2b and Gemma-7b listed here shijunju/gemma_7b_finRisk_r10_4VersionQ shijunju/gemma_7b_finRisk_r6_4VersionQ shijunju/gemma_7b_finRisk_r6_3VersionQ shijunju/gemma_2b_finRisk

    Number of rows: 14,550

    Author: Shijun… See the full description on the dataset page: https://huggingface.co/datasets/shijunju/fincen_all_questions_5versions.

  8. h

    CAI-synthetic-10k

    • huggingface.co
    Updated Apr 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Inner I Network (2024). CAI-synthetic-10k [Dataset]. https://huggingface.co/datasets/InnerI/CAI-synthetic-10k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2024
    Authors
    Inner I Network
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CAI-Synthetic Model

      Overview
    

    The CAI-Synthetic Model is a large language model designed to understand and respond to complex questions. This model has been fine-tuned on a synthetic dataset from Mostly AI, allowing it to engage in a variety of contexts with reliable responses. It is designed to perform well in diverse scenarios.

      Base Model and Fine-Tuning
    

    Base Model: Google/Gemma-7b

    Fine-Tuning Adapter: LoRA Adapter

    Synthetic Dataset: Mostly AI Synthetic… See the full description on the dataset page: https://huggingface.co/datasets/InnerI/CAI-synthetic-10k.

  9. h

    khmer_question_answer

    • huggingface.co
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bread (2024). khmer_question_answer [Dataset]. http://doi.org/10.57967/hf/3476
    Explore at:
    Dataset updated
    Oct 9, 2024
    Authors
    Bread
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The data collected from https://www.khsearch.com/ related to the general question-answering examination. It used to train fine-tuned models from many LLMs, including LlaMa, Qwen, Mistral, and Gemma. Under the research title "Fine-tuning for Question Answering in Low-Resource Languages: A Case Study on Khmer" conducted at ViLa Lab, Institute of Technology of Cambodia, Phnom Penh.

    Lab Info: https://www.facebook.com/vilalabitc Paper will availble on online soon

  10. h

    youtube-titles

    • huggingface.co
    Updated Jun 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Lucek (2024). youtube-titles [Dataset]. https://huggingface.co/datasets/AdamLucek/youtube-titles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2024
    Authors
    Adam Lucek
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    YouTube
    Description

    Youtube Title & Descriptions Dataset

      About
    

    4941 videos across 50 YouTube Channels List of sampled channels here

      Splits:
    

    Train: 4199 Validation: 493 Test: 249

    Data was shuffled and sampled evenly from all channels to create splits. Additionally, has a column ready to go for gemma-2-9b-it fine tuning formatting! Potentially more model formats to come.

      About the Data:
    
    
    
    
      Label
      Description
    
    
      channel_name
      The… See the full description on the dataset page: https://huggingface.co/datasets/AdamLucek/youtube-titles.
    
  11. h

    Thinker-JSON

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minchan, Thinker-JSON [Dataset]. https://huggingface.co/datasets/minchyeom/Thinker-JSON
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Minchan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Use for whatever you want. Made to replicate the thought traces of OpenAI's o1, I'll release RL datasets including DPO soon enough. For fine-tuning smaller models such as Google's google/gemma-2-2b-it with this dataset, I recommend fine-tuning for 2-3 epochs, the loss will be at around 1.6 at the beginning, and 1.3 by the end of the training job with learning rate of 2e-6. Suggested system prompt: Always respond in strict JSON format with a reasoning_steps array and a response field. Each… See the full description on the dataset page: https://huggingface.co/datasets/minchyeom/Thinker-JSON.

  12. h

    RPRevamped-Small

    • huggingface.co
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhargav Raj (2025). RPRevamped-Small [Dataset]. https://huggingface.co/datasets/TechPowerB/RPRevamped-Small
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Bhargav Raj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RPRevamped-Small-v1.0

      Dataset Description
    

    RPRevamped is a synthetic dataset generated by various numbers of models. It is very diverse and is recommended if you are fine-tuning a roleplay model. This is the Small version with Medium and Tiny version currently in work. Github: RPRevamped GitHub Here are the models used in creation of this dataset: DeepSeek-V3-0324 Gemini-2.0-Flash-Thinking-Exp-01-21 DeepSeek-R1 Gemma-3-27B-it Gemma-3-12B-it Qwen2.5-VL-72B-Instruct… See the full description on the dataset page: https://huggingface.co/datasets/TechPowerB/RPRevamped-Small.

  13. h

    fake_news_gonzaloA_gemma_ft

    • huggingface.co
    Updated Sep 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahaf (2024). fake_news_gonzaloA_gemma_ft [Dataset]. https://huggingface.co/datasets/shahafvl/fake_news_gonzaloA_gemma_ft
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Shahaf
    Description

    Model Uses: shahafvl/gemma-2-2b-fake-news Attention: Fine-tuned base model. Dataset Used: GonzaloA/fake_news

  14. h

    syntetisk-dialog-opsummering-raw

    • huggingface.co
    Updated Dec 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kasper Groes Albin Ludvigsen (2024). syntetisk-dialog-opsummering-raw [Dataset]. https://huggingface.co/datasets/ThatsGroes/syntetisk-dialog-opsummering-raw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 28, 2024
    Authors
    Kasper Groes Albin Ludvigsen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Thanks to NVIDIA and Arrow Denmark for sponsoring the compute needed to generate this dataset

    This dataset conists of 1,000,000 synthetic dialogs in Danish and a summary of each dialog generated with google/gemma-2-27b-it The purpose of the dataset is to fine tune small language models to make dialog summaries, but with minor adjustments it may also be used 1) to train an LLM to restore/improve speaker diarization, 2) to train a classifier for classifying dialogs into the topic, 3)… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/syntetisk-dialog-opsummering-raw.

  15. h

    synthetic-dialog-summaries-processed

    • huggingface.co
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kasper Groes Albin Ludvigsen (2024). synthetic-dialog-summaries-processed [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-dialog-summaries-processed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2024
    Authors
    Kasper Groes Albin Ludvigsen
    Description

    Thanks to Nvida, Arrow and Danish Data Science Community for sponsoring the creation of this dataset

    This is a dataset consisting of 1,000,000 un-diarized dialogs and their corresponding summaries. The data was generated with gemma-2-27b-it. The dataset is intended to be used to fine tune smaller models to summarize dialog. The dialogs are meant to resemble transcriptions of dialog made with models such as Whisper which do not diarize the dialog out of the box. The "messages" column… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-dialog-summaries-processed.

  16. h

    khmer-instruct-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    An, khmer-instruct-dataset [Dataset]. https://huggingface.co/datasets/Pisethan/khmer-instruct-dataset
    Explore at:
    Authors
    An
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Khmer Instruct Dataset

    This dataset contains 500 instruction-response pairs in Khmer for fine-tuning language models such as Mistral or Gemma.It is designed to support chatbot use cases, educational tools, and instruction-following agents in the Khmer language.

      Format
    

    Each entry in the dataset is a JSON object with:

    prompt: Instruction or question in Khmer response: Expected model output in Khmer

    The file is stored in .jsonl (JSON Lines) format, making it compatible… See the full description on the dataset page: https://huggingface.co/datasets/Pisethan/khmer-instruct-dataset.

  17. h

    smart-home-energy-gemma3

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Epitech (2025). smart-home-energy-gemma3 [Dataset]. https://huggingface.co/datasets/Epitech/smart-home-energy-gemma3
    Explore at:
    Dataset updated
    May 31, 2025
    Dataset authored and provided by
    Epitech
    Description

    Smart Home Energy Optimization Dataset (Bilingual - FR/EN)

    This dataset is designed for fine-tuning lightweight language models (e.g., Gemma 1B) for local energy assistant use cases in smart homes.It contains synthetic instruction-response pairs in both French and English, ideal for on-device LLMs running on resource-constrained environments like a Raspberry Pi 4 (4GB RAM).

      💡 Use Case
    

    Smart Home Energy Assistant:An on-device assistant that helps users reduce energy… See the full description on the dataset page: https://huggingface.co/datasets/Epitech/smart-home-energy-gemma3.

  18. h

    Augmented-Bilingual_Turkish_TR-EN

    • huggingface.co
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batuhan S (2025). Augmented-Bilingual_Turkish_TR-EN [Dataset]. https://huggingface.co/datasets/Ba2han/Augmented-Bilingual_Turkish_TR-EN
    Explore at:
    Dataset updated
    Jun 28, 2025
    Authors
    Batuhan S
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Türkiye
    Description

    Original Dataset: ambrosfitz/10k_wiki_summary This dataset was created with a fine-tuned version of Gemma-3-4B. English input and Turkish system messages (as seen in the dataset) were used to create Turkish rows. The dataset is lightly cleaned to remove possible refusals, direct references to the text and more. (Rows with phrases like Bu makalede, bu metinde, bu iddia... were all dropped) Please credit my account if you use this dataset, thank you!

  19. h

    86k_DUTCH_conversational

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adnane Acudad, 86k_DUTCH_conversational [Dataset]. https://huggingface.co/datasets/aacudad/86k_DUTCH_conversational
    Explore at:
    Authors
    Adnane Acudad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🧠 Dutch Instruction Dataset (Generated with Gemini & OpenAI)

    This dataset was generated using Gemini and OpenAI's API, and is intended for general-purpose Dutch language model training, instruction tuning, and experimentation. Feel free to use it for your own projects or use-cases.If you do, I’d really appreciate it if you could reference or tag me — thanks! 🙌

      🚀 Used in DUTCHGPT
    

    This dataset has been used to train DUTCHGPT — a fine-tuned version of Gemma and LLaMA… See the full description on the dataset page: https://huggingface.co/datasets/aacudad/86k_DUTCH_conversational.

  20. h

    5K_DUTCH_LEGAL_SUMMARY

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adnane Acudad (2025). 5K_DUTCH_LEGAL_SUMMARY [Dataset]. https://huggingface.co/datasets/aacudad/5K_DUTCH_LEGAL_SUMMARY
    Explore at:
    Dataset updated
    Apr 3, 2025
    Authors
    Adnane Acudad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ⚖️ Dutch Legal Case Dataset (Summarized with Gemini)

    This dataset consists of 5,000 Dutch legal cases sourced from rechtspraak.nl.Each case includes:

    The original legal text A summary generated by Gemini

    The dataset is designed to support long-context training tasks such as legal reasoning and summarization.

      🚀 Used in DUTCHGPT
    

    This dataset has been used to train DUTCHGPT — a fine-tuned version of Gemma and LLaMA optimized for Dutch. Explore the model here:👉… See the full description on the dataset page: https://huggingface.co/datasets/aacudad/5K_DUTCH_LEGAL_SUMMARY.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dinushi Jayasinghe (2022). gemma-function-calling [Dataset]. https://huggingface.co/datasets/dushj98/gemma-function-calling

gemma-function-calling

Gemma Function Calling

dushj98/gemma-function-calling

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 15, 2022
Authors
Dinushi Jayasinghe
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

👉🏽 Important

This dataset is adapted from hypervariance/function-calling-sharegpt to fine-tune the Google gemma-2-2b-it model for function calling.

  🔀 Changes Made

Merged consecutive "GPT" responses into single responses (affected 8.49% of examples, 7372 out of 86864). Updated role names: "system" → Removed (function usage instructions moved to separate column) "human" → "user" "gpt" → "assistant" "function_response" → Unchanged

Changed message keys from ["from"… See the full description on the dataset page: https://huggingface.co/datasets/dushj98/gemma-function-calling.

Search
Clear search
Close search
Google apps
Main menu