21 datasets found

h
gemma-function-calling
huggingface.co
Updated Oct 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinushi Jayasinghe (2022). gemma-function-calling [Dataset]. https://huggingface.co/datasets/dushj98/gemma-function-calling
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 15, 2022
Authors
Dinushi Jayasinghe
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
👉🏽 Important

This dataset is adapted from hypervariance/function-calling-sharegpt to fine-tune the Google gemma-2-2b-it model for function calling.

🔀 Changes Made

Merged consecutive "GPT" responses into single responses (affected 8.49% of examples, 7372 out of 86864). Updated role names: "system" → Removed (function usage instructions moved to separate column) "human" → "user" "gpt" → "assistant" "function_response" → Unchanged

Changed message keys from ["from"… See the full description on the dataset page: https://huggingface.co/datasets/dushj98/gemma-function-calling.
h
databricks-mini
huggingface.co
Updated Feb 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrinivasan Sankar (2024). databricks-mini [Dataset]. https://huggingface.co/datasets/ai-bites/databricks-mini
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 27, 2024
Authors
Shrinivasan Sankar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is a subset of the databricks 15k dataset databricks/databricks-dolly-15k used for finetuning Google's Gemma model google/gemma-2b. This version has only those records without context to match the dataset used in the fine-tuning Keras example from Google.
h
MNAs-FLAN-T5-Large
huggingface.co
Updated Apr 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Ge (2025). MNAs-FLAN-T5-Large [Dataset]. https://huggingface.co/datasets/alexge233/MNAs-FLAN-T5-Large
Explore at:
Dataset updated
Apr 16, 2025
Authors
Alex Ge
Description
Trying to Finetune using LORA a LLM model using Huggingface on Acquisitions

Currently we use Llama 3.3 Instruct 70b model it takes anywhere from 1-5 seconds for parallel 3-RAG prompt inference it uses a REST API from huggingface

Can we fine tune a smaller model, such as Google's Gemma 3 on our small data?
h
TinyMarkdown-Instruct-PT
huggingface.co
Updated Mar 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vitor Augusto Machado Jorge (2025). TinyMarkdown-Instruct-PT [Dataset]. https://huggingface.co/datasets/VAMJ-0042/TinyMarkdown-Instruct-PT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 5, 2025
Authors
Vitor Augusto Machado Jorge
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Markdown Fine-Tuning Datasets (English & PT-BR)

Overview

These datasets are designed to fine-tune Large Language Models (LLMs) like Gemma to generate structured Markdown-formatted responses. The datasets contain instruction-response pairs, ensuring the model learns how to output Markdown elements correctly.

Datasets 1. English Markdown Dataset

Available on Hugging Face: TinyMarkdown-Instruct-EN Size: Large-scale dataset with structured Markdown… See the full description on the dataset page: https://huggingface.co/datasets/VAMJ-0042/TinyMarkdown-Instruct-PT.
h
ms-marco-en-bge-gemma
huggingface.co
Updated Apr 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LightOn AI (2025). ms-marco-en-bge-gemma [Dataset]. https://huggingface.co/datasets/lightonai/ms-marco-en-bge-gemma
Explore at:
Dataset updated
Apr 28, 2025
Dataset authored and provided by
LightOn AI
Description
ms-marco-en-bge

This dataset contains the MS MARCO dataset with negatives mined using ColBERT and then scored by bge-reranker-v2-gemma. It can be used to train a retrieval model using knowledge distillation, for example using PyLate.

knowledge distillation

To fine-tune a model using knowledge distillation loss we will need three distinct file:

Datasetsfrom datasets import load_dataset

train = load_dataset( "lightonai/ms-marco-en-gemma", "train"… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/ms-marco-en-bge-gemma.
h
Guilherme34_uncensor
huggingface.co
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
huihui.ai (2025). Guilherme34_uncensor [Dataset]. https://huggingface.co/datasets/huihui-ai/Guilherme34_uncensor
Explore at:
Dataset updated
Apr 9, 2025
Authors
huihui.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
huihui-ai/Guilherme34_uncensor

This dataset is a copy of Guilherme34/uncensor This dataset is used for fine-tuning of huihui-ai/gemma-3-1b-it-abliterated, please refer to GRPO with Unsloth.

Usage

from datasets import Dataset import json

Define the system prompt that instructs the model to use a specific format

SYSTEM_PROMPT = """ Respond in the following format: """

def get_harmful_questions(split="train"… See the full description on the dataset page: https://huggingface.co/datasets/huihui-ai/Guilherme34_uncensor.
h
fincen_all_questions_5versions
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shijun Ju, fincen_all_questions_5versions [Dataset]. https://huggingface.co/datasets/shijunju/fincen_all_questions_5versions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Shijun Ju
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
About

These question-answer pairs are created using published pdf documents at fincen.gov.

Each question has 5 paraphased versions differentiated by column "question_version" (the first versions (No. 4) are at the end of the datafile). The data is used to fine-tune Gemma-2b and Gemma-7b listed here shijunju/gemma_7b_finRisk_r10_4VersionQ shijunju/gemma_7b_finRisk_r6_4VersionQ shijunju/gemma_7b_finRisk_r6_3VersionQ shijunju/gemma_2b_finRisk

Number of rows: 14,550

Author: Shijun… See the full description on the dataset page: https://huggingface.co/datasets/shijunju/fincen_all_questions_5versions.
h
CAI-synthetic-10k
huggingface.co
Updated Apr 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inner I Network (2024). CAI-synthetic-10k [Dataset]. https://huggingface.co/datasets/InnerI/CAI-synthetic-10k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 27, 2024
Authors
Inner I Network
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CAI-Synthetic Model

Overview

The CAI-Synthetic Model is a large language model designed to understand and respond to complex questions. This model has been fine-tuned on a synthetic dataset from Mostly AI, allowing it to engage in a variety of contexts with reliable responses. It is designed to perform well in diverse scenarios.

Base Model and Fine-Tuning

Base Model: Google/Gemma-7b

Fine-Tuning Adapter: LoRA Adapter

Synthetic Dataset: Mostly AI Synthetic… See the full description on the dataset page: https://huggingface.co/datasets/InnerI/CAI-synthetic-10k.
h
khmer_question_answer
huggingface.co
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bread (2024). khmer_question_answer [Dataset]. http://doi.org/10.57967/hf/3476
Explore at:
Unique identifier
https://doi.org/10.57967/hf/3476
Dataset updated
Oct 9, 2024
Authors
Bread
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The data collected from https://www.khsearch.com/ related to the general question-answering examination. It used to train fine-tuned models from many LLMs, including LlaMa, Qwen, Mistral, and Gemma. Under the research title "Fine-tuning for Question Answering in Low-Resource Languages: A Case Study on Khmer" conducted at ViLa Lab, Institute of Technology of Cambodia, Phnom Penh.

Lab Info: https://www.facebook.com/vilalabitc Paper will availble on online soon
h
youtube-titles
huggingface.co
Updated Jun 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Lucek (2024). youtube-titles [Dataset]. https://huggingface.co/datasets/AdamLucek/youtube-titles
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 30, 2024
Authors
Adam Lucek
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
YouTube
Description
Youtube Title & Descriptions Dataset

About

4941 videos across 50 YouTube Channels List of sampled channels here

Splits:

Train: 4199 Validation: 493 Test: 249

Data was shuffled and sampled evenly from all channels to create splits. Additionally, has a column ready to go for gemma-2-9b-it fine tuning formatting! Potentially more model formats to come.

About the Data: Label Description channel_name The… See the full description on the dataset page: https://huggingface.co/datasets/AdamLucek/youtube-titles.
h
Thinker-JSON
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minchan, Thinker-JSON [Dataset]. https://huggingface.co/datasets/minchyeom/Thinker-JSON
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Minchan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Use for whatever you want. Made to replicate the thought traces of OpenAI's o1, I'll release RL datasets including DPO soon enough. For fine-tuning smaller models such as Google's google/gemma-2-2b-it with this dataset, I recommend fine-tuning for 2-3 epochs, the loss will be at around 1.6 at the beginning, and 1.3 by the end of the training job with learning rate of 2e-6. Suggested system prompt: Always respond in strict JSON format with a reasoning_steps array and a response field. Each… See the full description on the dataset page: https://huggingface.co/datasets/minchyeom/Thinker-JSON.
h
RPRevamped-Small
huggingface.co
Updated Apr 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhargav Raj (2025). RPRevamped-Small [Dataset]. https://huggingface.co/datasets/TechPowerB/RPRevamped-Small
Explore at:
Dataset updated
Apr 26, 2025
Authors
Bhargav Raj
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RPRevamped-Small-v1.0

Dataset Description

RPRevamped is a synthetic dataset generated by various numbers of models. It is very diverse and is recommended if you are fine-tuning a roleplay model. This is the Small version with Medium and Tiny version currently in work. Github: RPRevamped GitHub Here are the models used in creation of this dataset: DeepSeek-V3-0324 Gemini-2.0-Flash-Thinking-Exp-01-21 DeepSeek-R1 Gemma-3-27B-it Gemma-3-12B-it Qwen2.5-VL-72B-Instruct… See the full description on the dataset page: https://huggingface.co/datasets/TechPowerB/RPRevamped-Small.
h
fake_news_gonzaloA_gemma_ft
huggingface.co
Updated Sep 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahaf (2024). fake_news_gonzaloA_gemma_ft [Dataset]. https://huggingface.co/datasets/shahafvl/fake_news_gonzaloA_gemma_ft
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
Shahaf
Description
Model Uses: shahafvl/gemma-2-2b-fake-news Attention: Fine-tuned base model. Dataset Used: GonzaloA/fake_news
h
syntetisk-dialog-opsummering-raw
huggingface.co
Updated Dec 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kasper Groes Albin Ludvigsen (2024). syntetisk-dialog-opsummering-raw [Dataset]. https://huggingface.co/datasets/ThatsGroes/syntetisk-dialog-opsummering-raw
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 28, 2024
Authors
Kasper Groes Albin Ludvigsen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Thanks to NVIDIA and Arrow Denmark for sponsoring the compute needed to generate this dataset

This dataset conists of 1,000,000 synthetic dialogs in Danish and a summary of each dialog generated with google/gemma-2-27b-it The purpose of the dataset is to fine tune small language models to make dialog summaries, but with minor adjustments it may also be used 1) to train an LLM to restore/improve speaker diarization, 2) to train a classifier for classifying dialogs into the topic, 3)… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/syntetisk-dialog-opsummering-raw.
h
synthetic-dialog-summaries-processed
huggingface.co
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kasper Groes Albin Ludvigsen (2024). synthetic-dialog-summaries-processed [Dataset]. https://huggingface.co/datasets/ThatsGroes/synthetic-dialog-summaries-processed
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 31, 2024
Authors
Kasper Groes Albin Ludvigsen
Description
Thanks to Nvida, Arrow and Danish Data Science Community for sponsoring the creation of this dataset

This is a dataset consisting of 1,000,000 un-diarized dialogs and their corresponding summaries. The data was generated with gemma-2-27b-it. The dataset is intended to be used to fine tune smaller models to summarize dialog. The dialogs are meant to resemble transcriptions of dialog made with models such as Whisper which do not diarize the dialog out of the box. The "messages" column… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-dialog-summaries-processed.
h
khmer-instruct-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
An, khmer-instruct-dataset [Dataset]. https://huggingface.co/datasets/Pisethan/khmer-instruct-dataset
Explore at:
Authors
An
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Khmer Instruct Dataset

This dataset contains 500 instruction-response pairs in Khmer for fine-tuning language models such as Mistral or Gemma.It is designed to support chatbot use cases, educational tools, and instruction-following agents in the Khmer language.

Format

Each entry in the dataset is a JSON object with:

prompt: Instruction or question in Khmer response: Expected model output in Khmer

The file is stored in .jsonl (JSON Lines) format, making it compatible… See the full description on the dataset page: https://huggingface.co/datasets/Pisethan/khmer-instruct-dataset.
h
smart-home-energy-gemma3
huggingface.co
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Epitech (2025). smart-home-energy-gemma3 [Dataset]. https://huggingface.co/datasets/Epitech/smart-home-energy-gemma3
Explore at:
Dataset updated
May 31, 2025
Dataset authored and provided by
Epitech
Description
Smart Home Energy Optimization Dataset (Bilingual - FR/EN)

This dataset is designed for fine-tuning lightweight language models (e.g., Gemma 1B) for local energy assistant use cases in smart homes.It contains synthetic instruction-response pairs in both French and English, ideal for on-device LLMs running on resource-constrained environments like a Raspberry Pi 4 (4GB RAM).

💡 Use Case

Smart Home Energy Assistant:An on-device assistant that helps users reduce energy… See the full description on the dataset page: https://huggingface.co/datasets/Epitech/smart-home-energy-gemma3.
h
Augmented-Bilingual_Turkish_TR-EN
huggingface.co
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Batuhan S (2025). Augmented-Bilingual_Turkish_TR-EN [Dataset]. https://huggingface.co/datasets/Ba2han/Augmented-Bilingual_Turkish_TR-EN
Explore at:
Dataset updated
Jun 28, 2025
Authors
Batuhan S
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
Türkiye
Description
Original Dataset: ambrosfitz/10k_wiki_summary This dataset was created with a fine-tuned version of Gemma-3-4B. English input and Turkish system messages (as seen in the dataset) were used to create Turkish rows. The dataset is lightly cleaned to remove possible refusals, direct references to the text and more. (Rows with phrases like Bu makalede, bu metinde, bu iddia... were all dropped) Please credit my account if you use this dataset, thank you!
h
86k_DUTCH_conversational
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adnane Acudad, 86k_DUTCH_conversational [Dataset]. https://huggingface.co/datasets/aacudad/86k_DUTCH_conversational
Explore at:
Authors
Adnane Acudad
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🧠 Dutch Instruction Dataset (Generated with Gemini & OpenAI)

This dataset was generated using Gemini and OpenAI's API, and is intended for general-purpose Dutch language model training, instruction tuning, and experimentation. Feel free to use it for your own projects or use-cases.If you do, I’d really appreciate it if you could reference or tag me — thanks! 🙌

🚀 Used in DUTCHGPT

This dataset has been used to train DUTCHGPT — a fine-tuned version of Gemma and LLaMA… See the full description on the dataset page: https://huggingface.co/datasets/aacudad/86k_DUTCH_conversational.
h
5K_DUTCH_LEGAL_SUMMARY
huggingface.co
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adnane Acudad (2025). 5K_DUTCH_LEGAL_SUMMARY [Dataset]. https://huggingface.co/datasets/aacudad/5K_DUTCH_LEGAL_SUMMARY
Explore at:
Dataset updated
Apr 3, 2025
Authors
Adnane Acudad
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
⚖️ Dutch Legal Case Dataset (Summarized with Gemini)

This dataset consists of 5,000 Dutch legal cases sourced from rechtspraak.nl.Each case includes:

The original legal text A summary generated by Gemini

The dataset is designed to support long-context training tasks such as legal reasoning and summarization.

🚀 Used in DUTCHGPT

This dataset has been used to train DUTCHGPT — a fine-tuned version of Gemma and LLaMA optimized for Dutch. Explore the model here:👉… See the full description on the dataset page: https://huggingface.co/datasets/aacudad/5K_DUTCH_LEGAL_SUMMARY.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dinushi Jayasinghe (2022). gemma-function-calling [Dataset]. https://huggingface.co/datasets/dushj98/gemma-function-calling

gemma-function-calling

Gemma Function Calling

dushj98/gemma-function-calling

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 15, 2022

Authors

Dinushi Jayasinghe

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

👉🏽 Important

This dataset is adapted from hypervariance/function-calling-sharegpt to fine-tune the Google gemma-2-2b-it model for function calling.

  🔀 Changes Made

Merged consecutive "GPT" responses into single responses (affected 8.49% of examples, 7372 out of 86864). Updated role names: "system" → Removed (function usage instructions moved to separate column) "human" → "user" "gpt" → "assistant" "function_response" → Unchanged

Changed message keys from ["from"… See the full description on the dataset page: https://huggingface.co/datasets/dushj98/gemma-function-calling.

Clear search

Close search

Google apps

Main menu

gemma-function-calling

databricks-mini

MNAs-FLAN-T5-Large

TinyMarkdown-Instruct-PT

ms-marco-en-bge-gemma

Guilherme34_uncensor

Define the system prompt that instructs the model to use a specific format

fincen_all_questions_5versions

CAI-synthetic-10k

khmer_question_answer

youtube-titles

Thinker-JSON

RPRevamped-Small

fake_news_gonzaloA_gemma_ft

syntetisk-dialog-opsummering-raw

synthetic-dialog-summaries-processed

khmer-instruct-dataset

smart-home-energy-gemma3

Augmented-Bilingual_Turkish_TR-EN

86k_DUTCH_conversational

5K_DUTCH_LEGAL_SUMMARY

gemma-function-calling

Gemma Function Calling

dushj98/gemma-function-calling