Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
👉🏽 Important
This dataset is adapted from hypervariance/function-calling-sharegpt to fine-tune the Google gemma-2-2b-it model for function calling.
🔀 Changes Made
Merged consecutive "GPT" responses into single responses (affected 8.49% of examples, 7372 out of 86864). Updated role names: "system" → Removed (function usage instructions moved to separate column) "human" → "user" "gpt" → "assistant" "function_response" → Unchanged
Changed message keys from ["from"… See the full description on the dataset page: https://huggingface.co/datasets/dushj98/gemma-function-calling.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a subset of the databricks 15k dataset databricks/databricks-dolly-15k used for finetuning Google's Gemma model google/gemma-2b. This version has only those records without context to match the dataset used in the fine-tuning Keras example from Google.
Trying to Finetune using LORA a LLM model using Huggingface on Acquisitions
Currently we use Llama 3.3 Instruct 70b model it takes anywhere from 1-5 seconds for parallel 3-RAG prompt inference it uses a REST API from huggingface
Can we fine tune a smaller model, such as Google's Gemma 3 on our small data?
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Markdown Fine-Tuning Datasets (English & PT-BR)
Overview
These datasets are designed to fine-tune Large Language Models (LLMs) like Gemma to generate structured Markdown-formatted responses. The datasets contain instruction-response pairs, ensuring the model learns how to output Markdown elements correctly.
Datasets
1. English Markdown Dataset
Available on Hugging Face: TinyMarkdown-Instruct-EN Size: Large-scale dataset with structured Markdown… See the full description on the dataset page: https://huggingface.co/datasets/VAMJ-0042/TinyMarkdown-Instruct-PT.
ms-marco-en-bge
This dataset contains the MS MARCO dataset with negatives mined using ColBERT and then scored by bge-reranker-v2-gemma. It can be used to train a retrieval model using knowledge distillation, for example using PyLate.
knowledge distillation
To fine-tune a model using knowledge distillation loss we will need three distinct file:
Datasetsfrom datasets import load_dataset
train = load_dataset( "lightonai/ms-marco-en-gemma", "train"… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/ms-marco-en-bge-gemma.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
huihui-ai/Guilherme34_uncensor
This dataset is a copy of Guilherme34/uncensor This dataset is used for fine-tuning of huihui-ai/gemma-3-1b-it-abliterated, please refer to GRPO with Unsloth.
Usage
from datasets import Dataset import json
SYSTEM_PROMPT = """ Respond in the following format: """
def get_harmful_questions(split="train"… See the full description on the dataset page: https://huggingface.co/datasets/huihui-ai/Guilherme34_uncensor.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
About
These question-answer pairs are created using published pdf documents at fincen.gov.
Each question has 5 paraphased versions differentiated by column "question_version" (the first versions (No. 4) are at the end of the datafile). The data is used to fine-tune Gemma-2b and Gemma-7b listed here shijunju/gemma_7b_finRisk_r10_4VersionQ shijunju/gemma_7b_finRisk_r6_4VersionQ shijunju/gemma_7b_finRisk_r6_3VersionQ shijunju/gemma_2b_finRisk
Number of rows: 14,550
Author: Shijun… See the full description on the dataset page: https://huggingface.co/datasets/shijunju/fincen_all_questions_5versions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CAI-Synthetic Model
Overview
The CAI-Synthetic Model is a large language model designed to understand and respond to complex questions. This model has been fine-tuned on a synthetic dataset from Mostly AI, allowing it to engage in a variety of contexts with reliable responses. It is designed to perform well in diverse scenarios.
Base Model and Fine-Tuning
Base Model: Google/Gemma-7b
Fine-Tuning Adapter: LoRA Adapter
Synthetic Dataset: Mostly AI Synthetic… See the full description on the dataset page: https://huggingface.co/datasets/InnerI/CAI-synthetic-10k.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The data collected from https://www.khsearch.com/ related to the general question-answering examination. It used to train fine-tuned models from many LLMs, including LlaMa, Qwen, Mistral, and Gemma. Under the research title "Fine-tuning for Question Answering in Low-Resource Languages: A Case Study on Khmer" conducted at ViLa Lab, Institute of Technology of Cambodia, Phnom Penh.
Lab Info: https://www.facebook.com/vilalabitc Paper will availble on online soon
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Youtube Title & Descriptions Dataset
About
4941 videos across 50 YouTube Channels List of sampled channels here
Splits:
Train: 4199 Validation: 493 Test: 249
Data was shuffled and sampled evenly from all channels to create splits. Additionally, has a column ready to go for gemma-2-9b-it fine tuning formatting! Potentially more model formats to come.
About the Data:
Label
Description
channel_name
The… See the full description on the dataset page: https://huggingface.co/datasets/AdamLucek/youtube-titles.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Use for whatever you want. Made to replicate the thought traces of OpenAI's o1, I'll release RL datasets including DPO soon enough. For fine-tuning smaller models such as Google's google/gemma-2-2b-it with this dataset, I recommend fine-tuning for 2-3 epochs, the loss will be at around 1.6 at the beginning, and 1.3 by the end of the training job with learning rate of 2e-6. Suggested system prompt: Always respond in strict JSON format with a reasoning_steps array and a response field. Each… See the full description on the dataset page: https://huggingface.co/datasets/minchyeom/Thinker-JSON.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RPRevamped-Small-v1.0
Dataset Description
RPRevamped is a synthetic dataset generated by various numbers of models. It is very diverse and is recommended if you are fine-tuning a roleplay model. This is the Small version with Medium and Tiny version currently in work. Github: RPRevamped GitHub Here are the models used in creation of this dataset: DeepSeek-V3-0324 Gemini-2.0-Flash-Thinking-Exp-01-21 DeepSeek-R1 Gemma-3-27B-it Gemma-3-12B-it Qwen2.5-VL-72B-Instruct… See the full description on the dataset page: https://huggingface.co/datasets/TechPowerB/RPRevamped-Small.
Model Uses: shahafvl/gemma-2-2b-fake-news Attention: Fine-tuned base model. Dataset Used: GonzaloA/fake_news
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Thanks to NVIDIA and Arrow Denmark for sponsoring the compute needed to generate this dataset
This dataset conists of 1,000,000 synthetic dialogs in Danish and a summary of each dialog generated with google/gemma-2-27b-it The purpose of the dataset is to fine tune small language models to make dialog summaries, but with minor adjustments it may also be used 1) to train an LLM to restore/improve speaker diarization, 2) to train a classifier for classifying dialogs into the topic, 3)… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/syntetisk-dialog-opsummering-raw.
Thanks to Nvida, Arrow and Danish Data Science Community for sponsoring the creation of this dataset
This is a dataset consisting of 1,000,000 un-diarized dialogs and their corresponding summaries. The data was generated with gemma-2-27b-it. The dataset is intended to be used to fine tune smaller models to summarize dialog. The dialogs are meant to resemble transcriptions of dialog made with models such as Whisper which do not diarize the dialog out of the box. The "messages" column… See the full description on the dataset page: https://huggingface.co/datasets/ThatsGroes/synthetic-dialog-summaries-processed.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Khmer Instruct Dataset
This dataset contains 500 instruction-response pairs in Khmer for fine-tuning language models such as Mistral or Gemma.It is designed to support chatbot use cases, educational tools, and instruction-following agents in the Khmer language.
Format
Each entry in the dataset is a JSON object with:
prompt: Instruction or question in Khmer response: Expected model output in Khmer
The file is stored in .jsonl (JSON Lines) format, making it compatible… See the full description on the dataset page: https://huggingface.co/datasets/Pisethan/khmer-instruct-dataset.
Smart Home Energy Optimization Dataset (Bilingual - FR/EN)
This dataset is designed for fine-tuning lightweight language models (e.g., Gemma 1B) for local energy assistant use cases in smart homes.It contains synthetic instruction-response pairs in both French and English, ideal for on-device LLMs running on resource-constrained environments like a Raspberry Pi 4 (4GB RAM).
💡 Use Case
Smart Home Energy Assistant:An on-device assistant that helps users reduce energy… See the full description on the dataset page: https://huggingface.co/datasets/Epitech/smart-home-energy-gemma3.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Original Dataset: ambrosfitz/10k_wiki_summary This dataset was created with a fine-tuned version of Gemma-3-4B. English input and Turkish system messages (as seen in the dataset) were used to create Turkish rows. The dataset is lightly cleaned to remove possible refusals, direct references to the text and more. (Rows with phrases like Bu makalede, bu metinde, bu iddia... were all dropped) Please credit my account if you use this dataset, thank you!
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🧠 Dutch Instruction Dataset (Generated with Gemini & OpenAI)
This dataset was generated using Gemini and OpenAI's API, and is intended for general-purpose Dutch language model training, instruction tuning, and experimentation. Feel free to use it for your own projects or use-cases.If you do, I’d really appreciate it if you could reference or tag me — thanks! 🙌
🚀 Used in DUTCHGPT
This dataset has been used to train DUTCHGPT — a fine-tuned version of Gemma and LLaMA… See the full description on the dataset page: https://huggingface.co/datasets/aacudad/86k_DUTCH_conversational.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
⚖️ Dutch Legal Case Dataset (Summarized with Gemini)
This dataset consists of 5,000 Dutch legal cases sourced from rechtspraak.nl.Each case includes:
The original legal text A summary generated by Gemini
The dataset is designed to support long-context training tasks such as legal reasoning and summarization.
🚀 Used in DUTCHGPT
This dataset has been used to train DUTCHGPT — a fine-tuned version of Gemma and LLaMA optimized for Dutch. Explore the model here:👉… See the full description on the dataset page: https://huggingface.co/datasets/aacudad/5K_DUTCH_LEGAL_SUMMARY.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
👉🏽 Important
This dataset is adapted from hypervariance/function-calling-sharegpt to fine-tune the Google gemma-2-2b-it model for function calling.
🔀 Changes Made
Merged consecutive "GPT" responses into single responses (affected 8.49% of examples, 7372 out of 86864). Updated role names: "system" → Removed (function usage instructions moved to separate column) "human" → "user" "gpt" → "assistant" "function_response" → Unchanged
Changed message keys from ["from"… See the full description on the dataset page: https://huggingface.co/datasets/dushj98/gemma-function-calling.