SmolTalk
Dataset description
This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737 During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasetsโฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Smol-SmalTalk
This is a subset of SmolTalk dataset adapted for smol models with less than 1B parameters. We used it to build SmolLM2-360M-Instruct and SmolLM2-135M-Instruct. We do SFT on this dataset and then DPO on UltraFeedback. Compared to SmolTalk:
The conversations from Smol-Magpie-Ultra are shorter in this dataset We include less task specific data compared to SmolTalk (e.g no function calling and less rewriting and summarization examples) since these smaller models haveโฆ See the full description on the dataset page: https://huggingface.co/datasets/vanek-epfl/MNLP_M2_mcqa_dataset.
HuggingFaceTB/smoltalk-multilingual8-Qwen3-32B-main-gen dataset hosted on Hugging Face and contributed by the HF Datasets community
ZeroAgency/HuggingFaceTB-smoltalk-all dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for ReactiveAI/Smol-Smoltalk Interaction SFT
Derived from HuggingFaceTB/smol-smoltalk. Made for Interaction Supervised Fine-Tuning of Reactive Transformer Proof-of-Concept models, especially RxT-Alpha (more info soon).
Dataset Details
Dataset Description
Reactive Transformers are processing only the single interactions in real-time and using Short-Term Memory to store information from previous interactions. Before the model is able to use it'sโฆ See the full description on the dataset page: https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT.
ketchup123/smoltalk dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for ReactiveAI/Smol-Smoltalk-Mini Interaction SFT
Derived from HuggingFaceTB/smol-smoltalk (used 25% of train & test splits). Made for Interaction Supervised Fine-Tuning of Reactive Transformer Proof-of-Concept models, especially RxT-Alpha-Mini (more info soon).
Full version available in ReactiveAI/smol-smoltalk-Interaction-SFT
Dataset Details
Dataset Description
Reactive Transformers are processing only the single interactions inโฆ See the full description on the dataset page: https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-mini-Interaction-SFT.
Delta-Vector/Hydrus-SmolTalk-2-IF-MT dataset hosted on Hugging Face and contributed by the HF Datasets community
aladinDJ/smoltalk-annotated-full dataset hosted on Hugging Face and contributed by the HF Datasets community
horus-ai-labs/smoltalk-filtered dataset hosted on Hugging Face and contributed by the HF Datasets community
nosuchjihyun/smol-smoltalk-chat dataset hosted on Hugging Face and contributed by the HF Datasets community
ketchup123/smoltalk-annotated-full-backup dataset hosted on Hugging Face and contributed by the HF Datasets community
jaxon3062/smoltalk-gemma3-chat-512 dataset hosted on Hugging Face and contributed by the HF Datasets community
salhernandez/homer-simpson-smoltalk dataset hosted on Hugging Face and contributed by the HF Datasets community
field_messages: messages message_field_role: role message_field_content: content```
Delta-Vector/Hydrus-Smoltalk-Subset dataset hosted on Hugging Face and contributed by the HF Datasets community
salhernandez/homer-simpson-smoltalk-everyday-conversations dataset hosted on Hugging Face and contributed by the HF Datasets community
HuggingFaceTB smoltalk DolphinLabeled
Part of the DolphinLabeled series of datasets
Presented by Eric Hartford and Cognitive Computations
The purpose of this dataset is to enable filtering of HuggingFaceTB/smoltalk dataset. The original dataset is HuggingFaceTB/smoltalk I have modified the dataset using two scripts.
dedupe.py - removes rows with identical final message content label.py - adds a "flags" column containing the following boolean values: "refusal":โฆ See the full description on the dataset page: https://huggingface.co/datasets/QuixiAI/HuggingFaceTB_smoltalk-DolphinLabeled.
SmolTalk2
Dataset description
This dataset contains three subsets (Mid, SFT, Preference) that correspond to the three phases of Post-Training for SmolLM3-3B. You can find more details in our blog post about how we used the data in each of the stages SmolLM3. The specific weight of each subset is available in the training recipe in SmolLM's repository. You can load a dataset using from datasets import load_dataset
FluxiIA/smoltalk dataset hosted on Hugging Face and contributed by the HF Datasets community
SmolTalk
Dataset description
This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737 During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasetsโฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.