35 datasets found

smoltalk
huggingface.co
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2024). smoltalk [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smoltalk
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
Description
SmolTalk

Dataset description

This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737 During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.
h
MNLP_M2_mcqa_dataset
huggingface.co
Updated Jun 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavlov (2025). MNLP_M2_mcqa_dataset [Dataset]. https://huggingface.co/datasets/vanek-epfl/MNLP_M2_mcqa_dataset
Explore at:
Dataset updated
Jun 1, 2025
Authors
Pavlov
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Smol-SmalTalk

This is a subset of SmolTalk dataset adapted for smol models with less than 1B parameters. We used it to build SmolLM2-360M-Instruct and SmolLM2-135M-Instruct. We do SFT on this dataset and then DPO on UltraFeedback. Compared to SmolTalk:

The conversations from Smol-Magpie-Ultra are shorter in this dataset We include less task specific data compared to SmolTalk (e.g no function calling and less rewriting and summarization examples) since these smaller models have… See the full description on the dataset page: https://huggingface.co/datasets/vanek-epfl/MNLP_M2_mcqa_dataset.
smoltalk-multilingual8-Qwen3-32B-main-gen
huggingface.co
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2025). smoltalk-multilingual8-Qwen3-32B-main-gen [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smoltalk-multilingual8-Qwen3-32B-main-gen
Explore at:
Dataset updated
Jul 11, 2025
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
Description
HuggingFaceTB/smoltalk-multilingual8-Qwen3-32B-main-gen dataset hosted on Hugging Face and contributed by the HF Datasets community
h
HuggingFaceTB-smoltalk-all
huggingface.co
Updated Jul 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ZeroAgency (2025). HuggingFaceTB-smoltalk-all [Dataset]. https://huggingface.co/datasets/ZeroAgency/HuggingFaceTB-smoltalk-all
Explore at:
Dataset updated
Jul 27, 2025
Dataset authored and provided by
ZeroAgency
Description
ZeroAgency/HuggingFaceTB-smoltalk-all dataset hosted on Hugging Face and contributed by the HF Datasets community
h
smol-smoltalk-Interaction-SFT
huggingface.co
Updated Aug 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reactive AI (2025). smol-smoltalk-Interaction-SFT [Dataset]. https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT
Explore at:
Dataset updated
Aug 7, 2025
Dataset authored and provided by
Reactive AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for ReactiveAI/Smol-Smoltalk Interaction SFT

Derived from HuggingFaceTB/smol-smoltalk. Made for Interaction Supervised Fine-Tuning of Reactive Transformer Proof-of-Concept models, especially RxT-Alpha (more info soon).

Dataset Details Dataset Description

Reactive Transformers are processing only the single interactions in real-time and using Short-Term Memory to store information from previous interactions. Before the model is able to use it's… See the full description on the dataset page: https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT.
h
smoltalk
huggingface.co
Updated Apr 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John M. (2025). smoltalk [Dataset]. https://huggingface.co/datasets/ketchup123/smoltalk
Explore at:
Dataset updated
Apr 23, 2025
Authors
John M.
Description
ketchup123/smoltalk dataset hosted on Hugging Face and contributed by the HF Datasets community
h
smol-smoltalk-mini-Interaction-SFT
huggingface.co
Updated Aug 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reactive AI (2025). smol-smoltalk-mini-Interaction-SFT [Dataset]. https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-mini-Interaction-SFT
Explore at:
Dataset updated
Aug 7, 2025
Dataset authored and provided by
Reactive AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for ReactiveAI/Smol-Smoltalk-Mini Interaction SFT

Derived from HuggingFaceTB/smol-smoltalk (used 25% of train & test splits). Made for Interaction Supervised Fine-Tuning of Reactive Transformer Proof-of-Concept models, especially RxT-Alpha-Mini (more info soon).

Full version available in ReactiveAI/smol-smoltalk-Interaction-SFT

Dataset Details Dataset Description

Reactive Transformers are processing only the single interactions in… See the full description on the dataset page: https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-mini-Interaction-SFT.
h
Hydrus-SmolTalk-2-IF-MT
huggingface.co
Updated Jul 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mango (2025). Hydrus-SmolTalk-2-IF-MT [Dataset]. https://huggingface.co/datasets/Delta-Vector/Hydrus-SmolTalk-2-IF-MT
Explore at:
Dataset updated
Jul 22, 2025
Authors
Mango
Description
Delta-Vector/Hydrus-SmolTalk-2-IF-MT dataset hosted on Hugging Face and contributed by the HF Datasets community
h
smoltalk-annotated-full
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aladin D., smoltalk-annotated-full [Dataset]. https://huggingface.co/datasets/aladinDJ/smoltalk-annotated-full
Explore at:
Authors
Aladin D.
Description
aladinDJ/smoltalk-annotated-full dataset hosted on Hugging Face and contributed by the HF Datasets community
h
smoltalk-filtered
huggingface.co
Updated Jan 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Horus AI Labs (2025). smoltalk-filtered [Dataset]. https://huggingface.co/datasets/horus-ai-labs/smoltalk-filtered
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2025
Authors
Horus AI Labs
Description
horus-ai-labs/smoltalk-filtered dataset hosted on Hugging Face and contributed by the HF Datasets community
h
smol-smoltalk-chat
huggingface.co
Updated May 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JiHyun Kang (2025). smol-smoltalk-chat [Dataset]. https://huggingface.co/datasets/nosuchjihyun/smol-smoltalk-chat
Explore at:
Dataset updated
May 29, 2025
Authors
JiHyun Kang
Description
nosuchjihyun/smol-smoltalk-chat dataset hosted on Hugging Face and contributed by the HF Datasets community
h
smoltalk-annotated-full-backup
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John M., smoltalk-annotated-full-backup [Dataset]. https://huggingface.co/datasets/ketchup123/smoltalk-annotated-full-backup
Explore at:
Authors
John M.
Description
ketchup123/smoltalk-annotated-full-backup dataset hosted on Hugging Face and contributed by the HF Datasets community
h
smoltalk-gemma3-chat-512
huggingface.co
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason Lin (2025). smoltalk-gemma3-chat-512 [Dataset]. https://huggingface.co/datasets/jaxon3062/smoltalk-gemma3-chat-512
Explore at:
Dataset updated
Aug 31, 2025
Authors
Jason Lin
Description
jaxon3062/smoltalk-gemma3-chat-512 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
homer-simpson-smoltalk
huggingface.co
Updated Sep 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hernandez (2025). homer-simpson-smoltalk [Dataset]. https://huggingface.co/datasets/salhernandez/homer-simpson-smoltalk
Explore at:
Dataset updated
Sep 27, 2025
Authors
Hernandez
Description
salhernandez/homer-simpson-smoltalk dataset hosted on Hugging Face and contributed by the HF Datasets community
h
smoltalk-easy
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emma, smoltalk-easy [Dataset]. https://huggingface.co/datasets/Emm9625/smoltalk-easy
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Emma
Description
field_messages: messages message_field_role: role message_field_content: content```
h
Hydrus-Smoltalk-Subset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mango, Hydrus-Smoltalk-Subset [Dataset]. https://huggingface.co/datasets/Delta-Vector/Hydrus-Smoltalk-Subset
Explore at:
Authors
Mango
Description
Delta-Vector/Hydrus-Smoltalk-Subset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
homer-simpson-smoltalk-everyday-conversations
huggingface.co
Updated Sep 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hernandez (2025). homer-simpson-smoltalk-everyday-conversations [Dataset]. https://huggingface.co/datasets/salhernandez/homer-simpson-smoltalk-everyday-conversations
Explore at:
Dataset updated
Sep 27, 2025
Authors
Hernandez
Description
salhernandez/homer-simpson-smoltalk-everyday-conversations dataset hosted on Hugging Face and contributed by the HF Datasets community
h
HuggingFaceTB_smoltalk-DolphinLabeled
huggingface.co
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quixi AI (2025). HuggingFaceTB_smoltalk-DolphinLabeled [Dataset]. https://huggingface.co/datasets/QuixiAI/HuggingFaceTB_smoltalk-DolphinLabeled
Explore at:
Dataset updated
Jan 6, 2025
Dataset authored and provided by
Quixi AI
Description
HuggingFaceTB smoltalk DolphinLabeled

Part of the DolphinLabeled series of datasets Presented by Eric Hartford and Cognitive Computations

The purpose of this dataset is to enable filtering of HuggingFaceTB/smoltalk dataset. The original dataset is HuggingFaceTB/smoltalk I have modified the dataset using two scripts.

dedupe.py - removes rows with identical final message content label.py - adds a "flags" column containing the following boolean values: "refusal":… See the full description on the dataset page: https://huggingface.co/datasets/QuixiAI/HuggingFaceTB_smoltalk-DolphinLabeled.
smoltalk2
huggingface.co
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2025). smoltalk2 [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smoltalk2
Explore at:
Dataset updated
Jul 11, 2025
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
Description
SmolTalk2

Dataset description

This dataset contains three subsets (Mid, SFT, Preference) that correspond to the three phases of Post-Training for SmolLM3-3B. You can find more details in our blog post about how we used the data in each of the stages SmolLM3. The specific weight of each subset is available in the training recipe in SmolLM's repository. You can load a dataset using from datasets import load_dataset

To load the train split of a specific subset, such as… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk2.
h
smoltalk
huggingface.co
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fluxi IA (2025). smoltalk [Dataset]. https://huggingface.co/datasets/FluxiIA/smoltalk
Explore at:
Dataset updated
Jun 22, 2025
Authors
Fluxi IA
Description
FluxiIA/smoltalk dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Hugging Face Smol Models Research (2024). smoltalk [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smoltalk

smoltalk

SmolTalk

HuggingFaceTB/smoltalk

Explore at:

23 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 21, 2024

Dataset provided by

Hugging Facehttps://huggingface.co/

Authors

Hugging Face Smol Models Research

Description

SmolTalk

  Dataset description

This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737 During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.

Clear search

Close search

Google apps

Main menu

smoltalk

MNLP_M2_mcqa_dataset

smoltalk-multilingual8-Qwen3-32B-main-gen

HuggingFaceTB-smoltalk-all

smol-smoltalk-Interaction-SFT

smoltalk

smol-smoltalk-mini-Interaction-SFT

Hydrus-SmolTalk-2-IF-MT

smoltalk-annotated-full

smoltalk-filtered

smol-smoltalk-chat

smoltalk-annotated-full-backup

smoltalk-gemma3-chat-512

homer-simpson-smoltalk

smoltalk-easy

Hydrus-Smoltalk-Subset

homer-simpson-smoltalk-everyday-conversations

HuggingFaceTB_smoltalk-DolphinLabeled

smoltalk2

To load the train split of a specific subset, such as… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk2.

smoltalk

smoltalk

SmolTalk

HuggingFaceTB/smoltalk