35 datasets found
  1. smoltalk

    • huggingface.co
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2024). smoltalk [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smoltalk
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    SmolTalk

      Dataset description
    

    This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737 During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasetsโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.

  2. h

    MNLP_M2_mcqa_dataset

    • huggingface.co
    Updated Jun 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavlov (2025). MNLP_M2_mcqa_dataset [Dataset]. https://huggingface.co/datasets/vanek-epfl/MNLP_M2_mcqa_dataset
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Pavlov
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Smol-SmalTalk

    This is a subset of SmolTalk dataset adapted for smol models with less than 1B parameters. We used it to build SmolLM2-360M-Instruct and SmolLM2-135M-Instruct. We do SFT on this dataset and then DPO on UltraFeedback. Compared to SmolTalk:

    The conversations from Smol-Magpie-Ultra are shorter in this dataset We include less task specific data compared to SmolTalk (e.g no function calling and less rewriting and summarization examples) since these smaller models haveโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/vanek-epfl/MNLP_M2_mcqa_dataset.

  3. smoltalk-multilingual8-Qwen3-32B-main-gen

    • huggingface.co
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2025). smoltalk-multilingual8-Qwen3-32B-main-gen [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smoltalk-multilingual8-Qwen3-32B-main-gen
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    HuggingFaceTB/smoltalk-multilingual8-Qwen3-32B-main-gen dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    HuggingFaceTB-smoltalk-all

    • huggingface.co
    Updated Jul 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZeroAgency (2025). HuggingFaceTB-smoltalk-all [Dataset]. https://huggingface.co/datasets/ZeroAgency/HuggingFaceTB-smoltalk-all
    Explore at:
    Dataset updated
    Jul 27, 2025
    Dataset authored and provided by
    ZeroAgency
    Description

    ZeroAgency/HuggingFaceTB-smoltalk-all dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    smol-smoltalk-Interaction-SFT

    • huggingface.co
    Updated Aug 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reactive AI (2025). smol-smoltalk-Interaction-SFT [Dataset]. https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT
    Explore at:
    Dataset updated
    Aug 7, 2025
    Dataset authored and provided by
    Reactive AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for ReactiveAI/Smol-Smoltalk Interaction SFT

    Derived from HuggingFaceTB/smol-smoltalk. Made for Interaction Supervised Fine-Tuning of Reactive Transformer Proof-of-Concept models, especially RxT-Alpha (more info soon).

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    Reactive Transformers are processing only the single interactions in real-time and using Short-Term Memory to store information from previous interactions. Before the model is able to use it'sโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT.

  6. h

    smoltalk

    • huggingface.co
    Updated Apr 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John M. (2025). smoltalk [Dataset]. https://huggingface.co/datasets/ketchup123/smoltalk
    Explore at:
    Dataset updated
    Apr 23, 2025
    Authors
    John M.
    Description

    ketchup123/smoltalk dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    smol-smoltalk-mini-Interaction-SFT

    • huggingface.co
    Updated Aug 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reactive AI (2025). smol-smoltalk-mini-Interaction-SFT [Dataset]. https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-mini-Interaction-SFT
    Explore at:
    Dataset updated
    Aug 7, 2025
    Dataset authored and provided by
    Reactive AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for ReactiveAI/Smol-Smoltalk-Mini Interaction SFT

    Derived from HuggingFaceTB/smol-smoltalk (used 25% of train & test splits). Made for Interaction Supervised Fine-Tuning of Reactive Transformer Proof-of-Concept models, especially RxT-Alpha-Mini (more info soon).

    Full version available in ReactiveAI/smol-smoltalk-Interaction-SFT

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    Reactive Transformers are processing only the single interactions inโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-mini-Interaction-SFT.

  8. h

    Hydrus-SmolTalk-2-IF-MT

    • huggingface.co
    Updated Jul 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mango (2025). Hydrus-SmolTalk-2-IF-MT [Dataset]. https://huggingface.co/datasets/Delta-Vector/Hydrus-SmolTalk-2-IF-MT
    Explore at:
    Dataset updated
    Jul 22, 2025
    Authors
    Mango
    Description

    Delta-Vector/Hydrus-SmolTalk-2-IF-MT dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    smoltalk-annotated-full

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aladin D., smoltalk-annotated-full [Dataset]. https://huggingface.co/datasets/aladinDJ/smoltalk-annotated-full
    Explore at:
    Authors
    Aladin D.
    Description

    aladinDJ/smoltalk-annotated-full dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    smoltalk-filtered

    • huggingface.co
    Updated Jan 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Horus AI Labs (2025). smoltalk-filtered [Dataset]. https://huggingface.co/datasets/horus-ai-labs/smoltalk-filtered
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2025
    Authors
    Horus AI Labs
    Description

    horus-ai-labs/smoltalk-filtered dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    smol-smoltalk-chat

    • huggingface.co
    Updated May 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JiHyun Kang (2025). smol-smoltalk-chat [Dataset]. https://huggingface.co/datasets/nosuchjihyun/smol-smoltalk-chat
    Explore at:
    Dataset updated
    May 29, 2025
    Authors
    JiHyun Kang
    Description

    nosuchjihyun/smol-smoltalk-chat dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    smoltalk-annotated-full-backup

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John M., smoltalk-annotated-full-backup [Dataset]. https://huggingface.co/datasets/ketchup123/smoltalk-annotated-full-backup
    Explore at:
    Authors
    John M.
    Description

    ketchup123/smoltalk-annotated-full-backup dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    smoltalk-gemma3-chat-512

    • huggingface.co
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Lin (2025). smoltalk-gemma3-chat-512 [Dataset]. https://huggingface.co/datasets/jaxon3062/smoltalk-gemma3-chat-512
    Explore at:
    Dataset updated
    Aug 31, 2025
    Authors
    Jason Lin
    Description

    jaxon3062/smoltalk-gemma3-chat-512 dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    homer-simpson-smoltalk

    • huggingface.co
    Updated Sep 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hernandez (2025). homer-simpson-smoltalk [Dataset]. https://huggingface.co/datasets/salhernandez/homer-simpson-smoltalk
    Explore at:
    Dataset updated
    Sep 27, 2025
    Authors
    Hernandez
    Description

    salhernandez/homer-simpson-smoltalk dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    smoltalk-easy

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emma, smoltalk-easy [Dataset]. https://huggingface.co/datasets/Emm9625/smoltalk-easy
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Emma
    Description

    field_messages: messages message_field_role: role message_field_content: content```

  16. h

    Hydrus-Smoltalk-Subset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mango, Hydrus-Smoltalk-Subset [Dataset]. https://huggingface.co/datasets/Delta-Vector/Hydrus-Smoltalk-Subset
    Explore at:
    Authors
    Mango
    Description

    Delta-Vector/Hydrus-Smoltalk-Subset dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    homer-simpson-smoltalk-everyday-conversations

    • huggingface.co
    Updated Sep 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hernandez (2025). homer-simpson-smoltalk-everyday-conversations [Dataset]. https://huggingface.co/datasets/salhernandez/homer-simpson-smoltalk-everyday-conversations
    Explore at:
    Dataset updated
    Sep 27, 2025
    Authors
    Hernandez
    Description

    salhernandez/homer-simpson-smoltalk-everyday-conversations dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    HuggingFaceTB_smoltalk-DolphinLabeled

    • huggingface.co
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quixi AI (2025). HuggingFaceTB_smoltalk-DolphinLabeled [Dataset]. https://huggingface.co/datasets/QuixiAI/HuggingFaceTB_smoltalk-DolphinLabeled
    Explore at:
    Dataset updated
    Jan 6, 2025
    Dataset authored and provided by
    Quixi AI
    Description

    HuggingFaceTB smoltalk DolphinLabeled

      Part of the DolphinLabeled series of datasets
    
    
    
    
    
      Presented by Eric Hartford and Cognitive Computations
    

    The purpose of this dataset is to enable filtering of HuggingFaceTB/smoltalk dataset. The original dataset is HuggingFaceTB/smoltalk I have modified the dataset using two scripts.

    dedupe.py - removes rows with identical final message content label.py - adds a "flags" column containing the following boolean values: "refusal":โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/QuixiAI/HuggingFaceTB_smoltalk-DolphinLabeled.

  19. smoltalk2

    • huggingface.co
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2025). smoltalk2 [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smoltalk2
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    SmolTalk2

      Dataset description
    

    This dataset contains three subsets (Mid, SFT, Preference) that correspond to the three phases of Post-Training for SmolLM3-3B. You can find more details in our blog post about how we used the data in each of the stages SmolLM3. The specific weight of each subset is available in the training recipe in SmolLM's repository. You can load a dataset using from datasets import load_dataset

    To load the train split of a specific subset, such asโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk2.

  20. h

    smoltalk

    • huggingface.co
    Updated Jun 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fluxi IA (2025). smoltalk [Dataset]. https://huggingface.co/datasets/FluxiIA/smoltalk
    Explore at:
    Dataset updated
    Jun 22, 2025
    Authors
    Fluxi IA
    Description

    FluxiIA/smoltalk dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hugging Face Smol Models Research (2024). smoltalk [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smoltalk
Organization logo

smoltalk

SmolTalk

HuggingFaceTB/smoltalk

Explore at:
23 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
Description

SmolTalk

  Dataset description

This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737 During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasetsโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.

Search
Clear search
Close search
Google apps
Main menu