100+ datasets found
  1. smol

    • huggingface.co
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2025). smol [Dataset]. https://huggingface.co/datasets/google/smol
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2025
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SMOL

    SMOL (Set for Maximal Overall Leverage) is a collection of professional translations into 221 Low-Resource Languages, for the purpose of training translation models, and otherwise increasing the representations of said languages in NLP and technology. Please read the SMOL Paper and the GATITOS Paper for a much more thorough description! There are four resources in this directory:

    SmolDoc: document-level translations into 100 languages SmolSent: sentence-level translations into… See the full description on the dataset page: https://huggingface.co/datasets/google/smol.

  2. h

    MNLP_M2_rag_dataset

    • huggingface.co
    Updated Jun 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Ushakov (2025). MNLP_M2_rag_dataset [Dataset]. https://huggingface.co/datasets/ushakov15/MNLP_M2_rag_dataset
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Ivan Ushakov
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Smol-SmalTalk

    This is a subset of SmolTalk dataset adapted for smol models with less than 1B parameters. We used it to build SmolLM2-360M-Instruct and SmolLM2-135M-Instruct. We do SFT on this dataset and then DPO on UltraFeedback. Compared to SmolTalk:

    The conversations from Smol-Magpie-Ultra are shorter in this dataset We include less task specific data compared to SmolTalk (e.g no function calling and less rewriting and summarization examples) since these smaller models have… See the full description on the dataset page: https://huggingface.co/datasets/ushakov15/MNLP_M2_rag_dataset.

  3. h

    huggingface-smol-course-preference-tuning-dataset

    • huggingface.co
    Updated Jun 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JingjunXu (2025). huggingface-smol-course-preference-tuning-dataset [Dataset]. https://huggingface.co/datasets/Tina-xxxx/huggingface-smol-course-preference-tuning-dataset
    Explore at:
    Dataset updated
    Jun 24, 2025
    Authors
    JingjunXu
    Description

    Dataset Card for huggingface-smol-course-preference-tuning-dataset

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/Tina-xxxx/huggingface-smol-course-preference-tuning-dataset/raw/main/pipeline.yaml"

    or explore the configuration: distilabel… See the full description on the dataset page: https://huggingface.co/datasets/Tina-xxxx/huggingface-smol-course-preference-tuning-dataset.

  4. h

    huggingface-smol-course-instruction-tuning-dataset

    • huggingface.co
    Updated Jul 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi Liu (2025). huggingface-smol-course-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/yiliu051016/huggingface-smol-course-instruction-tuning-dataset
    Explore at:
    Dataset updated
    Jul 27, 2025
    Authors
    Yi Liu
    Description

    yiliu051016/huggingface-smol-course-instruction-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    huggingface-smol-course-preference-tuning-dataset

    • huggingface.co
    Updated Apr 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    aspirina765 (2025). huggingface-smol-course-preference-tuning-dataset [Dataset]. https://huggingface.co/datasets/aspirina765/huggingface-smol-course-preference-tuning-dataset
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    aspirina765
    Description

    aspirina765/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    huggingface-smol-course-preference-tuning-dataset

    • huggingface.co
    Updated Mar 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wilka wilkin (2025). huggingface-smol-course-preference-tuning-dataset [Dataset]. https://huggingface.co/datasets/wilka/huggingface-smol-course-preference-tuning-dataset
    Explore at:
    Dataset updated
    Mar 30, 2025
    Authors
    wilka wilkin
    Description

    wilka/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. smoltalk

    • huggingface.co
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2024). smoltalk [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smoltalk
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    SmolTalk

      Dataset description
    

    This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737 During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.

  8. h

    huggingface-smol-course-instruction-tuning-dataset

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Callum (2025). huggingface-smol-course-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/CallumLongenecker-Aristocrat/huggingface-smol-course-instruction-tuning-dataset
    Explore at:
    Dataset updated
    Apr 3, 2025
    Authors
    Callum
    Description

    CallumLongenecker-Aristocrat/huggingface-smol-course-instruction-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. smollm-corpus

    • huggingface.co
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    SmolLM-Corpus

    This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

      Dataset subsets
    
    
    
    
    
      Cosmopedia v2
    

    Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.

  10. h

    crush-smol-merged

    • huggingface.co
    Updated Jun 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Will Lin (2025). crush-smol-merged [Dataset]. https://huggingface.co/datasets/wlsaidhi/crush-smol-merged
    Explore at:
    Dataset updated
    Jun 17, 2025
    Authors
    Will Lin
    Description

    crush-smol created out of https://huggingface.co/datasets/bigdata-pw/crush (crush_smol.parquet). Captions were generated with Qwen2VL.

    generate_captions.py

    from transformers import Qwen2VLForConditionalGeneration, AutoProcessor import torch import os from pathlib import Path from huggingface_hub import snapshot_download from torchvision import io

    model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", device_map="auto", torch_dtype=torch.bfloat16… See the full description on the dataset page: https://huggingface.co/datasets/wlsaidhi/crush-smol-merged.

  11. h

    huggingface-smol-course-preference-tuning-dataset

    • huggingface.co
    Updated Jan 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tom (2025). huggingface-smol-course-preference-tuning-dataset [Dataset]. https://huggingface.co/datasets/atomkevich/huggingface-smol-course-preference-tuning-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 14, 2025
    Authors
    tom
    Description

    atomkevich/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    huggingface-smol-course-preference-tuning-dataset

    • huggingface.co
    Updated Apr 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Master MUTSC (2025). huggingface-smol-course-preference-tuning-dataset [Dataset]. https://huggingface.co/datasets/MUTSC/huggingface-smol-course-preference-tuning-dataset
    Explore at:
    Dataset updated
    Apr 16, 2025
    Authors
    Master MUTSC
    Description

    MUTSC/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    huggingface-smol-course-instruction-tuning-dataset

    • huggingface.co
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alfons Futterer (2025). huggingface-smol-course-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/NanoMatriX/huggingface-smol-course-instruction-tuning-dataset
    Explore at:
    Dataset updated
    Jun 2, 2025
    Authors
    Alfons Futterer
    Description

    NanoMatriX/huggingface-smol-course-instruction-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    huggingface-smol-course-preference-tuning-dataset

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chernigov Aleksey, huggingface-smol-course-preference-tuning-dataset [Dataset]. https://huggingface.co/datasets/Tookies/huggingface-smol-course-preference-tuning-dataset
    Explore at:
    Authors
    Chernigov Aleksey
    Description

    Tookies/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    huggingface-smol-course-Vikhr-dataset

    • huggingface.co
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AlexseyElygin (2025). huggingface-smol-course-Vikhr-dataset [Dataset]. https://huggingface.co/datasets/AlekseyElygin/huggingface-smol-course-Vikhr-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 21, 2025
    Authors
    AlexseyElygin
    Description

    AlekseyElygin/huggingface-smol-course-Vikhr-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    the-stack-smol

    • huggingface.co
    Updated Nov 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2022). the-stack-smol [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-smol
    Explore at:
    Dataset updated
    Nov 14, 2022
    Dataset authored and provided by
    BigCode
    Description

    Dataset Description

    A small subset (~0.1%) of the-stack dataset, each programming language has 10,000 random samples from the original dataset. The dataset has 2.6GB of text (code).

      Languages
    

    The dataset contains 30 programming languages: "assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java", "javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust"… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-smol.

  17. h

    huggingface-smol-course-instruction-tuning-dataset

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raphael Fakhri (2025). huggingface-smol-course-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/taxiraph/huggingface-smol-course-instruction-tuning-dataset
    Explore at:
    Dataset updated
    Apr 3, 2025
    Authors
    Raphael Fakhri
    Description

    taxiraph/huggingface-smol-course-instruction-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. Data from: cosmopedia

    • huggingface.co
    Updated Feb 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2024). cosmopedia [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/cosmopedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Cosmopedia v0.1

    Image generated by DALL-E, the prompt was generated by Mixtral-8x7B-Instruct-v0.1
    

    Note: Cosmopedia v0.2 is available at smollm-corpus User: What do you think "Cosmopedia" could mean? Hint: in our case it's not related to cosmology.

    Mixtral-8x7B-Instruct-v0.1: A possible meaning for "Cosmopedia" could be an encyclopedia or collection of information about different cultures, societies, and topics from around the world, emphasizing diversity and global… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/cosmopedia.

  19. instruct-data-basics-smollm-H4

    • huggingface.co
    Updated Aug 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2024). instruct-data-basics-smollm-H4 [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/instruct-data-basics-smollm-H4
    Explore at:
    Dataset updated
    Aug 17, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    Datasets of basic instructions and answers for SmolLM-Instruct models trainings: it includes answers to greetings and questions such as "Who are you". This dataset was included in training of SmolLM-Instruct v0.2 but we didn't notice that it had an impact on model generations. We recommend using this generic larger dataset of multi-turn everyday conversations: https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k

  20. h

    crush-smol

    • huggingface.co
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang Peiyuan (2025). crush-smol [Dataset]. https://huggingface.co/datasets/PY007/crush-smol
    Explore at:
    Dataset updated
    Jun 6, 2025
    Authors
    Zhang Peiyuan
    Description

    PY007/crush-smol dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google (2025). smol [Dataset]. https://huggingface.co/datasets/google/smol
Organization logo

smol

Smol

google/smol

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2025
Dataset authored and provided by
Googlehttp://google.com/
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

SMOL

SMOL (Set for Maximal Overall Leverage) is a collection of professional translations into 221 Low-Resource Languages, for the purpose of training translation models, and otherwise increasing the representations of said languages in NLP and technology. Please read the SMOL Paper and the GATITOS Paper for a much more thorough description! There are four resources in this directory:

SmolDoc: document-level translations into 100 languages SmolSent: sentence-level translations into… See the full description on the dataset page: https://huggingface.co/datasets/google/smol.

Search
Clear search
Close search
Google apps
Main menu