100+ datasets found

smol
huggingface.co
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2025). smol [Dataset]. https://huggingface.co/datasets/google/smol
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2025
Dataset authored and provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SMOL

SMOL (Set for Maximal Overall Leverage) is a collection of professional translations into 221 Low-Resource Languages, for the purpose of training translation models, and otherwise increasing the representations of said languages in NLP and technology. Please read the SMOL Paper and the GATITOS Paper for a much more thorough description! There are four resources in this directory:

SmolDoc: document-level translations into 100 languages SmolSent: sentence-level translations into… See the full description on the dataset page: https://huggingface.co/datasets/google/smol.
h
MNLP_M2_rag_dataset
huggingface.co
Updated Jun 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Ushakov (2025). MNLP_M2_rag_dataset [Dataset]. https://huggingface.co/datasets/ushakov15/MNLP_M2_rag_dataset
Explore at:
Dataset updated
Jun 1, 2025
Authors
Ivan Ushakov
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Smol-SmalTalk

This is a subset of SmolTalk dataset adapted for smol models with less than 1B parameters. We used it to build SmolLM2-360M-Instruct and SmolLM2-135M-Instruct. We do SFT on this dataset and then DPO on UltraFeedback. Compared to SmolTalk:

The conversations from Smol-Magpie-Ultra are shorter in this dataset We include less task specific data compared to SmolTalk (e.g no function calling and less rewriting and summarization examples) since these smaller models have… See the full description on the dataset page: https://huggingface.co/datasets/ushakov15/MNLP_M2_rag_dataset.
h
huggingface-smol-course-preference-tuning-dataset
huggingface.co
Updated Jun 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JingjunXu (2025). huggingface-smol-course-preference-tuning-dataset [Dataset]. https://huggingface.co/datasets/Tina-xxxx/huggingface-smol-course-preference-tuning-dataset
Explore at:
Dataset updated
Jun 24, 2025
Authors
JingjunXu
Description
Dataset Card for huggingface-smol-course-preference-tuning-dataset

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/Tina-xxxx/huggingface-smol-course-preference-tuning-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel… See the full description on the dataset page: https://huggingface.co/datasets/Tina-xxxx/huggingface-smol-course-preference-tuning-dataset.
h
huggingface-smol-course-instruction-tuning-dataset
huggingface.co
Updated Jul 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi Liu (2025). huggingface-smol-course-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/yiliu051016/huggingface-smol-course-instruction-tuning-dataset
Explore at:
Dataset updated
Jul 27, 2025
Authors
Yi Liu
Description
yiliu051016/huggingface-smol-course-instruction-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
huggingface-smol-course-preference-tuning-dataset
huggingface.co
Updated Apr 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
aspirina765 (2025). huggingface-smol-course-preference-tuning-dataset [Dataset]. https://huggingface.co/datasets/aspirina765/huggingface-smol-course-preference-tuning-dataset
Explore at:
Dataset updated
Apr 26, 2025
Authors
aspirina765
Description
aspirina765/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
huggingface-smol-course-preference-tuning-dataset
huggingface.co
Updated Mar 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wilka wilkin (2025). huggingface-smol-course-preference-tuning-dataset [Dataset]. https://huggingface.co/datasets/wilka/huggingface-smol-course-preference-tuning-dataset
Explore at:
Dataset updated
Mar 30, 2025
Authors
wilka wilkin
Description
wilka/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
smoltalk
huggingface.co
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2024). smoltalk [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smoltalk
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
Description
SmolTalk

Dataset description

This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737 During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.
h
huggingface-smol-course-instruction-tuning-dataset
huggingface.co
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Callum (2025). huggingface-smol-course-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/CallumLongenecker-Aristocrat/huggingface-smol-course-instruction-tuning-dataset
Explore at:
Dataset updated
Apr 3, 2025
Authors
Callum
Description
CallumLongenecker-Aristocrat/huggingface-smol-course-instruction-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
smollm-corpus
huggingface.co
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
SmolLM-Corpus

This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

Dataset subsets Cosmopedia v2

Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
h
crush-smol-merged
huggingface.co
Updated Jun 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Will Lin (2025). crush-smol-merged [Dataset]. https://huggingface.co/datasets/wlsaidhi/crush-smol-merged
Explore at:
Dataset updated
Jun 17, 2025
Authors
Will Lin
Description
crush-smol created out of https://huggingface.co/datasets/bigdata-pw/crush (crush_smol.parquet). Captions were generated with Qwen2VL.

generate_captions.py

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor import torch import os from pathlib import Path from huggingface_hub import snapshot_download from torchvision import io

model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", device_map="auto", torch_dtype=torch.bfloat16… See the full description on the dataset page: https://huggingface.co/datasets/wlsaidhi/crush-smol-merged.
h
huggingface-smol-course-preference-tuning-dataset
huggingface.co
Updated Jan 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tom (2025). huggingface-smol-course-preference-tuning-dataset [Dataset]. https://huggingface.co/datasets/atomkevich/huggingface-smol-course-preference-tuning-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 14, 2025
Authors
tom
Description
atomkevich/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
huggingface-smol-course-preference-tuning-dataset
huggingface.co
Updated Apr 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Master MUTSC (2025). huggingface-smol-course-preference-tuning-dataset [Dataset]. https://huggingface.co/datasets/MUTSC/huggingface-smol-course-preference-tuning-dataset
Explore at:
Dataset updated
Apr 16, 2025
Authors
Master MUTSC
Description
MUTSC/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
huggingface-smol-course-instruction-tuning-dataset
huggingface.co
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alfons Futterer (2025). huggingface-smol-course-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/NanoMatriX/huggingface-smol-course-instruction-tuning-dataset
Explore at:
Dataset updated
Jun 2, 2025
Authors
Alfons Futterer
Description
NanoMatriX/huggingface-smol-course-instruction-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
huggingface-smol-course-preference-tuning-dataset
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chernigov Aleksey, huggingface-smol-course-preference-tuning-dataset [Dataset]. https://huggingface.co/datasets/Tookies/huggingface-smol-course-preference-tuning-dataset
Explore at:
Authors
Chernigov Aleksey
Description
Tookies/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
huggingface-smol-course-Vikhr-dataset
huggingface.co
Updated Jan 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AlexseyElygin (2025). huggingface-smol-course-Vikhr-dataset [Dataset]. https://huggingface.co/datasets/AlekseyElygin/huggingface-smol-course-Vikhr-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 21, 2025
Authors
AlexseyElygin
Description
AlekseyElygin/huggingface-smol-course-Vikhr-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
the-stack-smol
huggingface.co
Updated Nov 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2022). the-stack-smol [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-smol
Explore at:
Dataset updated
Nov 14, 2022
Dataset authored and provided by
BigCode
Description
Dataset Description

A small subset (~0.1%) of the-stack dataset, each programming language has 10,000 random samples from the original dataset. The dataset has 2.6GB of text (code).

Languages

The dataset contains 30 programming languages: "assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java", "javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust"… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-smol.
h
huggingface-smol-course-instruction-tuning-dataset
huggingface.co
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raphael Fakhri (2025). huggingface-smol-course-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/taxiraph/huggingface-smol-course-instruction-tuning-dataset
Explore at:
Dataset updated
Apr 3, 2025
Authors
Raphael Fakhri
Description
taxiraph/huggingface-smol-course-instruction-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Data from: cosmopedia
huggingface.co
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2024). cosmopedia [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/cosmopedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 20, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Cosmopedia v0.1

Image generated by DALL-E, the prompt was generated by Mixtral-8x7B-Instruct-v0.1

Note: Cosmopedia v0.2 is available at smollm-corpus User: What do you think "Cosmopedia" could mean? Hint: in our case it's not related to cosmology.

Mixtral-8x7B-Instruct-v0.1: A possible meaning for "Cosmopedia" could be an encyclopedia or collection of information about different cultures, societies, and topics from around the world, emphasizing diversity and global… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/cosmopedia.
instruct-data-basics-smollm-H4
huggingface.co
Updated Aug 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2024). instruct-data-basics-smollm-H4 [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/instruct-data-basics-smollm-H4
Explore at:
Dataset updated
Aug 17, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
Description
Datasets of basic instructions and answers for SmolLM-Instruct models trainings: it includes answers to greetings and questions such as "Who are you". This dataset was included in training of SmolLM-Instruct v0.2 but we didn't notice that it had an impact on model generations. We recommend using this generic larger dataset of multi-turn everyday conversations: https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k
h
crush-smol
huggingface.co
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhang Peiyuan (2025). crush-smol [Dataset]. https://huggingface.co/datasets/PY007/crush-smol
Explore at:
Dataset updated
Jun 6, 2025
Authors
Zhang Peiyuan
Description
PY007/crush-smol dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Google (2025). smol [Dataset]. https://huggingface.co/datasets/google/smol

smol

Smol

google/smol

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 28, 2025

Dataset authored and provided by

Googlehttp://google.com/

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

SMOL

SMOL (Set for Maximal Overall Leverage) is a collection of professional translations into 221 Low-Resource Languages, for the purpose of training translation models, and otherwise increasing the representations of said languages in NLP and technology. Please read the SMOL Paper and the GATITOS Paper for a much more thorough description! There are four resources in this directory:

SmolDoc: document-level translations into 100 languages SmolSent: sentence-level translations into… See the full description on the dataset page: https://huggingface.co/datasets/google/smol.

Clear search

Close search

Google apps

Main menu

smol

MNLP_M2_rag_dataset

huggingface-smol-course-preference-tuning-dataset

huggingface-smol-course-instruction-tuning-dataset

huggingface-smol-course-preference-tuning-dataset

huggingface-smol-course-preference-tuning-dataset

smoltalk

huggingface-smol-course-instruction-tuning-dataset

smollm-corpus

crush-smol-merged

huggingface-smol-course-preference-tuning-dataset

huggingface-smol-course-preference-tuning-dataset

huggingface-smol-course-instruction-tuning-dataset

huggingface-smol-course-preference-tuning-dataset

huggingface-smol-course-Vikhr-dataset

the-stack-smol

huggingface-smol-course-instruction-tuning-dataset

Data from: cosmopedia

instruct-data-basics-smollm-H4

crush-smol

smol

Smol

google/smol