Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SMOL
SMOL (Set for Maximal Overall Leverage) is a collection of professional translations into 221 Low-Resource Languages, for the purpose of training translation models, and otherwise increasing the representations of said languages in NLP and technology. Please read the SMOL Paper and the GATITOS Paper for a much more thorough description! There are four resources in this directory:
SmolDoc: document-level translations into 100 languages SmolSent: sentence-level translations into… See the full description on the dataset page: https://huggingface.co/datasets/google/smol.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Smol-SmalTalk
This is a subset of SmolTalk dataset adapted for smol models with less than 1B parameters. We used it to build SmolLM2-360M-Instruct and SmolLM2-135M-Instruct. We do SFT on this dataset and then DPO on UltraFeedback. Compared to SmolTalk:
The conversations from Smol-Magpie-Ultra are shorter in this dataset We include less task specific data compared to SmolTalk (e.g no function calling and less rewriting and summarization examples) since these smaller models have… See the full description on the dataset page: https://huggingface.co/datasets/ushakov15/MNLP_M2_rag_dataset.
Dataset Card for huggingface-smol-course-preference-tuning-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/Tina-xxxx/huggingface-smol-course-preference-tuning-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel… See the full description on the dataset page: https://huggingface.co/datasets/Tina-xxxx/huggingface-smol-course-preference-tuning-dataset.
yiliu051016/huggingface-smol-course-instruction-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
aspirina765/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
wilka/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
SmolTalk
Dataset description
This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737 During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.
CallumLongenecker-Aristocrat/huggingface-smol-course-instruction-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
SmolLM-Corpus
This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.
Dataset subsets
Cosmopedia v2
Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
crush-smol created out of https://huggingface.co/datasets/bigdata-pw/crush (crush_smol.parquet). Captions were generated with Qwen2VL.
generate_captions.py
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor import torch import os from pathlib import Path from huggingface_hub import snapshot_download from torchvision import io
model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", device_map="auto", torch_dtype=torch.bfloat16… See the full description on the dataset page: https://huggingface.co/datasets/wlsaidhi/crush-smol-merged.
atomkevich/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
MUTSC/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
NanoMatriX/huggingface-smol-course-instruction-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Tookies/huggingface-smol-course-preference-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
AlekseyElygin/huggingface-smol-course-Vikhr-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Description
A small subset (~0.1%) of the-stack dataset, each programming language has 10,000 random samples from the original dataset. The dataset has 2.6GB of text (code).
Languages
The dataset contains 30 programming languages: "assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java", "javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust"… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-smol.
taxiraph/huggingface-smol-course-instruction-tuning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Cosmopedia v0.1
Image generated by DALL-E, the prompt was generated by Mixtral-8x7B-Instruct-v0.1
Note: Cosmopedia v0.2 is available at smollm-corpus User: What do you think "Cosmopedia" could mean? Hint: in our case it's not related to cosmology.
Mixtral-8x7B-Instruct-v0.1: A possible meaning for "Cosmopedia" could be an encyclopedia or collection of information about different cultures, societies, and topics from around the world, emphasizing diversity and global… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/cosmopedia.
Datasets of basic instructions and answers for SmolLM-Instruct models trainings: it includes answers to greetings and questions such as "Who are you". This dataset was included in training of SmolLM-Instruct v0.2 but we didn't notice that it had an impact on model generations. We recommend using this generic larger dataset of multi-turn everyday conversations: https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k
PY007/crush-smol dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SMOL
SMOL (Set for Maximal Overall Leverage) is a collection of professional translations into 221 Low-Resource Languages, for the purpose of training translation models, and otherwise increasing the representations of said languages in NLP and technology. Please read the SMOL Paper and the GATITOS Paper for a much more thorough description! There are four resources in this directory:
SmolDoc: document-level translations into 100 languages SmolSent: sentence-level translations into… See the full description on the dataset page: https://huggingface.co/datasets/google/smol.