2 datasets found

h
TinyStories
huggingface.co
opendatalab.com
Updated May 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ronen Eldan (2023). TinyStories [Dataset]. https://huggingface.co/datasets/roneneldan/TinyStories
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 16, 2023
Authors
Ronen Eldan
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.
h
TinyStories-tokenized-10k
huggingface.co
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kabir Bakhshaei (2025). TinyStories-tokenized-10k [Dataset]. https://huggingface.co/datasets/KabirBakhshaei/TinyStories-tokenized-10k
Explore at:
Dataset updated
Jun 12, 2025
Authors
Kabir Bakhshaei
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
TinyStories-tokenized-10k

This repository provides a pre-tokenized version of the TinyStories dataset, prepared using a custom Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 10,000 tokens. This preprocessing was performed to facilitate efficient training of compact language models while maintaining high-quality language modeling performance.

📦 Dataset Description

Source: roneneldan/TinyStories Tokenizer: Custom BPE tokenizer… See the full description on the dataset page: https://huggingface.co/datasets/KabirBakhshaei/TinyStories-tokenized-10k.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ronen Eldan (2023). TinyStories [Dataset]. https://huggingface.co/datasets/roneneldan/TinyStories

TinyStories

roneneldan/TinyStories

Explore at:

383 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 16, 2023

Authors

Ronen Eldan

License

https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

Description

Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.

Clear search

Close search

Google apps

Main menu

TinyStories

TinyStories-tokenized-10k

TinyStoriesSee More Versions

roneneldan/TinyStories

TinyStories