https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
An open-source replication of the WebText dataset from OpenAI.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
10K slice of OpenWebText - An open-source replication of the WebText dataset from OpenAI. This is a small subset representing the first 10K records from the original dataset - created for testing.
OpenWebText dataset (open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2) tokenized for Llama 3.2 models Useful for accelerated training and testing of sparse autoencoders Context size: 128, not shuffled
olivercareyncl/openwebtext.json dataset hosted on Hugging Face and contributed by the HF Datasets community
olivercareyncl/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community
chanind/openwebtext-gemma-1024-abbrv-2B dataset hosted on Hugging Face and contributed by the HF Datasets community
xiangchensong/Skylion007-openwebtext-tokenizer-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
pccl-org/Skylion007-openwebtext-tokenizer-gpt2-64 dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created by tuggypetu
The first 10K elements of The Pile, useful for debugging models trained on it. See the HuggingFace page for the full Pile for more info. Inspired by stas' great resource doing the same for OpenWebText
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
An open-source replication of the WebText dataset from OpenAI.