Facebook
TwitterAn open-source replication of the WebText dataset from OpenAI.
This is a small subset representing the first 10K records from the original dataset - created for testing.
The full 8M-record dataset is at https://huggingface.co/datasets/openwebtext
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for "openwebtext"
Dataset Summary
An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure
Data Instances
plain_text
Size of downloaded dataset files: 13.51 GB Size of theโฆ See the full description on the dataset page: https://huggingface.co/datasets/dylanebert/openwebtext.
Facebook
Twitterolivercareyncl/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Card for "openwebtext-100k"
More Information needed
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is the processed dataset using Andrey Karpathy's script https://github.com/karpathy/nanoGPT/tree/master/data/openwebtext. The original dataset is from https://huggingface.co/datasets/Skylion007/openwebtext, which now requires datasets lib version < 3 to download.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A subset of OpenWebText, an open-source recreation of OpenAI's internal WebText corpus. This subset contains ~2 million documents, mainly in English, scraped from the Web. Highly unstructed text data; not necessarily clean.
Facebook
Twitterapollo-research/Skylion007-openwebtext-tokenizer-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The OpenWebText dataset, preprocessed and formatted as TFRecords for efficient use with TPUs. This format optimizes data loading and processing for large-scale language modeling tasks. Ideal for training transformer models on Google Colab with TPUs.
Facebook
TwitterThis dataset was created by Florian 0627
Facebook
TwitterThis dataset was created by Himon Sarkar
Facebook
TwitterI am Ismail Ben Alla, a computer science engineer with a keen interest in advancing natural language processing and deep learning. As part of a comprehensive project to develop GPT-2 from the ground up, I undertook the task of preprocessing a substantial portion of the OpenWebText dataset.
The dataset preprocessing involved tokenization using the highly effective GPT-2 tokenizer, resulting in the creation of two distinct sets:
For additional insights, updates, and access to this valuable dataset, please refer to the following links:
Facebook
Twitterchanind/openwebtext-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset made by modifying the code from this github repository slightly. This is part of the open-webtext movement where people are trying to replicate the data which the GPT-2(a text generator model which is said to be too dangerous to publish) was trained on. This is just roughly the quarter of the entire data but as Venya Cohen stated in the Discussion section, the full dataset is available at https://skylion007.github.io/OpenWebTextCorpus/.
This dataset contains an id to identify the text up to 2019434 for the large data.db and 743 for the smaller.db. Each of the ids corresponds to a text column which has a bit of text.
Thanks eukaryote31 for github repo the https://github.com/eukaryote31/openwebtext
The open ai's gpt-2 model is said to have still underfitted the 40GB webtext model so, I'm wondering how much data will be just right?
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A subset of Skylion007/openwebtext dataset consisting of 1 Million tokenized samples in Lance file format for blazing fast and memory efficient I/O.
The files were tokenized using the gpt2 tokenizer with no extra tokens.
For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.
This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called openwebtext_1M.lance. Below are detailed snippets on how to download and use this dataset.
First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:
$ pip install -q kaggle pyarrow pylance
$ kaggle datasets download -d heyytanay/openwebtext-1m
$ mkdir openwebtext_1M.lance/
$ unzip -qq openwebtext-1m.zip -d openwebtext_1M.lance/
$ rm openwebtext-1m.zip
Once this is done, you will find your dataset in the openwebtext_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.
import lance
dataset = lance.dataset('openwebtext_1M.lance/')
print(dataset.count_rows())
This will give you the total number of tokens in the dataset.
Facebook
TwitterOpenWebTextCorpus tokenized for Gemma
This dataset is a pre-tokenized version of the Skylion007/openwebtext dataset using the gemma tokenizer. As such, this dataset follows the same licensing as the original openwebtext dataset. This pre-tokenization is done as a performance optimization for using the openwebtext dataset with a Gemma model (gemma-2b, gemma-2b-it, gemma-7b, gemma-7b-it). This dataset was created using SAELens, with the following settings:
context_size: 8192โฆ See the full description on the dataset page: https://huggingface.co/datasets/chanind/openwebtext-gemma.
Facebook
TwitterA copy of OpenWebText used by OpenAI. This is a large corpus used by OpenAI to train GPT-3 and above.
Facebook
TwitterThis dataset was created by tuggypetu
Facebook
Twitterwangzn2001/openwebtext-10k dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Thijmen/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Card for "openwebtext-tokenized-9b"
More Information needed
Facebook
TwitterAn open-source replication of the WebText dataset from OpenAI.
This is a small subset representing the first 10K records from the original dataset - created for testing.
The full 8M-record dataset is at https://huggingface.co/datasets/openwebtext