Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a tokenized Fineweb-Edu (10BT subset) using GPT2 tokenizer for pre-training GPT2 model. Data is divided into shards (.npy files), each shard contains 2e8 tokens and test shard contains roughly 1.5e8 tokens.
For the Fineweb version, please refer to fineweb-10BT-for-gpt2.
Each .npy file can be loaded with numpy.load('file_name.npy').
Facebook
TwitterTraining data of the model detokenized in the exact order seen by the model. The training data is partitioned into 8 chunks (chunk-0 through chunk-7), based on the GPU rank that generated the data. Each chunk contains detokenized text files in JSON Lines format (.jsonl).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a dataset which was extracted for pre-training GPT2 for generating kids stories in English and Tamil.
For the english, the data was extracted from gutenberg.org.
Special thanks to mateibejan for the metadata for the Gutenberg books.
For the tamil dataset, the data has been extracted from Siruvarmalar, a very old and reliable source of kids stories.
The code for the data extraction can be found at: github/picturebook.ai
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is the processed dataset using Andrey Karpathy's script https://github.com/karpathy/nanoGPT/tree/master/data/openwebtext. The original dataset is from https://huggingface.co/datasets/Skylion007/openwebtext, which now requires datasets lib version < 3 to download.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains: 250K documents from the WebText test set For each GPT-2 model (trained on the WebText training set), 250K random samples (temperature 1, no truncation) and 250K samples generated with Top-K 40 truncation
Facebook
TwitterFineWeb Dataset - GPT-2 Tokenized
This dataset contains preprocessed and tokenized FineWeb data using the GPT-2 tokenizer. It consists of multiple training folders containing the processed data. Dataset structure:
fineweb_train_000001 to fineweb_train_000005: Training folders
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Geonwoohong/pile-uncopyrighted-train-tokenized-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset was created by Aananda Giri
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a single json containing label-value form fields generated using GPT-2. This data was used to train Dessurt (https://arxiv.org/abs/2203.16618). Details of the generation process can be found in Dessurt's Supplementary Materials and the script used to generate it is gpt_forms.py in https://github.com/herobd/dessurt
The data has groups of label-value pairs each with a "title" or topic (or null). Each label-value pair group was generated in a single GPT-2 generation and thus the pairs "belong to the same form." The json structure is a list of tuples, where each tuple has the title or null as the first element and the list of label-value pairs of the group as the second element. Each label-value pair is another tuple with the first element being the label and the second being the value or a list of values.
For example:
[ ["title",[ ["first label", "first value"], ["second label", ["a label", "another label"] ] ] ], [null, [ ["again label", "again value"] ] ] ]
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
kbmurali/gpt2-qa-train-ds dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Geonwoohong/lambada-openai-train-tokenized-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset was created by Abdullah Meda
Facebook
TwitterThis dataset was created by Abhishek Thakur
Released under Data files ยฉ Original Authors
GPT2 pre-trained models and configurations .
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is Huggingface's gpt2-large weights finetuned on the whole WritingPrompts training set. https://www.kaggle.com/ratthachat/writing-prompts
which is preprocessed here : https://www.kaggle.com/ratthachat/writingprompts-combine-one-line-data-for-gpt2
Perplexity score of gpt2-large on this dataset is 14.8 compared to 16.8 of gpt2-medium (see ref. kernel below) finetuned only on small dataset.
If we finetune gpt2-large on the small valid dataset, we would get perplexity score of 16.2 (tested locally).
The whole process is trained on V100 machine for 20 hours. However, if we would use fp16 training, the time should be reduced by half.
See the kernel https://www.kaggle.com/ratthachat/writingprompts-gpt2-lm-fine-tune
Skipping the training step, change from gpt2-medium to gpt2-large or if you cannot load it normally, you can try
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config
config = GPT2Config.from_json_file('[path]/config.json')
model = GPT2LMHeadModel(config)
model.load_state_dict(torch.load(' [path]/pytorch_model.bin'))
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The following dataset is constructed using entirely historical data up to the cutoff date "31-12-2012". The data comes from the WMT News dataset (https://data.statmt.org/news-crawl/en/) and Wikipedia. This dataset is the training dataset for a GPT-2-small-based model, and is available on Huggingface at the following location: "TiMa/TiMaGPT2-2012". The dataset uses the same license as the WMT News dataset (https://data.statmt.org/news-crawl/README) as this is the less permissive license of theโฆ See the full description on the dataset page: https://huggingface.co/datasets/Ti-Ma/TiMaGPT2-2012.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
sample-10BT version of the FineWeb dataset tokenized using the gpt2 tokenizer and split into 100M tokens binary shards. A shard is simply a 1D stream of np.uint16 numbers which are the tokenized samples from the dataset, stored contiguously. Each sample from the dataset was prefixed with the <|endoftext|> special token before being tokenized. There are 103 training shards (under train/ dir) and 1 shard for validation (under val/).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Open Poetry Vision dataset is a synthetic dataset created by Roboflow for OCR tasks.
It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.
Example Image:
https://i.imgur.com/sZT516a.png" alt="Example Image">
A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.
Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.
Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the datamix created by Team ๐ ๐ ๐ต๏ธโโ๏ธ ๐ค during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.
It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.
To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.
Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset
We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays
Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The following dataset is constructed using entirely historical data up to the cutoff date "31-12-2014". The data comes from the WMT News dataset (https://data.statmt.org/news-crawl/en/) and Wikipedia. This dataset is the training dataset for a GPT-2-small-based model, and is available on Huggingface at the following location: "TiMa/TiMaGPT2-2014". The dataset uses the same license as the WMT News dataset (https://data.statmt.org/news-crawl/README) as this is the less permissive license of theโฆ See the full description on the dataset page: https://huggingface.co/datasets/Ti-Ma/TiMaGPT2-2014.
Facebook
TwitterThis dataset was created by Jamie Wang
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a tokenized Fineweb-Edu (10BT subset) using GPT2 tokenizer for pre-training GPT2 model. Data is divided into shards (.npy files), each shard contains 2e8 tokens and test shard contains roughly 1.5e8 tokens.
For the Fineweb version, please refer to fineweb-10BT-for-gpt2.
Each .npy file can be loaded with numpy.load('file_name.npy').