37 datasets found

fineweb-edu-10BT-for-gpt2
kaggle.com
zip
Updated Jul 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minh-Thien Nguyen (2024). fineweb-edu-10BT-for-gpt2 [Dataset]. https://www.kaggle.com/datasets/minhthiennguyen/fineweb-edu-10bt-for-gpt2
Explore at:
zip(13769081319 bytes)Available download formats
Dataset updated
Jul 20, 2024
Authors
Minh-Thien Nguyen
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is a tokenized Fineweb-Edu (10BT subset) using GPT2 tokenizer for pre-training GPT2 model. Data is divided into shards (.npy files), each shard contains 2e8 tokens and test shard contains roughly 1.5e8 tokens.

For the Fineweb version, please refer to fineweb-10BT-for-gpt2.

Each .npy file can be loaded with numpy.load('file_name.npy').
h
gpt2-toksuite-detokenized
huggingface.co
Updated Nov 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TokSuite (2025). gpt2-toksuite-detokenized [Dataset]. https://huggingface.co/datasets/toksuite/gpt2-toksuite-detokenized
Explore at:
Dataset updated
Nov 18, 2025
Dataset authored and provided by
TokSuite
Description
Training data of the model detokenized in the exact order seen by the model. The training data is partitioned into 8 chunks (chunk-0 through chunk-7), based on the GPU rank that generated the data. Each chunk contains detokenized text files in JSON Lines format (.jsonl).
Data for English and Tamil Generative Models.
kaggle.com
zip
Updated Apr 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aneesh Aparajit (2023). Data for English and Tamil Generative Models. [Dataset]. https://www.kaggle.com/datasets/aneesh10/gpt2-pretraining-dataset-for-english-and-tamil/code
Explore at:
zip(2850559901 bytes)Available download formats
Dataset updated
Apr 8, 2023
Authors
Aneesh Aparajit
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is a dataset which was extracted for pre-training GPT2 for generating kids stories in English and Tamil.

For the english, the data was extracted from gutenberg.org.

Special thanks to mateibejan for the metadata for the Gutenberg books.

For the tamil dataset, the data has been extracted from Siruvarmalar, a very old and reliable source of kids stories.

The code for the data extraction can be found at: github/picturebook.ai
OpenWebText-gpt2
kaggle.com
zip
Updated Jan 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
windmaple (2025). OpenWebText-gpt2 [Dataset]. https://www.kaggle.com/datasets/windmaple/openwebtext-gpt2
Explore at:
zip(12138662851 bytes)Available download formats
Dataset updated
Jan 25, 2025
Authors
windmaple
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is the processed dataset using Andrey Karpathy's script https://github.com/karpathy/nanoGPT/tree/master/data/openwebtext. The original dataset is from https://huggingface.co/datasets/Skylion007/openwebtext, which now requires datasets lib version < 3 to download.
O
Gpt-2-output-dataset
opendatalab.com
kaggle.com
zip
Updated Jan 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2019). Gpt-2-output-dataset [Dataset]. https://opendatalab.com/OpenDataLab/Gpt-2-output-dataset
Explore at:
zipAvailable download formats
Dataset updated
Jan 1, 2019
Dataset provided by
OpenAI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains: 250K documents from the WebText test set For each GPT-2 model (trained on the WebText training set), 250K random samples (temperature 1, no truncation) and 250K samples generated with Top-K 40 truncation
h
fineweb10B-gpt2-fluxentropy
huggingface.co
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sinatras (2024). fineweb10B-gpt2-fluxentropy [Dataset]. https://huggingface.co/datasets/sinatras/fineweb10B-gpt2-fluxentropy
Explore at:
Dataset updated
Dec 5, 2024
Authors
Sinatras
Description
FineWeb Dataset - GPT-2 Tokenized

This dataset contains preprocessed and tokenized FineWeb data using the GPT-2 tokenizer. It consists of multiple training folders containing the processed data. Dataset structure:

fineweb_train_000001 to fineweb_train_000005: Training folders
h
pile-uncopyrighted-train-tokenized-gpt2
huggingface.co
Updated Nov 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geonwoo Hong (2025). pile-uncopyrighted-train-tokenized-gpt2 [Dataset]. https://huggingface.co/datasets/Geonwoohong/pile-uncopyrighted-train-tokenized-gpt2
Explore at:
Dataset updated
Nov 8, 2025
Authors
Geonwoo Hong
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Geonwoohong/pile-uncopyrighted-train-tokenized-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
GPT2-Nepali
kaggle.com
zip
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aananda Giri (2024). GPT2-Nepali [Dataset]. https://www.kaggle.com/datasets/aanandagiri/sebastian-gpt2
Explore at:
zip(5398572169 bytes)Available download formats
Dataset updated
Dec 16, 2024
Authors
Aananda Giri
Description
Dataset

This dataset was created by Aananda Giri

Contents
Z
GPT-2 generated form fields
data.niaid.nih.gov
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Davis (2022). GPT-2 generated form fields [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6544100
Explore at:
Dataset updated
May 13, 2022
Dataset provided by
Brigham Young University
Authors
Brian Davis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a single json containing label-value form fields generated using GPT-2. This data was used to train Dessurt (https://arxiv.org/abs/2203.16618). Details of the generation process can be found in Dessurt's Supplementary Materials and the script used to generate it is gpt_forms.py in https://github.com/herobd/dessurt

The data has groups of label-value pairs each with a "title" or topic (or null). Each label-value pair group was generated in a single GPT-2 generation and thus the pairs "belong to the same form." The json structure is a list of tuples, where each tuple has the title or null as the first element and the list of label-value pairs of the group as the second element. Each label-value pair is another tuple with the first element being the label and the second being the value or a list of values.

For example:

[ ["title",[ ["first label", "first value"], ["second label", ["a label", "another label"] ] ] ], [null, [ ["again label", "again value"] ] ] ]
h
gpt2-qa-train-ds
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Murali Kashaboina, gpt2-qa-train-ds [Dataset]. https://huggingface.co/datasets/kbmurali/gpt2-qa-train-ds
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Murali Kashaboina
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
kbmurali/gpt2-qa-train-ds dataset hosted on Hugging Face and contributed by the HF Datasets community
h
lambada-openai-train-tokenized-gpt2
huggingface.co
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geonwoo Hong (2025). lambada-openai-train-tokenized-gpt2 [Dataset]. https://huggingface.co/datasets/Geonwoohong/lambada-openai-train-tokenized-gpt2
Explore at:
Dataset updated
Oct 7, 2025
Authors
Geonwoo Hong
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Geonwoohong/lambada-openai-train-tokenized-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
gpt2-train-feats
kaggle.com
zip
Updated Jan 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdullah Meda (2024). gpt2-train-feats [Dataset]. https://www.kaggle.com/datasets/abdullahmeda/gpt2-train-feats
Explore at:
zip(94715615 bytes)Available download formats
Dataset updated
Jan 31, 2024
Authors
Abdullah Meda
Description
Dataset

This dataset was created by Abdullah Meda

Contents
GPT2 Pretrained Models (Pytorch)
kaggle.com
zip
Updated Nov 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Thakur (2019). GPT2 Pretrained Models (Pytorch) [Dataset]. https://www.kaggle.com/abhishek/gpt2-pytorch
Explore at:
zip(10740392680 bytes)Available download formats
Dataset updated
Nov 12, 2019
Authors
Abhishek Thakur
Description
Dataset

This dataset was created by Abhishek Thakur

Released under Data files © Original Authors

Contents

GPT2 pre-trained models and configurations .
GPT2-Large Weights Finetuned on WritingPrompts
kaggle.com
zip
Updated Apr 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neuron Engineer (2020). GPT2-Large Weights Finetuned on WritingPrompts [Dataset]. https://www.kaggle.com/ratthachat/gpt2-large-writingprompts
Explore at:
zip(2871197560 bytes)Available download formats
Dataset updated
Apr 25, 2020
Authors
Neuron Engineer
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Content

This is Huggingface's gpt2-large weights finetuned on the whole WritingPrompts training set. https://www.kaggle.com/ratthachat/writing-prompts

which is preprocessed here : https://www.kaggle.com/ratthachat/writingprompts-combine-one-line-data-for-gpt2

Perplexity score of gpt2-large on this dataset is 14.8 compared to 16.8 of gpt2-medium (see ref. kernel below) finetuned only on small dataset.

If we finetune gpt2-large on the small valid dataset, we would get perplexity score of 16.2 (tested locally).

additional details

The whole process is trained on V100 machine for 20 hours. However, if we would use fp16 training, the time should be reduced by half.

How to use :

See the kernel https://www.kaggle.com/ratthachat/writingprompts-gpt2-lm-fine-tune

Skipping the training step, change from gpt2-medium to gpt2-large or if you cannot load it normally, you can try

from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config config = GPT2Config.from_json_file('[path]/config.json') model = GPT2LMHeadModel(config) model.load_state_dict(torch.load(' [path]/pytorch_model.bin'))
h
TiMaGPT2-2012
huggingface.co
Updated Dec 31, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TimeMachine (2012). TiMaGPT2-2012 [Dataset]. https://huggingface.co/datasets/Ti-Ma/TiMaGPT2-2012
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 31, 2012
Dataset authored and provided by
TimeMachine
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The following dataset is constructed using entirely historical data up to the cutoff date "31-12-2012". The data comes from the WMT News dataset (https://data.statmt.org/news-crawl/en/) and Wikipedia. This dataset is the training dataset for a GPT-2-small-based model, and is available on Huggingface at the following location: "TiMa/TiMaGPT2-2012". The dataset uses the same license as the WMT News dataset (https://data.statmt.org/news-crawl/README) as this is the less permissive license of the… See the full description on the dataset page: https://huggingface.co/datasets/Ti-Ma/TiMaGPT2-2012.
h
fineweb-10b-gpt2
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vlad Cociorva, fineweb-10b-gpt2 [Dataset]. https://huggingface.co/datasets/skyehigh/fineweb-10b-gpt2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Vlad Cociorva
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
sample-10BT version of the FineWeb dataset tokenized using the gpt2 tokenizer and split into 100M tokens binary shards. A shard is simply a 1D stream of np.uint16 numbers which are the tokenized samples from the dataset, stored contiguously. Each sample from the dataset was prefixed with the <|endoftext|> special token before being tokenized. There are 103 training shards (under train/ dir) and 1 shard for validation (under val/).
R
Open Poetry Vision Object Detection Dataset - 512x512
public.roboflow.com
zip
Updated Apr 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brad Dwyer (2022). Open Poetry Vision Object Detection Dataset - 512x512 [Dataset]. https://public.roboflow.com/object-detection/open-poetry-vision/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 7, 2022
Dataset authored and provided by
Brad Dwyer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Bounding Boxes of text
Description
Overview

The Open Poetry Vision dataset is a synthetic dataset created by Roboflow for OCR tasks.

It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.

Example Image: https://i.imgur.com/sZT516a.png" alt="Example Image">

Use Cases

A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.

Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.

Using this Dataset

Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

Introducing the New Roboflow Train

What to Think About When Choosing Model Sizes

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
LLM - Detect AI Datamix
kaggle.com
zip
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
Explore at:
zip(172818297 bytes)Available download formats
Dataset updated
Jan 19, 2024
Authors
Raja Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:

Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.

One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.

Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
h
TiMaGPT2-2014
huggingface.co
Updated Dec 31, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TimeMachine (2014). TiMaGPT2-2014 [Dataset]. https://huggingface.co/datasets/Ti-Ma/TiMaGPT2-2014
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 31, 2014
Dataset authored and provided by
TimeMachine
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The following dataset is constructed using entirely historical data up to the cutoff date "31-12-2014". The data comes from the WMT News dataset (https://data.statmt.org/news-crawl/en/) and Wikipedia. This dataset is the training dataset for a GPT-2-small-based model, and is available on Huggingface at the following location: "TiMa/TiMaGPT2-2014". The dataset uses the same license as the WMT News dataset (https://data.statmt.org/news-crawl/README) as this is the less permissive license of the… See the full description on the dataset page: https://huggingface.co/datasets/Ti-Ma/TiMaGPT2-2014.
Continue training GPT2 1 Ep
kaggle.com
zip
Updated Dec 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jamie Wang (2021). Continue training GPT2 1 Ep [Dataset]. https://www.kaggle.com/jiaminggogogo/continue-training-gpt2-1-ep
Explore at:
zip(10037607993 bytes)Available download formats
Dataset updated
Dec 22, 2021
Authors
Jamie Wang
Description
Dataset

This dataset was created by Jamie Wang

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Minh-Thien Nguyen (2024). fineweb-edu-10BT-for-gpt2 [Dataset]. https://www.kaggle.com/datasets/minhthiennguyen/fineweb-edu-10bt-for-gpt2

fineweb-edu-10BT-for-gpt2

Tokenized Fineweb-Edu (10BT subset) for pre-training GPT2 model.

Explore at:

zip(13769081319 bytes)Available download formats

Dataset updated

Jul 20, 2024

Authors

Minh-Thien Nguyen

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This is a tokenized Fineweb-Edu (10BT subset) using GPT2 tokenizer for pre-training GPT2 model. Data is divided into shards (.npy files), each shard contains 2e8 tokens and test shard contains roughly 1.5e8 tokens.

For the Fineweb version, please refer to fineweb-10BT-for-gpt2.

Each .npy file can be loaded with numpy.load('file_name.npy').

Clear search

Close search

Google apps

Main menu

fineweb-edu-10BT-for-gpt2

gpt2-toksuite-detokenized

Data for English and Tamil Generative Models.

OpenWebText-gpt2

Gpt-2-output-dataset

fineweb10B-gpt2-fluxentropy

pile-uncopyrighted-train-tokenized-gpt2

GPT2-Nepali

Dataset

Contents

GPT-2 generated form fields

gpt2-qa-train-ds

lambada-openai-train-tokenized-gpt2

gpt2-train-feats

Dataset

Contents

GPT2 Pretrained Models (Pytorch)

Dataset

Contents

GPT2-Large Weights Finetuned on WritingPrompts

Content

additional details

How to use :

TiMaGPT2-2012

fineweb-10b-gpt2

Open Poetry Vision Object Detection Dataset - 512x512

Overview

Use Cases

Using this Dataset

Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

About Roboflow

LLM - Detect AI Datamix

TiMaGPT2-2014

Continue training GPT2 1 Ep

Dataset

Contents

fineweb-edu-10BT-for-gpt2

Tokenized Fineweb-Edu (10BT subset) for pre-training GPT2 model.