37 datasets found
  1. fineweb-edu-10BT-for-gpt2

    • kaggle.com
    zip
    Updated Jul 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minh-Thien Nguyen (2024). fineweb-edu-10BT-for-gpt2 [Dataset]. https://www.kaggle.com/datasets/minhthiennguyen/fineweb-edu-10bt-for-gpt2
    Explore at:
    zip(13769081319 bytes)Available download formats
    Dataset updated
    Jul 20, 2024
    Authors
    Minh-Thien Nguyen
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a tokenized Fineweb-Edu (10BT subset) using GPT2 tokenizer for pre-training GPT2 model. Data is divided into shards (.npy files), each shard contains 2e8 tokens and test shard contains roughly 1.5e8 tokens.

    For the Fineweb version, please refer to fineweb-10BT-for-gpt2.

    Each .npy file can be loaded with numpy.load('file_name.npy').

  2. h

    gpt2-toksuite-detokenized

    • huggingface.co
    Updated Nov 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TokSuite (2025). gpt2-toksuite-detokenized [Dataset]. https://huggingface.co/datasets/toksuite/gpt2-toksuite-detokenized
    Explore at:
    Dataset updated
    Nov 18, 2025
    Dataset authored and provided by
    TokSuite
    Description

    Training data of the model detokenized in the exact order seen by the model. The training data is partitioned into 8 chunks (chunk-0 through chunk-7), based on the GPU rank that generated the data. Each chunk contains detokenized text files in JSON Lines format (.jsonl).

  3. Data for English and Tamil Generative Models.

    • kaggle.com
    zip
    Updated Apr 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aneesh Aparajit (2023). Data for English and Tamil Generative Models. [Dataset]. https://www.kaggle.com/datasets/aneesh10/gpt2-pretraining-dataset-for-english-and-tamil/code
    Explore at:
    zip(2850559901 bytes)Available download formats
    Dataset updated
    Apr 8, 2023
    Authors
    Aneesh Aparajit
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is a dataset which was extracted for pre-training GPT2 for generating kids stories in English and Tamil.

    For the english, the data was extracted from gutenberg.org.

    Special thanks to mateibejan for the metadata for the Gutenberg books.

    For the tamil dataset, the data has been extracted from Siruvarmalar, a very old and reliable source of kids stories.

    The code for the data extraction can be found at: github/picturebook.ai

  4. OpenWebText-gpt2

    • kaggle.com
    zip
    Updated Jan 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    windmaple (2025). OpenWebText-gpt2 [Dataset]. https://www.kaggle.com/datasets/windmaple/openwebtext-gpt2
    Explore at:
    zip(12138662851 bytes)Available download formats
    Dataset updated
    Jan 25, 2025
    Authors
    windmaple
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the processed dataset using Andrey Karpathy's script https://github.com/karpathy/nanoGPT/tree/master/data/openwebtext. The original dataset is from https://huggingface.co/datasets/Skylion007/openwebtext, which now requires datasets lib version < 3 to download.

  5. O

    Gpt-2-output-dataset

    • opendatalab.com
    • kaggle.com
    zip
    Updated Jan 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2019). Gpt-2-output-dataset [Dataset]. https://opendatalab.com/OpenDataLab/Gpt-2-output-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 1, 2019
    Dataset provided by
    OpenAI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains: 250K documents from the WebText test set For each GPT-2 model (trained on the WebText training set), 250K random samples (temperature 1, no truncation) and 250K samples generated with Top-K 40 truncation

  6. h

    fineweb10B-gpt2-fluxentropy

    • huggingface.co
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sinatras (2024). fineweb10B-gpt2-fluxentropy [Dataset]. https://huggingface.co/datasets/sinatras/fineweb10B-gpt2-fluxentropy
    Explore at:
    Dataset updated
    Dec 5, 2024
    Authors
    Sinatras
    Description

    FineWeb Dataset - GPT-2 Tokenized

    This dataset contains preprocessed and tokenized FineWeb data using the GPT-2 tokenizer. It consists of multiple training folders containing the processed data. Dataset structure:

    fineweb_train_000001 to fineweb_train_000005: Training folders

  7. h

    pile-uncopyrighted-train-tokenized-gpt2

    • huggingface.co
    Updated Nov 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geonwoo Hong (2025). pile-uncopyrighted-train-tokenized-gpt2 [Dataset]. https://huggingface.co/datasets/Geonwoohong/pile-uncopyrighted-train-tokenized-gpt2
    Explore at:
    Dataset updated
    Nov 8, 2025
    Authors
    Geonwoo Hong
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Geonwoohong/pile-uncopyrighted-train-tokenized-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. GPT2-Nepali

    • kaggle.com
    zip
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aananda Giri (2024). GPT2-Nepali [Dataset]. https://www.kaggle.com/datasets/aanandagiri/sebastian-gpt2
    Explore at:
    zip(5398572169 bytes)Available download formats
    Dataset updated
    Dec 16, 2024
    Authors
    Aananda Giri
    Description

    Dataset

    This dataset was created by Aananda Giri

    Contents

  9. Z

    GPT-2 generated form fields

    • data.niaid.nih.gov
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Davis (2022). GPT-2 generated form fields [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6544100
    Explore at:
    Dataset updated
    May 13, 2022
    Dataset provided by
    Brigham Young University
    Authors
    Brian Davis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a single json containing label-value form fields generated using GPT-2. This data was used to train Dessurt (https://arxiv.org/abs/2203.16618). Details of the generation process can be found in Dessurt's Supplementary Materials and the script used to generate it is gpt_forms.py in https://github.com/herobd/dessurt

    The data has groups of label-value pairs each with a "title" or topic (or null). Each label-value pair group was generated in a single GPT-2 generation and thus the pairs "belong to the same form." The json structure is a list of tuples, where each tuple has the title or null as the first element and the list of label-value pairs of the group as the second element. Each label-value pair is another tuple with the first element being the label and the second being the value or a list of values.

    For example:

    [ ["title",[ ["first label", "first value"], ["second label", ["a label", "another label"] ] ] ], [null, [ ["again label", "again value"] ] ] ]

  10. h

    gpt2-qa-train-ds

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Murali Kashaboina, gpt2-qa-train-ds [Dataset]. https://huggingface.co/datasets/kbmurali/gpt2-qa-train-ds
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Murali Kashaboina
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    kbmurali/gpt2-qa-train-ds dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    lambada-openai-train-tokenized-gpt2

    • huggingface.co
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geonwoo Hong (2025). lambada-openai-train-tokenized-gpt2 [Dataset]. https://huggingface.co/datasets/Geonwoohong/lambada-openai-train-tokenized-gpt2
    Explore at:
    Dataset updated
    Oct 7, 2025
    Authors
    Geonwoo Hong
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Geonwoohong/lambada-openai-train-tokenized-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. gpt2-train-feats

    • kaggle.com
    zip
    Updated Jan 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah Meda (2024). gpt2-train-feats [Dataset]. https://www.kaggle.com/datasets/abdullahmeda/gpt2-train-feats
    Explore at:
    zip(94715615 bytes)Available download formats
    Dataset updated
    Jan 31, 2024
    Authors
    Abdullah Meda
    Description

    Dataset

    This dataset was created by Abdullah Meda

    Contents

  13. GPT2 Pretrained Models (Pytorch)

    • kaggle.com
    zip
    Updated Nov 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Thakur (2019). GPT2 Pretrained Models (Pytorch) [Dataset]. https://www.kaggle.com/abhishek/gpt2-pytorch
    Explore at:
    zip(10740392680 bytes)Available download formats
    Dataset updated
    Nov 12, 2019
    Authors
    Abhishek Thakur
    Description

    Dataset

    This dataset was created by Abhishek Thakur

    Released under Data files ยฉ Original Authors

    Contents

    GPT2 pre-trained models and configurations .

  14. GPT2-Large Weights Finetuned on WritingPrompts

    • kaggle.com
    zip
    Updated Apr 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neuron Engineer (2020). GPT2-Large Weights Finetuned on WritingPrompts [Dataset]. https://www.kaggle.com/ratthachat/gpt2-large-writingprompts
    Explore at:
    zip(2871197560 bytes)Available download formats
    Dataset updated
    Apr 25, 2020
    Authors
    Neuron Engineer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Content

    This is Huggingface's gpt2-large weights finetuned on the whole WritingPrompts training set. https://www.kaggle.com/ratthachat/writing-prompts

    which is preprocessed here : https://www.kaggle.com/ratthachat/writingprompts-combine-one-line-data-for-gpt2

    Perplexity score of gpt2-large on this dataset is 14.8 compared to 16.8 of gpt2-medium (see ref. kernel below) finetuned only on small dataset.

    If we finetune gpt2-large on the small valid dataset, we would get perplexity score of 16.2 (tested locally).

    additional details

    The whole process is trained on V100 machine for 20 hours. However, if we would use fp16 training, the time should be reduced by half.

    How to use :

    See the kernel https://www.kaggle.com/ratthachat/writingprompts-gpt2-lm-fine-tune

    Skipping the training step, change from gpt2-medium to gpt2-large or if you cannot load it normally, you can try

    from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config
    
    config = GPT2Config.from_json_file('[path]/config.json')
    model = GPT2LMHeadModel(config)
    model.load_state_dict(torch.load(' [path]/pytorch_model.bin'))
    
  15. h

    TiMaGPT2-2012

    • huggingface.co
    Updated Dec 31, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TimeMachine (2012). TiMaGPT2-2012 [Dataset]. https://huggingface.co/datasets/Ti-Ma/TiMaGPT2-2012
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2012
    Dataset authored and provided by
    TimeMachine
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The following dataset is constructed using entirely historical data up to the cutoff date "31-12-2012". The data comes from the WMT News dataset (https://data.statmt.org/news-crawl/en/) and Wikipedia. This dataset is the training dataset for a GPT-2-small-based model, and is available on Huggingface at the following location: "TiMa/TiMaGPT2-2012". The dataset uses the same license as the WMT News dataset (https://data.statmt.org/news-crawl/README) as this is the less permissive license of theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Ti-Ma/TiMaGPT2-2012.

  16. h

    fineweb-10b-gpt2

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vlad Cociorva, fineweb-10b-gpt2 [Dataset]. https://huggingface.co/datasets/skyehigh/fineweb-10b-gpt2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Vlad Cociorva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    sample-10BT version of the FineWeb dataset tokenized using the gpt2 tokenizer and split into 100M tokens binary shards. A shard is simply a 1D stream of np.uint16 numbers which are the tokenized samples from the dataset, stored contiguously. Each sample from the dataset was prefixed with the <|endoftext|> special token before being tokenized. There are 103 training shards (under train/ dir) and 1 shard for validation (under val/).

  17. R

    Open Poetry Vision Object Detection Dataset - 512x512

    • public.roboflow.com
    zip
    Updated Apr 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brad Dwyer (2022). Open Poetry Vision Object Detection Dataset - 512x512 [Dataset]. https://public.roboflow.com/object-detection/open-poetry-vision/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 7, 2022
    Dataset authored and provided by
    Brad Dwyer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Bounding Boxes of text
    Description

    Overview

    The Open Poetry Vision dataset is a synthetic dataset created by Roboflow for OCR tasks.

    It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.

    Example Image: https://i.imgur.com/sZT516a.png" alt="Example Image">

    Use Cases

    A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.

    Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.

    Using this Dataset

    Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

    Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

    Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

    Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

    Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

    Roboflow Workmark

  18. LLM - Detect AI Datamix

    • kaggle.com
    zip
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
    Explore at:
    zip(172818297 bytes)Available download formats
    Dataset updated
    Jan 19, 2024
    Authors
    Raja Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the datamix created by Team ๐Ÿ” ๐Ÿ“ ๐Ÿ•ต๏ธโ€โ™‚๏ธ ๐Ÿค– during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

    It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

    To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

    Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

    • Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:
      • Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.
      • One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.
      • Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

    We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

    Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence

  19. h

    TiMaGPT2-2014

    • huggingface.co
    Updated Dec 31, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TimeMachine (2014). TiMaGPT2-2014 [Dataset]. https://huggingface.co/datasets/Ti-Ma/TiMaGPT2-2014
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2014
    Dataset authored and provided by
    TimeMachine
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The following dataset is constructed using entirely historical data up to the cutoff date "31-12-2014". The data comes from the WMT News dataset (https://data.statmt.org/news-crawl/en/) and Wikipedia. This dataset is the training dataset for a GPT-2-small-based model, and is available on Huggingface at the following location: "TiMa/TiMaGPT2-2014". The dataset uses the same license as the WMT News dataset (https://data.statmt.org/news-crawl/README) as this is the less permissive license of theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Ti-Ma/TiMaGPT2-2014.

  20. Continue training GPT2 1 Ep

    • kaggle.com
    zip
    Updated Dec 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jamie Wang (2021). Continue training GPT2 1 Ep [Dataset]. https://www.kaggle.com/jiaminggogogo/continue-training-gpt2-1-ep
    Explore at:
    zip(10037607993 bytes)Available download formats
    Dataset updated
    Dec 22, 2021
    Authors
    Jamie Wang
    Description

    Dataset

    This dataset was created by Jamie Wang

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Minh-Thien Nguyen (2024). fineweb-edu-10BT-for-gpt2 [Dataset]. https://www.kaggle.com/datasets/minhthiennguyen/fineweb-edu-10bt-for-gpt2
Organization logo

fineweb-edu-10BT-for-gpt2

Tokenized Fineweb-Edu (10BT subset) for pre-training GPT2 model.

Explore at:
zip(13769081319 bytes)Available download formats
Dataset updated
Jul 20, 2024
Authors
Minh-Thien Nguyen
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This is a tokenized Fineweb-Edu (10BT subset) using GPT2 tokenizer for pre-training GPT2 model. Data is divided into shards (.npy files), each shard contains 2e8 tokens and test shard contains roughly 1.5e8 tokens.

For the Fineweb version, please refer to fineweb-10BT-for-gpt2.

Each .npy file can be loaded with numpy.load('file_name.npy').

Search
Clear search
Close search
Google apps
Main menu