An open-source replication of the WebText dataset from OpenAI.
This is a small subset representing the first 10K records from the original dataset - created for testing.
The full 8M-record dataset is at https://huggingface.co/datasets/openwebtext
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
An open-source replication of the WebText dataset from OpenAI. For more info please visit https://skylion007.github.io/OpenWebTextCorpus/
@misc{Gokaslan2019OpenWeb,
title={OpenWebText Corpus},
author={Aaron Gokaslan and Vanya Cohen},
howpublished{\url{http://Skylion007.github.io/OpenWebTextCorpus}},
year={2019}
}
Dataset Card for "openwebtext-100k"
More Information needed
olivercareyncl/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Code by eukaryote31 and Joshua Peterson: and Scraped by Aaron Gokaslan and Vanya Cohen: Tokenized by eukaryote31
chanind/openwebtext-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
OpenWebTextCorpus tokenized for Gemma 2 with 128 context size
This dataset is a pre-tokenized version of the Skylion007/openwebtext dataset using the gemma tokenizer. As such, this dataset follows the same licensing as the original openwebtext dataset. This pre-tokenization is done as a performance optimization for using the openwebtext dataset with a Gemma model (gemma-2b, gemma-2b-it, gemma-7b, gemma-7b-it). This dataset was created using SAELens, with the following settings:… See the full description on the dataset page: https://huggingface.co/datasets/Marlon154/openwebtext-gemma-2-context-128.
A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Thijmen/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains 25 Wikitext-103 LSTM models and 25 LSTM models trained on a 100 million token subset of the OpenWebTextCorpus. Training/validation/test data is included with the Web models. By-epoch validation perplexity is given in the logs (within the directory for the models). Please write to me if you have any questions :)
wangzn2001/openwebtext-10k dataset hosted on Hugging Face and contributed by the HF Datasets community
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Every outbound reddit link from before 31. Dec 2018 with at least 3 karma. The list is filtered to remove image sites, non-scraper-friendly sites, and other media files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains approximately 300k text entries collected from 5 different sources (Human, ChatGPT, PaLM, LLaMA, GPT2-XL).
60k of them are Human-written, randomly selected from the OpenWebText dataset. These entries are collected from the user generated content from Reddit before 2019.
60k of them are the ChatGPT's (gpt3.5-turbo) paragraph-by-paragraph rephrasing for the human written data.
60k of them are the PaLM's (Pathway Language Model, text-bison-001) paragraph-by-paragraph rephrasing for the human written data.
60k of them are the LLaMA-7B's (Large Language Model Meta AI) paragraph-by-pargraph rephrasing for the human written data.
60k of them are the data adapted from the GPT2-output dataset released by the OpenAI (GPT2-XL).
olivercareyncl/openwebtext.json dataset hosted on Hugging Face and contributed by the HF Datasets community
gartland/openwebtext-cc-196K dataset hosted on Hugging Face and contributed by the HF Datasets community
andythebreaker/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community
laxmanvijay24/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community
gartland/openwebtext-cc-98K dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/bigscience-bloom-rail-1.0/https://choosealicense.com/licenses/bigscience-bloom-rail-1.0/
The first 10K elements of The Pile, useful for debugging models trained on it. See the HuggingFace page for the full Pile for more info. Inspired by stas' great resource doing the same for OpenWebText
Juliushanhanhan/openwebtext-1b-llama3-tokenized-cxt-1024 dataset hosted on Hugging Face and contributed by the HF Datasets community
An open-source replication of the WebText dataset from OpenAI.
This is a small subset representing the first 10K records from the original dataset - created for testing.
The full 8M-record dataset is at https://huggingface.co/datasets/openwebtext