36 datasets found

h
openwebtext-10k
huggingface.co
opendatalab.com
Updated Aug 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stas Bekman (2021). openwebtext-10k [Dataset]. https://huggingface.co/datasets/stas/openwebtext-10k
Explore at:
Dataset updated
Aug 27, 2021
Authors
Stas Bekman
Description
An open-source replication of the WebText dataset from OpenAI.

This is a small subset representing the first 10K records from the original dataset - created for testing.

The full 8M-record dataset is at https://huggingface.co/datasets/openwebtext
E
OpenWebText
live.european-language-grid.eu
opendatalab.com
+2more
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). OpenWebText [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7790
Explore at:
Dataset updated
Apr 30, 2024
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
An open-source replication of the WebText dataset from OpenAI. For more info please visit https://skylion007.github.io/OpenWebTextCorpus/
@misc{Gokaslan2019OpenWeb, title={OpenWebText Corpus}, author={Aaron Gokaslan and Vanya Cohen}, howpublished{\url{http://Skylion007.github.io/OpenWebTextCorpus}}, year={2019}
}
h
openwebtext-100k
huggingface.co
Updated Sep 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Logan Riggs Smith (2025). openwebtext-100k [Dataset]. https://huggingface.co/datasets/Elriggs/openwebtext-100k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2025
Authors
Logan Riggs Smith
Description
Dataset Card for "openwebtext-100k"

More Information needed
h
openwebtext
huggingface.co
Updated Feb 1, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliver Carey (2017). openwebtext [Dataset]. https://huggingface.co/datasets/olivercareyncl/openwebtext
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2017
Authors
Oliver Carey
Description
olivercareyncl/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community
a
OpenWebText (Gokaslan's distribution, 2019), GPT-2 Tokenized
academictorrents.com
bittorrent
Updated Jun 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
eukaryote31 and Joshua Peterson and Aaron Gokaslan and Vanya Cohen (2019). OpenWebText (Gokaslan's distribution, 2019), GPT-2 Tokenized [Dataset]. https://academictorrents.com/details/36c39b25657ce1639ccec0a91cf242b42e1f01db
Explore at:
bittorrent(16023403913)Available download formats
Dataset updated
Jun 1, 2019
Dataset authored and provided by
eukaryote31 and Joshua Peterson and Aaron Gokaslan and Vanya Cohen
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Code by eukaryote31 and Joshua Peterson: and Scraped by Aaron Gokaslan and Vanya Cohen: Tokenized by eukaryote31
h
openwebtext-gpt2
huggingface.co
Updated Nov 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Chanin (2024). openwebtext-gpt2 [Dataset]. https://huggingface.co/datasets/chanind/openwebtext-gpt2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 18, 2024
Authors
David Chanin
Description
chanind/openwebtext-gpt2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
openwebtext-gemma-2-context-128
huggingface.co
Updated Aug 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marlon May (2025). openwebtext-gemma-2-context-128 [Dataset]. https://huggingface.co/datasets/Marlon154/openwebtext-gemma-2-context-128
Explore at:
Dataset updated
Aug 2, 2025
Authors
Marlon May
Description
OpenWebTextCorpus tokenized for Gemma 2 with 128 context size

This dataset is a pre-tokenized version of the Skylion007/openwebtext dataset using the gemma tokenizer. As such, this dataset follows the same licensing as the original openwebtext dataset. This pre-tokenization is done as a performance optimization for using the openwebtext dataset with a Gemma model (gemma-2b, gemma-2b-it, gemma-7b, gemma-7b-it). This dataset was created using SAELens, with the following settings:… See the full description on the dataset page: https://huggingface.co/datasets/Marlon154/openwebtext-gemma-2-context-128.
t
Subham Sekhar Sahoo, Aaron Gokaslan, Chris De Sa, Volodymyr Kuleshov (2024)....
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Subham Sekhar Sahoo, Aaron Gokaslan, Chris De Sa, Volodymyr Kuleshov (2024). Dataset: OpenWebText Corpus. https://doi.org/10.57702/yw8o2eqh [Dataset]. https://service.tib.eu/ldmservice/dataset/openwebtext-corpus
Explore at:
Dataset updated
Dec 2, 2024
Description
A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
h
openwebtext
huggingface.co
Updated Jul 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thijmen Nijdam (2024). openwebtext [Dataset]. https://huggingface.co/datasets/Thijmen/openwebtext
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 29, 2024
Authors
Thijmen Nijdam
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Thijmen/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community
Wikitext-103 and OpenWebText Models
zenodo.org
application/gzip
Updated Sep 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Forrest Davis; Forrest Davis (2020). Wikitext-103 and OpenWebText Models [Dataset]. http://doi.org/10.5281/zenodo.4053572
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4053572
Dataset updated
Sep 30, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Forrest Davis; Forrest Davis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains 25 Wikitext-103 LSTM models and 25 LSTM models trained on a 100 million token subset of the OpenWebTextCorpus. Training/validation/test data is included with the Web models. By-epoch validation perplexity is given in the logs (within the directory for the models). Please write to me if you have any questions :)
h
openwebtext-10k
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhongnan Wang, openwebtext-10k [Dataset]. https://huggingface.co/datasets/wangzn2001/openwebtext-10k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Zhongnan Wang
Description
wangzn2001/openwebtext-10k dataset hosted on Hugging Face and contributed by the HF Datasets community
a
OpenWebText-urls-26M-filtered.xz
academictorrents.com
bittorrent
Updated May 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
eukaryote and jcpeterson (2019). OpenWebText-urls-26M-filtered.xz [Dataset]. https://academictorrents.com/details/f5161721b322bca66ed74da32b963c1066e64312
Explore at:
bittorrent(480280068)Available download formats
Dataset updated
May 12, 2019
Dataset authored and provided by
eukaryote and jcpeterson
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Every outbound reddit link from before 31. Dec 2018 with at least 3 karma. The list is filtered to remove image sites, non-scraper-friendly sites, and other media files.
OpenLLMText Dataset
zenodo.org
zip
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yutian Chen; Yutian Chen; Hao Kang; Yiyan Zhai; Liangze Li; Rita Singh; Bhiksha Raj; Hao Kang; Yiyan Zhai; Liangze Li; Rita Singh; Bhiksha Raj (2023). OpenLLMText Dataset [Dataset]. http://doi.org/10.5281/zenodo.8285326
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8285326
Dataset updated
Oct 19, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yutian Chen; Yutian Chen; Hao Kang; Yiyan Zhai; Liangze Li; Rita Singh; Bhiksha Raj; Hao Kang; Yiyan Zhai; Liangze Li; Rita Singh; Bhiksha Raj
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains approximately 300k text entries collected from 5 different sources (Human, ChatGPT, PaLM, LLaMA, GPT2-XL).

60k of them are Human-written, randomly selected from the OpenWebText dataset. These entries are collected from the user generated content from Reddit before 2019.

60k of them are the ChatGPT's (gpt3.5-turbo) paragraph-by-paragraph rephrasing for the human written data.

60k of them are the PaLM's (Pathway Language Model, text-bison-001) paragraph-by-paragraph rephrasing for the human written data.

60k of them are the LLaMA-7B's (Large Language Model Meta AI) paragraph-by-pargraph rephrasing for the human written data.

60k of them are the data adapted from the GPT2-output dataset released by the OpenAI (GPT2-XL).
h
openwebtext.json
huggingface.co
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliver Carey (2025). openwebtext.json [Dataset]. https://huggingface.co/datasets/olivercareyncl/openwebtext.json
Explore at:
Dataset updated
Mar 12, 2025
Authors
Oliver Carey
Description
olivercareyncl/openwebtext.json dataset hosted on Hugging Face and contributed by the HF Datasets community
h
openwebtext-cc-196K
huggingface.co
Updated Jun 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Woojin Chung (2025). openwebtext-cc-196K [Dataset]. https://huggingface.co/datasets/gartland/openwebtext-cc-196K
Explore at:
Dataset updated
Jun 8, 2025
Authors
Woojin Chung
Description
gartland/openwebtext-cc-196K dataset hosted on Hugging Face and contributed by the HF Datasets community
h
openwebtext
huggingface.co
Updated Sep 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
andythebreaker (2025). openwebtext [Dataset]. https://huggingface.co/datasets/andythebreaker/openwebtext
Explore at:
Dataset updated
Sep 14, 2025
Authors
andythebreaker
Description
andythebreaker/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community
h
openwebtext
huggingface.co
Updated Sep 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
laxman vijay (2025). openwebtext [Dataset]. https://huggingface.co/datasets/laxmanvijay24/openwebtext
Explore at:
Dataset updated
Sep 6, 2025
Authors
laxman vijay
Description
laxmanvijay24/openwebtext dataset hosted on Hugging Face and contributed by the HF Datasets community
h
openwebtext-cc-98K
huggingface.co
Updated Jun 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Woojin Chung (2025). openwebtext-cc-98K [Dataset]. https://huggingface.co/datasets/gartland/openwebtext-cc-98K
Explore at:
Dataset updated
Jun 8, 2025
Authors
Woojin Chung
Description
gartland/openwebtext-cc-98K dataset hosted on Hugging Face and contributed by the HF Datasets community
h
deepseek-v3-10k
huggingface.co
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tony shark (2025). deepseek-v3-10k [Dataset]. https://huggingface.co/datasets/tonyshark/deepseek-v3-10k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 3, 2025
Authors
tony shark
License
https://choosealicense.com/licenses/bigscience-bloom-rail-1.0/https://choosealicense.com/licenses/bigscience-bloom-rail-1.0/
Description
The first 10K elements of The Pile, useful for debugging models trained on it. See the HuggingFace page for the full Pile for more info. Inspired by stas' great resource doing the same for OpenWebText
h
openwebtext-1b-llama3-tokenized-cxt-1024
huggingface.co
Updated Aug 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiatong Han (2024). openwebtext-1b-llama3-tokenized-cxt-1024 [Dataset]. http://doi.org/10.57967/hf/2890
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2890
Dataset updated
Aug 16, 2024
Authors
Jiatong Han
Description
Juliushanhanhan/openwebtext-1b-llama3-tokenized-cxt-1024 dataset hosted on Hugging Face and contributed by the HF Datasets community