8 datasets found

h
openwebtext2
huggingface.co
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziyin Zhang (2025). openwebtext2 [Dataset]. https://huggingface.co/datasets/Geralt-Targaryen/openwebtext2
Explore at:
Dataset updated
Mar 31, 2025
Authors
Ziyin Zhang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
A cleaned version of OpenWebText2 by removing non-English, duplicated, copyrighted, and low-quality (too short, too many special characters, etc) samples. This dataset has also been decontaminated with respect to the following benchmarks based on n-gram overlap:

GLUE (dev set of SST-2, CoLA, QQP, WNLI, RTE, QNLI, MNLI; test set of MPRC) SIQA, PIQA, QASC, CSQA, HellaSWAG (all dev set) CONLL 2003 BLIMP MAIN BoolQ (dev set) WinoGrande (dev set) ANLI (test set) ARC easy and challenge (test set)… See the full description on the dataset page: https://huggingface.co/datasets/Geralt-Targaryen/openwebtext2.
h
openwebtext2-first-30-chunks-ablation-bilingual
huggingface.co
Updated Feb 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raimundo Becerra Parra (2024). openwebtext2-first-30-chunks-ablation-bilingual [Dataset]. https://huggingface.co/datasets/RaiBP/openwebtext2-first-30-chunks-ablation-bilingual
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2024
Authors
Raimundo Becerra Parra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
RaiBP/openwebtext2-first-30-chunks-ablation-bilingual dataset hosted on Hugging Face and contributed by the HF Datasets community
h
dsir-pile-13m-filtered-for-openwebtext2
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timaeus, dsir-pile-13m-filtered-for-openwebtext2 [Dataset]. https://huggingface.co/datasets/timaeus/dsir-pile-13m-filtered-for-openwebtext2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Timaeus
Description
timaeus/dsir-pile-13m-filtered-for-openwebtext2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
dsir-pile-100k-filtered-for-OpenWebText2
huggingface.co
Updated May 6, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timaeus (2016). dsir-pile-100k-filtered-for-OpenWebText2 [Dataset]. https://huggingface.co/datasets/timaeus/dsir-pile-100k-filtered-for-OpenWebText2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 6, 2016
Dataset authored and provided by
Timaeus
Description
timaeus/dsir-pile-100k-filtered-for-OpenWebText2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
RaiBP-openwebtext2-first-30-chunks-lang-detect-raw-output
huggingface.co
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
french-datasets (2025). RaiBP-openwebtext2-first-30-chunks-lang-detect-raw-output [Dataset]. https://huggingface.co/datasets/french-datasets/RaiBP-openwebtext2-first-30-chunks-lang-detect-raw-output
Explore at:
Dataset updated
Mar 31, 2025
Dataset authored and provided by
french-datasets
Description
Ce répertoire est vide, il a été créé pour améliorer le référencement du jeu de données https://huggingface.co/datasets/RaiBP/openwebtext2-first-30-chunks-lang-detect-raw-output.
h
openwebtext
huggingface.co
opendatalab.com
+3more
Updated Jul 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Gokaslan (2023). openwebtext [Dataset]. https://huggingface.co/datasets/Skylion007/openwebtext
Explore at:
Dataset updated
Jul 17, 2023
Authors
Aaron Gokaslan
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
An open-source replication of the WebText dataset from OpenAI.
h
openwebmath
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziyin Zhang, openwebmath [Dataset]. https://huggingface.co/datasets/Geralt-Targaryen/openwebmath
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Ziyin Zhang
Description
A cleaned, deduplicated, and decontaminated version of OpenWebMath.

Non-English documents and low-quality documents are removed; Cross-deduplicated with OpenWebText2 and CC-News.

This dataset has been decontaminated with respect to the following benchmarks based on n-gram overlap:

GLUE (dev set of SST-2, CoLA, QQP, WNLI, RTE, QNLI, MNLI; test set of MPRC) SIQA, PIQA, QASC, CSQA, HellaSWAG (all dev set) CONLL 2003 BLIMP MAIN BoolQ (dev set) WinoGrande (dev set) ANLI (test set) ARC easy and… See the full description on the dataset page: https://huggingface.co/datasets/Geralt-Targaryen/openwebmath.
h
finemath
huggingface.co
Updated Feb 8, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziyin Zhang (2014). finemath [Dataset]. https://huggingface.co/datasets/Geralt-Targaryen/finemath
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2014
Authors
Ziyin Zhang
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
A cleaned, deduplicated, and decontaminated version of FineMath-3+. All non-English content is removed.

Deduplication

More than one million documents are removed during cross-deduplication with OpenWebText2, CC-News, and OpenWebMath.

Decontamination

4.8K documents are removed during decontamination with respect to the following benchmarks based on n-gram overlap:

GLUE (dev set of SST-2, CoLA, QQP, WNLI, RTE, QNLI, MNLI; test set of MPRC) SIQA, PIQA, QASC, CSQA, HellaSWAG… See the full description on the dataset page: https://huggingface.co/datasets/Geralt-Targaryen/finemath.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ziyin Zhang (2025). openwebtext2 [Dataset]. https://huggingface.co/datasets/Geralt-Targaryen/openwebtext2

openwebtext2

Geralt-Targaryen/openwebtext2

Explore at:

205 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Mar 31, 2025

Authors

Ziyin Zhang

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

A cleaned version of OpenWebText2 by removing non-English, duplicated, copyrighted, and low-quality (too short, too many special characters, etc) samples. This dataset has also been decontaminated with respect to the following benchmarks based on n-gram overlap:

GLUE (dev set of SST-2, CoLA, QQP, WNLI, RTE, QNLI, MNLI; test set of MPRC) SIQA, PIQA, QASC, CSQA, HellaSWAG (all dev set) CONLL 2003 BLIMP MAIN BoolQ (dev set) WinoGrande (dev set) ANLI (test set) ARC easy and challenge (test set)… See the full description on the dataset page: https://huggingface.co/datasets/Geralt-Targaryen/openwebtext2.

Clear search

Close search

Google apps

Main menu

openwebtext2

openwebtext2-first-30-chunks-ablation-bilingual

dsir-pile-13m-filtered-for-openwebtext2

dsir-pile-100k-filtered-for-OpenWebText2

RaiBP-openwebtext2-first-30-chunks-lang-detect-raw-output

openwebtext

openwebmath

finemath

openwebtext2

Geralt-Targaryen/openwebtext2