Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A cleaned version of OpenWebText2 by removing non-English, duplicated, copyrighted, and low-quality (too short, too many special characters, etc) samples. This dataset has also been decontaminated with respect to the following benchmarks based on n-gram overlap:
GLUE (dev set of SST-2, CoLA, QQP, WNLI, RTE, QNLI, MNLI; test set of MPRC) SIQA, PIQA, QASC, CSQA, HellaSWAG (all dev set) CONLL 2003 BLIMP MAIN BoolQ (dev set) WinoGrande (dev set) ANLI (test set) ARC easy and challenge (test set)… See the full description on the dataset page: https://huggingface.co/datasets/Geralt-Targaryen/openwebtext2.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
RaiBP/openwebtext2-first-30-chunks-ablation-bilingual dataset hosted on Hugging Face and contributed by the HF Datasets community
timaeus/dsir-pile-13m-filtered-for-openwebtext2 dataset hosted on Hugging Face and contributed by the HF Datasets community
timaeus/dsir-pile-100k-filtered-for-OpenWebText2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Ce répertoire est vide, il a été créé pour améliorer le référencement du jeu de données https://huggingface.co/datasets/RaiBP/openwebtext2-first-30-chunks-lang-detect-raw-output.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
An open-source replication of the WebText dataset from OpenAI.
A cleaned, deduplicated, and decontaminated version of OpenWebMath.
Non-English documents and low-quality documents are removed; Cross-deduplicated with OpenWebText2 and CC-News.
This dataset has been decontaminated with respect to the following benchmarks based on n-gram overlap:
GLUE (dev set of SST-2, CoLA, QQP, WNLI, RTE, QNLI, MNLI; test set of MPRC) SIQA, PIQA, QASC, CSQA, HellaSWAG (all dev set) CONLL 2003 BLIMP MAIN BoolQ (dev set) WinoGrande (dev set) ANLI (test set) ARC easy and… See the full description on the dataset page: https://huggingface.co/datasets/Geralt-Targaryen/openwebmath.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
A cleaned, deduplicated, and decontaminated version of FineMath-3+. All non-English content is removed.
Deduplication
More than one million documents are removed during cross-deduplication with OpenWebText2, CC-News, and OpenWebMath.
Decontamination
4.8K documents are removed during decontamination with respect to the following benchmarks based on n-gram overlap:
GLUE (dev set of SST-2, CoLA, QQP, WNLI, RTE, QNLI, MNLI; test set of MPRC) SIQA, PIQA, QASC, CSQA, HellaSWAG… See the full description on the dataset page: https://huggingface.co/datasets/Geralt-Targaryen/finemath.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A cleaned version of OpenWebText2 by removing non-English, duplicated, copyrighted, and low-quality (too short, too many special characters, etc) samples. This dataset has also been decontaminated with respect to the following benchmarks based on n-gram overlap:
GLUE (dev set of SST-2, CoLA, QQP, WNLI, RTE, QNLI, MNLI; test set of MPRC) SIQA, PIQA, QASC, CSQA, HellaSWAG (all dev set) CONLL 2003 BLIMP MAIN BoolQ (dev set) WinoGrande (dev set) ANLI (test set) ARC easy and challenge (test set)… See the full description on the dataset page: https://huggingface.co/datasets/Geralt-Targaryen/openwebtext2.