8 datasets found
  1. h

    openwebtext2

    • huggingface.co
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziyin Zhang (2025). openwebtext2 [Dataset]. https://huggingface.co/datasets/Geralt-Targaryen/openwebtext2
    Explore at:
    Dataset updated
    Mar 31, 2025
    Authors
    Ziyin Zhang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A cleaned version of OpenWebText2 by removing non-English, duplicated, copyrighted, and low-quality (too short, too many special characters, etc) samples. This dataset has also been decontaminated with respect to the following benchmarks based on n-gram overlap:

    GLUE (dev set of SST-2, CoLA, QQP, WNLI, RTE, QNLI, MNLI; test set of MPRC) SIQA, PIQA, QASC, CSQA, HellaSWAG (all dev set) CONLL 2003 BLIMP MAIN BoolQ (dev set) WinoGrande (dev set) ANLI (test set) ARC easy and challenge (test set)… See the full description on the dataset page: https://huggingface.co/datasets/Geralt-Targaryen/openwebtext2.

  2. h

    openwebtext2-first-30-chunks-ablation-bilingual

    • huggingface.co
    Updated Feb 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raimundo Becerra Parra (2024). openwebtext2-first-30-chunks-ablation-bilingual [Dataset]. https://huggingface.co/datasets/RaiBP/openwebtext2-first-30-chunks-ablation-bilingual
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2024
    Authors
    Raimundo Becerra Parra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    RaiBP/openwebtext2-first-30-chunks-ablation-bilingual dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    dsir-pile-13m-filtered-for-openwebtext2

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timaeus, dsir-pile-13m-filtered-for-openwebtext2 [Dataset]. https://huggingface.co/datasets/timaeus/dsir-pile-13m-filtered-for-openwebtext2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Timaeus
    Description

    timaeus/dsir-pile-13m-filtered-for-openwebtext2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    dsir-pile-100k-filtered-for-OpenWebText2

    • huggingface.co
    Updated May 6, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timaeus (2016). dsir-pile-100k-filtered-for-OpenWebText2 [Dataset]. https://huggingface.co/datasets/timaeus/dsir-pile-100k-filtered-for-OpenWebText2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2016
    Dataset authored and provided by
    Timaeus
    Description

    timaeus/dsir-pile-100k-filtered-for-OpenWebText2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    RaiBP-openwebtext2-first-30-chunks-lang-detect-raw-output

    • huggingface.co
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    french-datasets (2025). RaiBP-openwebtext2-first-30-chunks-lang-detect-raw-output [Dataset]. https://huggingface.co/datasets/french-datasets/RaiBP-openwebtext2-first-30-chunks-lang-detect-raw-output
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset authored and provided by
    french-datasets
    Description

    Ce répertoire est vide, il a été créé pour améliorer le référencement du jeu de données https://huggingface.co/datasets/RaiBP/openwebtext2-first-30-chunks-lang-detect-raw-output.

  6. h

    openwebtext

    • huggingface.co
    • opendatalab.com
    • +3more
    Updated Jul 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Gokaslan (2023). openwebtext [Dataset]. https://huggingface.co/datasets/Skylion007/openwebtext
    Explore at:
    Dataset updated
    Jul 17, 2023
    Authors
    Aaron Gokaslan
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    An open-source replication of the WebText dataset from OpenAI.

  7. h

    openwebmath

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziyin Zhang, openwebmath [Dataset]. https://huggingface.co/datasets/Geralt-Targaryen/openwebmath
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ziyin Zhang
    Description

    A cleaned, deduplicated, and decontaminated version of OpenWebMath.

    Non-English documents and low-quality documents are removed; Cross-deduplicated with OpenWebText2 and CC-News.

    This dataset has been decontaminated with respect to the following benchmarks based on n-gram overlap:

    GLUE (dev set of SST-2, CoLA, QQP, WNLI, RTE, QNLI, MNLI; test set of MPRC) SIQA, PIQA, QASC, CSQA, HellaSWAG (all dev set) CONLL 2003 BLIMP MAIN BoolQ (dev set) WinoGrande (dev set) ANLI (test set) ARC easy and… See the full description on the dataset page: https://huggingface.co/datasets/Geralt-Targaryen/openwebmath.

  8. h

    finemath

    • huggingface.co
    Updated Feb 8, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziyin Zhang (2014). finemath [Dataset]. https://huggingface.co/datasets/Geralt-Targaryen/finemath
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2014
    Authors
    Ziyin Zhang
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    A cleaned, deduplicated, and decontaminated version of FineMath-3+. All non-English content is removed.

      Deduplication
    

    More than one million documents are removed during cross-deduplication with OpenWebText2, CC-News, and OpenWebMath.

      Decontamination
    

    4.8K documents are removed during decontamination with respect to the following benchmarks based on n-gram overlap:

    GLUE (dev set of SST-2, CoLA, QQP, WNLI, RTE, QNLI, MNLI; test set of MPRC) SIQA, PIQA, QASC, CSQA, HellaSWAG… See the full description on the dataset page: https://huggingface.co/datasets/Geralt-Targaryen/finemath.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ziyin Zhang (2025). openwebtext2 [Dataset]. https://huggingface.co/datasets/Geralt-Targaryen/openwebtext2

openwebtext2

Geralt-Targaryen/openwebtext2

Explore at:
205 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Mar 31, 2025
Authors
Ziyin Zhang
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

A cleaned version of OpenWebText2 by removing non-English, duplicated, copyrighted, and low-quality (too short, too many special characters, etc) samples. This dataset has also been decontaminated with respect to the following benchmarks based on n-gram overlap:

GLUE (dev set of SST-2, CoLA, QQP, WNLI, RTE, QNLI, MNLI; test set of MPRC) SIQA, PIQA, QASC, CSQA, HellaSWAG (all dev set) CONLL 2003 BLIMP MAIN BoolQ (dev set) WinoGrande (dev set) ANLI (test set) ARC easy and challenge (test set)… See the full description on the dataset page: https://huggingface.co/datasets/Geralt-Targaryen/openwebtext2.

Search
Clear search
Close search
Google apps
Main menu