8 datasets found
  1. CC100 English

    • kaggle.com
    Updated Nov 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krish Baisoya (2022). CC100 English [Dataset]. https://www.kaggle.com/datasets/krishbaisoya/cc100-english/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Krish Baisoya
    Description

    Dataset

    This dataset was created by Krish Baisoya

    Contents

  2. h

    filtered_cc100_25gb

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xingming Li (2025). filtered_cc100_25gb [Dataset]. https://huggingface.co/datasets/xmli/filtered_cc100_25gb
    Explore at:
    Dataset updated
    Mar 26, 2025
    Authors
    Xingming Li
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description


    CC100 dataset comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. This dataset loader implements streaming to iterate over CC100 dataset. It applies strict filtering criteria to remove short, noisy, or repetitive sentences and keeps the language proportions similar to the ones used for XLM-R pre-training. The filtered CC100 dataset is ~25 GB.

  3. h

    filtered_cc100_minimal

    • huggingface.co
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xingming Li (2025). filtered_cc100_minimal [Dataset]. https://huggingface.co/datasets/xmli/filtered_cc100_minimal
    Explore at:
    Dataset updated
    Mar 23, 2025
    Authors
    Xingming Li
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    CC100 dataset comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. This dataset loader implements streaming to iterate over CC100 dataset. It applies strict filtering criteria to remove short, noisy, or repetitive sentences and keeps the language proportions similar to the ones used for XLM-R pre-training. The filtered CC100 dataset is ~25 GB.

  4. h

    cc100-ko-390M-uncleaned

    • huggingface.co
    Updated Apr 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WOO HWAN PARK (2023). cc100-ko-390M-uncleaned [Dataset]. https://huggingface.co/datasets/richard-park/cc100-ko-390M-uncleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2023
    Authors
    WOO HWAN PARK
    Description

    richard-park/cc100-ko-390M-uncleaned dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    cc100-ko-only-5-of-5

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chang W Lee, cc100-ko-only-5-of-5 [Dataset]. https://huggingface.co/datasets/lcw99/cc100-ko-only-5-of-5
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Chang W Lee
    Description

    lcw99/cc100-ko-only-5-of-5 dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    cc-100-korean-processing

    • huggingface.co
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cc-100-korean-processing [Dataset]. https://huggingface.co/datasets/CocoRoF/cc-100-korean-processing
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 18, 2025
    Authors
    Haryeom
    Description

    CocoRoF/cc-100-korean-processing dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    cc-100-korean

    • huggingface.co
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haryeom (2025). cc-100-korean [Dataset]. https://huggingface.co/datasets/CocoRoF/cc-100-korean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 18, 2025
    Authors
    Haryeom
    Description

    CC-100-ko

    Reference: (https://data.statmt.org/cc-100/)

  8. O

    nepalitext-language-model-dataset

    • opendatalab.com
    • huggingface.co
    zip
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). nepalitext-language-model-dataset [Dataset]. https://opendatalab.com/OpenDataLab/nepalitext-language-model-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 28, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    NepaliText" language modeling dataset is a collection of over 13 million Nepali text sequences (phrases/sentences/paragraphs) extracted by combining the datasets: OSCAR , cc100 and a set of scraped Nepali articles on Wikipedia.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Krish Baisoya (2022). CC100 English [Dataset]. https://www.kaggle.com/datasets/krishbaisoya/cc100-english/code
Organization logo

CC100 English

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Krish Baisoya
Description

Dataset

This dataset was created by Krish Baisoya

Contents

Search
Clear search
Close search
Google apps
Main menu