This dataset was created by Krish Baisoya
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CC100 dataset comprises of monolingual data for 100+ languages and also includes data for romanized languages.
This was constructed using the urls and paragraph indices provided by the CC-Net repository
by processing January-December 2018 Commoncrawl snapshots.
Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline.
The data is generated using the open source CC-Net repository.
This dataset loader implements streaming to iterate over CC100 dataset.
It applies strict filtering criteria to remove short, noisy, or repetitive sentences
and keeps the language proportions similar to the ones used for XLM-R pre-training.
The filtered CC100 dataset is ~25 GB.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CC100 dataset comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. This dataset loader implements streaming to iterate over CC100 dataset. It applies strict filtering criteria to remove short, noisy, or repetitive sentences and keeps the language proportions similar to the ones used for XLM-R pre-training. The filtered CC100 dataset is ~25 GB.
richard-park/cc100-ko-390M-uncleaned dataset hosted on Hugging Face and contributed by the HF Datasets community
lcw99/cc100-ko-only-5-of-5 dataset hosted on Hugging Face and contributed by the HF Datasets community
CocoRoF/cc-100-korean-processing dataset hosted on Hugging Face and contributed by the HF Datasets community
CC-100-ko
Reference: (https://data.statmt.org/cc-100/)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
NepaliText" language modeling dataset is a collection of over 13 million Nepali text sequences (phrases/sentences/paragraphs) extracted by combining the datasets: OSCAR , cc100 and a set of scraped Nepali articles on Wikipedia.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This dataset was created by Krish Baisoya