This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.
Latin part of cc100 corpus
This dataset contains parts of the Latin part of the cc100 dataset. It was used to train a RoBERTa-based LM model with huggingface.
Preprocessing
I undertook the following preprocessing steps:
Removal of all "pseudo-Latin" text ("Lorem ipsum ..."). Use of CLTK for sentence splitting and normalisation. Retaining only lines containing letters of the Latin alphabet, numerals, and certain punctuation (--> grep -P '^[A-z0-9ÄÖÜäöüÆæŒœᵫĀāūōŌ.,;:?!-… See the full description on the dataset page: https://huggingface.co/datasets/pstroe/cc100-latin.
toramaru-u/cc100-ja dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CC100 dataset comprises of monolingual data for 100+ languages and also includes data for romanized languages.
This was constructed using the urls and paragraph indices provided by the CC-Net repository
by processing January-December 2018 Commoncrawl snapshots.
Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline.
The data is generated using the open source CC-Net repository.
This dataset loader implements streaming to iterate over CC100 dataset.
It applies strict filtering criteria to remove short, noisy, or repetitive sentences
and keeps the language proportions similar to the ones used for XLM-R pre-training.
The filtered CC100 dataset is ~7 GB.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
The cc100-samples is a subset which contains first 10,000 lines of cc100.
Languages
To load a language which isn't part of the config, all you need to do is specify the language code in the config. You can find the valid languages in Homepage section of Dataset Description: https://data.statmt.org/cc-100/ E.g. dataset = load_dataset("cc100-samples", lang="en") VALID_CODES = [ "am", "ar", "as", "az", "be", "bg", "bn", "bn_rom", "br", "bs", "ca", "cs", "cy", "da", "de", "el"… See the full description on the dataset page: https://huggingface.co/datasets/xu-song/cc100-samples.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to train the BERTić transformer model (https://huggingface.co/classla/bcms-bertic). The data consists of web crawls before 2015, i.e. bsWaC (http://hdl.handle.net/11356/1062), hrWaC (http://hdl.handle.net/11356/1064), and srWaC (http://hdl.handle.net/11356/1063); previously unpublished 2019-2020 crawls, i.e. cnrWaC, CLASSLA-bs, CLASSLA-hr, and CLASSLA-sr; the cc100-hr and cc100-sr parts of CommonCrawl (https://commoncrawl.org/); and the Riznica corpus (http://hdl.handle.net/11356/1180). All texts were transliterated to the Latin script. The format of the text collection is one-sentence-per-line, empty-line-as-document-boundary. More details, especially on the applied near-deduplication procedure, can be found in the BERTić paper (https://arxiv.org/pdf/2104.09243.pdf).
Please visit to the GitHub repository for other Myanmar Langauge datasets.
Myanmar CC100 Dataset
A preprocessed subset of the CC100 dataset containing only Myanmar language text, with consistent Unicode encoding.
Dataset Description
This dataset is derived from the statmt/cc100 created by "Statistical and Neural Machine Translation". It contains only the Myanmar language portion of the original CC100 dataset, with additional preprocessing to standardize text encoding.… See the full description on the dataset page: https://huggingface.co/datasets/chuuhtetnaing/myanmar-cc100-dataset.
CC-100 zh-Hant (Traditional Chinese)
From https://data.statmt.org/cc-100/, only zh-Hant - Chinese (Traditional). Broken into paragraphs, with each paragraphs as a row. Estimated to have around 4B tokens when tokenized with the bigscience/bloom tokenizer. There's another version that the text is split by lines instead of paragraphs: zetavg/CC-100-zh-Hant.
References
Please cite the following if you found the resources in the CC-100 corpus useful.
Unsupervised… See the full description on the dataset page: https://huggingface.co/datasets/zetavg/CC-100-zh-Hant-merged.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Credit report of Jan Japan Motors Cc 100 Richard Carte Rd Jacobs Durban contains unique and detailed export import market intelligence with it's phone, email, Linkedin and details of each import and export shipment like product, quantity, price, buyer, supplier names, country and date of shipment.
actuallysatya/odiallama-cc100-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
bowphs/cc-100-01-percent dataset hosted on Hugging Face and contributed by the HF Datasets community
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
NepaliText" language modeling dataset is a collection of over 13 million Nepali text sequences (phrases/sentences/paragraphs) extracted by combining the datasets: OSCAR , cc100 and a set of scraped Nepali articles on Wikipedia.
neody/cc100-ja-cleaned-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
AlienKevin/cc100-yue-tagged dataset hosted on Hugging Face and contributed by the HF Datasets community
realtmxi/CC100-sinhala dataset hosted on Hugging Face and contributed by the HF Datasets community
lcw99/cc100-ko-only-1-of-5 dataset hosted on Hugging Face and contributed by the HF Datasets community
alamin05/cc100-hausa dataset hosted on Hugging Face and contributed by the HF Datasets community
richard-park/cc100-ko-390M-uncleaned dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
tash-huggingface/cc100-ja dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
zerostratos/vi-cc100-parquet-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.