26 datasets found

h
cc100
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SEACrowd, cc100 [Dataset]. https://huggingface.co/datasets/SEACrowd/cc100
Explore at:
Dataset authored and provided by
SEACrowd
Description
This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.
h
cc100-latin
huggingface.co
Updated Mar 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Phillip Benjamin Ströbel (2022). cc100-latin [Dataset]. https://huggingface.co/datasets/pstroe/cc100-latin
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2022
Authors
Phillip Benjamin Ströbel
Description
Latin part of cc100 corpus

This dataset contains parts of the Latin part of the cc100 dataset. It was used to train a RoBERTa-based LM model with huggingface.

Preprocessing

I undertook the following preprocessing steps:

Removal of all "pseudo-Latin" text ("Lorem ipsum ..."). Use of CLTK for sentence splitting and normalisation. Retaining only lines containing letters of the Latin alphabet, numerals, and certain punctuation (--> grep -P '^[A-z0-9ÄÖÜäöüÆæŒœᵫĀāūōŌ.,;:?!-… See the full description on the dataset page: https://huggingface.co/datasets/pstroe/cc100-latin.
h
cc100-ja
huggingface.co
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
toramaru (2024). cc100-ja [Dataset]. https://huggingface.co/datasets/toramaru-u/cc100-ja
Explore at:
Dataset updated
Jul 1, 2024
Authors
toramaru
Description
toramaru-u/cc100-ja dataset hosted on Hugging Face and contributed by the HF Datasets community
h
filtered_cc100_7gb
huggingface.co
Updated Apr 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xingming Li (2025). filtered_cc100_7gb [Dataset]. https://huggingface.co/datasets/xmli/filtered_cc100_7gb
Explore at:
Dataset updated
Apr 26, 2025
Authors
Xingming Li
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description

CC100 dataset comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. This dataset loader implements streaming to iterate over CC100 dataset. It applies strict filtering criteria to remove short, noisy, or repetitive sentences and keeps the language proportions similar to the ones used for XLM-R pre-training. The filtered CC100 dataset is ~7 GB.
h
cc100-samples
huggingface.co
Updated Mar 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xu song (2024). cc100-samples [Dataset]. https://huggingface.co/datasets/xu-song/cc100-samples
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 5, 2024
Authors
xu song
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
The cc100-samples is a subset which contains first 10,000 lines of cc100.

Languages

To load a language which isn't part of the config, all you need to do is specify the language code in the config. You can find the valid languages in Homepage section of Dataset Description: https://data.statmt.org/cc-100/ E.g. dataset = load_dataset("cc100-samples", lang="en") VALID_CODES = [ "am", "ar", "as", "az", "be", "bg", "bn", "bn_rom", "br", "bs", "ca", "cs", "cy", "da", "de", "el"… See the full description on the dataset page: https://huggingface.co/datasets/xu-song/cc100-samples.
e
Text collection for training the BERTić transformer model BERTić-data -...
b2find.eudat.eu
Updated May 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Text collection for training the BERTić transformer model BERTić-data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/e765a108-6a8d-59a6-a5a3-d4eb70139f2e
Explore at:
Dataset updated
May 8, 2021
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to train the BERTić transformer model (https://huggingface.co/classla/bcms-bertic). The data consists of web crawls before 2015, i.e. bsWaC (http://hdl.handle.net/11356/1062), hrWaC (http://hdl.handle.net/11356/1064), and srWaC (http://hdl.handle.net/11356/1063); previously unpublished 2019-2020 crawls, i.e. cnrWaC, CLASSLA-bs, CLASSLA-hr, and CLASSLA-sr; the cc100-hr and cc100-sr parts of CommonCrawl (https://commoncrawl.org/); and the Riznica corpus (http://hdl.handle.net/11356/1180). All texts were transliterated to the Latin script. The format of the text collection is one-sentence-per-line, empty-line-as-document-boundary. More details, especially on the applied near-deduplication procedure, can be found in the BERTić paper (https://arxiv.org/pdf/2104.09243.pdf).
h
myanmar-cc100-dataset
huggingface.co
Updated Feb 24, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chuu Htet Naing (2018). myanmar-cc100-dataset [Dataset]. https://huggingface.co/datasets/chuuhtetnaing/myanmar-cc100-dataset
Explore at:
Dataset updated
Feb 24, 2018
Authors
Chuu Htet Naing
Area covered
မြန်မာ
Description
Please visit to the GitHub repository for other Myanmar Langauge datasets.

Myanmar CC100 Dataset

A preprocessed subset of the CC100 dataset containing only Myanmar language text, with consistent Unicode encoding.

Dataset Description

This dataset is derived from the statmt/cc100 created by "Statistical and Neural Machine Translation". It contains only the Myanmar language portion of the original CC100 dataset, with additional preprocessing to standardize text encoding.… See the full description on the dataset page: https://huggingface.co/datasets/chuuhtetnaing/myanmar-cc100-dataset.
h
CC-100-zh-Hant-merged
huggingface.co
Updated Jul 1, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pokai Chang (2018). CC-100-zh-Hant-merged [Dataset]. https://huggingface.co/datasets/zetavg/CC-100-zh-Hant-merged
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2018
Authors
Pokai Chang
Description
CC-100 zh-Hant (Traditional Chinese)

From https://data.statmt.org/cc-100/, only zh-Hant - Chinese (Traditional). Broken into paragraphs, with each paragraphs as a row. Estimated to have around 4B tokens when tokenized with the bigscience/bloom tokenizer. There's another version that the text is split by lines instead of paragraphs: zetavg/CC-100-zh-Hant.

References

Please cite the following if you found the resources in the CC-100 corpus useful.

Unsupervised… See the full description on the dataset page: https://huggingface.co/datasets/zetavg/CC-100-zh-Hant-merged.
v
Jan Japan Motors Cc 100 Richard Carte Rd Jacobs Durban Company profile with...
volza.com
csv
Updated Aug 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Volza FZ LLC (2025). Jan Japan Motors Cc 100 Richard Carte Rd Jacobs Durban Company profile with phone,email, buyers, suppliers, price, export import shipments. [Dataset]. https://www.volza.com/company-profile/jan-japan-motors-cc-100-richard-carte-rd-jacobs-durban-43092218
Explore at:
csvAvailable download formats
Dataset updated
Aug 29, 2025
Dataset authored and provided by
Volza FZ LLC
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2014 - Sep 30, 2021
Area covered
Durban, Richard Carte Road
Variables measured
Count of exporters, Count of importers, Sum of export value, Sum of import value, Count of export shipments, Count of import shipments
Description
Credit report of Jan Japan Motors Cc 100 Richard Carte Rd Jacobs Durban contains unique and detailed export import market intelligence with it's phone, email, Linkedin and details of each import and export shipment like product, quantity, price, buyer, supplier names, country and date of shipment.
h
odiallama-cc100-dataset
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satyajit Pradhan, odiallama-cc100-dataset [Dataset]. https://huggingface.co/datasets/actuallysatya/odiallama-cc100-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Satyajit Pradhan
Description
actuallysatya/odiallama-cc100-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
cc-100-01-percent
huggingface.co
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frederick Riemenschneider (2024). cc-100-01-percent [Dataset]. https://huggingface.co/datasets/bowphs/cc-100-01-percent
Explore at:
Dataset updated
Mar 6, 2024
Authors
Frederick Riemenschneider
Description
bowphs/cc-100-01-percent dataset hosted on Hugging Face and contributed by the HF Datasets community
O
nepalitext-language-model-dataset
opendatalab.com
huggingface.co
zip
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). nepalitext-language-model-dataset [Dataset]. https://opendatalab.com/OpenDataLab/nepalitext-language-model-dataset
Explore at:
zipAvailable download formats
Dataset updated
Dec 28, 2023
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
NepaliText" language modeling dataset is a collection of over 13 million Nepali text sequences (phrases/sentences/paragraphs) extracted by combining the datasets: OSCAR , cc100 and a set of scraped Nepali articles on Wikipedia.
h
cc100-ja-cleaned-sample
huggingface.co
Updated Jun 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
neodyland (2024). cc100-ja-cleaned-sample [Dataset]. https://huggingface.co/datasets/neody/cc100-ja-cleaned-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2024
Dataset authored and provided by
neodyland
Description
neody/cc100-ja-cleaned-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
h
cc100-yue-tagged
huggingface.co
Updated Sep 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiang (Kevin) Li (2024). cc100-yue-tagged [Dataset]. https://huggingface.co/datasets/AlienKevin/cc100-yue-tagged
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 5, 2024
Authors
Xiang (Kevin) Li
Description
AlienKevin/cc100-yue-tagged dataset hosted on Hugging Face and contributed by the HF Datasets community
h
CC100-sinhala
huggingface.co
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TianMuxin (2025). CC100-sinhala [Dataset]. https://huggingface.co/datasets/realtmxi/CC100-sinhala
Explore at:
Dataset updated
May 22, 2025
Authors
TianMuxin
Description
realtmxi/CC100-sinhala dataset hosted on Hugging Face and contributed by the HF Datasets community
h
cc100-ko-only-1-of-5
huggingface.co
Updated Apr 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang W Lee (2023). cc100-ko-only-1-of-5 [Dataset]. https://huggingface.co/datasets/lcw99/cc100-ko-only-1-of-5
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 4, 2023
Authors
Chang W Lee
Description
lcw99/cc100-ko-only-1-of-5 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
cc100-hausa
huggingface.co
Updated Jun 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alamin Usman (2022). cc100-hausa [Dataset]. https://huggingface.co/datasets/alamin05/cc100-hausa
Explore at:
Dataset updated
Jun 25, 2022
Authors
Alamin Usman
Description
alamin05/cc100-hausa dataset hosted on Hugging Face and contributed by the HF Datasets community
h
cc100-ko-390M-uncleaned
huggingface.co
Updated Apr 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WOO HWAN PARK (2023). cc100-ko-390M-uncleaned [Dataset]. https://huggingface.co/datasets/richard-park/cc100-ko-390M-uncleaned
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 4, 2023
Authors
WOO HWAN PARK
Description
richard-park/cc100-ko-390M-uncleaned dataset hosted on Hugging Face and contributed by the HF Datasets community
h
cc100-ja
huggingface.co
Updated Nov 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SHIGA_T (2018). cc100-ja [Dataset]. https://huggingface.co/datasets/tash-huggingface/cc100-ja
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2018
Authors
SHIGA_T
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
tash-huggingface/cc100-ja dataset hosted on Hugging Face and contributed by the HF Datasets community
h
vi-cc100-parquet-dataset
huggingface.co
Updated May 12, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nguyễn Tiến Khôi (2017). vi-cc100-parquet-dataset [Dataset]. https://huggingface.co/datasets/zerostratos/vi-cc100-parquet-dataset
Explore at:
Dataset updated
May 12, 2017
Authors
Nguyễn Tiến Khôi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
zerostratos/vi-cc100-parquet-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

SEACrowd, cc100 [Dataset]. https://huggingface.co/datasets/SEACrowd/cc100

cc100

Cc100

SEACrowd/cc100

Explore at:

Dataset authored and provided by

SEACrowd

Description

This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

Clear search

Close search

Google apps

Main menu

cc100

cc100-latin

cc100-ja

filtered_cc100_7gb

cc100-samples

Text collection for training the BERTić transformer model BERTić-data -...

myanmar-cc100-dataset

CC-100-zh-Hant-merged

Jan Japan Motors Cc 100 Richard Carte Rd Jacobs Durban Company profile with...

odiallama-cc100-dataset

cc-100-01-percent

nepalitext-language-model-dataset

cc100-ja-cleaned-sample

cc100-yue-tagged

CC100-sinhala

cc100-ko-only-1-of-5

cc100-hausa

cc100-ko-390M-uncleaned

cc100-ja

vi-cc100-parquet-dataset

cc100

Cc100

SEACrowd/cc100