26 datasets found
  1. h

    cc100

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SEACrowd, cc100 [Dataset]. https://huggingface.co/datasets/SEACrowd/cc100
    Explore at:
    Dataset authored and provided by
    SEACrowd
    Description

    This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

  2. h

    cc100-latin

    • huggingface.co
    Updated Mar 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phillip Benjamin Ströbel (2022). cc100-latin [Dataset]. https://huggingface.co/datasets/pstroe/cc100-latin
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 2, 2022
    Authors
    Phillip Benjamin Ströbel
    Description

    Latin part of cc100 corpus

    This dataset contains parts of the Latin part of the cc100 dataset. It was used to train a RoBERTa-based LM model with huggingface.

      Preprocessing
    

    I undertook the following preprocessing steps:

    Removal of all "pseudo-Latin" text ("Lorem ipsum ..."). Use of CLTK for sentence splitting and normalisation. Retaining only lines containing letters of the Latin alphabet, numerals, and certain punctuation (--> grep -P '^[A-z0-9ÄÖÜäöüÆæŒœᵫĀāūōŌ.,;:?!-… See the full description on the dataset page: https://huggingface.co/datasets/pstroe/cc100-latin.

  3. h

    cc100-ja

    • huggingface.co
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    toramaru (2024). cc100-ja [Dataset]. https://huggingface.co/datasets/toramaru-u/cc100-ja
    Explore at:
    Dataset updated
    Jul 1, 2024
    Authors
    toramaru
    Description

    toramaru-u/cc100-ja dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    filtered_cc100_7gb

    • huggingface.co
    Updated Apr 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xingming Li (2025). filtered_cc100_7gb [Dataset]. https://huggingface.co/datasets/xmli/filtered_cc100_7gb
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Xingming Li
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description


    CC100 dataset comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. This dataset loader implements streaming to iterate over CC100 dataset. It applies strict filtering criteria to remove short, noisy, or repetitive sentences and keeps the language proportions similar to the ones used for XLM-R pre-training. The filtered CC100 dataset is ~7 GB.

  5. h

    cc100-samples

    • huggingface.co
    Updated Mar 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xu song (2024). cc100-samples [Dataset]. https://huggingface.co/datasets/xu-song/cc100-samples
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 5, 2024
    Authors
    xu song
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    The cc100-samples is a subset which contains first 10,000 lines of cc100.

      Languages
    

    To load a language which isn't part of the config, all you need to do is specify the language code in the config. You can find the valid languages in Homepage section of Dataset Description: https://data.statmt.org/cc-100/ E.g. dataset = load_dataset("cc100-samples", lang="en") VALID_CODES = [ "am", "ar", "as", "az", "be", "bg", "bn", "bn_rom", "br", "bs", "ca", "cs", "cy", "da", "de", "el"… See the full description on the dataset page: https://huggingface.co/datasets/xu-song/cc100-samples.

  6. e

    Text collection for training the BERTić transformer model BERTić-data -...

    • b2find.eudat.eu
    Updated May 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Text collection for training the BERTić transformer model BERTić-data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/e765a108-6a8d-59a6-a5a3-d4eb70139f2e
    Explore at:
    Dataset updated
    May 8, 2021
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to train the BERTić transformer model (https://huggingface.co/classla/bcms-bertic). The data consists of web crawls before 2015, i.e. bsWaC (http://hdl.handle.net/11356/1062), hrWaC (http://hdl.handle.net/11356/1064), and srWaC (http://hdl.handle.net/11356/1063); previously unpublished 2019-2020 crawls, i.e. cnrWaC, CLASSLA-bs, CLASSLA-hr, and CLASSLA-sr; the cc100-hr and cc100-sr parts of CommonCrawl (https://commoncrawl.org/); and the Riznica corpus (http://hdl.handle.net/11356/1180). All texts were transliterated to the Latin script. The format of the text collection is one-sentence-per-line, empty-line-as-document-boundary. More details, especially on the applied near-deduplication procedure, can be found in the BERTić paper (https://arxiv.org/pdf/2104.09243.pdf).

  7. h

    myanmar-cc100-dataset

    • huggingface.co
    Updated Feb 24, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chuu Htet Naing (2018). myanmar-cc100-dataset [Dataset]. https://huggingface.co/datasets/chuuhtetnaing/myanmar-cc100-dataset
    Explore at:
    Dataset updated
    Feb 24, 2018
    Authors
    Chuu Htet Naing
    Area covered
    မြန်မာ
    Description

    Please visit to the GitHub repository for other Myanmar Langauge datasets.

      Myanmar CC100 Dataset
    

    A preprocessed subset of the CC100 dataset containing only Myanmar language text, with consistent Unicode encoding.

      Dataset Description
    

    This dataset is derived from the statmt/cc100 created by "Statistical and Neural Machine Translation". It contains only the Myanmar language portion of the original CC100 dataset, with additional preprocessing to standardize text encoding.… See the full description on the dataset page: https://huggingface.co/datasets/chuuhtetnaing/myanmar-cc100-dataset.

  8. h

    CC-100-zh-Hant-merged

    • huggingface.co
    Updated Jul 1, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pokai Chang (2018). CC-100-zh-Hant-merged [Dataset]. https://huggingface.co/datasets/zetavg/CC-100-zh-Hant-merged
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2018
    Authors
    Pokai Chang
    Description

    CC-100 zh-Hant (Traditional Chinese)

    From https://data.statmt.org/cc-100/, only zh-Hant - Chinese (Traditional). Broken into paragraphs, with each paragraphs as a row. Estimated to have around 4B tokens when tokenized with the bigscience/bloom tokenizer. There's another version that the text is split by lines instead of paragraphs: zetavg/CC-100-zh-Hant.

      References
    

    Please cite the following if you found the resources in the CC-100 corpus useful.

    Unsupervised… See the full description on the dataset page: https://huggingface.co/datasets/zetavg/CC-100-zh-Hant-merged.

  9. v

    Jan Japan Motors Cc 100 Richard Carte Rd Jacobs Durban Company profile with...

    • volza.com
    csv
    Updated Aug 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volza FZ LLC (2025). Jan Japan Motors Cc 100 Richard Carte Rd Jacobs Durban Company profile with phone,email, buyers, suppliers, price, export import shipments. [Dataset]. https://www.volza.com/company-profile/jan-japan-motors-cc-100-richard-carte-rd-jacobs-durban-43092218
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 29, 2025
    Dataset authored and provided by
    Volza FZ LLC
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2014 - Sep 30, 2021
    Area covered
    Durban, Richard Carte Road
    Variables measured
    Count of exporters, Count of importers, Sum of export value, Sum of import value, Count of export shipments, Count of import shipments
    Description

    Credit report of Jan Japan Motors Cc 100 Richard Carte Rd Jacobs Durban contains unique and detailed export import market intelligence with it's phone, email, Linkedin and details of each import and export shipment like product, quantity, price, buyer, supplier names, country and date of shipment.

  10. h

    odiallama-cc100-dataset

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Satyajit Pradhan, odiallama-cc100-dataset [Dataset]. https://huggingface.co/datasets/actuallysatya/odiallama-cc100-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Satyajit Pradhan
    Description

    actuallysatya/odiallama-cc100-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    cc-100-01-percent

    • huggingface.co
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frederick Riemenschneider (2024). cc-100-01-percent [Dataset]. https://huggingface.co/datasets/bowphs/cc-100-01-percent
    Explore at:
    Dataset updated
    Mar 6, 2024
    Authors
    Frederick Riemenschneider
    Description

    bowphs/cc-100-01-percent dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. O

    nepalitext-language-model-dataset

    • opendatalab.com
    • huggingface.co
    zip
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). nepalitext-language-model-dataset [Dataset]. https://opendatalab.com/OpenDataLab/nepalitext-language-model-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 28, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    NepaliText" language modeling dataset is a collection of over 13 million Nepali text sequences (phrases/sentences/paragraphs) extracted by combining the datasets: OSCAR , cc100 and a set of scraped Nepali articles on Wikipedia.

  13. h

    cc100-ja-cleaned-sample

    • huggingface.co
    Updated Jun 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    neodyland (2024). cc100-ja-cleaned-sample [Dataset]. https://huggingface.co/datasets/neody/cc100-ja-cleaned-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2024
    Dataset authored and provided by
    neodyland
    Description

    neody/cc100-ja-cleaned-sample dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    cc100-yue-tagged

    • huggingface.co
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang (Kevin) Li (2024). cc100-yue-tagged [Dataset]. https://huggingface.co/datasets/AlienKevin/cc100-yue-tagged
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 5, 2024
    Authors
    Xiang (Kevin) Li
    Description

    AlienKevin/cc100-yue-tagged dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    CC100-sinhala

    • huggingface.co
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TianMuxin (2025). CC100-sinhala [Dataset]. https://huggingface.co/datasets/realtmxi/CC100-sinhala
    Explore at:
    Dataset updated
    May 22, 2025
    Authors
    TianMuxin
    Description

    realtmxi/CC100-sinhala dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    cc100-ko-only-1-of-5

    • huggingface.co
    Updated Apr 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chang W Lee (2023). cc100-ko-only-1-of-5 [Dataset]. https://huggingface.co/datasets/lcw99/cc100-ko-only-1-of-5
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2023
    Authors
    Chang W Lee
    Description

    lcw99/cc100-ko-only-1-of-5 dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    cc100-hausa

    • huggingface.co
    Updated Jun 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alamin Usman (2022). cc100-hausa [Dataset]. https://huggingface.co/datasets/alamin05/cc100-hausa
    Explore at:
    Dataset updated
    Jun 25, 2022
    Authors
    Alamin Usman
    Description

    alamin05/cc100-hausa dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    cc100-ko-390M-uncleaned

    • huggingface.co
    Updated Apr 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WOO HWAN PARK (2023). cc100-ko-390M-uncleaned [Dataset]. https://huggingface.co/datasets/richard-park/cc100-ko-390M-uncleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2023
    Authors
    WOO HWAN PARK
    Description

    richard-park/cc100-ko-390M-uncleaned dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    cc100-ja

    • huggingface.co
    Updated Nov 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SHIGA_T (2018). cc100-ja [Dataset]. https://huggingface.co/datasets/tash-huggingface/cc100-ja
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2018
    Authors
    SHIGA_T
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    tash-huggingface/cc100-ja dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    vi-cc100-parquet-dataset

    • huggingface.co
    Updated May 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nguyễn Tiến Khôi (2017). vi-cc100-parquet-dataset [Dataset]. https://huggingface.co/datasets/zerostratos/vi-cc100-parquet-dataset
    Explore at:
    Dataset updated
    May 12, 2017
    Authors
    Nguyễn Tiến Khôi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    zerostratos/vi-cc100-parquet-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
SEACrowd, cc100 [Dataset]. https://huggingface.co/datasets/SEACrowd/cc100

cc100

Cc100

SEACrowd/cc100

Explore at:
Dataset authored and provided by
SEACrowd
Description

This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

Search
Clear search
Close search
Google apps
Main menu