8 datasets found
  1. h

    CCNet

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jorge Gallego Feliciano, CCNet [Dataset]. https://huggingface.co/datasets/JorgeeGF/CCNet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Jorge Gallego Feliciano
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CCNet Reproduced Split (4M rows, 3.7B Tokens (Mistral tokenizer))

      Overview
    

    This dataset is a reproduced subset of the larger CCNet dataset, tailored specifically to facilitate easier access and processing for researchers needing high-quality, web-crawled text data for natural language processing tasks. The CCNet dataset leverages data from the Common Crawl, a non-profit organization that crawls the web and freely provides its archives to the public. This subset contains 4… See the full description on the dataset page: https://huggingface.co/datasets/JorgeeGF/CCNet.

  2. O

    CCNet

    • opendatalab.com
    zip
    Updated Mar 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Facebook AI Research (2023). CCNet [Dataset]. https://opendatalab.com/OpenDataLab/CCNet
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 17, 2023
    Dataset provided by
    Facebook AI Research
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CCNet is a dataset extracted from Common Crawl with a different filtering process than for OSCAR. It was built using a language model trained on Wikipedia, in order to filter out bad quality texts such as code or tables. CCNet contains longer documents on average compared to OSCAR with smaller—and often noisier—documents weeded out.

  3. t

    CCNet - Dataset - LDM

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). CCNet - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/ccnet
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    The dataset used in the paper to train the Toolformer model.

  4. O

    CC100

    • opendatalab.com
    • huggingface.co
    zip
    Updated Jul 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Facebook AI Research (2020). CC100 [Dataset]. https://opendatalab.com/OpenDataLab/CC100
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 1, 2020
    Dataset provided by
    Facebook AI Research
    Description

    This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

  5. w

    ccnet.info - Historical whois Lookup

    • whoisdatacenter.com
    csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHeart Web Inc, ccnet.info - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/ccnet.info/
    Explore at:
    csvAvailable download formats
    Dataset authored and provided by
    AllHeart Web Inc
    License

    https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/

    Time period covered
    Mar 15, 1985 - Aug 12, 2025
    Description

    Explore the historical Whois records related to ccnet.info (Domain). Get insights into ownership history and changes over time.

  6. CCNet-Ltd. (Company) - Reverse Whois Lookup

    • whoisdatacenter.com
    csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHeart Web Inc, CCNet-Ltd. (Company) - Reverse Whois Lookup [Dataset]. https://whoisdatacenter.com/company/CCNet-Ltd./
    Explore at:
    csvAvailable download formats
    Dataset provided by
    AllHeart Web
    Authors
    AllHeart Web Inc
    License

    https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/

    Time period covered
    Mar 15, 1985 - Jul 21, 2025
    Description

    Uncover historical ownership history and changes over time by performing a reverse Whois lookup for the company CCNet-Ltd..

  7. tajimaya-cc.net Website Traffic, Ranking, Analytics [July 2025]

    • stb2.digiseotools.com
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semrush (2025). tajimaya-cc.net Website Traffic, Ranking, Analytics [July 2025] [Dataset]. https://stb2.digiseotools.com/website/tajimaya-cc.net/overview/
    Explore at:
    Dataset updated
    Aug 12, 2025
    Dataset authored and provided by
    Semrushhttps://fr.semrush.com/
    License

    https://sem1.theseowheel.com/company/legal/terms-of-service/https://sem1.theseowheel.com/company/legal/terms-of-service/

    Time period covered
    Aug 12, 2025
    Area covered
    Worldwide
    Variables measured
    visits, backlinks, bounceRate, pagesPerVisit, authorityScore, organicKeywords, avgVisitDuration, referringDomains, trafficByCountry, paidSearchTraffic, and 3 more
    Measurement technique
    Semrush Traffic Analytics; Click-stream data
    Description

    tajimaya-cc.net is ranked #14751 in JP with 182.58K Traffic. Categories: Online Services. Learn more about website traffic, market share, and more!

  8. c

    Finansijski podaci za CCNET SCIENTIFIC DOO NOVI SAD

    • companywall.rs
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agencija za privredne registre - APR, Finansijski podaci za CCNET SCIENTIFIC DOO NOVI SAD [Dataset]. https://www.companywall.rs/firma/ccnet-scientific-doo-novi-sad/MMhzlE1q
    Explore at:
    Dataset authored and provided by
    Agencija za privredne registre - APR
    License

    http://www.companywall.rs/Home/Licencehttp://www.companywall.rs/Home/Licence

    Area covered
    Нови Сад
    Description

    Ovaj skup podataka uključuje finansijske izvještaje, račune i blokade, te nekretnine. Podaci uključuju prihode, rashode, dobit, imovinu, obaveze i informacije o nekretninama u vlasništvu kompanije. Finansijski podaci, finansijski sažetak, sažetak kompanije, preduzetnik, zanatlija, udruženje, poslovni subjekti.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jorge Gallego Feliciano, CCNet [Dataset]. https://huggingface.co/datasets/JorgeeGF/CCNet

CCNet

JorgeeGF/CCNet

CCNet split (4M)

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Jorge Gallego Feliciano
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

CCNet Reproduced Split (4M rows, 3.7B Tokens (Mistral tokenizer))

  Overview

This dataset is a reproduced subset of the larger CCNet dataset, tailored specifically to facilitate easier access and processing for researchers needing high-quality, web-crawled text data for natural language processing tasks. The CCNet dataset leverages data from the Common Crawl, a non-profit organization that crawls the web and freely provides its archives to the public. This subset contains 4… See the full description on the dataset page: https://huggingface.co/datasets/JorgeeGF/CCNet.

Search
Clear search
Close search
Google apps
Main menu