MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CCNet Reproduced Split (4M rows, 3.7B Tokens (Mistral tokenizer))
Overview
This dataset is a reproduced subset of the larger CCNet dataset, tailored specifically to facilitate easier access and processing for researchers needing high-quality, web-crawled text data for natural language processing tasks. The CCNet dataset leverages data from the Common Crawl, a non-profit organization that crawls the web and freely provides its archives to the public. This subset contains 4… See the full description on the dataset page: https://huggingface.co/datasets/JorgeeGF/CCNet.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CCNet is a dataset extracted from Common Crawl with a different filtering process than for OSCAR. It was built using a language model trained on Wikipedia, in order to filter out bad quality texts such as code or tables. CCNet contains longer documents on average compared to OSCAR with smaller—and often noisier—documents weeded out.
This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Explore the historical Whois records related to ccnet.info (Domain). Get insights into ownership history and changes over time.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Uncover historical ownership history and changes over time by performing a reverse Whois lookup for the company CCNet-Ltd..
https://sem1.theseowheel.com/company/legal/terms-of-service/https://sem1.theseowheel.com/company/legal/terms-of-service/
tajimaya-cc.net is ranked #14751 in JP with 182.58K Traffic. Categories: Online Services. Learn more about website traffic, market share, and more!
http://www.companywall.rs/Home/Licencehttp://www.companywall.rs/Home/Licence
Ovaj skup podataka uključuje finansijske izvještaje, račune i blokade, te nekretnine. Podaci uključuju prihode, rashode, dobit, imovinu, obaveze i informacije o nekretninama u vlasništvu kompanije. Finansijski podaci, finansijski sažetak, sažetak kompanije, preduzetnik, zanatlija, udruženje, poslovni subjekti.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CCNet Reproduced Split (4M rows, 3.7B Tokens (Mistral tokenizer))
Overview
This dataset is a reproduced subset of the larger CCNet dataset, tailored specifically to facilitate easier access and processing for researchers needing high-quality, web-crawled text data for natural language processing tasks. The CCNet dataset leverages data from the Common Crawl, a non-profit organization that crawls the web and freely provides its archives to the public. This subset contains 4… See the full description on the dataset page: https://huggingface.co/datasets/JorgeeGF/CCNet.