3 datasets found
  1. statistics

    • huggingface.co
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
    Explore at:
    Dataset updated
    Nov 20, 2024
    Dataset provided by
    Common Crawlhttp://commoncrawl.org/
    Authors
    Common Crawl Foundation
    Description

    Common Crawl Statistics

    Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

      Charsets
    

    The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

  2. h

    oscar_2201

    • huggingface.co
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SEACrowd (2024). oscar_2201 [Dataset]. https://huggingface.co/datasets/SEACrowd/oscar_2201
    Explore at:
    Dataset updated
    Jun 20, 2024
    Dataset authored and provided by
    SEACrowd
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the ungoliant architecture. Data is distributed by language in both original and deduplicated form.

  3. Z

    European Multilingual News Articles Dataset with Topic Annotation

    • data.niaid.nih.gov
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morini, Virginia; Bellomo, Lorenzo; Rossetti, Giulio; Pedreschi, Dino; Ferragina, Paolo (2024). European Multilingual News Articles Dataset with Topic Annotation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10397399
    Explore at:
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    National Research Council
    University of Pisa
    Scuola Normale Superiore
    Authors
    Morini, Virginia; Bellomo, Lorenzo; Rossetti, Giulio; Pedreschi, Dino; Ferragina, Paolo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The European Multilingual News Articles Dataset is composed of over 18 million European news articles coming from 205 media outlets belonging to 27 European countries (i.e., all EU countries belonging to the European Union) with the addition of the United Kingdom. Articles range in a time period from 2017 to 2021 and are written in their original languages, for a total of 23 different languages included.

    After selecting reliable, nationwide European media outlets, each article (i.e., title, textual content, URL, and date and time of publication) was extracted from the Common Crawl News Corpus, which contains petabytes of raw web page data collected since 2016. The dataset is released without any text pre-processing other than a cleanup of XML tags. Further, we enriched it by adding several media metadata (e.g., frequency of publication, distribution area, language, type of media).

    Moreover, we enhanced the dataset by adding - whenever possible - article-level topic annotation by using articles' URLs as a proxy of the topic discussed. In the end, we were able to assign a topic to over 4 million articles (33 unique topics, e.g., politics, sport, entertainment), thus 23.2% of the entire dataset. Further, from URLs, we also extract the types of over 4 million articles (15 unique article types, e.g., news, international, multimedia).

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
Organization logo

statistics

commoncrawl/statistics

Common Crawl Statistics

Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description

Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

  Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

Search
Clear search
Close search
Google apps
Main menu