3 datasets found

statistics
huggingface.co
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description
Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.
h
oscar_2201
huggingface.co
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SEACrowd (2024). oscar_2201 [Dataset]. https://huggingface.co/datasets/SEACrowd/oscar_2201
Explore at:
Dataset updated
Jun 20, 2024
Dataset authored and provided by
SEACrowd
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the ungoliant architecture. Data is distributed by language in both original and deduplicated form.
Z
European Multilingual News Articles Dataset with Topic Annotation
data.niaid.nih.gov
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morini, Virginia; Bellomo, Lorenzo; Rossetti, Giulio; Pedreschi, Dino; Ferragina, Paolo (2024). European Multilingual News Articles Dataset with Topic Annotation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10397399
Explore at:
Dataset updated
Jul 7, 2024
Dataset provided by
National Research Council
University of Pisa
Scuola Normale Superiore
Authors
Morini, Virginia; Bellomo, Lorenzo; Rossetti, Giulio; Pedreschi, Dino; Ferragina, Paolo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The European Multilingual News Articles Dataset is composed of over 18 million European news articles coming from 205 media outlets belonging to 27 European countries (i.e., all EU countries belonging to the European Union) with the addition of the United Kingdom. Articles range in a time period from 2017 to 2021 and are written in their original languages, for a total of 23 different languages included.

After selecting reliable, nationwide European media outlets, each article (i.e., title, textual content, URL, and date and time of publication) was extracted from the Common Crawl News Corpus, which contains petabytes of raw web page data collected since 2016. The dataset is released without any text pre-processing other than a cleanup of XML tags. Further, we enriched it by adding several media metadata (e.g., frequency of publication, distribution area, language, type of media).

Moreover, we enhanced the dataset by adding - whenever possible - article-level topic annotation by using articles' URLs as a proxy of the topic discussed. In the end, we were able to assign a topic to over 4 million articles (33 unique topics, e.g., politics, sport, entertainment), thus 23.2% of the entire dataset. Further, from URLs, we also extract the types of over 4 million articles (15 unique article types, e.g., news, international, multimedia).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics

statistics

commoncrawl/statistics

Common Crawl Statistics

Explore at:

Dataset updated

Nov 20, 2024

Dataset provided by

Common Crawlhttp://commoncrawl.org/

Authors

Common Crawl Foundation

Description

Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

  Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

Clear search

Close search

Google apps

Main menu

statistics

oscar_2201

European Multilingual News Articles Dataset with Topic Annotation

statistics

commoncrawl/statistics

Common Crawl Statistics