Common Crawl Statistics
Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:
Charsets
The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
VietVault
VietVault is a large-scale Vietnamese language corpus, carefully filtered and curated from Common Crawl dataset dumps prior to 2023. This dataset is designed to serve as a high-quality resource for Vietnamese language model pretraining and various natural language processing tasks.
Dataset Statistics
Size: 80GB of raw text Language: Vietnamese Source: Common Crawl dataset (all dumps in 2013-2023) Preprocessing: Cleaned, deduplicated, filtered for Vietnamese… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/vietvault.
This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.The corpus has following structure:
wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content
without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content)
The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
FastText embeddings built from Common Crawl german dataset
Parameters | Value(s) |
---|---|
Dimensions | 256 and 384 |
Context window | 5 |
Negative sampled | 10 |
Epochs | 1 |
Number of buckets | 131072 or 262144 |
Min n | 3 |
Max n | 6 |
mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for CC-News
Dataset Summary
CC-News dataset contains news articles from news sites all over the world. The data is available on AWS S3 in the Common Crawl bucket at /crawl-data/CC-NEWS/. This version of the dataset has been prepared using news-please - an integrated web crawler and information extractor for news.It contains 708241 English language news articles published between Jan 2017 and December 2019. It represents a small portion of the English… See the full description on the dataset page: https://huggingface.co/datasets/vblagoje/cc_news.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FastText embeddings built from Common Crawl French dataset
Parameters
Parameters
Value(s)
Dimensions
512
Context window
5
Negative sampled
10
Epochs
1
Number of buckets
262144
Min n
3
Max n
6
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
CCQA is new web-scale dataset for in-domain model pre-training. CCQA is a novel QA dataset based on the Common Crawl project. Using the readily available schema.org annotation, around 130 million multilingual question-answer pairs are extracted, including about 60 million English data-points.
CCNet is a dataset extracted from Common Crawl with a different filtering process than for OSCAR. It was built using a language model trained on Wikipedia, in order to filter out bad quality texts such as code or tables. CCNet contains longer documents on average compared to OSCAR with smaller—and often noisier—documents weeded out.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.\
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.
In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.
The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
Datasets Overview
The dataset URLs and Domain Names are collected from the following sources:
mC4
Description: The Multilingual Colossal Common Crawl Corpus (mC4) is a cleaned version of the Common Crawl's web corpus, curated by the Allen Institute for Artificial Intelligence. It contains approximately 170 million URLs. Source: mC4 Dataset on Hugging Face
falcon-refinedweb
Description: An English large-scale dataset curated for large language model… See the full description on the dataset page: https://huggingface.co/datasets/amahdaouy/Web_DomURLs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (November 2018) based on the Common Crawl November 2018 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v2.1.1, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes.
Dataset Properties
gunzip -c measurements.csv.gz | wc -l
xsd:double
.xsd:float
.xsd:date
.xsd:dateTime
.xsd:decimal
, xsd:float
, and xsd:double
.xsd:float
, and xsd:double
.INF
, +INF
, -INF
or NaN
and whose lexical representation is thereby in the lexical space of xsd:float
, and xsd:double
.xsd:integer
, xsd:decimal
, xsd:float
, and xsd:double
.xsd:time
.true
or false
and whose lexical representation is thereby in the lexical space of xsd:boolean
.0
or 1
and whose lexical representation is thereby in the lexical space of xsd:boolean
, and xsd:integer
, xsd:decimal
, xsd:float
, and xsd:double
.xsd:double
values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures.Preview
"CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#longitude","https://www.w3.org/2001/XMLSchema#float","4"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#latitude","https://www.w3.org/2001/XMLSchema#float","4"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://purl.org/goodrelations/v1#hasCurrencyValue","https://www.w3.org/2001/XMLSchema#float","6"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://purl.org/goodrelations/v1#hasCurrencyValue","http://www.w3.org/2001/XMLSchema#floatfloat","8"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://opengraphprotocol.org/schema/latitude","http://www.w3.org/2001/XMLSchema#string","30"
…
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/numberOfItems","http://www.w3.org/2001/XMLSchema#integer","40"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/ratingValue","http://www.w3.org/2001/XMLSchema#integer","431"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/width","http://www.w3.org/2001/XMLSchema#integer","122"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/minValue","http://www.w3.org/2001/XMLSchema#integer","63"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/pageEnd","http://www.w3.org/2001/XMLSchema#integer","139"
Note: The data contain malformed IRIs, like "xsd:dateTime
" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime
"), which are caused by missing namespace definitions in the original source website.
Reproduce
To reproduce this dataset checkout the RDF Property and Datatype Usage Scanner v2.1.1 and execute:
mvn clean package
java -jar target/Scanner.jar --category html-rdfa --list http://webdatacommons.org/structureddata/2018-12/files/html-rdfa.list November2018
java -jar target/Scanner.jar --category html-embedded-jsonld --list http://webdatacommons.org/structureddata/2018-12/files/html-embedded-jsonld.list November2018
./measure.sh November2018
# Wait until the scan has completed. This will take a few days
java -jar target/Scanner.jar --results ./November2018/measurements.csv.gz November2018
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Concerns about gender bias in word embedding models have captured substantial attention in the algorithmic bias research literature. Other bias types however have received lesser amounts of scrutiny. This work describes a large-scale analysis of sentiment associations in popular word embedding models along the lines of gender and ethnicity but also along the less frequently studied dimensions of socioeconomic status, age, physical appearance, sexual orientation, religious sentiment and political leanings. Consistent with previous scholarly literature, this work has found systemic bias against given names popular among African-Americans in most embedding models examined. Gender bias in embedding models however appears to be multifaceted and often reversed in polarity to what has been regularly reported. Interestingly, using the common operationalization of the term bias in the fairness literature, novel types of so far unreported bias types in word embedding models have also been identified. Specifically, the popular embedding models analyzed here display negative biases against middle and working-class socioeconomic status, male children, senior citizens, plain physical appearance and intellectual phenomena such as Islamic religious faith, non-religiosity and conservative political orientation. Reasons for the paradoxical underreporting of these bias types in the relevant literature are probably manifold but widely held blind spots when searching for algorithmic bias and a lack of widespread technical jargon to unambiguously describe a variety of algorithmic associations could conceivably be playing a role. The causal origins for the multiplicity of loaded associations attached to distinct demographic groups within embedding models are often unclear but the heterogeneity of said associations and their potential multifactorial roots raises doubts about the validity of grouping them all under the umbrella term bias. Richer and more fine-grained terminology as well as a more comprehensive exploration of the bias landscape could help the fairness epistemic community to characterize and neutralize algorithmic discrimination more efficiently.
Methods This data set has collected several popular pre-trained word embedding models.
-Word2vec Skip-Gram trained on Google News corpus (100B tokens) https://code.google.com/archive/p/word2vec/
-Glove trained on Wikipedia 2014 + Gigaword 5 (6B tokens) http://nlp.stanford.edu/data/glove.6B.zip
-Glove trained on 2B tweets Twitter corpus (27B tokens) http://nlp.stanford.edu/data/glove.twitter.27B.zip
-Glove trained on Common Crawl (42B tokens) http://nlp.stanford.edu/data/glove.42B.300d.zip
-Glove trained on Common Crawl (840B tokens) http://nlp.stanford.edu/data/glove.840B.300d.zip
-FastText trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens) https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip
-Fastext trained with subword infomation on Common Crawl (600B tokens) https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip"
NOTE: Dữ liệu mới hơn (chưa lọc) đã có tại
https://github.com/hplt-project/data-analytics-tool/blob/main/reports/mono-2.0/HPLT-v2-vie_Latn.lite.pdf https://hplt-project.org/datasets/v2.0
Dữ liệu tiếng Việt từ https://hplt-project.org/datasets/v1, loại bỏ những dữ liệu từ Common Crawl (CC) Thống kê theo tên miền SIZE DOCS DOMAIN
40855.5mb 3586.6k http://dongtrieu.edu.vn 30012.1mb 112.8k http://hamtruyentranh.net… See the full description on the dataset page: https://huggingface.co/datasets/Symato/hplt-vi.
CCMatrix uses ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totalling 32.7 billion unique sentences.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Summary
DCAD-2000 is a large-scale multilingual corpus built using newly extracted Common Crawl data (CC-MAIN-2024-46) and existing multilingual datasets. It includes over 2,282 languages, 46.72TB of data, and 8.63 billion documents, spanning 155 highand medium-resource languages and 159 writing scripts. We propose reframing data cleaning as an anomaly detection task. This dynamic filtering approach significantly enhances data quality by identifying and removing noisy or… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/DCAD-2000.
https://commoncrawl.org/terms-of-use/https://commoncrawl.org/terms-of-use/
Tracking the Trackers is a large-scale analysis of third-party trackers on the World Wide Web. We extract third-party embeddings from more than 3.5 billion web pages of the CommonCrawl 2012 corpus, and aggregate those to a dataset containing more than 140 million third-party embeddings in over 41 million domains.We provide the data used in our recent large-scale analysis of third-party trackers on the web. We created an extractor that finds embedded third-party resources from HTML pages and ran it on the 3.5 billion webpages contained in the CommonCrawl 2012 web crawl.
Common Crawl Statistics
Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:
Charsets
The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.