36 datasets found

statistics
huggingface.co
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description
Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.
h
vietvault
huggingface.co
Updated Jul 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nam Pham (2024). vietvault [Dataset]. http://doi.org/10.57967/hf/2210
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2210
Dataset updated
Jul 9, 2024
Authors
Nam Pham
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
VietVault

VietVault is a large-scale Vietnamese language corpus, carefully filtered and curated from Common Crawl dataset dumps prior to 2023. This dataset is designed to serve as a high-quality resource for Vietnamese language model pretraining and various natural language processing tasks.

Dataset Statistics

Size: 80GB of raw text Language: Vietnamese Source: Common Crawl dataset (all dumps in 2013-2023) Preprocessing: Cleaned, deduplicated, filtered for Vietnamese… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/vietvault.
P
CC100 Dataset
paperswithcode.com
Updated Jul 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexis Conneau; Kartikay Khandelwal; Naman Goyal; Vishrav Chaudhary; Guillaume Wenzek; Francisco Guzmán; Edouard Grave; Myle Ott; Luke Zettlemoyer; Veselin Stoyanov (2020). CC100 Dataset [Dataset]. https://paperswithcode.com/dataset/cc100
Explore at:
Dataset updated
Jul 14, 2020
Authors
Alexis Conneau; Kartikay Khandelwal; Naman Goyal; Vishrav Chaudhary; Guillaume Wenzek; Francisco Guzmán; Edouard Grave; Myle Ott; Luke Zettlemoyer; Veselin Stoyanov
Description
This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.
E
Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)
live.european-language-grid.eu
explore.openaire.eu
+1more
json
Updated Apr 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7748
Explore at:
jsonAvailable download formats
Dataset updated
Apr 11, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.The corpus has following structure:
wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content
without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content)
The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.
German CBOW FastText embeddings with min count 250
zenodo.org
data.niaid.nih.gov
xz
Updated Oct 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Bocharov; Victor Bocharov (2021). German CBOW FastText embeddings with min count 250 [Dataset]. http://doi.org/10.5281/zenodo.5598144
Explore at:
xzAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5598144
Dataset updated
Oct 26, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Victor Bocharov; Victor Bocharov
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
FastText embeddings built from Common Crawl german dataset

Parameters
Parameters Value(s)
Dimensions 256 and 384
Context window 5
Negative sampled 10
Epochs 1
Number of buckets 131072 or 262144
Min n 3
Max n 6
P
Data from: mC4 Dataset
paperswithcode.com
opendatalab.com
+1more
Updated Mar 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel (2021). mC4 Dataset [Dataset]. https://paperswithcode.com/dataset/mc4
Explore at:
Dataset updated
Mar 24, 2021
Authors
Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel
Description
mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.
h
cc_news
huggingface.co
Updated Jul 3, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vladimir Blagojevic (2018). cc_news [Dataset]. https://huggingface.co/datasets/vblagoje/cc_news
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2018
Authors
Vladimir Blagojevic
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for CC-News

Dataset Summary

CC-News dataset contains news articles from news sites all over the world. The data is available on AWS S3 in the Common Crawl bucket at /crawl-data/CC-NEWS/. This version of the dataset has been prepared using news-please - an integrated web crawler and information extractor for news.It contains 708241 English language news articles published between Jan 2017 and December 2019. It represents a small portion of the English… See the full description on the dataset page: https://huggingface.co/datasets/vblagoje/cc_news.
Z
French CBOW FastText embeddings with min count 250
data.niaid.nih.gov
Updated Apr 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
de Chalendar, Gaël (2022). French CBOW FastText embeddings with min count 250 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6471440
Explore at:
Dataset updated
Apr 20, 2022
Dataset provided by
Bocharov, Victor
de Chalendar, Gaël
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
FastText embeddings built from Common Crawl French dataset

Parameters Parameters Value(s) Dimensions 512 Context window 5 Negative sampled 10 Epochs 1 Number of buckets 262144 Min n 3 Max n 6
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
P
CCQA Dataset
paperswithcode.com
opendatalab.com
Updated Oct 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Huber; Armen Aghajanyan; Barlas Oğuz; Dmytro Okhonko; Wen-tau Yih; Sonal Gupta; Xilun Chen (2021). CCQA Dataset [Dataset]. https://paperswithcode.com/dataset/ccqa
Explore at:
Dataset updated
Oct 13, 2021
Authors
Patrick Huber; Armen Aghajanyan; Barlas Oğuz; Dmytro Okhonko; Wen-tau Yih; Sonal Gupta; Xilun Chen
Description
CCQA is new web-scale dataset for in-domain model pre-training. CCQA is a novel QA dataset based on the Common Crawl project. Using the readily available schema.org annotation, around 130 million multilingual question-answer pairs are extracted, including about 60 million English data-points.
P
CCNet Dataset
paperswithcode.com
opendatalab.com
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillaume Wenzek; Marie-Anne Lachaux; Alexis Conneau; Vishrav Chaudhary; Francisco Guzmán; Armand Joulin; Edouard Grave (2024). CCNet Dataset [Dataset]. https://paperswithcode.com/dataset/ccnet
Explore at:
Dataset updated
Apr 3, 2024
Authors
Guillaume Wenzek; Marie-Anne Lachaux; Alexis Conneau; Vishrav Chaudhary; Francisco Guzmán; Armand Joulin; Edouard Grave
Description
CCNet is a dataset extracted from Common Crawl with a different filtering process than for OSCAR. It was built using a language model trained on Wikipedia, in order to filter out bad quality texts such as code or tables. CCNet contains longer documents on average compared to OSCAR with smaller—and often noisier—documents weeded out.
h
oscar
huggingface.co
Updated Aug 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OSCAR (2023). oscar [Dataset]. https://huggingface.co/datasets/oscar-corpus/oscar
Explore at:
Dataset updated
Aug 1, 2023
Dataset authored and provided by
OSCAR
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.\
P
WDC LSPM Dataset
paperswithcode.com
Updated Dec 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). WDC LSPM Dataset [Dataset]. https://paperswithcode.com/dataset/wdc-products
Explore at:
Dataset updated
Dec 2, 2020
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
h
Web_DomURLs
huggingface.co
Updated Sep 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelkader El Mahdaouy (2024). Web_DomURLs [Dataset]. https://huggingface.co/datasets/amahdaouy/Web_DomURLs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 17, 2024
Authors
Abdelkader El Mahdaouy
Description
Datasets Overview

The dataset URLs and Domain Names are collected from the following sources:

mC4

Description: The Multilingual Colossal Common Crawl Corpus (mC4) is a cleaned version of the Common Crawl's web corpus, curated by the Allen Institute for Artificial Intelligence. It contains approximately 170 million URLs. Source: mC4 Dataset on Hugging Face

falcon-refinedweb

Description: An English large-scale dataset curated for large language model… See the full description on the dataset page: https://huggingface.co/datasets/amahdaouy/Web_DomURLs.
Web Data Commons (November 2018) Property and Datatype Usage Dataset
zenodo.org
application/gzip
Updated May 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Martin Keil; Jan Martin Keil (2022). Web Data Commons (November 2018) Property and Datatype Usage Dataset [Dataset]. http://doi.org/10.5281/zenodo.6477443
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6477443
Dataset updated
May 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Martin Keil; Jan Martin Keil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (November 2018) based on the Common Crawl November 2018 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v2.1.1, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes.

Dataset Properties

Size: 22.2 MiB compressed, 569.6 MiB uncompressed, 2 608 325 rows plus 1 head line determined using gunzip -c measurements.csv.gz | wc -l

Parsing Failures: The scanner failed to parse 4 135 842 triples (~0.077 %) of the source dataset (containing 5 367 569 192 triples).

Content:

CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured.

FILE_URL: The URL of the Web Data Commons file that has been measured.

MEASUREMENT: The applied measurement with specific conditions, one of:

UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of xsd:double.

UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of xsd:float.

UsedAsDatatype: The total number of literals with the datatype.

UsedAsPropertyRange: The number of statements that specify the datatype as range of the property.

ValidDateNotation: The number of lexicals that are in the lexical space of xsd:date.

ValidDateTimeNotation: The number of lexicals that are in the lexical space of xsd:dateTime.

ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of xsd:decimal, xsd:float, and xsd:double.

ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.

ValidInfOrNaNNotation: The number of lexicals that equals either INF, +INF, -INF or NaN and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.

ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of xsd:integer, xsd:decimal, xsd:float, and xsd:double.

ValidTimeNotation: The number of lexicals that are in the lexical space of xsd:time.

ValidTrueOrFalseNotation: The number of lexicals that equal either true or false and whose lexical representation is thereby in the lexical space of xsd:boolean.

ValidZeroOrOneNotation: The number of lexicals that equal either 0 or 1 and whose lexical representation is thereby in the lexical space of xsd:boolean, and xsd:integer, xsd:decimal, xsd:float, and xsd:double.

Note: Lexical representation of xsd:double values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures.

PROPERTY: The property that has been measured.

DATATYPE: The datatype that has been measured.

QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype.

Preview

"CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#longitude","https://www.w3.org/2001/XMLSchema#float","4" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#latitude","https://www.w3.org/2001/XMLSchema#float","4" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://purl.org/goodrelations/v1#hasCurrencyValue","https://www.w3.org/2001/XMLSchema#float","6" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://purl.org/goodrelations/v1#hasCurrencyValue","http://www.w3.org/2001/XMLSchema#floatfloat","8" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://opengraphprotocol.org/schema/latitude","http://www.w3.org/2001/XMLSchema#string","30" … "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/numberOfItems","http://www.w3.org/2001/XMLSchema#integer","40" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/ratingValue","http://www.w3.org/2001/XMLSchema#integer","431" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/width","http://www.w3.org/2001/XMLSchema#integer","122" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/minValue","http://www.w3.org/2001/XMLSchema#integer","63" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/pageEnd","http://www.w3.org/2001/XMLSchema#integer","139"

Note: The data contain malformed IRIs, like "xsd:dateTime" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime"), which are caused by missing namespace definitions in the original source website.

Reproduce

To reproduce this dataset checkout the RDF Property and Datatype Usage Scanner v2.1.1 and execute:

mvn clean package java -jar target/Scanner.jar --category html-rdfa --list http://webdatacommons.org/structureddata/2018-12/files/html-rdfa.list November2018 java -jar target/Scanner.jar --category html-embedded-jsonld --list http://webdatacommons.org/structureddata/2018-12/files/html-embedded-jsonld.list November2018 ./measure.sh November2018 # Wait until the scan has completed. This will take a few days java -jar target/Scanner.jar --results ./November2018/measurements.csv.gz November2018
n
Data from: Wide range screening of algorithmic bias in word embedding models...
data.niaid.nih.gov
zenodo.org
zip
Updated Apr 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Rozado (2020). Wide range screening of algorithmic bias in word embedding models using large sentiment lexicons reveals underreported bias types [Dataset]. http://doi.org/10.5061/dryad.rbnzs7h7w
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rbnzs7h7w
Dataset updated
Apr 7, 2020
Dataset provided by
Otago Polytechnic
Authors
David Rozado
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Concerns about gender bias in word embedding models have captured substantial attention in the algorithmic bias research literature. Other bias types however have received lesser amounts of scrutiny. This work describes a large-scale analysis of sentiment associations in popular word embedding models along the lines of gender and ethnicity but also along the less frequently studied dimensions of socioeconomic status, age, physical appearance, sexual orientation, religious sentiment and political leanings. Consistent with previous scholarly literature, this work has found systemic bias against given names popular among African-Americans in most embedding models examined. Gender bias in embedding models however appears to be multifaceted and often reversed in polarity to what has been regularly reported. Interestingly, using the common operationalization of the term bias in the fairness literature, novel types of so far unreported bias types in word embedding models have also been identified. Specifically, the popular embedding models analyzed here display negative biases against middle and working-class socioeconomic status, male children, senior citizens, plain physical appearance and intellectual phenomena such as Islamic religious faith, non-religiosity and conservative political orientation. Reasons for the paradoxical underreporting of these bias types in the relevant literature are probably manifold but widely held blind spots when searching for algorithmic bias and a lack of widespread technical jargon to unambiguously describe a variety of algorithmic associations could conceivably be playing a role. The causal origins for the multiplicity of loaded associations attached to distinct demographic groups within embedding models are often unclear but the heterogeneity of said associations and their potential multifactorial roots raises doubts about the validity of grouping them all under the umbrella term bias. Richer and more fine-grained terminology as well as a more comprehensive exploration of the bias landscape could help the fairness epistemic community to characterize and neutralize algorithmic discrimination more efficiently.

Methods This data set has collected several popular pre-trained word embedding models.

-Word2vec Skip-Gram trained on Google News corpus (100B tokens) https://code.google.com/archive/p/word2vec/

-Glove trained on Wikipedia 2014 + Gigaword 5 (6B tokens) http://nlp.stanford.edu/data/glove.6B.zip

-Glove trained on 2B tweets Twitter corpus (27B tokens) http://nlp.stanford.edu/data/glove.twitter.27B.zip

-Glove trained on Common Crawl (42B tokens) http://nlp.stanford.edu/data/glove.42B.300d.zip

-Glove trained on Common Crawl (840B tokens) http://nlp.stanford.edu/data/glove.840B.300d.zip

-FastText trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens) https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip

-Fastext trained with subword infomation on Common Crawl (600B tokens) https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip"
h
hplt-vi
huggingface.co
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Symato Team (2024). hplt-vi [Dataset]. https://huggingface.co/datasets/Symato/hplt-vi
Explore at:
Dataset updated
Oct 1, 2024
Dataset authored and provided by
Symato Team
Description
NOTE: Dữ liệu mới hơn (chưa lọc) đã có tại

https://github.com/hplt-project/data-analytics-tool/blob/main/reports/mono-2.0/HPLT-v2-vie_Latn.lite.pdf https://hplt-project.org/datasets/v2.0

Dữ liệu tiếng Việt từ https://hplt-project.org/datasets/v1, loại bỏ những dữ liệu từ Common Crawl (CC) Thống kê theo tên miền SIZE DOCS DOMAIN

40855.5mb 3586.6k http://dongtrieu.edu.vn 30012.1mb 112.8k http://hamtruyentranh.net… See the full description on the dataset page: https://huggingface.co/datasets/Symato/hplt-vi.
P
Data from: CCMatrix Dataset
paperswithcode.com
opendatalab.com
Updated Feb 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Holger Schwenk; Guillaume Wenzek; Sergey Edunov; Edouard Grave; Armand Joulin (2021). CCMatrix Dataset [Dataset]. https://paperswithcode.com/dataset/ccmatrix
Explore at:
Dataset updated
Feb 14, 2021
Authors
Holger Schwenk; Guillaume Wenzek; Sergey Edunov; Edouard Grave; Armand Joulin
Description
CCMatrix uses ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totalling 32.7 billion unique sentences.
h
DCAD-2000
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenBMB, DCAD-2000 [Dataset]. https://huggingface.co/datasets/openbmb/DCAD-2000
Explore at:
Dataset authored and provided by
OpenBMB
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Summary

DCAD-2000 is a large-scale multilingual corpus built using newly extracted Common Crawl data (CC-MAIN-2024-46) and existing multilingual datasets. It includes over 2,282 languages, 46.72TB of data, and 8.63 billion documents, spanning 155 highand medium-resource languages and 159 writing scripts. We propose reframing data cleaning as an anomaly detection task. This dynamic filtering approach significantly enhances data quality by identifying and removing noisy or… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/DCAD-2000.
O
Tracking the Trackers
opendatalab.com
paperswithcode.com
zip
Updated Mar 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technical University of Berlin (2023). Tracking the Trackers [Dataset]. https://opendatalab.com/OpenDataLab/Tracking_the_Trackers
Explore at:
zip(1857478517 bytes)Available download formats
Dataset updated
Mar 24, 2023
Dataset provided by
Technical University of Berlin
University of Koblenz-Landau
License
https://commoncrawl.org/terms-of-use/https://commoncrawl.org/terms-of-use/
Description
Tracking the Trackers is a large-scale analysis of third-party trackers on the World Wide Web. We extract third-party embeddings from more than 3.5 billion web pages of the CommonCrawl 2012 corpus, and aggregate those to a dataset containing more than 140 million third-party embeddings in over 41 million domains.We provide the data used in our recent large-scale analysis of third-party trackers on the web. We created an extractor that finds embedded third-party resources from HTML pages and ran it on the 3.5 billion webpages contained in the CommonCrawl 2012 web crawl.

Parameters
Parameters	Value(s)
Dimensions	256 and 384
Context window	5
Negative sampled	10
Epochs	1
Number of buckets	131072 or 262144
Min n	3
Max n	6

Facebook

Twitter

Click to copy link

Link copied

Cite

Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics

statistics

commoncrawl/statistics

Common Crawl Statistics

Explore at:

Dataset updated

Nov 20, 2024

Dataset provided by

Common Crawlhttp://commoncrawl.org/

Authors

Common Crawl Foundation

Description

Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

  Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

Clear search

Close search

Google apps

Main menu

statistics

vietvault

CC100 Dataset

Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

German CBOW FastText embeddings with min count 250

Data from: mC4 Dataset

cc_news

French CBOW FastText embeddings with min count 250

fineweb

CCQA Dataset

CCNet Dataset

oscar

WDC LSPM Dataset

Web_DomURLs

Web Data Commons (November 2018) Property and Datatype Usage Dataset

Data from: Wide range screening of algorithmic bias in word embedding models...

hplt-vi

Data from: CCMatrix Dataset

DCAD-2000

Tracking the Trackers

statistics

commoncrawl/statistics

Common Crawl Statistics