11 datasets found

P
Data from: mC4 Dataset
paperswithcode.com
opendatalab.com
+1more
Updated Jun 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel (2022). mC4 Dataset [Dataset]. https://paperswithcode.com/dataset/mc4
Explore at:
Dataset updated
Jun 8, 2022
Authors
Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel
Description
mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.
P
WDC LSPM Dataset
paperswithcode.com
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WDC LSPM Dataset [Dataset]. https://paperswithcode.com/dataset/wdc-products
Explore at:
Dataset updated
May 31, 2022
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
Z
German CBOW FastText embeddings with min count 250
data.niaid.nih.gov
zenodo.org
Updated Oct 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bocharov, Victor (2021). German CBOW FastText embeddings with min count 250 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5598143
Explore at:
Dataset updated
Oct 26, 2021
Dataset authored and provided by
Bocharov, Victor
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
FastText embeddings built from Common Crawl german dataset

Parameters Parameters Value(s) Dimensions 256 and 384 Context window 5 Negative sampled 10 Epochs 1 Number of buckets 131072 or 262144 Min n 3 Max n 6
Web Data Commons (November 2019) Property and Datatype Usage Dataset
zenodo.org
application/gzip
Updated May 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Martin Keil; Jan Martin Keil (2022). Web Data Commons (November 2019) Property and Datatype Usage Dataset [Dataset]. http://doi.org/10.5281/zenodo.6359895
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6359895
Dataset updated
May 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Martin Keil; Jan Martin Keil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (November 2019) based on the Common Crawl November 2019 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v2.1.1, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes.

Dataset Properties

Size: 37.0 MiB compressed, 785.7 MiB uncompressed, 3 542 700 rows plus 1 head line determined using gunzip -c measurements.csv.gz | wc -l

Parsing Failures: The scanner failed to parse 5 080 473 triples (~0.053 %) of the source dataset (containing 9 547 365 107 triples).

Content:

CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured.

FILE_URL: The URL of the Web Data Commons file that has been measured.

MEASUREMENT: The applied measurement with specific conditions, one of:

UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of xsd:double.

UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of xsd:float.

UsedAsDatatype: The total number of literals with the datatype.

UsedAsPropertyRange: The number of statements that specify the datatype as range of the property.

ValidDateNotation: The number of lexicals that are in the lexical space of xsd:date.

ValidDateTimeNotation: The number of lexicals that are in the lexical space of xsd:dateTime.

ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of xsd:decimal, xsd:float, and xsd:double.

ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.

ValidInfOrNaNNotation: The number of lexicals that equals either INF, +INF, -INF or NaN and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.

ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of xsd:integer, xsd:decimal, xsd:float, and xsd:double.

ValidTimeNotation: The number of lexicals that are in the lexical space of xsd:time.

ValidTrueOrFalseNotation: The number of lexicals that equal either true or false and whose lexical representation is thereby in the lexical space of xsd:boolean.

ValidZeroOrOneNotation: The number of lexicals that equal either 0 or 1 and whose lexical representation is thereby in the lexical space of xsd:boolean, and xsd:integer, xsd:decimal, xsd:float, and xsd:double.

Note: Lexical representation of xsd:double values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures.

PROPERTY: The property that has been measured.

DATATYPE: The datatype that has been measured.

QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype.

Preview

"CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#longitude","https://www.w3.org/2001/XMLSchema#float","1" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#latitude","https://www.w3.org/2001/XMLSchema#float","1" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://purl.org/goodrelations/v1#hasCurrencyValue","http://www.w3.org/2001/XMLSchema#floatfloat","10" … "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/dpef.html-embedded-jsonld.nq-01426.gz","ValidZeroOrOneNotation","http://schema.org/ratingValue","http://www.w3.org/2001/XMLSchema#integer","109" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/dpef.html-embedded-jsonld.nq-01426.gz","ValidZeroOrOneNotation","http://schema.org/width","http://www.w3.org/2001/XMLSchema#integer","60" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/dpef.html-embedded-jsonld.nq-01426.gz","ValidZeroOrOneNotation","http://schema.org/minValue","http://www.w3.org/2001/XMLSchema#integer","68"

Note: The data contain malformed IRIs, like "xsd:dateTime" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime"), which are caused by missing namespace definitions in the original source website.

Reproduce

To reproduce this dataset checkout the RDF Property and Datatype Usage Scanner v2.1.0 and execute:

mvn clean package java -jar target/Scanner.jar --category html-rdfa --list http://webdatacommons.org/structureddata/2019-12/files/html-rdfa.list November2019 java -jar target/Scanner.jar --category html-embedded-jsonld --list http://webdatacommons.org/structureddata/2019-12/files/html-embedded-jsonld.list November2019 ./measure.sh November2019 # Wait until the scan has completed. This will take a few days java -jar target/Scanner.jar --results ./November2019/measurements.csv.gz November2019
h
oscar
huggingface.co
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OSCAR (2023). oscar [Dataset]. https://huggingface.co/datasets/oscar-corpus/oscar
Explore at:
Dataset updated
Sep 15, 2023
Dataset authored and provided by
OSCAR
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.\
T
web_graph
tensorflow.org
Updated Nov 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). web_graph [Dataset]. http://identifiers.org/arxiv:2112.02194
Explore at:
Unique identifier
https://identifiers.org/arxiv:2112.02194
Dataset updated
Nov 23, 2022
Description
This dataset contains a sparse graph representing web link structure for a small subset of the Web.

Its a processed version of a single crawl performed by CommonCrawl in 2021 where we strip everything and keep only the link->outlinks structure. The final dataset is basically int -> List[int] format with each integer id representing a url.

Also, in order to increase the value of this resource, we created 6 different version of WebGraph, each varying in the sparsity pattern and locale. We took the following processing steps, in order:

We started with WAT files from June 2021 crawl.

Since the outlinks in HTTP-Response-Metadata are stored as relative paths, we convert them to absolute paths using urllib after validating each link.

To study locale-specific graphs, we further filter based on 2 top level domains: ‘de’ and ‘in’, each producing a graph with an order of magnitude less number of nodes.

These graphs can still have arbitrary sparsity patterns and dangling links. Thus we further filter the nodes in each graph to have minimum of K ∈ [10, 50] inlinks and outlinks. Note that we only do this processing once, thus this is still an approximation i.e. the resulting graph might have nodes with less than K links.

Using both locale and count filters, we finalize 6 versions of WebGraph dataset, summarized in the folling table.

Version Top level domain Min count Num nodes Num edges
sparse 10 365.4M 30B
dense 50 136.5M 22B
de-sparse de 10 19.7M 1.19B
de-dense de 50 5.7M 0.82B
in-sparse in 10 1.5M 0.14B
in-dense in 50 0.5M 0.12B

All versions of the dataset have following features:

"row_tag": a unique identifier of the row (source link).

"col_tag": a list of unique identifiers of non-zero columns (dest outlinks).

"gt_tag": a list of unique identifiers of non-zero columns used as ground truth (dest outlinks), empty for train/train_t splits.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('web_graph', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
P
Reddit Dataset
paperswithcode.com
opendatalab.com
Updated Jun 9, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reddit Dataset [Dataset]. https://paperswithcode.com/dataset/reddit
Explore at:
Dataset updated
Jun 9, 2017
Authors
William L. Hamilton; Rex Ying; Jure Leskovec
Description
The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.
MADLAD-400
huggingface.co
opendatalab.com
Updated Oct 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MADLAD-400 [Dataset]. https://huggingface.co/datasets/allenai/MADLAD-400
Explore at:
Dataset updated
Oct 30, 2023
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
MADLAD-400

Dataset and Introduction

MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.
P
Data from: CCMatrix Dataset
paperswithcode.com
opendatalab.com
Updated May 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Holger Schwenk; Guillaume Wenzek; Sergey Edunov; Edouard Grave; Armand Joulin (2023). CCMatrix Dataset [Dataset]. https://paperswithcode.com/dataset/ccmatrix
Explore at:
Dataset updated
May 24, 2023
Authors
Holger Schwenk; Guillaume Wenzek; Sergey Edunov; Edouard Grave; Armand Joulin
Description
CCMatrix uses ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totalling 32.7 billion unique sentences.
Web Data Commons (October 2016) Property and Datatype Usage Dataset
zenodo.org
application/gzip
Updated Aug 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Martin Keil; Jan Martin Keil (2022). Web Data Commons (October 2016) Property and Datatype Usage Dataset [Dataset]. http://doi.org/10.5281/zenodo.6534413
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6534413
Dataset updated
Aug 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Martin Keil; Jan Martin Keil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (October 2016) based on the Common Crawl October 2016 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v2.1.1, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes.

Dataset Properties

Size: 17.4 MiB compressed, 351.1 MiB uncompressed, 1 612 479 rows plus 1 head line determined using gunzip -c measurements.csv.gz | wc -l

Parsing Failures: The scanner failed to parse 28 326 152 triples (~0.69 %) of the source dataset (containing 4 097 655 302 triples).

Content:

CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured.

FILE_URL: The URL of the Web Data Commons file that has been measured.

MEASUREMENT: The applied measurement with specific conditions, one of:

UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of xsd:double.

UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of xsd:float.

UsedAsDatatype: The total number of literals with the datatype.

UsedAsPropertyRange: The number of statements that specify the datatype as range of the property.

ValidDateNotation: The number of lexicals that are in the lexical space of xsd:date.

ValidDateTimeNotation: The number of lexicals that are in the lexical space of xsd:dateTime.

ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of xsd:decimal, xsd:float, and xsd:double.

ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.

ValidInfOrNaNNotation: The number of lexicals that equals either INF, +INF, -INF or NaN and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.

ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of xsd:integer, xsd:decimal, xsd:float, and xsd:double.

ValidTimeNotation: The number of lexicals that are in the lexical space of xsd:time.

ValidTrueOrFalseNotation: The number of lexicals that equal either true or false and whose lexical representation is thereby in the lexical space of xsd:boolean.

ValidZeroOrOneNotation: The number of lexicals that equal either 0 or 1 and whose lexical representation is thereby in the lexical space of xsd:boolean, and xsd:integer, xsd:decimal, xsd:float, and xsd:double.

Note: Lexical representation of xsd:double values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures.

PROPERTY: The property that has been measured.

DATATYPE: The datatype that has been measured.

QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype.

Preview

"CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://schema.org/aggregateRating","http://www.w3.org/2001/XMLSchema#string","36" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://opengraphprotocol.org/schema/longitude","http://www.w3.org/2001/XMLSchema#string","1137" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://ogp.me/ns#title","http://www.w3.org/2001/XMLSchema#string","3" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://ogp.me/nslongitude","http://www.w3.org/2001/XMLSchema#string","1" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://ogp.me/ns#latitude","http://www.w3.org/2001/XMLSchema#string","884" […] "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-embedded-jsonld.nq-00294.gz","ValidZeroOrOneNotation","http://schema.org/minPrice","http://www.w3.org/2001/XMLSchema#integer","12" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-embedded-jsonld.nq-00294.gz","ValidZeroOrOneNotation","http://schema.org/highPrice","http://www.w3.org/2001/XMLSchema#integer","1" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-embedded-jsonld.nq-00294.gz","ValidZeroOrOneNotation","http://schema.org/numberOfItems","http://www.w3.org/2001/XMLSchema#integer","44" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-embedded-jsonld.nq-00294.gz","ValidZeroOrOneNotation","http://schema.org/ratingValue","http://www.w3.org/2001/XMLSchema#integer","139" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-embedded-jsonld.nq-00294.gz","ValidZeroOrOneNotation","http://schema.org/width","http://www.w3.org/2001/XMLSchema#integer","76"

Note: The data contain malformed IRIs, like "xsd:dateTime" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime"), which are caused by missing namespace definitions in the original source website.

Reproduce

To reproduce this dataset checkout the RDF Property and Datatype Usage Scanner v2.1.1 and execute:

mvn clean package java -jar target/Scanner.jar --category html-rdfa --list http://webdatacommons.org/structureddata/2016-10/files/rdfa.list October2016 java -jar target/Scanner.jar --category html-embedded-jsonld --list http://webdatacommons.org/structureddata/2016-10/files/html-embedded-jsonld.list October2016 ./measure.sh October2016 # Wait until the scan has completed. This will take a few days java -jar target/Scanner.jar --results ./October2016/measurements.csv.gz October2016
Z
Data for manuscript: "Longitudinal Analysis of Sentiment and Emotion in News...
data.niaid.nih.gov
zenodo.org
Updated Sep 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymized (2022). Data for manuscript: "Longitudinal Analysis of Sentiment and Emotion in News Media Headlines Using Automated Labelling with Transformer Language Models" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5144112
Explore at:
Dataset updated
Sep 13, 2022
Dataset authored and provided by
Anonymized
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains automated sentiment and emotionality annotations of 23 million headlines from 47 popular news media outlets popular in the United States.

The set of 47 news media outlets analysed (listed in Figure 1 of the main manuscript) was derived from the AllSides organization 2019 Media Bias Chart v1.1. The human ratings of outlets’ ideological leanings were also taken from this chart and are listed in Figure 2 of the main manuscript.

News articles headlines from the set of outlets analyzed in the manuscript are available in the outlets’ online domains and/or public cache repositories such as The Internet Wayback Machine, Google cache and Common Crawl. Articles headlines were located in articles’ HTML raw data using outlet-specific XPath expressions.

The temporal coverage of headlines across news outlets is not uniform. For some media organizations, news articles availability in online domains or Internet cache repositories becomes sparse for earlier years. Furthermore, some news outlets popular in 2019, such as The Huffington Post or Breitbart, did not exist in the early 2000’s. Hence, our data set is sparser in headlines sample size and representativeness for earlier years in the 2000-2019 timeline. Nevertheless, 20 outlets in our data set have chronologically continuous partial or full headline data availability since the year 2000. Figure S 1 in the SI reports the number of headlines per outlet and per year in our analysis.

In a small percentage of articles, outlet specific XPath expressions might fail to properly capture the content of the headline due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. After manual testing, we determined that the percentage of headlines following in this category is very small. Additionally, our method might miss detecting some articles in the online domains of news outlets. To conclude, in a data analysis of over 23 million headlines, we cannot manually check the correctness of every single data instance and hundred percent accuracy at capturing headlines’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our headlines set is representative of headlines in print news media content for the studied time period and outlets analyzed.

The list of compressed files in this data set is listed next:

-analysisScripts.rar contains the analysis scripts used in the main manuscript as well as aggregated data of sentiment and emotionality automated annotations of the headlines and human annotations of a subset of headlines sentiment and emotionality used as ground truth.

-models.rar contains the Transformer sentiment and emotion annotation models used in the analysis. Namely:

Siebert/sentiment-roberta-large-english from https://huggingface.co/siebert/sentiment-roberta-large-english. This model is a fine-tuned checkpoint of RoBERTa-large (Liu et al. 2019). It enables reliable binary sentiment analysis for various types of English-language text. For each instance, it predicts either positive (1) or negative (0) sentiment. The model was fine-tuned and evaluated on 15 data sets from diverse text sources to enhance generalization across different types of texts (reviews, tweets, etc.). See more information from the original authors at https://huggingface.co/siebert/sentiment-roberta-large-english

DistilbertSST2.rar is the default sentiment classification model of the HuggingFace Transformer library https://huggingface.co/ This model is only used to replicate the results of the sentiment analysis with sentiment-roberta-large-english

DistilRoberta j-hartmann/emotion-english-distilroberta-base from https://huggingface.co/j-hartmann/emotion-english-distilroberta-base. The model is a fine-tuned checkpoint of DistilRoBERTa-base. The model allows annotation of English text with Ekman's 6 basic emotions, plus a neutral class. The model was trained on 6 diverse datasets. Please refer to the original author at https://huggingface.co/j-hartmann/emotion-english-distilroberta-base for an overview of the data sets used for fine tuning. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

-headlinesDataWithSentimentLabelsAnnotationsFromSentimentRobertaLargeModel.rar URLs of headlines analyzed and the sentiment annotations of the siebert/sentiment-roberta-large-english Transformer model. https://huggingface.co/siebert/sentiment-roberta-large-english

-headlinesDataWithSentimentLabelsAnnotationsFromDistilbertSST2.rar URLs of headlines analyzed and the sentiment annotations of the default HuggingFace sentiment analysis model fine-tuned on the SST-2 dataset. https://huggingface.co/

-headlinesDataWithEmotionLabelsAnnotationsFromDistilRoberta.rar URLs of headlines analyzed and the emotion categories annotations of the j-hartmann/emotion-english-distilroberta-base Transformer model. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Version	Top level domain	Min count	Num nodes	Num edges
sparse		10	365.4M	30B
dense		50	136.5M	22B
de-sparse	de	10	19.7M	1.19B
de-dense	de	50	5.7M	0.82B
in-sparse	in	10	1.5M	0.14B
in-dense	in	50	0.5M	0.12B

Facebook

Twitter

Click to copy link

Link copied

Cite

Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel (2022). mC4 Dataset [Dataset]. https://paperswithcode.com/dataset/mc4

Data from: mC4 Dataset

Explore at:

Dataset updated

Jun 8, 2022

Authors

Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel

Description

mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.

Clear search

Close search

Google apps

Main menu

Data from: mC4 Dataset

WDC LSPM Dataset

German CBOW FastText embeddings with min count 250

Web Data Commons (November 2019) Property and Datatype Usage Dataset

oscar

web_graph

Reddit Dataset

MADLAD-400

Data from: CCMatrix Dataset

Web Data Commons (October 2016) Property and Datatype Usage Dataset

Data for manuscript: "Longitudinal Analysis of Sentiment and Emotion in News...

Data from: mC4 Dataset