11 datasets found
  1. P

    Data from: mC4 Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Jun 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel (2022). mC4 Dataset [Dataset]. https://paperswithcode.com/dataset/mc4
    Explore at:
    Dataset updated
    Jun 8, 2022
    Authors
    Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel
    Description

    mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.

  2. P

    WDC LSPM Dataset

    • paperswithcode.com
    Updated May 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WDC LSPM Dataset [Dataset]. https://paperswithcode.com/dataset/wdc-products
    Explore at:
    Dataset updated
    May 31, 2022
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

    In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

    The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.

  3. Z

    German CBOW FastText embeddings with min count 250

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bocharov, Victor (2021). German CBOW FastText embeddings with min count 250 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5598143
    Explore at:
    Dataset updated
    Oct 26, 2021
    Dataset authored and provided by
    Bocharov, Victor
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    FastText embeddings built from Common Crawl german dataset

    Parameters
    
    
        Parameters
        Value(s)
    
    
    
    
        Dimensions
        256 and 384
    
    
        Context window
        5
    
    
        Negative sampled
        10
    
    
        Epochs
        1
    
    
        Number of buckets
        131072 or 262144
    
    
        Min n
        3
    
    
        Max n
        6
    
  4. Web Data Commons (November 2019) Property and Datatype Usage Dataset

    • zenodo.org
    application/gzip
    Updated May 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Martin Keil; Jan Martin Keil (2022). Web Data Commons (November 2019) Property and Datatype Usage Dataset [Dataset]. http://doi.org/10.5281/zenodo.6359895
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan Martin Keil; Jan Martin Keil
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (November 2019) based on the Common Crawl November 2019 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v2.1.1, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes.

    Dataset Properties

    • Size: 37.0 MiB compressed, 785.7 MiB uncompressed, 3 542 700 rows plus 1 head line determined using gunzip -c measurements.csv.gz | wc -l
    • Parsing Failures: The scanner failed to parse 5 080 473 triples (~0.053 %) of the source dataset (containing 9 547 365 107 triples).
    • Content:
      • CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured.
      • FILE_URL: The URL of the Web Data Commons file that has been measured.
      • MEASUREMENT: The applied measurement with specific conditions, one of:
        • UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of xsd:double.
        • UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of xsd:float.
        • UsedAsDatatype: The total number of literals with the datatype.
        • UsedAsPropertyRange: The number of statements that specify the datatype as range of the property.
        • ValidDateNotation: The number of lexicals that are in the lexical space of xsd:date.
        • ValidDateTimeNotation: The number of lexicals that are in the lexical space of xsd:dateTime.
        • ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of xsd:decimal, xsd:float, and xsd:double.
        • ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.
        • ValidInfOrNaNNotation: The number of lexicals that equals either INF, +INF, -INF or NaN and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.
        • ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of xsd:integer, xsd:decimal, xsd:float, and xsd:double.
        • ValidTimeNotation: The number of lexicals that are in the lexical space of xsd:time.
        • ValidTrueOrFalseNotation: The number of lexicals that equal either true or false and whose lexical representation is thereby in the lexical space of xsd:boolean.
        • ValidZeroOrOneNotation: The number of lexicals that equal either 0 or 1 and whose lexical representation is thereby in the lexical space of xsd:boolean, and xsd:integer, xsd:decimal, xsd:float, and xsd:double.
        Note: Lexical representation of xsd:double values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures.
      • PROPERTY: The property that has been measured.
      • DATATYPE: The datatype that has been measured.
      • QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype.

    Preview

    "CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#longitude","https://www.w3.org/2001/XMLSchema#float","1"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#latitude","https://www.w3.org/2001/XMLSchema#float","1"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://purl.org/goodrelations/v1#hasCurrencyValue","http://www.w3.org/2001/XMLSchema#floatfloat","10"
    …
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/dpef.html-embedded-jsonld.nq-01426.gz","ValidZeroOrOneNotation","http://schema.org/ratingValue","http://www.w3.org/2001/XMLSchema#integer","109"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/dpef.html-embedded-jsonld.nq-01426.gz","ValidZeroOrOneNotation","http://schema.org/width","http://www.w3.org/2001/XMLSchema#integer","60"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/dpef.html-embedded-jsonld.nq-01426.gz","ValidZeroOrOneNotation","http://schema.org/minValue","http://www.w3.org/2001/XMLSchema#integer","68"

    Note: The data contain malformed IRIs, like "xsd:dateTime" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime"), which are caused by missing namespace definitions in the original source website.

    Reproduce

    To reproduce this dataset checkout the RDF Property and Datatype Usage Scanner v2.1.0 and execute:

    mvn clean package
    java -jar target/Scanner.jar --category html-rdfa --list http://webdatacommons.org/structureddata/2019-12/files/html-rdfa.list November2019
    java -jar target/Scanner.jar --category html-embedded-jsonld --list http://webdatacommons.org/structureddata/2019-12/files/html-embedded-jsonld.list November2019
    ./measure.sh November2019
    # Wait until the scan has completed. This will take a few days
    java -jar target/Scanner.jar --results ./November2019/measurements.csv.gz November2019
    
  5. h

    oscar

    • huggingface.co
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OSCAR (2023). oscar [Dataset]. https://huggingface.co/datasets/oscar-corpus/oscar
    Explore at:
    Dataset updated
    Sep 15, 2023
    Dataset authored and provided by
    OSCAR
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.\

  6. T

    web_graph

    • tensorflow.org
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). web_graph [Dataset]. http://identifiers.org/arxiv:2112.02194
    Explore at:
    Dataset updated
    Nov 23, 2022
    Description

    This dataset contains a sparse graph representing web link structure for a small subset of the Web.

    Its a processed version of a single crawl performed by CommonCrawl in 2021 where we strip everything and keep only the link->outlinks structure. The final dataset is basically int -> List[int] format with each integer id representing a url.

    Also, in order to increase the value of this resource, we created 6 different version of WebGraph, each varying in the sparsity pattern and locale. We took the following processing steps, in order:

    • We started with WAT files from June 2021 crawl.
    • Since the outlinks in HTTP-Response-Metadata are stored as relative paths, we convert them to absolute paths using urllib after validating each link.
    • To study locale-specific graphs, we further filter based on 2 top level domains: ‘de’ and ‘in’, each producing a graph with an order of magnitude less number of nodes.
    • These graphs can still have arbitrary sparsity patterns and dangling links. Thus we further filter the nodes in each graph to have minimum of K ∈ [10, 50] inlinks and outlinks. Note that we only do this processing once, thus this is still an approximation i.e. the resulting graph might have nodes with less than K links.
    • Using both locale and count filters, we finalize 6 versions of WebGraph dataset, summarized in the folling table.
    VersionTop level domainMin countNum nodesNum edges
    sparse10365.4M30B
    dense50136.5M22B
    de-sparsede1019.7M1.19B
    de-densede505.7M0.82B
    in-sparsein101.5M0.14B
    in-densein500.5M0.12B

    All versions of the dataset have following features:

    • "row_tag": a unique identifier of the row (source link).
    • "col_tag": a list of unique identifiers of non-zero columns (dest outlinks).
    • "gt_tag": a list of unique identifiers of non-zero columns used as ground truth (dest outlinks), empty for train/train_t splits.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('web_graph', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  7. P

    Reddit Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 9, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reddit Dataset [Dataset]. https://paperswithcode.com/dataset/reddit
    Explore at:
    Dataset updated
    Jun 9, 2017
    Authors
    William L. Hamilton; Rex Ying; Jure Leskovec
    Description

    The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.

  8. MADLAD-400

    • huggingface.co
    • opendatalab.com
    Updated Oct 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MADLAD-400 [Dataset]. https://huggingface.co/datasets/allenai/MADLAD-400
    Explore at:
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    MADLAD-400

      Dataset and Introduction
    

    MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main… See the full description on the dataset page: https://huggingface.co/datasets/allenai/MADLAD-400.

  9. P

    Data from: CCMatrix Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated May 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Holger Schwenk; Guillaume Wenzek; Sergey Edunov; Edouard Grave; Armand Joulin (2023). CCMatrix Dataset [Dataset]. https://paperswithcode.com/dataset/ccmatrix
    Explore at:
    Dataset updated
    May 24, 2023
    Authors
    Holger Schwenk; Guillaume Wenzek; Sergey Edunov; Edouard Grave; Armand Joulin
    Description

    CCMatrix uses ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totalling 32.7 billion unique sentences.

  10. Web Data Commons (October 2016) Property and Datatype Usage Dataset

    • zenodo.org
    application/gzip
    Updated Aug 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Martin Keil; Jan Martin Keil (2022). Web Data Commons (October 2016) Property and Datatype Usage Dataset [Dataset]. http://doi.org/10.5281/zenodo.6534413
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Aug 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan Martin Keil; Jan Martin Keil
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (October 2016) based on the Common Crawl October 2016 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v2.1.1, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes.

    Dataset Properties

    • Size: 17.4 MiB compressed, 351.1 MiB uncompressed, 1 612 479 rows plus 1 head line determined using gunzip -c measurements.csv.gz | wc -l
    • Parsing Failures: The scanner failed to parse 28 326 152 triples (~0.69 %) of the source dataset (containing 4 097 655 302 triples).
    • Content:
      • CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured.
      • FILE_URL: The URL of the Web Data Commons file that has been measured.
      • MEASUREMENT: The applied measurement with specific conditions, one of:
        • UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of xsd:double.
        • UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of xsd:float.
        • UsedAsDatatype: The total number of literals with the datatype.
        • UsedAsPropertyRange: The number of statements that specify the datatype as range of the property.
        • ValidDateNotation: The number of lexicals that are in the lexical space of xsd:date.
        • ValidDateTimeNotation: The number of lexicals that are in the lexical space of xsd:dateTime.
        • ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of xsd:decimal, xsd:float, and xsd:double.
        • ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.
        • ValidInfOrNaNNotation: The number of lexicals that equals either INF, +INF, -INF or NaN and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.
        • ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of xsd:integer, xsd:decimal, xsd:float, and xsd:double.
        • ValidTimeNotation: The number of lexicals that are in the lexical space of xsd:time.
        • ValidTrueOrFalseNotation: The number of lexicals that equal either true or false and whose lexical representation is thereby in the lexical space of xsd:boolean.
        • ValidZeroOrOneNotation: The number of lexicals that equal either 0 or 1 and whose lexical representation is thereby in the lexical space of xsd:boolean, and xsd:integer, xsd:decimal, xsd:float, and xsd:double.
        Note: Lexical representation of xsd:double values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures.
      • PROPERTY: The property that has been measured.
      • DATATYPE: The datatype that has been measured.
      • QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype.

    Preview

    "CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://schema.org/aggregateRating","http://www.w3.org/2001/XMLSchema#string","36"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://opengraphprotocol.org/schema/longitude","http://www.w3.org/2001/XMLSchema#string","1137"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://ogp.me/ns#title","http://www.w3.org/2001/XMLSchema#string","3"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://ogp.me/nslongitude","http://www.w3.org/2001/XMLSchema#string","1"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://ogp.me/ns#latitude","http://www.w3.org/2001/XMLSchema#string","884"
    […]
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-embedded-jsonld.nq-00294.gz","ValidZeroOrOneNotation","http://schema.org/minPrice","http://www.w3.org/2001/XMLSchema#integer","12"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-embedded-jsonld.nq-00294.gz","ValidZeroOrOneNotation","http://schema.org/highPrice","http://www.w3.org/2001/XMLSchema#integer","1"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-embedded-jsonld.nq-00294.gz","ValidZeroOrOneNotation","http://schema.org/numberOfItems","http://www.w3.org/2001/XMLSchema#integer","44"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-embedded-jsonld.nq-00294.gz","ValidZeroOrOneNotation","http://schema.org/ratingValue","http://www.w3.org/2001/XMLSchema#integer","139"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2016-10/quads/dpef.html-embedded-jsonld.nq-00294.gz","ValidZeroOrOneNotation","http://schema.org/width","http://www.w3.org/2001/XMLSchema#integer","76"
    

    Note: The data contain malformed IRIs, like "xsd:dateTime" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime"), which are caused by missing namespace definitions in the original source website.

    Reproduce

    To reproduce this dataset checkout the RDF Property and Datatype Usage Scanner v2.1.1 and execute:

    mvn clean package
    java -jar target/Scanner.jar --category html-rdfa --list http://webdatacommons.org/structureddata/2016-10/files/rdfa.list October2016
    java -jar target/Scanner.jar --category html-embedded-jsonld --list http://webdatacommons.org/structureddata/2016-10/files/html-embedded-jsonld.list October2016
    ./measure.sh October2016
    # Wait until the scan has completed. This will take a few days
    java -jar target/Scanner.jar --results ./October2016/measurements.csv.gz October2016
    
  11. Z

    Data for manuscript: "Longitudinal Analysis of Sentiment and Emotion in News...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymized (2022). Data for manuscript: "Longitudinal Analysis of Sentiment and Emotion in News Media Headlines Using Automated Labelling with Transformer Language Models" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5144112
    Explore at:
    Dataset updated
    Sep 13, 2022
    Dataset authored and provided by
    Anonymized
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains automated sentiment and emotionality annotations of 23 million headlines from 47 popular news media outlets popular in the United States.

    The set of 47 news media outlets analysed (listed in Figure 1 of the main manuscript) was derived from the AllSides organization 2019 Media Bias Chart v1.1. The human ratings of outlets’ ideological leanings were also taken from this chart and are listed in Figure 2 of the main manuscript.

    News articles headlines from the set of outlets analyzed in the manuscript are available in the outlets’ online domains and/or public cache repositories such as The Internet Wayback Machine, Google cache and Common Crawl. Articles headlines were located in articles’ HTML raw data using outlet-specific XPath expressions.

    The temporal coverage of headlines across news outlets is not uniform. For some media organizations, news articles availability in online domains or Internet cache repositories becomes sparse for earlier years. Furthermore, some news outlets popular in 2019, such as The Huffington Post or Breitbart, did not exist in the early 2000’s. Hence, our data set is sparser in headlines sample size and representativeness for earlier years in the 2000-2019 timeline. Nevertheless, 20 outlets in our data set have chronologically continuous partial or full headline data availability since the year 2000. Figure S 1 in the SI reports the number of headlines per outlet and per year in our analysis.

    In a small percentage of articles, outlet specific XPath expressions might fail to properly capture the content of the headline due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. After manual testing, we determined that the percentage of headlines following in this category is very small. Additionally, our method might miss detecting some articles in the online domains of news outlets. To conclude, in a data analysis of over 23 million headlines, we cannot manually check the correctness of every single data instance and hundred percent accuracy at capturing headlines’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our headlines set is representative of headlines in print news media content for the studied time period and outlets analyzed.

    The list of compressed files in this data set is listed next:

    -analysisScripts.rar contains the analysis scripts used in the main manuscript as well as aggregated data of sentiment and emotionality automated annotations of the headlines and human annotations of a subset of headlines sentiment and emotionality used as ground truth.

    -models.rar contains the Transformer sentiment and emotion annotation models used in the analysis. Namely:

    Siebert/sentiment-roberta-large-english from https://huggingface.co/siebert/sentiment-roberta-large-english. This model is a fine-tuned checkpoint of RoBERTa-large (Liu et al. 2019). It enables reliable binary sentiment analysis for various types of English-language text. For each instance, it predicts either positive (1) or negative (0) sentiment. The model was fine-tuned and evaluated on 15 data sets from diverse text sources to enhance generalization across different types of texts (reviews, tweets, etc.). See more information from the original authors at https://huggingface.co/siebert/sentiment-roberta-large-english

    DistilbertSST2.rar is the default sentiment classification model of the HuggingFace Transformer library https://huggingface.co/ This model is only used to replicate the results of the sentiment analysis with sentiment-roberta-large-english

    DistilRoberta j-hartmann/emotion-english-distilroberta-base from https://huggingface.co/j-hartmann/emotion-english-distilroberta-base. The model is a fine-tuned checkpoint of DistilRoBERTa-base. The model allows annotation of English text with Ekman's 6 basic emotions, plus a neutral class. The model was trained on 6 diverse datasets. Please refer to the original author at https://huggingface.co/j-hartmann/emotion-english-distilroberta-base for an overview of the data sets used for fine tuning. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

    -headlinesDataWithSentimentLabelsAnnotationsFromSentimentRobertaLargeModel.rar URLs of headlines analyzed and the sentiment annotations of the siebert/sentiment-roberta-large-english Transformer model. https://huggingface.co/siebert/sentiment-roberta-large-english

    -headlinesDataWithSentimentLabelsAnnotationsFromDistilbertSST2.rar URLs of headlines analyzed and the sentiment annotations of the default HuggingFace sentiment analysis model fine-tuned on the SST-2 dataset. https://huggingface.co/

    -headlinesDataWithEmotionLabelsAnnotationsFromDistilRoberta.rar URLs of headlines analyzed and the emotion categories annotations of the j-hartmann/emotion-english-distilroberta-base Transformer model. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel (2022). mC4 Dataset [Dataset]. https://paperswithcode.com/dataset/mc4

Data from: mC4 Dataset

Related Article
Explore at:
Dataset updated
Jun 8, 2022
Authors
Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel
Description

mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.

Search
Clear search
Close search
Google apps
Main menu