Saved datasets
Last updated
Download format
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Cost to access
Described as free to access or have a license that allows redistribution.
2 datasets found
  1. Webis Gmane Email Corpus 2019

    Updated Jun 3, 2020
  2. Webis-Gmane-19

    Updated 2020
  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Click to copy link
Link copied
Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Benno Stein (2020). Webis Gmane Email Corpus 2019 [Dataset].
Organization logoOrganization logo

Webis Gmane Email Corpus 2019

4 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 3, 2020
Dataset provided by
Bauhaus-Universität Weimar
Leipzig University
Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Benno Stein

The Webis Gmane Email Corpus 2019 is a dataset of more than 153 million parsed and segmented emails crawled between February and May 2019 from covering more than 20 years of public mailing lists. The dataset has been published as a resource at ACL 2020.

The dataset comes as a set of Gzip-compressed files containing line-based JSON in the Elasticsearch bulk format. Each data record consists of two lines:

{"index": {"_id": "

The first line is the Elasticsearch index action with a document UUID, the second one the actual parsed email with a (reduced and anonymized) set of headers, the detected language, the original Gmane group name and the predicted content segments as character spans. The Gzip files are splittable every 1,000 records (line pairs) for parallel processing in, e.g., Hadoop.

Available email headers are:

  • message_id
  • date (yyyy-MM-dd HH:mm:ssZZ)
  • subject
  • from
  • to
  • cc
  • in_reply_to
  • references
  • list_id

Available segment classes are:

  • paragraph
  • closing
  • inline_headers
  • log_data
  • mua_signature
  • patch
  • personal_signature
  • quotation
  • quotation_marker
  • raw_code
  • salutation
  • section_heading
  • tabular
  • technical
  • visual_separator

Find more information about the dataset and the segmentation model at">

If you are using this resource in your work, please cite it as:

 author =       {Janek Bevendorff and Khalid Al-Khatib and Martin Potthast and Benno Stein},
 booktitle =      {58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)},
 month =        jul,
 publisher =      {Association for Computational Linguistics},
 site =        {Seattle, USA},
 title =        {{Crawling and Preprocessing Mailing Lists At Scale for Dialog Analysis}},
 year =        2020

Clear search
Close search
Google apps
Main menu