Saved datasets
Last updated
Download format
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Topic
Free
Cost to access
Described as free to access or have a license that allows redistribution.
2 datasets found
  1. PAN15 Author Identification: Verification

    • zenodo.org
    Updated Sep 8, 2015
  2. Profiling Fake News Spreaders on Twitter

    • zenodo.org
    Updated Feb 29, 2020
  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stamatatos, Efstathios; Daelemans Daelemans amd Ben Verhoeven, Walter; Juola, Patrick; López-López, Aurelio; Potthast, Martin; Stein, Benno (2015). PAN15 Author Identification: Verification [Dataset]. http://doi.org/10.5281/zenodo.3737563
Organization logoOrganization logo

PAN15 Author Identification: Verification

Dataset updated Sep 8, 2015
Dataset provided by
Leipzig Universityhttp://www.uni-leipzig.de/
Bauhaus-Universität Weimarhttp://www.uni-weimar.de/
Authors
Stamatatos, Efstathios; Daelemans Daelemans amd Ben Verhoeven, Walter; Juola, Patrick; López-López, Aurelio; Potthast, Martin; Stein, Benno
Description

We provide you with a training corpus that comprises a set of author verification problems in several languages/genres. Each problem consists of some (up to five) known documents by a single person and exactly one questioned document. All documents within a single problem instance will be in the same language. However, their genre and/or topic may differ significantly. The document lengths vary from a few hundred to a few thousand words.

The documents of each problem are located in a separate folder, the name of which (problem ID) encodes the language of the documents. The following list shows the available sub-corpora, including their language, type (cross-genre or cross-topic), code, and examples of problem IDs:

Language; Type; Code; Problem IDs
Dutch; Cross-genre; DU; DU001, DU002, DU003, etc.
English; Cross-topic; EN; EN001, EN002, EN003, etc.
Greek; Cross-topic; GR; GR001, GR002, GR003, etc.
Spanish; Cross-genre; SP; SP001, SP002, SP003, etc.

The ground truth data of the training corpus found in the file truth.txt include one line per problem with problem ID and the correct binary answer (Y means the known and the questioned documents are by the same author and N means the opposite). For example:

EN001 N
EN002 Y
EN003 N
...
Search
Clear search
Close search
Google apps
Main menu