7 datasets found

PAN Plagiarism Corpus 2011 (PAN-PC-11)
zenodo.org
live.european-language-grid.eu
+2more
bin
Updated Jun 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso (2022). PAN Plagiarism Corpus 2011 (PAN-PC-11) [Dataset]. http://doi.org/10.5281/zenodo.3250095
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3250095
Dataset updated
Jun 11, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The PAN plagiarism corpus 2011 (PAN-PC-11) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.

The PAN-PC-11 contains documents in which plagiarism has been inserted automatically as well as documents in which plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk.
Z
PAN Plagiarism Corpus 2010 (PAN-PC-10)
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Potthast, Martin; Stein, Benno; Eiselt, Andreas; Barrón-Cedeño, Alberto; Rosso, Paolo (2020). PAN Plagiarism Corpus 2010 (PAN-PC-10) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3250122
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Universidad Polytécnica de Valencia
Bauhaus-Universität Weimar
Authors
Potthast, Martin; Stein, Benno; Eiselt, Andreas; Barrón-Cedeño, Alberto; Rosso, Paolo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus is outdated. Please use its successor PAN-PC-11: https://doi.org/10.5281/zenodo.3250095

The PAN plagiarism corpus 2010 (PAN-PC-10) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.

The PAN-PC-10 contains documents in which artificial plagiarism has been inserted automatically as well as documents in which simulated plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk.
Z
Webis Plagiarism Corpus 2008 (Webis-PC-08)
data.niaid.nih.gov
Updated Jun 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meyer zu Eißen, Sven; Stein, Benno (2022). Webis Plagiarism Corpus 2008 (Webis-PC-08) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3254617
Explore at:
Dataset updated
Jun 11, 2022
Dataset provided by
Bauhaus-Universität Weimar
Authors
Meyer zu Eißen, Sven; Stein, Benno
Description
This corpus is outdated. Please use its successor PAN-PC-11: https://doi.org/10.5281/zenodo.3250095

The Webis plagiarism corpus 2008 (Webis-PC-08) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge, however, since the documents in the corpus are not free of copyrights we need assurance that you have legal access to the ACM digital library.
E
Data from: Detecting Cross-Language Plagiarism using Open Knowledge Graphs
live.european-language-grid.eu
data.niaid.nih.gov
+1more
txt
Updated Apr 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Detecting Cross-Language Plagiarism using Open Knowledge Graphs [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18288
Explore at:
txtAvailable download formats
Dataset updated
Apr 12, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Details
ASPEC. The Asian Scientific Paper Excerpt Corpus comprises excepts of scientific papers in Japanese that have been manually translated to English and Chinese. We use both subsets of the ASPEC corpus.
ASPEC-JC contains abstracts and paragraphs from the main text of research papers that were translated manually from Japanese to Chinese.
ASPEC-JE contains abstracts of approx. two million research papers that were translated manually from Japanese to English.
JRC-Acquis. The corpus consists of legislative texts in 22 languages, which the European Union's Joint Research Centre (JRC) selected from the cumulative body of EU laws (the so called Acquis communautaire). We sampled our test cases from the 10,000 document pairs in the English-French subset of the corpus.
Europarl. The corpus contains transcripts of European Parliament proceedings in 21 European languages. We exclusively sampled test cases from the 9,443 document pairs in the English-French subset of the corpus.
PAN-PC-11. The corpus contains instances of simulated monolingual and cross-language plagiarism that were used for evaluating plagiarism detection methods as part of the workshop series Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN). Most of the 26,939 documents in the corpus were created by extracting text from openly available books. The documents are partially interspersed with instances of simulated plagiarism that were created and obfuscated automatically or by crowdsourced workers. We exclusively sampled test cases from the 2,921 Spanish-English aligned document pairs in the corpus, for which simulated plagiarism instances were either machine-generated or created manually by crowdsourced workers.
==========================================================================
File Structure
[corpus_documents] folder: Corpora of translation-aligned documents used in our experiments composed of:
aspec: Japanese and English
aspecx: Japanese and Chinese
jrc: English and French
europarl: English and French
pan: English and Spanish
Each sub-corpus consists of 4,000 translation-aligned files (2,000 per language); the entire corpus has thus 20,000 files.
Each set of translation-aligned documents was randomly selected from the original datasets (details in the paper).
The Japanese files in aspec and aspecx do not necessarily overlap even though they are from the same dataset.

[corpus_paragraphs] folder: 2,000 translation-aligned paragraphs randomly selected from:
jrc: English and French
europarl: English and French
pan: English and Spanish

[vectors_documents] folder: Average vector representation of the documents in the datasets from two pre-trained models:
Universal Sentence Encoder - Multilingual (USE-ML)
ConceptNet Numberbatch
Two granularities are provided:
vector_paragraphs
vector_documents
The structure for each level of granularity follows the same pattern as their respective corpus.

Naming convention:
Example: cn_jrc_es:
model: ConceptNet Numberbatch
corpus: JRC-Acquis
language: Spanish
Labels:
W
PAN-PC-09
webis.de
3250083
Updated 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Potthast; Benno Stein; Andreas Eiselt (2009). PAN-PC-09 [Dataset]. http://doi.org/10.5281/zenodo.3250083
Explore at:
3250083Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.3250083
Dataset updated
2009
Dataset provided by
University of Kassel, hessian.AI, and ScaDS.AI
Universitat Polit?cnica de Val?ncia
Bauhaus-Universit?t Weimar
The Web Technology & Information Systems Network
Authors
Martin Potthast; Benno Stein; Andreas Eiselt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus is outdated. Please use its successor PAN-PC-11.
W
PAN-PC-10
webis.de
3250123
Updated 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Potthast; Benno Stein; Andreas Eiselt (2010). PAN-PC-10 [Dataset]. http://doi.org/10.5281/zenodo.3250123
Explore at:
3250123Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.3250123
Dataset updated
2010
Dataset provided by
University of Kassel, hessian.AI, and ScaDS.AI
Universitat Polit?cnica de Val?ncia
Bauhaus-Universit?t Weimar
The Web Technology & Information Systems Network
Authors
Martin Potthast; Benno Stein; Andreas Eiselt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus is outdated. Please use its successor PAN-PC-11.
W
Webis-PC-08
webis.de
3254618
Updated 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benno Stein; Sven Meyer zu Eissen (2008). Webis-PC-08 [Dataset]. http://doi.org/10.5281/zenodo.3254618
Explore at:
3254618Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.3254618
Dataset updated
2008
Dataset provided by
Bauhaus-Universit?t Weimar
The Web Technology & Information Systems Network
Authors
Benno Stein; Sven Meyer zu Eissen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus is outdated. Please use its successor PAN-PC-11.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso (2022). PAN Plagiarism Corpus 2011 (PAN-PC-11) [Dataset]. http://doi.org/10.5281/zenodo.3250095

PAN Plagiarism Corpus 2011 (PAN-PC-11)

Explore at:

9 scholarly articles cite this dataset (View in Google Scholar)

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3250095

Dataset updated

Jun 11, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The PAN plagiarism corpus 2011 (PAN-PC-11) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.

The PAN-PC-11 contains documents in which plagiarism has been inserted automatically as well as documents in which plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk.

Clear search

Close search

Google apps

Main menu

PAN Plagiarism Corpus 2011 (PAN-PC-11)

PAN Plagiarism Corpus 2010 (PAN-PC-10)

Webis Plagiarism Corpus 2008 (Webis-PC-08)

Data from: Detecting Cross-Language Plagiarism using Open Knowledge Graphs

PAN-PC-09

PAN-PC-10

Webis-PC-08

PAN Plagiarism Corpus 2011 (PAN-PC-11)See More Versions

PAN Plagiarism Corpus 2011 (PAN-PC-11)