Facebook
TwitterAttribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
Dataset Card for Dataset Name
Dataset for authorship verification, comprised of 12 cleaned, modified, open source authorship verification and attribution datasets.
Dataset Details
Code for cleaning and modifying datasets can be found in https://github.com/swan-07/authorship-verification/blob/main/Authorship_Verification_Datasets.ipynb and is detailed in paper. Datasets used to produce the final dataset are:
Reuters50
@misc{misc_reuter_50_50_217, author = {Liu… See the full description on the dataset page: https://huggingface.co/datasets/swan07/authorship-verification.
Facebook
TwitterTask
Authorship verification is the task of deciding whether two texts have been written by the same author based on comparing the texts' writing styles.
In the coming three years at PAN 2020 to PAN 2022, we develop a new experimental setup that addresses three key questions in authorship verification that have not been studied at scale to date:
Year 1 (PAN 2020): Closed-set verficiation.
Given a large training dataset comprising of known authors who have written about a given set of topics, the test dataset contains verification cases from a subset of the authors and topics found in the training data.
Year 2 (PAN 2021): Open-set verification.
Given the training dataset of Year 1, the test dataset contains verification cases from previously unseen authors and topics.
Year 3 (PAN 2022): Suprise task.
The task of the last year of this evaluation cycle (to be announced at a later time) will be designed with an eye on realism and practical application.
This evaluation cycle on authorship verification provides for a renewed challenge of increasing difficulty within a large-scale evaluation. We invite you to plan ahead and participate in all three of these tasks.
More information at: PAN @ CLEF 2020 - Authorship Verification
Citing the Dataset
If you use this dataset for your research, please be sure to cite the following paper:
Sebastian Bischoff, Niklas Deckers, Marcel Schliebs, Ben Thies, Matthias Hagen, Efstathios Stamatatos, Benno Stein, and Martin Potthast. The Importance of Suppressing Domain Style in Authorship Analysis. CoRR, abs/2005.14714, May 2020.
Bibtex:
@Article{stein:2020k, author = {Sebastian Bischoff and Niklas Deckers and Marcel Schliebs and Ben Thies and Matthias Hagen and Efstathios Stamatatos and Benno Stein and Martin Potthast}, journal = {CoRR}, month = may, title = {{The Importance of Suppressing Domain Style in Authorship Analysis}}, url = {https://arxiv.org/abs/2005.14714}, volume = {abs/2005.14714}, year = 2020 }
Facebook
TwitterDownload
Access to our corpus can be requested via the Aston Institute for Forensic Linguistics Databank: https://fold.aston.ac.uk/handle/123456789/17
Task
Authorship verification is the task of deciding whether two texts have been written by the same author based on comparing the texts' writing styles. In previous editions of PAN, we explored the effectiveness of authorship verification technology in several languages and text genres. In the two most recent editions, cross-domain authorship verification using fanfiction texts was examined. Despite certain differences between fandoms, the task of cross-fandom authorship verification has proved to be relatively feasible. In the current edition, we focus on more challenging scenarios where each author verification case considers two texts that belong to different DTs (cross-DT authorship verification). This will allow us to study the ability of stylometric approaches to capture authorial characteristics that remain stable across DTs even when very different forms of expression are imposed by the DT norms.
Based on a new corpus in English, we provide cross-DT authorship verification cases using the following DTs:
Essays
Emails
Text messages
Business memos
The corpus comprises texts of around 100 individuals. All individuals have similar age (18-22) and are native English speakers. The topic of text samples is not restricted while the level of formality can vary within a certain DT (e.g., text messages may be addressed to family members or non-familial acquaintances).
More information at: Authorship Verification 2022
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Enron Authorship Verification Corpus" is a derivate of the well-known "Enron Email Dataset", which was transformed in such a way to meet the same standardized format of the "PAN Authorship Identification corpora" (http://pan.webis.de). The corpus consists of 80 authorship verification cases, evenly distributed regarding true/false authorships. Each authorship verification case comprise exactly 5 documents (plain text files). Here, 4 documents represent samples from the known (true) author, while the remaining 1 document represents the text of the unknown author (the subject of verification). The corpus is ballanced, not only in terms of the same number of known documents per case, but also regarding the lenth of the texts, which is near-equal (3-4 kilobyte per text). It can be assumed that each document is aggregated from (short) mails of the same author, in order to have a sufficient length that captures the authors writing style. All texts in the corpus have undergone the same preprocessing-procedure: De-duplication, removing of URL's, newlines/tabs, normalization of utf-8 symbols and substitution of multiple successive blanks with a single blank. All e-mail headers and other metadata (including signatures) have been removed from each document such that it contains only pure natural language text fron a single author. The intention behind this corpus is to provide other researchers in the field of authorship verification the opportunity to compare their results to each other.
Facebook
TwitterVeriDark (Authorship Verification in the DarkNet) is a benchmark for evaluating authorship analysis methods in a cybersecurity context, by introducing datasets gathered from the DarkNet marketplace forums or from Darknet-related discussions on Reddit. This benchmark contains three datasets for authorship verification and one dataset for authorship identification.
Facebook
Twitterhttps://www.isc.org/downloads/software-support-policy/isc-license/https://www.isc.org/downloads/software-support-policy/isc-license/
The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'.
1. Preprocessing
We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria:
- accumulatively at least 10,000 characters,
- accumulatively at most 49,410 characters,
- accumulatively at least 16 posts,
- accumulatively at most 40 posts, and
- each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose).
Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus.
2. Statistics
Its creation and statistics can be found in the Jupyter Notebook.
| Split | # Authors | # Posts | # Characters | Avg. Characters Per Author (Std.) | Avg. Characters Per Post (Std.) |
| Train | 1,000 | 16,132 | 30,092,057 | 30,092 (5,884) | 1,865 (1,007) |
| Validation | 935 | 2,017 | 3,755,362 | 4,016 (2,269) | 1,862 (999) |
| Test | 924 | 2,017 | 3,732,448 | 4,039 (2,188) | 1,850 (936) |
3. Usage
import pandas as pd
df = pd.read_csv('blog1000.csv.gz', compression='infer')
# read in training data
train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))
4. License
All the materials is licensed under the ISC License.
5. Contact
Please contact its maintainer for questions.
Facebook
TwitterWe provide you with a training corpus that comprises several different common attribution and verification scenarios. There are five training collections consisting of real-world texts (for authorship attribution), and three each with a single author (for authorship verification).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The repository link contains a README which gives an overview of the files along with the structure of the data.
Additionally, for LLAMA and GPT2, the files are in human_{llm_name}{i}.jsonl format where {llm} is the name of the LLM and {i} is the partition of the file and which can be concatenated to form the full dataset for that llm.
Facebook
Twitterhttps://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/NZ7VLChttps://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/NZ7VLC
Human trafficking (HT) is a pervasive global issue affecting vulnerable individuals, violating their fundamental human rights. Investigations reveal that a significant number of HT cases are associated with online advertisements (ads), particularly in escort markets. Consequently, identifying and connecting HT vendors has become increasingly challenging for Law Enforcement Agencies (LEAs). To address this issue, we introduce IDTraffickers, an extensive dataset consisting of 87,595 text ads and 5,244 vendor labels to enable the verification and identification of potential HT vendors on online escort markets. To establish a benchmark for authorship identification, we train a DeCLUTR-small model, achieving a macro-F1 score of 0.8656 in a closed-set classification environment. Next, we leverage the style representations extracted from the trained classifier to conduct authorship verification, resulting in a mean r-precision score of 0.8852 in an open-set ranking environment. Finally, to encourage further research and ensure responsible data sharing, we plan to release IDTraffickers for the authorship attribution task to researchers under specific conditions, considering the sensitive nature of the data. We believe that the availability of our dataset and benchmarks will empower future researchers to utilize our findings, thereby facilitating the effective linkage of escort ads and the development of more robust approaches for identifying HT indicators. The dataset contains text sequences as inputs and labels as the arbitrary vendor IDs obtained by linking the phone numbers mentioned in Backpage escort market advertisements. To protect privacy, all personal information, except for the pseudonyms used by the escorts and the post locations, has been redacted so that it cannot be retrieved. For more details, kindly refer to our research attached to the submission. It is important to emphasize that this dataset should only be used for its intended purpose, research on authorship attribution of vendors on escort markets, and not other commercial/non-commercial purposes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We make available MedLatin1 and MedLatin2, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatin1 and MedLatin2 consist of 294 and 30 curated texts, respectively, labelled by author, with MedLatin1 texts being of an epistolary nature and MedLatin2 texts consisting of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification.
Facebook
TwitterWe provide you with a training data set that consists of documents written in both English and Spanish. With regard to age, we will consider posts of three classes: 10s (13-17), 20s (23-27), and 30s (33-47). Moreover, documents from authors who pretend to be minors will be included (e.g., documents composed of chat lines of sexual predators will be also considered). Learn more »
Facebook
TwitterThis is the dataset for the shared task on Voight-Kampff Generative AI Authorship Verification PAN@CLEF2024. Please consult the task's page for further details on the format, the dataset's creation, and links to baselines and utility code.
With Large Language Models (LLMs) improving at breakneck speed and seeing more widespread adoption every day, it is getting increasingly hard to discern whether a given text was authored by a human being or a machine. Many classification approaches have been devised to help humans distinguish between human- and machine-authored text, though often without questioning the fundamental and inherent feasibility of the task itself.
With years of experience in a related but much broader field—authorship verification—, we set out to answer whether this task can be solved. We start with the simplest arrangement of a suitable task setup: Given two texts, one authored by a human, one by a machine: pick out the human.
The Generative AI Authorship Verification Task @ PAN is organized in collaboration with the Voight-Kampff Task @ ELOQUENT Lab in a builder-breaker style. PAN participants will build systems to tell human and machine apart, while ELOQUENT participants will investigate novel text generation and obfuscation methods for avoiding detection.
Facebook
TwitterWe provide you with a training corpus that comprises a set of author verification problems in several languages/genres. Each problem consists of some (up to five) known documents by a single person and exactly one questioned document. All documents within a single problem instance will be in the same language. However, their genre and/or topic may differ significantly. The document lengths vary from a few hundred to a few thousand words.
The documents of each problem are located in a separate folder, the name of which (problem ID) encodes the language of the documents. The following list shows the available sub-corpora, including their language, type (cross-genre or cross-topic), code, and examples of problem IDs:
Language; Type; Code; Problem IDs Dutch; Cross-genre; DU; DU001, DU002, DU003, etc. English; Cross-topic; EN; EN001, EN002, EN003, etc. Greek; Cross-topic; GR; GR001, GR002, GR003, etc. Spanish; Cross-genre; SP; SP001, SP002, SP003, etc.
The ground truth data of the training corpus found in the file truth.txt include one line per problem with problem ID and the correct binary answer (Y means the known and the questioned documents are by the same author and N means the opposite). For example:
EN001 N EN002 Y EN003 N ...
Facebook
Twitterhttps://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/UR3RVEhttps://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/UR3RVE
The MATCHED dataset is a novel multimodal collection of escort advertisements curated to support research in Authorship Attribution (AA) and related tasks. It comprises 27,619 unique text descriptions and 55,115 images (in jpg format) sourced from Backpage escort ads across seven major U.S. cities–Atlanta, Dallas, Detroit, Houston, Chicago, San Fransisco, and New York. These cities are further categorized into four geographical regions—South, Midwest, West, and Northeast—offering a structured dataset that enables both in-distribution and out-of-distribution (OOD) evaluations. Each ad in the dataset contains metadata that links text and visual components, providing a rich resource for studying multimodal patterns, vendor identification, and verification tasks. The dataset is uniquely suited for multimodal authorship attribution, vendor linking, stylometric analysis, and understanding the interplay between textual and visual patterns in advertisements. All text descriptions are carefully processed to redact any explicit references to phone numbers, email addresses, advertisement IDs, age-related information, or other contact details that could be used to identify individuals or vendors. The structured metadata allows researchers to explore how multimodal features contribute to uncovering latent patterns in stylometry and vendor behaviors. A demi-data file showcasing the format and structure of our MATCHED dataset is attached with the entry. Given the sensitivity of the subject matter, the actual dataset resides securely on Maastricht University's servers. Only the metadata will be publicly released on Dataverse to ensure ethical use. Researchers interested in accessing the full dataset must sign a Non-Disclosure Agreement (NDA) and a Data Transfer Agreement with Prof. Dr. Gijs Van Dijck from Maastricht University. Access will only be granted under strict restrictions, and recipients must adhere to the ethical guidelines established by the university's committee. These guidelines emphasize the responsible use of the dataset to prevent misuse and to safeguard the privacy and dignity of all individuals involved.
Facebook
TwitterThis is the dataset VGDB-2016 built for the paper "From Impressionism to Expressionism: Automatically Identifying Van Gogh's Paintings", which has been published on the 23rd IEEE International Conference on Image Processing (ICIP 2016).
To the best of our knowledge, this is the very first public and open dataset with high quality images of paintings, which also takes density (in Pixels Per Inch) into consideration. The main research question we wanted to address was: Is it possible to distinguish Vincent van Gogh's paintings from his contemporaries? Our method achieved a F1-score of 92.3%.
There are many possibilities for future work, such as:
The paper is available at IEEE Xplore (free access until October 6, 2016): https://dx.doi.org/10.1109/icip.2016.7532335
The dataset has been originally published at figshare (CC BY 4.0): https://dx.doi.org/10.6084/m9.figshare.3370627
The source code is available at GitHub (Apache 2.0): https://github.com/gfolego/vangogh
If you find this work useful in your research, please cite the paper! :-)
@InProceedings{folego2016vangogh,
author = {Guilherme Folego and Otavio Gomes and Anderson Rocha},
booktitle = {2016 IEEE International Conference on Image Processing (ICIP)},
title = {From Impressionism to Expressionism: Automatically Identifying Van Gogh's Paintings},
year = {2016},
month = {Sept},
pages = {141--145},
doi = {10.1109/icip.2016.7532335}
}
Keywords: Art; Feature extraction; Painting; Support vector machines; Testing; Training; Visualization; CNN-based authorship attribution; Painter attribution; Data-driven painting characterization
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
There is a growing interest in the possibility of artificial neural networks’ applications in forensics. Extensive research has been published on this subject, especially in the field of handwriting examination. However, it seldom discusses forensic and legal standards, which are the most fundamental of conditions for the acceptance of artificial neural networks in forensics. From the perspective of handwriting analysis, we have exemplified and systematized general methods for an informal falsification of artificial neural networks applied to verification of offline handwritten documents’ authorship. These approaches should be generally effective against applications of neural networks in forensics, aimed to objectively expose and prove models as unreliable.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Research Integrity Tools market size reached USD 1.24 billion in 2024, reflecting the increasing adoption of digital solutions to uphold ethical standards in research activities worldwide. The market is anticipated to grow at a robust CAGR of 13.1% from 2025 to 2033, with the market forecasted to reach USD 3.73 billion by 2033. This strong growth trajectory is driven by the escalating need for transparent, credible, and reliable research outputs in both academic and commercial environments, alongside tightening regulations and rising instances of research misconduct.
One of the primary growth factors fueling the Research Integrity Tools market is the exponential rise in scholarly publications and research data generation globally. As universities, research organizations, and publishers face mounting pressure to ensure the authenticity and originality of research, the demand for advanced integrity tools such as plagiarism detection, authorship verification, and data management solutions has surged. These tools are not only crucial in preventing academic misconduct but also play a pivotal role in streamlining the peer review process, ensuring compliance with international standards, and maintaining the reputation of institutions and publishers. The growing awareness around the consequences of research fraud, coupled with increased investment in digital infrastructure, has further catalyzed the adoption of these solutions across diverse end-user segments.
Another significant driver is the evolving regulatory landscape and the tightening of compliance requirements by governments and funding agencies worldwide. Regulatory bodies are increasingly mandating strict adherence to ethical guidelines, data transparency, and reproducibility in research outputs. This shift has prompted institutions to invest in comprehensive research integrity tools that can automate compliance monitoring, facilitate transparent reporting, and mitigate the risks associated with data fabrication, falsification, and plagiarism. Moreover, the proliferation of open-access publishing and international collaborations has accentuated the need for robust integrity tools capable of supporting multi-language and multi-disciplinary research environments, further expanding the addressable market.
Technological advancements have also played a crucial role in shaping the Research Integrity Tools market. The integration of artificial intelligence, machine learning, and natural language processing into these tools has significantly enhanced their accuracy, scalability, and usability. AI-powered plagiarism detection, automated peer review management, and advanced authorship verification are now enabling institutions to process vast volumes of research content with unprecedented speed and precision. Additionally, the shift towards cloud-based deployment models has made these solutions more accessible and cost-effective for a broader range of users, including smaller academic institutions and emerging research organizations. As digital transformation continues to permeate the research ecosystem, the market for research integrity tools is expected to witness sustained growth.
From a regional perspective, North America currently leads the global Research Integrity Tools market, accounting for the largest share in 2024, followed by Europe and the Asia Pacific region. The dominance of North America can be attributed to the presence of leading academic institutions, robust research funding, and stringent regulatory frameworks. Europe’s market is bolstered by strong government initiatives and a collaborative research culture, while Asia Pacific is emerging as a high-growth region driven by rapid digitalization and expanding research activities in countries like China, India, and Japan. Latin America and the Middle East & Africa, though smaller in market size, are witnessing steady adoption as awareness of research integrity and compliance grows.
The Research Integrity Tools market by component is primarily segmented into Software and Services. The software segment dominates the market, accounting for the majority of the global revenue in 2024. This is largely due to the growing need for automated, scalable, and user-friendly solutions that can efficiently detect plagiarism, veri
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We modeled a simultaneous multi-round auction with BPMN models, transformed the latter to Petri nets, and used a model checker to verify whether certain outcomes of the auction are possible or not.
Dataset Characteristics: Tabular
Subject Area: Computer Science
Associated Tasks: Classification, Regression
Instances: 2043
Features: 7
For what purpose was the dataset created? The dataset was created as part of a scientific study. The goal was to find out whether one could replace costly verification of complex process models (here: simultaneous multi-round auctions, as used for auctioning frequency spectra) with predictions of the outcome.
What do the instances in this dataset represent? Each instance represents one verification run. Verification checks whether a particular price is possible for a particular product, and (for only some of the instances) whether a particular bidder might win the product to that price.
Additional Information Our code to prepare the dataset and to make predictions is available here: https://github.com/Jakob-Bach/Analyzing-Auction-Verification
Has Missing Values? No
Title: Analyzing and Predicting Verification of Data-Aware Process Models – a Case Study with Spectrum Auctions
Authors: Elaheh Ordoni, Jakob Bach, Ann-Katrin Fleck. 2022
Journal: Published in Journal
Verification techniques play an essential role in detecting undesirable behaviors in many applications like spectrum auctions. By verifying an auction design, one can detect the least favorable outcomes, e.g., the lowest revenue of an auctioneer. However, verification may be infeasible in practice, given the vast size of the state space on the one hand and the large number of properties to be verified on the other hand. To overcome this challenge, we leverage machine-learning techniques. In particular, we create a dataset by verifying properties of a spectrum auction first. Second, we use this dataset to analyze and predict outcomes of the auction and characteristics of the verification procedure. To evaluate the usefulness of machine learning in the given scenario, we consider prediction quality and feature importance. In our experiments, we observe that prediction models can capture relationships in our dataset well, though one needs to be careful to obtain a representative and sufficiently large training dataset. While the focus of this article is on a specific verification scenario, our analysis approach is general and can be adapted to other domains.
Citation:Ordoni,Elaheh, Bach,Jakob, Fleck,Ann-Katrin, and Bach,Jakob. (2022). Auction Verification. UCI Machine Learning Repository. https://doi.org/10.24432/C52K6N.
BibTeX:@misc{misc_auction_verification_713,
author = {Ordoni,Elaheh, Bach,Jakob, Fleck,Ann-Katrin, and Bach,Jakob},
title = {{Auction Verification}},
year = {2022},
howpublished = {UCI Machine Learning Repository},
note = {{DOI}: https://doi.org/10.24432/C52K6N}
}
pip install ucimlrepo
`from ucimlrepo import fetch_ucirepo
auction_verification = fetch_ucirepo(id=713)
X = auction_verification.data.features y = auction_verification.data.targets
print(auction_verification.metadata)
print(auction_verification.variables) `
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Blockchain data query: PROTOCOL IDENTIFICATION VERIFICATION
Facebook
Twitterhttps://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Global Digital Identity Verification market size 2021 was recorded $7874.26 Million whereas by the end of 2025 it will reach $13274.6 Million. According to the author, by 2033 Digital Identity Verification market size will become $37726.3. Digital Identity Verification market will be growing at a CAGR of 13.947% during 2025 to 2033.
Facebook
TwitterAttribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
Dataset Card for Dataset Name
Dataset for authorship verification, comprised of 12 cleaned, modified, open source authorship verification and attribution datasets.
Dataset Details
Code for cleaning and modifying datasets can be found in https://github.com/swan-07/authorship-verification/blob/main/Authorship_Verification_Datasets.ipynb and is detailed in paper. Datasets used to produce the final dataset are:
Reuters50
@misc{misc_reuter_50_50_217, author = {Liu… See the full description on the dataset page: https://huggingface.co/datasets/swan07/authorship-verification.