43 datasets found
  1. h

    authorship-verification

    • huggingface.co
    Updated Jul 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    swan (2024). authorship-verification [Dataset]. https://huggingface.co/datasets/swan07/authorship-verification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 14, 2024
    Authors
    swan
    License

    Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    Dataset for authorship verification, comprised of 12 cleaned, modified, open source authorship verification and attribution datasets.

      Dataset Details
    

    Code for cleaning and modifying datasets can be found in https://github.com/swan-07/authorship-verification/blob/main/Authorship_Verification_Datasets.ipynb and is detailed in paper. Datasets used to produce the final dataset are:

    Reuters50

    @misc{misc_reuter_50_50_217, author = {Liu… See the full description on the dataset page: https://huggingface.co/datasets/swan07/authorship-verification.

  2. Data from: PAN20 Authorship Analysis: Authorship Verification

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Janek Bevendorff; Mike Kestemont; Efstathios Stamatatos; Enrique Manjavacas; Martin Potthast; Benno Stein; Janek Bevendorff; Mike Kestemont; Efstathios Stamatatos; Enrique Manjavacas; Martin Potthast; Benno Stein (2023). PAN20 Authorship Analysis: Authorship Verification [Dataset]. http://doi.org/10.5281/zenodo.3724096
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Janek Bevendorff; Mike Kestemont; Efstathios Stamatatos; Enrique Manjavacas; Martin Potthast; Benno Stein; Janek Bevendorff; Mike Kestemont; Efstathios Stamatatos; Enrique Manjavacas; Martin Potthast; Benno Stein
    Description

    Task

    Authorship verification is the task of deciding whether two texts have been written by the same author based on comparing the texts' writing styles.

    In the coming three years at PAN 2020 to PAN 2022, we develop a new experimental setup that addresses three key questions in authorship verification that have not been studied at scale to date:

    Year 1 (PAN 2020): Closed-set verficiation.
    Given a large training dataset comprising of known authors who have written about a given set of topics, the test dataset contains verification cases from a subset of the authors and topics found in the training data.

    Year 2 (PAN 2021): Open-set verification.
    Given the training dataset of Year 1, the test dataset contains verification cases from previously unseen authors and topics.

    Year 3 (PAN 2022): Suprise task.
    The task of the last year of this evaluation cycle (to be announced at a later time) will be designed with an eye on realism and practical application.

    This evaluation cycle on authorship verification provides for a renewed challenge of increasing difficulty within a large-scale evaluation. We invite you to plan ahead and participate in all three of these tasks.

    More information at: PAN @ CLEF 2020 - Authorship Verification

    Citing the Dataset

    If you use this dataset for your research, please be sure to cite the following paper:

    Sebastian Bischoff, Niklas Deckers, Marcel Schliebs, Ben Thies, Matthias Hagen, Efstathios Stamatatos, Benno Stein, and Martin Potthast. The Importance of Suppressing Domain Style in Authorship Analysis. CoRR, abs/2005.14714, May 2020.

    Bibtex:

    @Article{stein:2020k, author = {Sebastian Bischoff and Niklas Deckers and Marcel Schliebs and Ben Thies and Matthias Hagen and Efstathios Stamatatos and Benno Stein and Martin Potthast}, journal = {CoRR}, month = may, title = {{The Importance of Suppressing Domain Style in Authorship Analysis}}, url = {https://arxiv.org/abs/2005.14714}, volume = {abs/2005.14714}, year = 2020 }

  3. Z

    PAN22 Authorship Analysis: Authorship Verification

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Nov 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stamatatos, Efstathios; Kredens, Krzysztof; Pezik, Piotr; Heini, Annina; Kestemont, Mike; Bevendorff, Janek; Potthast, Martin; Stein, Benno (2022). PAN22 Authorship Analysis: Authorship Verification [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6337136
    Explore at:
    Dataset updated
    Nov 30, 2022
    Dataset provided by
    Leipzig University
    University of the Aegean
    Aston University
    University of Antwerp
    Bauhaus-Universität Weimar
    Authors
    Stamatatos, Efstathios; Kredens, Krzysztof; Pezik, Piotr; Heini, Annina; Kestemont, Mike; Bevendorff, Janek; Potthast, Martin; Stein, Benno
    Description

    Download

    Access to our corpus can be requested via the Aston Institute for Forensic Linguistics Databank: https://fold.aston.ac.uk/handle/123456789/17

    Task

    Authorship verification is the task of deciding whether two texts have been written by the same author based on comparing the texts' writing styles. In previous editions of PAN, we explored the effectiveness of authorship verification technology in several languages and text genres. In the two most recent editions, cross-domain authorship verification using fanfiction texts was examined. Despite certain differences between fandoms, the task of cross-fandom authorship verification has proved to be relatively feasible. In the current edition, we focus on more challenging scenarios where each author verification case considers two texts that belong to different DTs (cross-DT authorship verification). This will allow us to study the ability of stylometric approaches to capture authorial characteristics that remain stable across DTs even when very different forms of expression are imposed by the DT norms.

    Based on a new corpus in English, we provide cross-DT authorship verification cases using the following DTs:

    Essays

    Emails

    Text messages

    Business memos

    The corpus comprises texts of around 100 individuals. All individuals have similar age (18-22) and are native English speakers. The topic of text samples is not restricted while the level of formality can vary within a certain DT (e.g., text messages may be addressed to family members or non-familial acquaintances).

    More information at: Authorship Verification 2022

  4. Enron Authorship Verification Corpus

    • search.datacite.org
    • data.mendeley.com
    Updated Oct 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oren Halvani (2017). Enron Authorship Verification Corpus [Dataset]. http://doi.org/10.17632/n77w7mygwg.1
    Explore at:
    Dataset updated
    Oct 1, 2017
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Mendeley
    Authors
    Oren Halvani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Enron Authorship Verification Corpus" is a derivate of the well-known "Enron Email Dataset", which was transformed in such a way to meet the same standardized format of the "PAN Authorship Identification corpora" (http://pan.webis.de). The corpus consists of 80 authorship verification cases, evenly distributed regarding true/false authorships. Each authorship verification case comprise exactly 5 documents (plain text files). Here, 4 documents represent samples from the known (true) author, while the remaining 1 document represents the text of the unknown author (the subject of verification). The corpus is ballanced, not only in terms of the same number of known documents per case, but also regarding the lenth of the texts, which is near-equal (3-4 kilobyte per text). It can be assumed that each document is aggregated from (short) mails of the same author, in order to have a sufficient length that captures the authors writing style. All texts in the corpus have undergone the same preprocessing-procedure: De-duplication, removing of URL's, newlines/tabs, normalization of utf-8 symbols and substitution of multiple successive blanks with a single blank. All e-mail headers and other metadata (including signatures) have been removed from each document such that it contains only pure natural language text fron a single author. The intention behind this corpus is to provide other researchers in the field of authorship verification the opportunity to compare their results to each other.

  5. VeriDark Agora Authorship Verification Dataset

    • zenodo.org
    Updated Aug 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrei Manolache; Florin Brad; Antonio Barbalau; Radu Ionescu; Marius Popescu; Andrei Manolache; Florin Brad; Antonio Barbalau; Radu Ionescu; Marius Popescu (2022). VeriDark Agora Authorship Verification Dataset [Dataset]. http://doi.org/10.5281/zenodo.7018853
    Explore at:
    Dataset updated
    Aug 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrei Manolache; Florin Brad; Antonio Barbalau; Radu Ionescu; Marius Popescu; Andrei Manolache; Florin Brad; Antonio Barbalau; Radu Ionescu; Marius Popescu
    Description
    VeriDark (Authorship Verification in the DarkNet) is a benchmark for evaluating authorship analysis methods in a cybersecurity context, by introducing datasets gathered from the DarkNet marketplace forums or from Darknet-related discussions on Reddit. This benchmark contains three datasets for authorship verification and one dataset for authorship identification.
  6. Blog-1K

    • zenodo.org
    • data-staging.niaid.nih.gov
    • +1more
    application/gzip
    Updated Dec 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haining Wang; Haining Wang (2022). Blog-1K [Dataset]. http://doi.org/10.5281/zenodo.7455623
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Dec 21, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Haining Wang; Haining Wang
    License

    https://www.isc.org/downloads/software-support-policy/isc-license/https://www.isc.org/downloads/software-support-policy/isc-license/

    Description

    The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'.

    1. Preprocessing

    We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria:
    - accumulatively at least 10,000 characters,
    - accumulatively at most 49,410 characters,
    - accumulatively at least 16 posts,
    - accumulatively at most 40 posts, and
    - each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose).

    Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus.

    2. Statistics

    Its creation and statistics can be found in the Jupyter Notebook.

    Split# Authors# Posts# CharactersAvg. Characters Per Author (Std.)Avg. Characters Per Post (Std.)
    Train1,00016,13230,092,05730,092 (5,884)1,865 (1,007)
    Validation9352,0173,755,3624,016 (2,269)1,862 (999)
    Test9242,0173,732,4484,039 (2,188)1,850 (936)


    3. Usage

    import pandas as pd
    
    df = pd.read_csv('blog1000.csv.gz', compression='infer')
    
    # read in training data
    train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))

    4. License
    All the materials is licensed under the ISC License.


    5. Contact
    Please contact its maintainer for questions.

  7. PAN11 Author Identification: Attribution

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Nov 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shlomo Argamon; Patrick Juola; Shlomo Argamon; Patrick Juola (2023). PAN11 Author Identification: Attribution [Dataset]. http://doi.org/10.5281/zenodo.3713246
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shlomo Argamon; Patrick Juola; Shlomo Argamon; Patrick Juola
    Description

    We provide you with a training corpus that comprises several different common attribution and verification scenarios. There are five training collections consisting of real-world texts (for authorship attribution), and three each with a single author (for authorship verification).

  8. Data from: Be sure to use the same writing style: Applying Authorship...

    • zenodo.org
    bin
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weerasinghe; Seepersaud; Smothers; Jose; Greenstadt; Weerasinghe; Seepersaud; Smothers; Jose; Greenstadt (2025). Be sure to use the same writing style: Applying Authorship Verification on LLM-Generated Texts [Dataset]. http://doi.org/10.5281/zenodo.14714057
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Weerasinghe; Seepersaud; Smothers; Jose; Greenstadt; Weerasinghe; Seepersaud; Smothers; Jose; Greenstadt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The repository link contains a README which gives an overview of the files along with the structure of the data.

    Additionally, for LLAMA and GPT2, the files are in human_{llm_name}{i}.jsonl format where {llm} is the name of the LLM and {i} is the partition of the file and which can be concatenated to form the full dataset for that llm.

  9. D

    IDTraffickers: An Authorship Attribution Dataset to link and connect...

    • dataverse.nl
    Updated Nov 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vageesh Saxena; Gijs Van Dijck; Gerasimos Spanakis; Benjamin Bashpole; Vageesh Saxena; Gijs Van Dijck; Gerasimos Spanakis; Benjamin Bashpole (2023). IDTraffickers: An Authorship Attribution Dataset to link and connect Potential Human-Trafficking Operations on Text Escort Advertisements [Dataset]. http://doi.org/10.34894/NZ7VLC
    Explore at:
    Dataset updated
    Nov 3, 2023
    Dataset provided by
    DataverseNL
    Authors
    Vageesh Saxena; Gijs Van Dijck; Gerasimos Spanakis; Benjamin Bashpole; Vageesh Saxena; Gijs Van Dijck; Gerasimos Spanakis; Benjamin Bashpole
    License

    https://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/NZ7VLChttps://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/NZ7VLC

    Time period covered
    Dec 1, 2015 - Apr 1, 2016
    Description

    Human trafficking (HT) is a pervasive global issue affecting vulnerable individuals, violating their fundamental human rights. Investigations reveal that a significant number of HT cases are associated with online advertisements (ads), particularly in escort markets. Consequently, identifying and connecting HT vendors has become increasingly challenging for Law Enforcement Agencies (LEAs). To address this issue, we introduce IDTraffickers, an extensive dataset consisting of 87,595 text ads and 5,244 vendor labels to enable the verification and identification of potential HT vendors on online escort markets. To establish a benchmark for authorship identification, we train a DeCLUTR-small model, achieving a macro-F1 score of 0.8656 in a closed-set classification environment. Next, we leverage the style representations extracted from the trained classifier to conduct authorship verification, resulting in a mean r-precision score of 0.8852 in an open-set ranking environment. Finally, to encourage further research and ensure responsible data sharing, we plan to release IDTraffickers for the authorship attribution task to researchers under specific conditions, considering the sensitive nature of the data. We believe that the availability of our dataset and benchmarks will empower future researchers to utilize our findings, thereby facilitating the effective linkage of escort ads and the development of more robust approaches for identifying HT indicators. The dataset contains text sequences as inputs and labels as the arbitrary vendor IDs obtained by linking the phone numbers mentioned in Backpage escort market advertisements. To protect privacy, all personal information, except for the pseudonyms used by the escorts and the post locations, has been redacted so that it cannot be retrieved. For more details, kindly refer to our research attached to the submission. It is important to emphasize that this dataset should only be used for its intended purpose, research on authorship attribution of vendors on escort markets, and not other commercial/non-commercial purposes.

  10. Two Datasets for the Computational Authorship Analysis of Medieval Latin...

    • zenodo.org
    zip
    Updated May 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silvia Corbara; Silvia Corbara; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Mirko Tavoni; Mirko Tavoni (2021). Two Datasets for the Computational Authorship Analysis of Medieval Latin Texts [Dataset]. http://doi.org/10.5281/zenodo.3903296
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 24, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Silvia Corbara; Silvia Corbara; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Mirko Tavoni; Mirko Tavoni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We make available MedLatin1 and MedLatin2, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatin1 and MedLatin2 consist of 294 and 30 curated texts, respectively, labelled by author, with MedLatin1 texts being of an epistolary nature and MedLatin2 texts consisting of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification.

  11. Z

    PAN13 Author Identification: Verification

    • data.niaid.nih.gov
    Updated Nov 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juola, Patrick; Stamatatos, Efstathios (2023). PAN13 Author Identification: Verification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3715998
    Explore at:
    Dataset updated
    Nov 18, 2023
    Authors
    Juola, Patrick; Stamatatos, Efstathios
    Description

    We provide you with a training data set that consists of documents written in both English and Spanish. With regard to age, we will consider posts of three classes: 10s (13-17), 20s (23-27), and 30s (33-47). Moreover, documents from authors who pretend to be minors will be included (e.g., documents composed of chat lines of sexual predators will be also considered). Learn more »

  12. PAN24 Voight-Kampff Generative AI Authorship Verification

    • zenodo.org
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Janek Bevendorff; Janek Bevendorff; Matti Wiegmann; Matti Wiegmann; Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Efstathios Stamatatos; Efstathios Stamatatos (2024). PAN24 Voight-Kampff Generative AI Authorship Verification [Dataset]. http://doi.org/10.5281/zenodo.10718757
    Explore at:
    Dataset updated
    Feb 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Janek Bevendorff; Janek Bevendorff; Matti Wiegmann; Matti Wiegmann; Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Efstathios Stamatatos; Efstathios Stamatatos
    Description

    This is the dataset for the shared task on Voight-Kampff Generative AI Authorship Verification PAN@CLEF2024. Please consult the task's page for further details on the format, the dataset's creation, and links to baselines and utility code.

    Task

    With Large Language Models (LLMs) improving at breakneck speed and seeing more widespread adoption every day, it is getting increasingly hard to discern whether a given text was authored by a human being or a machine. Many classification approaches have been devised to help humans distinguish between human- and machine-authored text, though often without questioning the fundamental and inherent feasibility of the task itself.

    With years of experience in a related but much broader field—authorship verification—, we set out to answer whether this task can be solved. We start with the simplest arrangement of a suitable task setup: Given two texts, one authored by a human, one by a machine: pick out the human.

    The Generative AI Authorship Verification Task @ PAN is organized in collaboration with the Voight-Kampff Task @ ELOQUENT Lab in a builder-breaker style. PAN participants will build systems to tell human and machine apart, while ELOQUENT participants will investigate novel text generation and obfuscation methods for avoiding detection.

  13. Z

    PAN15 Author Identification: Verification

    • data.niaid.nih.gov
    Updated Nov 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stamatatos, Efstathios; Daelemans Daelemans amd Ben Verhoeven, Walter; Juola, Patrick; López-López, Aurelio; Potthast, Martin; Stein, Benno (2023). PAN15 Author Identification: Verification [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3737562
    Explore at:
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Bauhaus-Universität Weimar
    Universität Leipzig
    Authors
    Stamatatos, Efstathios; Daelemans Daelemans amd Ben Verhoeven, Walter; Juola, Patrick; López-López, Aurelio; Potthast, Martin; Stein, Benno
    Description

    We provide you with a training corpus that comprises a set of author verification problems in several languages/genres. Each problem consists of some (up to five) known documents by a single person and exactly one questioned document. All documents within a single problem instance will be in the same language. However, their genre and/or topic may differ significantly. The document lengths vary from a few hundred to a few thousand words.

    The documents of each problem are located in a separate folder, the name of which (problem ID) encodes the language of the documents. The following list shows the available sub-corpora, including their language, type (cross-genre or cross-topic), code, and examples of problem IDs:

    Language; Type; Code; Problem IDs Dutch; Cross-genre; DU; DU001, DU002, DU003, etc. English; Cross-topic; EN; EN001, EN002, EN003, etc. Greek; Cross-topic; GR; GR001, GR002, GR003, etc. Spanish; Cross-genre; SP; SP001, SP002, SP003, etc.

    The ground truth data of the training corpus found in the file truth.txt include one line per problem with problem ID and the correct binary answer (Y means the known and the questioned documents are by the same author and N means the opposite). For example:

    EN001 N EN002 Y EN003 N ...

  14. D

    MATCHED: Multimodal Authorship-Attribution To Combat Human Trafficking in...

    • dataverse.nl
    pdf
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vageesh Saxena; Vageesh Saxena (2024). MATCHED: Multimodal Authorship-Attribution To Combat Human Trafficking in Escort-Advertisement Data [Dataset]. http://doi.org/10.34894/UR3RVE
    Explore at:
    pdf(40346)Available download formats
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    DataverseNL
    Authors
    Vageesh Saxena; Vageesh Saxena
    License

    https://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/UR3RVEhttps://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/UR3RVE

    Description

    The MATCHED dataset is a novel multimodal collection of escort advertisements curated to support research in Authorship Attribution (AA) and related tasks. It comprises 27,619 unique text descriptions and 55,115 images (in jpg format) sourced from Backpage escort ads across seven major U.S. cities–Atlanta, Dallas, Detroit, Houston, Chicago, San Fransisco, and New York. These cities are further categorized into four geographical regions—South, Midwest, West, and Northeast—offering a structured dataset that enables both in-distribution and out-of-distribution (OOD) evaluations. Each ad in the dataset contains metadata that links text and visual components, providing a rich resource for studying multimodal patterns, vendor identification, and verification tasks. The dataset is uniquely suited for multimodal authorship attribution, vendor linking, stylometric analysis, and understanding the interplay between textual and visual patterns in advertisements. All text descriptions are carefully processed to redact any explicit references to phone numbers, email addresses, advertisement IDs, age-related information, or other contact details that could be used to identify individuals or vendors. The structured metadata allows researchers to explore how multimodal features contribute to uncovering latent patterns in stylometry and vendor behaviors. A demi-data file showcasing the format and structure of our MATCHED dataset is attached with the entry. Given the sensitivity of the subject matter, the actual dataset resides securely on Maastricht University's servers. Only the metadata will be publicly released on Dataverse to ensure ethical use. Researchers interested in accessing the full dataset must sign a Non-Disclosure Agreement (NDA) and a Data Transfer Agreement with Prof. Dr. Gijs Van Dijck from Maastricht University. Access will only be granted under strict restrictions, and recipients must adhere to the ethical guidelines established by the university's committee. These guidelines emphasize the responsible use of the dataset to prevent misuse and to safeguard the privacy and dignity of all individuals involved.

  15. Vincent van Gogh's paintings

    • kaggle.com
    zip
    Updated Aug 26, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guilherme Folego (2016). Vincent van Gogh's paintings [Dataset]. https://www.kaggle.com/datasets/gfolego/vangogh/discussion
    Explore at:
    zip(35310 bytes)Available download formats
    Dataset updated
    Aug 26, 2016
    Authors
    Guilherme Folego
    Description

    This is the dataset VGDB-2016 built for the paper "From Impressionism to Expressionism: Automatically Identifying Van Gogh's Paintings", which has been published on the 23rd IEEE International Conference on Image Processing (ICIP 2016).

    To the best of our knowledge, this is the very first public and open dataset with high quality images of paintings, which also takes density (in Pixels Per Inch) into consideration. The main research question we wanted to address was: Is it possible to distinguish Vincent van Gogh's paintings from his contemporaries? Our method achieved a F1-score of 92.3%.

    There are many possibilities for future work, such as:

    • Increase the dataset. This includes Wikimedia Commons and WikiArt. Unfortunately, Google Art Project does not allow downloads.
    • Deal with density normalization. There is a lot of data available without such normalization (e.g., Painting-91 and Painter by Numbers). It is possible analyze how this affects accuracy.
    • Experiment with multi-class and open-set recognition.
    • Try to identify the painting style, movement, or school.
    • Maybe study painting authorship verification: given two paintings, are they from the same author?
    • Is it possible to detect artificially generated paintings? Are they useful for dataset augmentation?

    The paper is available at IEEE Xplore (free access until October 6, 2016): https://dx.doi.org/10.1109/icip.2016.7532335

    The dataset has been originally published at figshare (CC BY 4.0): https://dx.doi.org/10.6084/m9.figshare.3370627

    The source code is available at GitHub (Apache 2.0): https://github.com/gfolego/vangogh

    If you find this work useful in your research, please cite the paper! :-)

    @InProceedings{folego2016vangogh,
      author = {Guilherme Folego and Otavio Gomes and Anderson Rocha},
      booktitle = {2016 IEEE International Conference on Image Processing (ICIP)},
      title = {From Impressionism to Expressionism: Automatically Identifying Van Gogh's Paintings},
      year = {2016},
      month = {Sept},
      pages = {141--145},
      doi = {10.1109/icip.2016.7532335}
    }
    

    Keywords: Art; Feature extraction; Painting; Support vector machines; Testing; Training; Visualization; CNN-based authorship attribution; Painter attribution; Data-driven painting characterization

  16. Data from: Evaluation of neural networks applied in forensics; handwriting...

    • tandf.figshare.com
    pdf
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maciej Marcinowski (2023). Evaluation of neural networks applied in forensics; handwriting verification example [Dataset]. http://doi.org/10.6084/m9.figshare.19946632.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Maciej Marcinowski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    There is a growing interest in the possibility of artificial neural networks’ applications in forensics. Extensive research has been published on this subject, especially in the field of handwriting examination. However, it seldom discusses forensic and legal standards, which are the most fundamental of conditions for the acceptance of artificial neural networks in forensics. From the perspective of handwriting analysis, we have exemplified and systematized general methods for an informal falsification of artificial neural networks applied to verification of offline handwritten documents’ authorship. These approaches should be generally effective against applications of neural networks in forensics, aimed to objectively expose and prove models as unreliable.

  17. D

    Research Integrity Tools Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Research Integrity Tools Market Research Report 2033 [Dataset]. https://dataintelo.com/report/research-integrity-tools-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Research Integrity Tools Market Outlook



    According to our latest research, the global Research Integrity Tools market size reached USD 1.24 billion in 2024, reflecting the increasing adoption of digital solutions to uphold ethical standards in research activities worldwide. The market is anticipated to grow at a robust CAGR of 13.1% from 2025 to 2033, with the market forecasted to reach USD 3.73 billion by 2033. This strong growth trajectory is driven by the escalating need for transparent, credible, and reliable research outputs in both academic and commercial environments, alongside tightening regulations and rising instances of research misconduct.




    One of the primary growth factors fueling the Research Integrity Tools market is the exponential rise in scholarly publications and research data generation globally. As universities, research organizations, and publishers face mounting pressure to ensure the authenticity and originality of research, the demand for advanced integrity tools such as plagiarism detection, authorship verification, and data management solutions has surged. These tools are not only crucial in preventing academic misconduct but also play a pivotal role in streamlining the peer review process, ensuring compliance with international standards, and maintaining the reputation of institutions and publishers. The growing awareness around the consequences of research fraud, coupled with increased investment in digital infrastructure, has further catalyzed the adoption of these solutions across diverse end-user segments.




    Another significant driver is the evolving regulatory landscape and the tightening of compliance requirements by governments and funding agencies worldwide. Regulatory bodies are increasingly mandating strict adherence to ethical guidelines, data transparency, and reproducibility in research outputs. This shift has prompted institutions to invest in comprehensive research integrity tools that can automate compliance monitoring, facilitate transparent reporting, and mitigate the risks associated with data fabrication, falsification, and plagiarism. Moreover, the proliferation of open-access publishing and international collaborations has accentuated the need for robust integrity tools capable of supporting multi-language and multi-disciplinary research environments, further expanding the addressable market.




    Technological advancements have also played a crucial role in shaping the Research Integrity Tools market. The integration of artificial intelligence, machine learning, and natural language processing into these tools has significantly enhanced their accuracy, scalability, and usability. AI-powered plagiarism detection, automated peer review management, and advanced authorship verification are now enabling institutions to process vast volumes of research content with unprecedented speed and precision. Additionally, the shift towards cloud-based deployment models has made these solutions more accessible and cost-effective for a broader range of users, including smaller academic institutions and emerging research organizations. As digital transformation continues to permeate the research ecosystem, the market for research integrity tools is expected to witness sustained growth.




    From a regional perspective, North America currently leads the global Research Integrity Tools market, accounting for the largest share in 2024, followed by Europe and the Asia Pacific region. The dominance of North America can be attributed to the presence of leading academic institutions, robust research funding, and stringent regulatory frameworks. Europe’s market is bolstered by strong government initiatives and a collaborative research culture, while Asia Pacific is emerging as a high-growth region driven by rapid digitalization and expanding research activities in countries like China, India, and Japan. Latin America and the Middle East & Africa, though smaller in market size, are witnessing steady adoption as awareness of research integrity and compliance grows.



    Component Analysis



    The Research Integrity Tools market by component is primarily segmented into Software and Services. The software segment dominates the market, accounting for the majority of the global revenue in 2024. This is largely due to the growing need for automated, scalable, and user-friendly solutions that can efficiently detect plagiarism, veri

  18. Auction Verification Dataset

    • kaggle.com
    zip
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rabie El Kharoua (2024). Auction Verification Dataset [Dataset]. https://www.kaggle.com/datasets/rabieelkharoua/auction-verification-dataset/data
    Explore at:
    zip(15678 bytes)Available download formats
    Dataset updated
    Apr 24, 2024
    Authors
    Rabie El Kharoua
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We modeled a simultaneous multi-round auction with BPMN models, transformed the latter to Petri nets, and used a model checker to verify whether certain outcomes of the auction are possible or not.

    Dataset Characteristics: Tabular

    Subject Area: Computer Science

    Associated Tasks: Classification, Regression

    Instances: 2043

    Features: 7

    Dataset Information

    For what purpose was the dataset created? The dataset was created as part of a scientific study. The goal was to find out whether one could replace costly verification of complex process models (here: simultaneous multi-round auctions, as used for auctioning frequency spectra) with predictions of the outcome.

    What do the instances in this dataset represent? Each instance represents one verification run. Verification checks whether a particular price is possible for a particular product, and (for only some of the instances) whether a particular bidder might win the product to that price.

    Additional Information Our code to prepare the dataset and to make predictions is available here: https://github.com/Jakob-Bach/Analyzing-Auction-Verification

    Has Missing Values? No

    Introductory Paper

    Title: Analyzing and Predicting Verification of Data-Aware Process Models – a Case Study with Spectrum Auctions

    Authors: Elaheh Ordoni, Jakob Bach, Ann-Katrin Fleck. 2022

    Journal: Published in Journal

    Link of Article

    Abstract of Introductory Paper

    Verification techniques play an essential role in detecting undesirable behaviors in many applications like spectrum auctions. By verifying an auction design, one can detect the least favorable outcomes, e.g., the lowest revenue of an auctioneer. However, verification may be infeasible in practice, given the vast size of the state space on the one hand and the large number of properties to be verified on the other hand. To overcome this challenge, we leverage machine-learning techniques. In particular, we create a dataset by verifying properties of a spectrum auction first. Second, we use this dataset to analyze and predict outcomes of the auction and characteristics of the verification procedure. To evaluate the usefulness of machine learning in the given scenario, we consider prediction quality and feature importance. In our experiments, we observe that prediction models can capture relationships in our dataset well, though one needs to be careful to obtain a representative and sufficiently large training dataset. While the focus of this article is on a specific verification scenario, our analysis approach is general and can be adapted to other domains.

    Cite

    Citation:Ordoni,Elaheh, Bach,Jakob, Fleck,Ann-Katrin, and Bach,Jakob. (2022). Auction Verification. UCI Machine Learning Repository. https://doi.org/10.24432/C52K6N.

    BibTeX:@misc{misc_auction_verification_713, author = {Ordoni,Elaheh, Bach,Jakob, Fleck,Ann-Katrin, and Bach,Jakob}, title = {{Auction Verification}}, year = {2022}, howpublished = {UCI Machine Learning Repository}, note = {{DOI}: https://doi.org/10.24432/C52K6N} }

    Import in Python

    pip install ucimlrepo

    `from ucimlrepo import fetch_ucirepo

    fetch dataset

    auction_verification = fetch_ucirepo(id=713)

    data (as pandas dataframes)

    X = auction_verification.data.features y = auction_verification.data.targets

    metadata

    print(auction_verification.metadata)

    variable information

    print(auction_verification.variables) `

  19. d

    PROTOCOL IDENTIFICATION VERIFICATION

    • dune.com
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fungi_agents (2025). PROTOCOL IDENTIFICATION VERIFICATION [Dataset]. https://dune.com/discover/content/relevant?q=author%3Afungi_agents&resource-type=queries
    Explore at:
    Dataset updated
    Jun 26, 2025
    Authors
    fungi_agents
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Blockchain data query: PROTOCOL IDENTIFICATION VERIFICATION

  20. c

    Global Digital Identity Verification Market Report 2025 Edition, Market...

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research (2025). Global Digital Identity Verification Market Report 2025 Edition, Market Size, Share, CAGR, Forecast, Revenue [Dataset]. https://www.cognitivemarketresearch.com/digital-identity-verification-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Aug 26, 2025
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    Global Digital Identity Verification market size 2021 was recorded $7874.26 Million whereas by the end of 2025 it will reach $13274.6 Million. According to the author, by 2033 Digital Identity Verification market size will become $37726.3. Digital Identity Verification market will be growing at a CAGR of 13.947% during 2025 to 2033.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
swan (2024). authorship-verification [Dataset]. https://huggingface.co/datasets/swan07/authorship-verification

authorship-verification

swan07/authorship-verification

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 14, 2024
Authors
swan
License

Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically

Description

Dataset Card for Dataset Name

Dataset for authorship verification, comprised of 12 cleaned, modified, open source authorship verification and attribution datasets.

  Dataset Details

Code for cleaning and modifying datasets can be found in https://github.com/swan-07/authorship-verification/blob/main/Authorship_Verification_Datasets.ipynb and is detailed in paper. Datasets used to produce the final dataset are:

Reuters50

@misc{misc_reuter_50_50_217, author = {Liu… See the full description on the dataset page: https://huggingface.co/datasets/swan07/authorship-verification.

Search
Clear search
Close search
Google apps
Main menu