2 datasets found
  1. Wikipedia Multilingual Vandalism Detection Dataset

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mykola Trokhymovych; Mykola Trokhymovych; Muniza Aslam; Muniza Aslam; Ai-Jou Chou; Ai-Jou Chou; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Diego Saez-Trumper; Diego Saez-Trumper (2025). Wikipedia Multilingual Vandalism Detection Dataset [Dataset]. http://doi.org/10.5281/zenodo.8174336
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mykola Trokhymovych; Mykola Trokhymovych; Muniza Aslam; Muniza Aslam; Ai-Jou Chou; Ai-Jou Chou; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Diego Saez-Trumper; Diego Saez-Trumper
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies a research paper that introduces a novel system designed to support the Wikipedia community in combating vandalism on the platform. The dataset has been prepared to enhance the accuracy and efficiency of Wikipedia patrolling in multiple languages.

    The release of this comprehensive dataset aims to encourage further research and development in vandalism detection techniques, fostering a safer and more inclusive environment for the Wikipedia community. Researchers and practitioners can utilize this dataset to train and validate their models for vandalism detection and contribute to improving online platforms' content moderation strategies.

    Dataset Details:

    • Number of Languages: 47
    • Observation period: 6 months training, one week hold-out testing
    • Use Case: The dataset is primarily intended for training and evaluating vandalism detection systems.
    • Features: Each record characterizes the corresponding revision of the Wikipedia page, including revision metadata, user details, text inserted, removed, or changed, and corresponding MLMs-based features.
    • Data Filtering and Feature Engineering: Advanced filtering and feature engineering techniques were applied to ensure the dataset's quality and relevance for effectively training the vandalism detection system.
    • Files: Training and hold-out testing datasets of anonymous and all users.

    Related paper citation:

    @inproceedings{10.1145/3580305.3599823,
    author = {Trokhymovych, Mykola and Aslam, Muniza and Chou, Ai-Jou and Baeza-Yates, Ricardo and Saez-Trumper, Diego},
    title = {Fair Multilingual Vandalism Detection System for Wikipedia},
    year = {2023},
    isbn = {9798400701030},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3580305.3599823},
    doi = {10.1145/3580305.3599823},
    abstract = {This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system was evaluated through comparison with the one used in production in Wikipedia, known as ORES. Our research results in a significant increase in the number of languages covered, making Wikipedia patrolling more efficient to a wider range of communities. Furthermore, our model outperforms ORES, ensuring that the results provided are not only more accurate but also less biased against certain groups of contributors.},
    booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
    pages = {4981–4990},
    numpages = {10},
    location = {Long Beach, CA, USA},
    series = {KDD '23}
    }

  2. Z

    Individual Edit Histories of All References in the English Wikipedia

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Feb 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weller, Katrin (2021). Individual Edit Histories of All References in the English Wikipedia [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3964989
    Explore at:
    Dataset updated
    Feb 19, 2021
    Dataset provided by
    Flöck, Fabian
    Weller, Katrin
    Ulloa, Roberto
    Zagovora, Olga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes the historical versions of all individual references per article in the English Wikipedia. Each reference object also contains information about its original creating editor, editors implementing changes to it, and timestamps of all actions (creations, modifications, deletions, and reinsertions) that were applied to the reference. Each historical version of a reference is represented as a list of tokens (≈ words), where each token has an individual creator and change history.

    The extraction process was meticulously vetted through crowdsourcing evaluations, assuring very high accuracy in contrast to standard textual difference algorithms. The dataset includes references that were created with the "" tag until June 2019. It contains 55,503,998 references with 164,530,374 actions. These references were found in 4,690,046 Wikipedia articles.

    The dataset consists of JSON files where each article's page ID (here: article_id) is used as a file name. Each file is represented as a list of “References”. Each reference is a dictionary with the following keys:

    "first_rev_id" type: Integer, first revision where the reference was inserted (the same value is represented in “ins” as the first element of the list and in "rev_id" of the first element in the "change_sequence"),

    "first_hash_id" type: String, the hash value of the first version of token_id (from WikiWho1, see below) list of the reference (the same value is represented as "hash_id" of the first element in the "change_sequence"),

    "first_editor_id" type: String, user_id or IP address of the first revision where the reference was inserted (the same value is represented as "editor_id" of the first element in the "change_sequence",

    "deleted" type: Boolean, an indicator if the reference exists in the last available revision,

    "ins" type: List of Integers, list of revisions where the reference was inserted (includes the first revision mentioned as "first_rev_id"),

    "ins_editor" type: List of Strings, list of user_id or IP addresses of editors where the reference was inserted (includes the first user mentioned as "first_editor_id"),

    "del" type: List of Integers, list of revisions where the reference was deleted from the article or reference was modified in a way that less than 25% of tokens remained,

    "del_editor“ type: List of Strings, list of user_id or IP addresses of editors where the reference was deleted or reference was modified in a way that less than 25% of tokens remained,

    "modif" type: List of Integers, list of revisions where the reference was modified, or reinserted with modification,

    "hashes": type: List of Strings, list of hash values of all versions of references,

    "first_rev_time": type: DateTime, the timestamp when the reference was created (the same value is represented in "ins_time” as the first element of the list and in "time" of the first element in the "change_sequence"),

    "ins_time" type: List of DateTime, list of timestamps when the reference was inserted or reinserted,

    "del_time" type: List of DateTime, list of timestamps when the reference was deleted,

    "change_sequence" type: List of dictionaries, with information about tokens, editors and revisions where the reference was modified (the first element representing the first revision where the reference was inserted), where:

    "hash_id" type: String, the hash value of the token_id (WikiWho1) list of the reference version,

    "rev_id" type: Integer, the revision number of the particular version of the reference,

    "editor_id" type: String, user_id or IP address of the revision editor,

    "time" type: DateTime, the timestamp when of this particular version of the reference,

    "tokens" type: List of Strings, ordered list of tokens (created by WikiWho1) that represents the particular version of the reference (the list has the same length as "token_editors"),

    "token_editors" type: List of Strings, ordered list of user_ids or IP addresses of editors that were first who added the corresponding token (see "tokens") to Wikipedia article.

    1 WikiWho is a text mining algorithm to extract changes to tokens from Wikipedia revisions. Each token is assigned a unique ID. More information: https://www.wikiwho.net/#technical_details

    GitHub Repository with Python example code on how to process data and extract document identifiers: https://github.com/gesiscss/wikipedia_references

    To run the code at GESIS Notebook follow the link: https://notebooks.gesis.org/binder/v2/gh/gesiscss/wikipedia_references/master

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mykola Trokhymovych; Mykola Trokhymovych; Muniza Aslam; Muniza Aslam; Ai-Jou Chou; Ai-Jou Chou; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Diego Saez-Trumper; Diego Saez-Trumper (2025). Wikipedia Multilingual Vandalism Detection Dataset [Dataset]. http://doi.org/10.5281/zenodo.8174336
Organization logo

Wikipedia Multilingual Vandalism Detection Dataset

Explore at:
csvAvailable download formats
Dataset updated
May 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mykola Trokhymovych; Mykola Trokhymovych; Muniza Aslam; Muniza Aslam; Ai-Jou Chou; Ai-Jou Chou; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Diego Saez-Trumper; Diego Saez-Trumper
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset accompanies a research paper that introduces a novel system designed to support the Wikipedia community in combating vandalism on the platform. The dataset has been prepared to enhance the accuracy and efficiency of Wikipedia patrolling in multiple languages.

The release of this comprehensive dataset aims to encourage further research and development in vandalism detection techniques, fostering a safer and more inclusive environment for the Wikipedia community. Researchers and practitioners can utilize this dataset to train and validate their models for vandalism detection and contribute to improving online platforms' content moderation strategies.

Dataset Details:

  • Number of Languages: 47
  • Observation period: 6 months training, one week hold-out testing
  • Use Case: The dataset is primarily intended for training and evaluating vandalism detection systems.
  • Features: Each record characterizes the corresponding revision of the Wikipedia page, including revision metadata, user details, text inserted, removed, or changed, and corresponding MLMs-based features.
  • Data Filtering and Feature Engineering: Advanced filtering and feature engineering techniques were applied to ensure the dataset's quality and relevance for effectively training the vandalism detection system.
  • Files: Training and hold-out testing datasets of anonymous and all users.

Related paper citation:

@inproceedings{10.1145/3580305.3599823,
author = {Trokhymovych, Mykola and Aslam, Muniza and Chou, Ai-Jou and Baeza-Yates, Ricardo and Saez-Trumper, Diego},
title = {Fair Multilingual Vandalism Detection System for Wikipedia},
year = {2023},
isbn = {9798400701030},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3580305.3599823},
doi = {10.1145/3580305.3599823},
abstract = {This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system was evaluated through comparison with the one used in production in Wikipedia, known as ORES. Our research results in a significant increase in the number of languages covered, making Wikipedia patrolling more efficient to a wider range of communities. Furthermore, our model outperforms ORES, ensuring that the results provided are not only more accurate but also less biased against certain groups of contributors.},
booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {4981–4990},
numpages = {10},
location = {Long Beach, CA, USA},
series = {KDD '23}
}

Search
Clear search
Close search
Google apps
Main menu