100+ datasets found
  1. English Wikipedia People Dataset

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
    Explore at:
    zip(4293465577 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

    The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

    We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

    Data Structure

    • File name: wme_people_infobox.tar.gz
    • Size of compressed file: 4.12 GB
    • Size of uncompressed file: 21.28 GB

    Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

    The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

    Stats

    Infoboxes - Compressed: 2GB - Uncompressed: 11GB

    Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

    Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

    This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

    Maintenance and Support

    This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

    Initial Data Collection and Normalization

    The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

    Who are the source language producers?

    Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

    Attribution

    Terms and conditions

    Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

  2. d

    Archival Data for Page Protection: Another Missing Dimension of Wikipedia...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hill, Benjamin Mako; Shaw, Aaron (2023). Archival Data for Page Protection: Another Missing Dimension of Wikipedia Research [Dataset]. http://doi.org/10.7910/DVN/P1VECE
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Hill, Benjamin Mako; Shaw, Aaron
    Description

    This dataset contains data and software for the following paper: Hill, Benjamin Mako and Shaw, Aaron. (2015) “Page Protection: Another Missing Dimension of Wikipedia Research.” In Proceedings of the 11th International Symposium on Open Collaboration (OpenSym 2015). ACM Press. doi: 10.1145/2788993.2789846 This is an archival version of the data and software released with the paper. All of these data were (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-proetection/ Page protection is a feature of MediaWiki software that allows administrators to restrict contributions to particular pages. For example, a page can be “protected” so that only administrators or logged-in editors with a history of good editing can edit, move, or create it. Protection might involve “full protection” where a page can only be edited by administrators (i.e., “sysops”) or “semi-protection” where a page can only be edited by accounts with a history of good edits (i.e., “autoconfirmed” users). Although largely hidden, page protection profoundly shapes activity on the site. For example, page protection is an important tool used to manage access and participation in situations where vandalism or interpersonal conflict can threaten to undermine content quality. While protection affects only a small portion of pages in English Wikipedia, many of the most highly viewed pages are protected. For example, the “Main Page” in English Wikipedia has been protected since February, 2006 and all Featured Articles are protected at the time they appear on the site’s main page. Millions of viewers may never edit Wikipedia because they never see an edit button. Despite it's widespread and influential nature, very little quantitative research on Wikipedia has taken page protection into account systematically. This page contains software and data to help Wikipedia researchers do exactly this in their work. Because a page's protection status can change over time, the snapshots of page protection data stored by Wikimedia and published by Wikimedia Foundation in as dumps is incomplete. As a result, taking protection into account involves looking at several different sources of data. Much more detail can be found in our paper Page Protection: Another Missing Dimension of Wikipedia Research. If you use this software or these data, we would appreciate if you cite the paper.

  3. Wikipedia SQLITE Portable DB, Huge 5M+ Rows

    • kaggle.com
    zip
    Updated Jun 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    christernyc (2024). Wikipedia SQLITE Portable DB, Huge 5M+ Rows [Dataset]. https://www.kaggle.com/datasets/christernyc/wikipedia-sqlite-portable-db-huge-5m-rows/code
    Explore at:
    zip(6064169983 bytes)Available download formats
    Dataset updated
    Jun 29, 2024
    Authors
    christernyc
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.

    I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.

    Key Features:

    Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects

    The database consists of four main tables:

    • items: Contains information about Wikipedia items, including labels and descriptions
    • properties: Stores details about Wikidata properties, such as labels and descriptions
    • pages: Provides metadata for Wikipedia pages, including page IDs, item IDs, titles, and view counts
    • link_annotated_text: Contains the link-annotated text of Wikipedia pages, divided into sections

    This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.

    https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data

    Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings

    Usage with LIKE queries: ``` import aiosqlite import asyncio

    class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file

    async def _aenter_(self):
      self.conn = await aiosqlite.connect(self.db_file)
      return self
    
    async def _aexit_(self, exc_type, exc_val, exc_tb):
      await self.conn.close()
    
    async def search_pages_by_title(self, title):
      query = """
      SELECT pages.page_id, pages.item_id, pages.title, pages.views, 
          items.labels AS item_labels, items.description AS item_description,
          link_annotated_text.sections
      FROM pages 
      JOIN items ON pages.item_id = items.id
      JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id
      WHERE pages.title LIKE ?
      """
      async with self.conn.execute(query, (f"%{title}%",)) as cursor:
        return await cursor.fetchall()
    
    async def search_items_by_label_or_description(self, keyword):
      query = """
      SELECT id, labels, description 
      FROM items
      WHERE labels LIKE ? OR description LIKE ?
      """
      async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor:
        return await cursor.fetchall()
    
    async def search_items_by_label(self, label):
      query = """
      SELECT id, labels, description
      FROM items 
      WHERE labels LIKE ?
      """
      async with self.conn.execute(query, (f"%{label}%",)) as cursor:
        return await cursor.fetchall()
    
    async def search_properties_by_label_or_desc...
    
  4. Data from: English Wikipedia - Species Pages

    • gbif.org
    Updated Aug 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
    Explore at:
    Dataset updated
    Aug 23, 2022
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    Authors
    Markus Döring; Markus Döring
    Description

    Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

    See https://github.com/mdoering/wikipedia-dwca for details.

  5. g

    Data from: Politicians on Wikipedia and DBpedia

    • search.gesis.org
    Updated Dec 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wagner, Claudia (2024). Politicians on Wikipedia and DBpedia [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-1515
    Explore at:
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    GESIS search
    GESIS, Köln
    Authors
    Wagner, Claudia
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Description

    This dataset contains information about politicians from DBpedia, a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. Some important information about people that is available on DBpedia are name, gender, nationality, occupation, birth date, death date, profession and for many politicians also the political party they belong to. This dataset is based on the English DBpedia dump from October 2015 and documents the temporal evolution of the hyperlink network that articles about politicians formed on Wikipedia between 2001 and 2016 every month. Wikipedia maintains revisions for each article to keep track of the changes over time. The first revision of each month was used to construct the hyperlink network between articles about politicians.

  6. Wikipedia English Lucene Index

    • kaggle.com
    zip
    Updated Oct 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    corochann (2023). Wikipedia English Lucene Index [Dataset]. https://www.kaggle.com/datasets/corochann/lucene-wikipedia-en
    Explore at:
    zip(29115687581 bytes)Available download formats
    Dataset updated
    Oct 1, 2023
    Authors
    corochann
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    See our solution write-up: https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/446293

    This is Apache Lucene indexed file for Wikipedia English dataset. This can be used by pyserini library.

    Original data (Wikipedia dump) license information: https://dumps.wikimedia.org/legal.html

  7. Data for: Wikipedia as a gateway to biomedical research

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    application/gzip, txt
    Updated Sep 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joe Wass; Ryan Steinberg; Lauren Maggio; Joe Wass; Ryan Steinberg; Lauren Maggio (2020). Data for: Wikipedia as a gateway to biomedical research [Dataset]. http://doi.org/10.5281/zenodo.831459
    Explore at:
    txt, application/gzipAvailable download formats
    Dataset updated
    Sep 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joe Wass; Ryan Steinberg; Lauren Maggio; Joe Wass; Ryan Steinberg; Lauren Maggio
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia has been described as a gateway to knowledge. However, the extent to which this gateway ends at Wikipedia or continues via supporting citations is unknown. This dataset was used to establish benchmarks for the relative distribution and referral (click) rate of citations, as indicated by presence of a Digital Object Identifier (DOI), from Wikipedia with a focus on medical citations.

    This data set includes for each day in August 2016 a listing of all DOI present in the English language version of Wikipedia and whether or not the DOI are biomedical in nature. Source Code for these data are available at: Ryan Steinberg. (2017, July 9). Lane-Library/wiki-extract: initial Zenodo/DOI release. Zenodo. http://doi.org/10.5281/zenodo.824813

    This dataset also includes a listing from Crossref DOIs that were referred from Wikipedia in August 2016 (Wikipedia_referred_DOI). Source code for these data sets is available at: Joe Wass. (2017, July 4). CrossRef/logppj: Initial DOI registered release. Zenodo. http://doi.org/10.5281/zenodo.822636

    An article based on this data was published in PLOS One:

    Maggio LA, Willinsky JM, Steinberg RM, Mietchen D, Wass JL, Dong T. Wikipedia as a gateway to biomedical research: The relative distribution and use of citations in the English Wikipedia. PloS one. 2017 Dec 21;12(12):e0190046.

    https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190046

  8. h

    Data from: WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia

    • heidata.uni-heidelberg.de
    application/x-gzip +1
    Updated Apr 5, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felix Hieber; Shigehiko Schamoni; Artem Sokolov; Stefan Riezler; Felix Hieber; Shigehiko Schamoni; Artem Sokolov; Stefan Riezler (2017). WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia [Dataset]. http://doi.org/10.11588/DATA/10003
    Explore at:
    text/plain; charset=us-ascii(1858), application/x-gzip(887887912)Available download formats
    Dataset updated
    Apr 5, 2017
    Dataset provided by
    heiDATA
    Authors
    Felix Hieber; Shigehiko Schamoni; Artem Sokolov; Stefan Riezler; Felix Hieber; Shigehiko Schamoni; Artem Sokolov; Stefan Riezler
    License

    https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/10003https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/10003

    Description

    WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents. Queries are well-formed natural language sentences that allow large-scale training of (translation-based) ranking models. The corpus contains training, development and testing subsets randomly split on the query level. Relevance judgments for Cross-Language Information Retrieval (CLIR) are constructed from the inter-language links between German and English Wikipedia articles. A relevance level of (3) is assigned to the (English) cross-lingual mate, and level (2) to all other (English) articles that link to the mate, AND are linked by the mate. Our intuition for this level (2) is that arti cles in a bidirectional link relation to the mate are likely to either define similar concepts or are instances of the concept defined by the mate. For a more detailed description of the corpus construction process, see the above publication.

  9. Data from: Wikipedia on the CompTox Chemicals Dashboard: Connecting...

    • catalog.data.gov
    • datasets.ai
    Updated Nov 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data [Dataset]. https://catalog.data.gov/dataset/wikipedia-on-the-comptox-chemicals-dashboard-connecting-resources-to-enrich-public-chemica
    Explore at:
    Dataset updated
    Nov 3, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Spreadsheet summaries of identifier availability and correctness in Wikipedia Tabular summaries of identifier availability and correctness in Wikipedia; summary statistics of drugboxes and chemboxes Investigation of John W. Huffman cannabinoid dataset Summary of Wikipedia pages linked to DSSTox records Complete identifier data scraped from Wikipedia Chembox and Drugbox pages. This dataset is associated with the following publication: Sinclair, G., I. Thillainadarajah, B. Meyer, V. Samano, S. Sivasupramaniam, L. Adams, E. Willighagen, A. Richard, M. Walker, and A. Williams. Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, USA, 62(20): 4888-4905, (2022).

  10. Data from: Wikipedia Category Granularity (WikiGrain) data

    • zenodo.org
    csv, txt
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jürgen Lerner; Jürgen Lerner (2020). Wikipedia Category Granularity (WikiGrain) data [Dataset]. http://doi.org/10.5281/zenodo.1005175
    Explore at:
    txt, csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jürgen Lerner; Jürgen Lerner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).

    The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.

    WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.

    The WikiGrain Data is analyzed in the paper

    Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.

    ===============================================================
    Individual files (tables in comma-separated-values-format):

    ---------------------------------------------------------------
    * article_info.csv contains the following variables:

    - "id"
    (integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

    - "granularity"
    (decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.

    - "is.FA"
    (boolean) True ('1') if the article is a featured article; false ('0') else.

    - "is.FA.or.GA"
    (boolean) True ('1') if the article is a featured article or a good article; false ('0') else.

    - "is.top.importance"
    (boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.

    - "number.of.revisions"
    (integer) Number of times a new version of the article has been uploaded.


    ---------------------------------------------------------------
    * article_to_tlc.csv
    is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
    The file contains the following variables:

    - "id"
    (integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

    - "id.of.tlc"
    (integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.

    - "title.of.tlc"
    (string) Title of the TLC in which the article is contained.

    ---------------------------------------------------------------
    * article_info_normalized.csv
    contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
    The file contains the following variables:

    - "id"
    Article id.

    - "is.FA"
    Boolean indicator for whether the article is featured.

    - "log1p.length"
    Length measured by the number of bytes.

    - "age"
    Age measured by the time since the first edit.

    - "log1p.number.of.edits"
    Number of times a new version of the article has been uploaded.

    - "log1p.number.of.reverts"
    Number of times a revision has been reverted to a previous one.

    - "log1p.number.of.contributors"
    Number of unique contributors to the article.

    - "number.of.characters.per.word"
    Average number of characters per word (one component of 'reading complexity').

    - "number.of.words.per.sentence"
    Average number of words per sentence (second component of 'reading complexity').

    - "number.of.level.1.sections"
    Number of first level sections in the article.

    - "number.of.level.2.sections"
    Number of second level sections in the article.

    - "number.of.categories"
    Number of categories the article is in.

    - "log1p.average.size.of.categories"
    Average size of the categories the article is in.

    - "log1p.number.of.intra.wiki.links"
    Number of links to pages in the English-language version of Wikipedia.

    - "log1p.number.of.external.references"
    Number of external references given in the article.

    - "log1p.number.of.images"
    Number of images in the article.

    - "log1p.number.of.templates"
    Number of templates that the article uses.

    - "log1p.number.of.inter.language.links"
    Number of links to articles in different language edition of Wikipedia.

    - "granularity"
    As in article_info.csv (but normalized to standard deviation one).

  11. Data from: WikiHist.html: English Wikipedia's Full Revision History in HTML...

    • zenodo.org
    application/gzip, zip
    Updated Jun 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski (2020). WikiHist.html: English Wikipedia's Full Revision History in HTML Format [Dataset]. http://doi.org/10.5281/zenodo.3605388
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    Jun 8, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Introduction

    Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions.

    We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML.

    For more details, please refer to the description below and to the dataset paper:
    Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020.
    https://arxiv.org/abs/2001.10256

    When using the dataset, please cite the above paper.

    Dataset summary

    The dataset consists of three parts:

    1. English Wikipedia’s full revision history parsed to HTML,
    2. a table of the creation times of all Wikipedia pages (page_creation_times.json.gz),
    3. a table that allows for resolving redirects for any point in time (redirect_history.json.gz).

    Part 1 is our main contribution, while parts 2 and 3 contain complementary information that can aid researchers in their analyses.

    Getting the data

    Parts 2 and 3 are hosted in this Zenodo repository. Part 1 is 7TB large -- too large for Zenodo -- and is therefore hosted externally on the Internet Archive. For downloading part 1, you have multiple options:

    Dataset details

    Part 1: HTML revision history
    The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and p$2p$3 indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

    • id: id of this revision
    • parentid: id of revision modified by this revision
    • timestamp: time when revision was made
    • cont_username: username of contributor
    • cont_id: id of contributor
    • cont_ip: IP address of contributor
    • comment: comment made by contributor
    • model: content model (usually "wikitext")
    • format: content format (usually "text/x-wiki")
    • sha1: SHA-1 hash
    • title: page title
    • ns: namespace (always 0)
    • page_id: page id
    • redirect_title: if page is redirect, title of target page
    • html: revision content in HTML format

    Part 2: Page creation times (page_creation_times.json.gz)

    This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

    • page_id: page id
    • title: page title
    • ns: namespace (0 for articles)
    • timestamp: time when page was created

    Part 3: Redirect history (redirect_history.json.gz)

    This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

    • page_id: page id of redirect source
    • title: page title of redirect source
    • ns: namespace (0 for articles)
    • revision_id: revision id of redirect source
    • timestamp: time at which redirect became active
    • redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

    The repository also contains two additional files, metadata.zip and mysql_database.zip. These two files are not part of WikiHist.html per se, and most users will not need to download them manually. The file metadata.zip is required by the download script (and will be fetched by the script automatically), and mysql_database.zip is required by the code used to produce WikiHist.html. The code that uses these files is hosted at GitHub, but the files are too big for GitHub and are therefore hosted here.

    WikiHist.html was produced by parsing the 1 March 2019 dump of https://dumps.wikimedia.org/enwiki/20190301 from wikitext to HTML. That old dump is not available anymore on Wikimedia's servers, so we make a copy available at https://archive.org/details/enwiki-20190301-original-full-history-dump_dlab .

  12. Wikipedia Structured Contents

    • kaggle.com
    zip
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). Wikipedia Structured Contents [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents
    Explore at:
    zip(25121685657 bytes)Available download formats
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.

    This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).

    Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.

    The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

    The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.

    Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/

    Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...

  13. Wiki Surveys: Open and Quantifiable Social Data Collection

    • plos.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew J. Salganik; Karen E. C. Levy (2023). Wiki Surveys: Open and Quantifiable Social Data Collection [Dataset]. http://doi.org/10.1371/journal.pone.0123483
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Matthew J. Salganik; Karen E. C. Levy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the social sciences, there is a longstanding tension between data collection methods that facilitate quantification and those that are open to unanticipated information. Advances in technology now enable new, hybrid methods that combine some of the benefits of both approaches. Drawing inspiration from online information aggregation systems like Wikipedia and from traditional survey research, we propose a new class of research instruments called wiki surveys. Just as Wikipedia evolves over time based on contributions from participants, we envision an evolving survey driven by contributions from respondents. We develop three general principles that underlie wiki surveys: they should be greedy, collaborative, and adaptive. Building on these principles, we develop methods for data collection and data analysis for one type of wiki survey, a pairwise wiki survey. Using two proof-of-concept case studies involving our free and open-source website www.allourideas.org, we show that pairwise wiki surveys can yield insights that would be difficult to obtain with other methods.

  14. Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their...

    • zenodo.org
    bin, zip
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Göpfert; Jan Göpfert; Patrick Kuckertz; Patrick Kuckertz; Jann M. Weinand; Jann M. Weinand; Detlef Stolten; Detlef Stolten (2025). Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia [Dataset]. http://doi.org/10.5281/zenodo.14858280
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Feb 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan Göpfert; Jan Göpfert; Patrick Kuckertz; Patrick Kuckertz; Jann M. Weinand; Jann M. Weinand; Detlef Stolten; Detlef Stolten
    Description

    The task of measurement extraction is typically approached in a pipeline manner, where 1) quantities are identified before 2) their individual measurement context is extracted (see our review paper). To support the development and evaluation of systems for measurement extraction, we present two large datasets that correspond to the two tasks:

    • Wiki-Quantities, a dataset for identifying quantities, and
    • Wiki-Measurements, a dataset for extracting measurement context for given quantities.

    The datasets are heuristically generated from Wikipedia articles and Wikidata facts. For a detailed description of the datasets, please refer to the upcoming corresponding paper:

    Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia. 2025. Jan Göpfert, Patrick Kuckertz, Jann M. Weinand, and Detlef Stolten.

    Versions

    The datasets are released in different versions:

    • Processing level: the pre-processed versions can be used directly for training and evaluating models, while the raw versions can be used to create custom pre-processed versions or for other purposes. Wiki-Quantities is pre-processed for IOB sequence labeling, while Wiki-Measurements is pre-processed for SQuAD-style generative question answering.
    • Filtering level:
      • Wiki-Quantities is available in a raw, large, small, and tiny version: The raw version is the original version, which includes all the samples originally obtained. In the large version, all duplicates and near duplicates present in the raw version are removed. The small and tiny versions are subsets of the large version which are additionally filtered to balance the data with respect to units, properties, and topics.
      • Wiki-Measurements is available in a large`, small, large_strict, small_strict, small_context, and large_strict_context version: The large version contains all examples minus a few duplicates. The small version is a subset of the large version with very similar examples removed. In the context versions, additional sentences are added around the annotated sentence. In the strict versions, the quantitative facts are more strictly aligned with the text.
    • Quality: all data has been automatically annotated using heuristics. In contrast to the silver data, the gold data has been manually curated.

    Format

    The datasets are stored in JSON format. The pre-processed versions are formatted for direct use for IOB sequence labeling or SQuAD-style generative question answering in NLP frameworks such as Huggingface Transformers. In the not pre-processed versions of the datasets, annotations are visualized using emojis to facilitate curation. For example:

    • Wiki-Quantities (only quantities annotated):
      • "In a 🍏100-gram🍏 reference amount, almonds supply 🍏579 kilocalories🍏 of food energy."
      • "Extreme heat waves can raise readings to around and slightly above 🍏38 °C🍏, and arctic blasts can drop lows to 🍏−23 °C to 0 °F🍏."
      • "This sail added another 🍏0.5 kn🍏."
    • Wiki-Measurements (measurement context for a single quantity; qualifiers and quantity modifiers are only sparsely annotated):
      • "The 🔭French national census🔭 of 📆2018📆 estimated the 🍊population🍊 of 🌶️Metz🌶️ to be 🍐116,581🍐, while the population of Metz metropolitan area was about 368,000."
      • "The 🍊surface temperature🍊 of 🌶️Triton🌶️ was 🔭recorded by Voyager 2🔭 as 🍐-235🍐 🍓°C🍓 (-391 °F)."
      • "🙋The Babylonians🙋 were able to find that the 🍊value🍊 of 🌶️pi🌶️ was ☎️slightly greater than☎️ 🍐3🍐, by simply 🔭making a big circle and then sticking a piece of rope onto the circumference and the diameter, taking note of their distances, and then dividing the circumference by the diameter🔭."

    The mapping of annotation types to emojis is as follows:

    • Basic quantitative statement:
      • Entity: 🌶️
      • Property: 🍊
      • Quantity: 🍏
      • Value: 🍐
      • Unit: 🍓
      • Quantity modifier: ☎️
    • Qualifier:
      • Temporal scope: 📆
      • Start time: ⏱️
      • End time: ⏰️
      • Location: 📍
      • Reference: 🙋
      • Determination method: 🔭
      • Criterion used: 📏
      • Applies to part: 🦵
      • Scope: 🔎
      • Some qualifier: 🛁

    Note that for each version of Wiki-Measurements sample IDs are randomly assigned. Therefore, they are not consistent, e.g., between silver small and silver large. The proportions of train, dev, and test sets are unusual because Wiki-Quantities and Wiki-Measurements are intended to be used in conjunction with other non-heuristically generated data.

    Evaluation

    The evaluation directories contain the manually validated random samples used for evaluation. The evaluation is based on the large versions of the datasets. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements showed that 100% of the Wiki-Quantities samples and 94% (or 84% if strictly scored) of the Wiki-Measurements samples were correct.

    License

    In accordance with Wikipedia's and Wikidata's licensing terms, the datasets are released under the CC BY-SA 4.0 license, except for Wikidata facts in ./Wiki-Measurements/raw/additional_data.json, which are released under the CC0 1.0 license (the texts are still CC BY-SA 4.0).

    About Us

    We are the Institute of Climate and Energy Systems (ICE) - Jülich Systems Analysis belonging to the Forschungszentrum Jülich. Our interdisciplinary department's research is focusing on energy-related process and systems analyses. Data searches and system simulations are used to determine energy and mass balances, as well as to evaluate performance, emissions and costs of energy systems. The results are used for performing comparative assessment studies between the various systems. Our current priorities include the development of energy strategies, in accordance with the German Federal Government’s greenhouse gas reduction targets, by designing new infrastructures for sustainable and secure energy supply chains and by conducting cost analysis studies for integrating new technologies into future energy market frameworks.

    Acknowledgements

    The authors would like to thank the German Federal Government, the German State Governments, and the Joint Science Conference (GWK) for their funding and support as part of the NFDI4Ing consortium. Funded by the German Research Foundation (DFG) – project number: 442146713. Furthermore, this work was supported by the Helmholtz Association under the program "Energy System Design".

  15. E

    A meta analysis of Wikipedia's coronavirus sources during the COVID-19...

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    txt
    Updated Sep 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7806
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 8, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    At the height of the coronavirus pandemic, on the last day of March 2020, Wikipedia in all languages broke a record for most traffic in a single day. Since the breakout of the Covid-19 pandemic at the start of January, tens if not hundreds of millions of people have come to Wikipedia to read - and in some cases also contribute - knowledge, information and data about the virus to an ever-growing pool of articles. Our study focuses on the scientific backbone behind the content people across the world read: which sources informed Wikipedia’s coronavirus content, and how was the scientific research on this field represented on Wikipedia. Using citation as readout we try to map how COVID-19 related research was used in Wikipedia and analyse what happened to it before and during the pandemic. Understanding how scientific and medical information was integrated into Wikipedia, and what were the different sources that informed the Covid-19 content, is key to understanding the digital knowledge echosphere during the pandemic. To delimitate the corpus of Wikipedia articles containing Digital Object Identifier (DOI), we applied two different strategies. First we scraped every Wikipedia pages form the COVID-19 Wikipedia project (about 3000 pages) and we filtered them to keep only page containing DOI citations. For our second strategy, we made a search with EuroPMC on Covid-19, SARS-CoV2, SARS-nCoV19 (30’000 sci papers, reviews and preprints) and a selection on scientific papers form 2019 onwards that we compared to the Wikipedia extracted citations from the english Wikipedia dump of May 2020 (2’000’000 DOIs). This search led to 231 Wikipedia articles containing at least one citation of the EuroPMC search or part of the wikipedia COVID-19 project pages containing DOIs. Next, from our 231 Wikipedia articles corpus we extracted DOIs, PMIDs, ISBNs, websites and URLs using a set of regular expressions. Subsequently, we computed several statistics for each wikipedia article and we retrive Atmetics, CrossRef and EuroPMC infromations for each DOI. Finally, our method allowed to produce tables of citations annotated and extracted infromations in each wikipadia articles such as books, websites, newspapers.Files used as input and extracted information on Wikipedia's COVID-19 sources are presented in this archive.See the WikiCitationHistoRy Github repository for the R codes, and other bash/python scripts utilities related to this project.

  16. d

    Archival Data for Consider the Redirect: A Missing Dimension of Wikipedia...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hill, Benjamin Mako; Shaw, Aaron (2023). Archival Data for Consider the Redirect: A Missing Dimension of Wikipedia Research [Dataset]. http://doi.org/10.7910/DVN/NQSHQD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Hill, Benjamin Mako; Shaw, Aaron
    Description

    This contains data and software for the following paper: Hill, Benjamin Mako and Shaw, Aaron. (2014) "Consider the Redirect: A Missing Dimension of Wikipedia Research." In Proceedings of the 10th International Symposium on Open Collaboration (OpenSym 2014). ACM Press. doi: 10.1145/2641580.2641616 This is an archival version of the data and software released with the paper. All of these data were originally (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-redirects/ In wikis, redirects are special pages in that silently take readers from the page they are visiting to another page in the wiki. In the English Wikipedia, redirects make up more than half of all article pages. Different Wikipedia data sources handle redirects differently. For example, the MediaWiki API will automatically "follow" redirects but the XML database dumps treat redirects like normal articles. In both cases, redirects are often invisible to researchers. Because redirects constitute a majority of all pages and see a large portion of all traffic, Wikipedia researchers need to take redirects into account or their findings may be incomplete or incorrect. For example, the histogram on this page shows the distribution of edits across pages in Wikipedia for every page, and for non-redirects only. Because redirects are almost never edited, the distributions are very different. Similarly, because redirects are viewed but almost never edited, any study of views over articles should also take redirects into account. Because redirects can change over time, the snapshots of redirects stored by Wikimedia and published by Wikimedia Foundation are incomplete. Taking redirects into account fully involves looking at the content of every single revision of every article to determine both when and where pages redirect. Much more detail can be found in Consider the Redirect: A Missing Dimension of Wikipedia Research — a short paper that we have written to accompany this dataset and these tools. If you use this software or these data, we would appreciate if you cite the paper. This dataset was previously hosted at this now obsolete URL: http://networkcollectiv.es/wiki-redirects/

  17. n

    Data from: Robust clustering of languages across Wikipedia growth

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Sep 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kristina Ban; Matjaž Perc; Zoran Levnajić (2017). Robust clustering of languages across Wikipedia growth [Dataset]. http://doi.org/10.5061/dryad.sk0q2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 19, 2017
    Dataset provided by
    University of Maribor
    Faculty of Information Studies, Ljubljanska cesta 31A, 8000 Novo Mesto, Slovenia
    Authors
    Kristina Ban; Matjaž Perc; Zoran Levnajić
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Wikipedia is the largest existing knowledge repository that is growing on a genuine crowdsourcing support. While the English Wikipedia is the most extensive and the most researched one with over 5 million articles, comparatively little is known about the behaviour and growth of the remaining 283 smaller Wikipedias, the smallest of which, Afar, has only one article. Here, we use a subset of these data, consisting of 14 962 different articles, each of which exists in 26 different languages, from Arabic to Ukrainian. We study the growth of Wikipedias in these languages over a time span of 15 years. We show that, while an average article follows a random path from one language to another, there exist six well-defined clusters of Wikipedias that share common growth patterns. The make-up of these clusters is remarkably robust against the method used for their determination, as we verify via four different clustering methods. Interestingly, the identified Wikipedia clusters have little correlation with language families and groups. Rather, the growth of Wikipedia across different languages is governed by different factors, ranging from similarities in culture to information literacy.

  18. People Wikipedia Data

    • kaggle.com
    zip
    Updated Nov 20, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sameer Mahajan (2017). People Wikipedia Data [Dataset]. https://www.kaggle.com/sameersmahajan/people-wikipedia-data
    Explore at:
    zip(30838712 bytes)Available download formats
    Dataset updated
    Nov 20, 2017
    Authors
    Sameer Mahajan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This is wikipedia data on people.

    Content

    It contains URIs, names of people and text from their wikipedia page.

    Acknowledgements

    This data set is used by coursera Machine Learning Foundations course which is a part of Machine Learning Specialization. I have transformed and made it available in .csv so that it can be used with open source softwares like scikit-learn etc.

    Inspiration

    You can find out nearest neighbors for various people based on their wikipedia information.

  19. Data from: WikiReddit: Tracing Information and Attention Flows Between...

    • zenodo.org
    bin
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
    Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final

  20. d

    Replication Data for: Measuring Wikipedia Article Quality in One Dimension...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TeBlunthuis, Nathan (2024). Replication Data for: Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression [Dataset]. http://doi.org/10.7910/DVN/U5V0G1
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    TeBlunthuis, Nathan
    Description

    This dataset provides code, data, and instructions for replicating the analysis of Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression published in OpenSym 2021 (link to come). The paper introduces a method for transforming scores from the ORES quality models into a single dimensional measure of quality amenable for statistical analysis that is well-calibrated to a dataset. The purpose is to improve the validity of research into article quality through more precise measurement. The code and data for replicating the paper are found in this dataverse repository. If you wish to use method on a new dataset, you should obtain the actively maintaned version of the code from this git repository. If you attempt to replicate part of this repository please let me know via an email to nathante@uw.edu. Replicating the Analysis from the OpenSym Paper This project analyzes a sample of articles with quality labels from the English Wikipedia XML dumps from March 2020. Copies of the dumps are not provided in this dataset. They can be obtained via https://dumps.wikimedia.org/. Everything else you need to replicate the project (other than a sufficiently powerful computer) should be available here. The project is organized into stages. The prerequisite data files are provided at each stage so you do not need to rerun the entire pipeline from the beginning, which is not easily done without a high-performance computer. If you start replicating at an intermediate stage, this should overwrite the inputs to the downstream stages. This should make it easier to verify a partial replication. To help manage the size of the dataverse, all code files are included in code.tar.gz. Extracting this with tar xzvf code.tar.gz is the first step. Getting Set Up You need a version of R >= 4.0 and a version of Python >= 3.7.8. You also need a bash shell, tar, gzip, and make installed as they should be installed on any Unix system. To install brms you need a working C++ compiler. If you run into trouble see the instructions for installing Rstan. The datasets were built on CentOS 7, except for the ORES scoring which was done on Ubuntu 18.04.5 and building which was done on Debian 9. The RemembR and pyRembr projects provide simple tools for saving intermediate variables for building papers with LaTex. First, extract the articlequality.tar.gz, RemembR.tar.gz and pyRembr.tar.gz archives. Then, install the following: Python Packages Running the following steps in a new Python virtual environment is strongly recommended. Run pip3 install -r requirements.txt to install the Python dependencies. Then navigate into the pyRembr directory and run python3 setup.py install. R Packages Run Rscript install_requirements.R to install the necessary R libraries. If you run into trouble installing brms see the instructions on Drawing a Sample of Labeled Articles I provide steps and intermediate data files for replicating the sampling of labeled articles. The steps in this section are quite computationally intensive. Those only interested in replicating the models and analyses should skip this section. Extracting Metadata from Wikipedia Dumps Metadata from the Wikipedia dumps is required for calibrating models to the revision and article levels of analysis. You can use the wikiq Python script from the mediawiki dump tools git repository to extract metadata from the XML dumps as TSV files. The version of wikiq that was used is provided here. Running Wikiq on a full dump of English Wikipedia in a reasonable amount of requires considerable computing resources. For this project, Wikiq was run on Hyak a high performance computer at the University of Washington. The code for doing so is highly speicific to Hyak. For transparency and in case it helps others using similar academic computers this code is included in WikiqRunning.tar.gz. A copy of the wikiq output is included in this dataset in the multi-part archive enwiki202003-wikiq.tar.gz. To extract this archive, download all the parts and then run cat enwiki202003-wikiq.tar.gz* > enwiki202003-wikiq.tar.gz && tar xvzf enwiki202003-wikiq.tar.gz. Obtaining Quality Labels for Articles We obtain up-to-date labels for each article using the articlequality python package included in articlequality.tar.gz. The XML dumps are also the input to this step, and while it does not require a great deal of memory, a powerful computer (we used 28 cores) is helpful so that it completes in a reasonable amount of time. extract_quality_labels.sh runs the command to extract the labels from the xml dumps. The resulting files have the format data/enwiki-20200301-pages-meta-history*.xml-p*.7z_article_labelings.json and are included in this dataset in the archive enwiki202003-article_labelings-json.tar.gz. Taking a Sample of Quality Labels I used Apache Spark to merge the metadata from Wikiq with the quality labels and to draw a sample of articles where each quality class is equally represented. To...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Organization logo

English Wikipedia People Dataset

Biographical Data for People on English Wikipedia

Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

  • File name: wme_people_infobox.tar.gz
  • Size of compressed file: 4.12 GB
  • Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

Search
Clear search
Close search
Google apps
Main menu