100+ datasets found
  1. English Wikipedia People Dataset

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
    Explore at:
    zip(4293465577 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

    The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

    We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

    Data Structure

    • File name: wme_people_infobox.tar.gz
    • Size of compressed file: 4.12 GB
    • Size of uncompressed file: 21.28 GB

    Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

    The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

    Stats

    Infoboxes - Compressed: 2GB - Uncompressed: 11GB

    Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

    Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

    This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

    Maintenance and Support

    This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

    Initial Data Collection and Normalization

    The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

    Who are the source language producers?

    Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

    Attribution

    Terms and conditions

    Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

  2. h

    wikimedia

    • huggingface.co
    Updated Jul 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Pile (2024). wikimedia [Dataset]. https://huggingface.co/datasets/common-pile/wikimedia
    Explore at:
    Dataset updated
    Jul 27, 2024
    Dataset authored and provided by
    Common Pile
    Description

    Wikimedia

      Description
    

    Official Wikimedia wikis are released under a CC BY-SA license. We downloaded the official database dumps from March 2025 of the English-language wikis that are directly managed by the Wikimedia foundation. These database dumps include the wikitext—Mediawiki’s custom markup language—for each page as well as talk pages, where editors discuss changes made for a page. We only use the most recent version of each page. We converted wikitext to plain… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/wikimedia.

  3. Wikimedia Dump enwiki-20220901

    • kaggle.com
    zip
    Updated Sep 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shy_ (2022). Wikimedia Dump enwiki-20220901 [Dataset]. https://www.kaggle.com/datasets/shyguy/wikimedia-dump-enwiki20220901
    Explore at:
    zip(25110528357 bytes)Available download formats
    Dataset updated
    Sep 7, 2022
    Authors
    shy_
    Description

    The wikimedia dump for the english wikipedia on the 9 September 2022: https://dumps.wikimedia.org/enwiki/20220901/ for more information visit (especially for the license): https://dumps.wikimedia.org/

  4. d

    Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20...

    • datadryad.org
    • search.dataone.org
    zip
    Updated Aug 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    R. Stuart Geiger; Aaron Halfaker (2017). Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20 April 2017 [Dataset]. http://doi.org/10.6078/D1FD3K
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 15, 2017
    Dataset provided by
    Dryad
    Authors
    R. Stuart Geiger; Aaron Halfaker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 15, 2017
    Description

    See https://meta.wikimedia.org/wiki/Data_dumps for more detail on using these dumps.

  5. [deprecated] Reference and map usage across Wikimedia wiki pages

    • figshare.com
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Wight (2023). [deprecated] Reference and map usage across Wikimedia wiki pages [Dataset]. http://doi.org/10.6084/m9.figshare.24064941.v2
    Explore at:
    Dataset updated
    Dec 18, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Adam Wight
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    ErrataPlease note that this data set includes some major inaccuracies and should not be used. The data files will be unpublished from their hosting and this metadata will eventually be unpublished as well.A short list of issues discovered:Many dumps were truncated (T345176).Pages appeared multiple times, with different revision numbers.Revisions were sometimes mixed, with wikitext and HTML coming from different versions of an article.Reference similarity was overcounted when more than two refs shared content.In particular, the truncation and duplication means that the aggregate statistics are inaccurate and can't be compared to other data points.OverviewThis data was produced by Wikimedia Germany’s Technical Wishes team, and focuses on real-world usage statistics for reference footnotes (Cite extension) and maps (Kartographer extension) across all main-namespace pages (articles) on about 700 Wikimedia wikis. It was produced by processing the Wikimedia Enterprise HTML dumps which are a fully-parsed rendering of the pages, and by querying the MediaWiki query API to get more detailed information about maps. The data is also accompanied by several more general columns about each page for context.Our analysis of references was inspired by "Characterizing Wikipedia Citation Usage” and other research, but the goal in our case was to understand the potential impact of improving the ways in which references can be reused within a page. Gathering the map data was to understand the actual impact of improvements made to how external data can be integrated in maps. Both tasks are complicated by the heavy use of wikitext templates, obscuring when and how and tags are being used. For this reason, we decided to parse the rendered HTML pages rather than the original wikitext.LicenseAll files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/The source code is distributed under BSD-3-Clause.Source code and executionThe program used to create these files is our HTML dump scraper, version 0.1, written in Elixir. It can be run locally, but we used the Wikimedia Cloud VPS in order to have intra-datacenter access to the HTML dump file inputs. Our production configuration is included in the source code repository, and the commandline used to run was: “MIX_ENV=prod mix run pipeline.exs” .Execution was interrupted and restarted many times in order to make small fixes to the code. We expect that the only class of inconsistency this could have caused is that a small number of article records may potentially be repeated in the per-page summary files, and these pages’ statistics duplicated in the aggregates. Whatever the cause, we’ve found many of these duplicate errors and counts are given in the “duplicates.txt” file.The program is pluggable and configurable, it can be extended by writing new analysis modules. Our team plans to continue development and to run it again in the near future to track evolution of the collected metrics over time.FormatAll fields are documented in metrics.md as part of the code repository. Outputs are mostly split into separate ND-JSON (newline-delimited) and JSON files, and grand totals are gathered into a single CSV file.Per-page summary filesThe first phase of scraping produces a fine-grained report summarizing each page into a few statistics. Each file corresponds to a wiki (using its database name, for example "enwiki" for English Wikipedia) and each line of the file is a JSON object corresponding to a page.Example file name: enwiki-20230601-page-summary.ndjson.gzExample metrics:How many tags are created from templates vs. directly in the article.How many references contain a template transclusion to produce their content.How many references are unnamed, automatically, or manually named.How often references are reused via their name.Copy-pasted references that share the same or almost the same content, on the same page.Whether an article has more than one list.Mapdata filesExample file name: enwiki-20230601-mapdata.ndjson.gzThese files give the count of different types of map "external data" on each page. A line will either be empty "{}" or it will include the revid and number of external data references for maps on that page.External data is tallied in 9 different buckets, starting with "page" meaning that the source is .map data from the Wikimedia Commons server, or geoline / geoshape / geomask / geopoint and the data source, either an "ids" (Wikidata Q-ID) or "query" (SPARQL query) source.Mapdata summary filesEach wiki has a summary of map external data counts, which contains a sum for each type count.Example file name: enwiki-20230601-mapdata-summary.jsonWiki summary filesPer-page statistics are rolled up to the wiki level, and results are stored in a separate file for each wiki. Some statistics are summed, some are averaged, check the suffix on the column name for a hint.Example file name: enwiki-20230601-summary.jsonTop-level summary fileThere is one file which aggregates the wiki summary statistics, discarding non-numeric fields and formatting as a CSV for ease of use: all-wikis-20230601-summary.csv

  6. d

    Archival Data for Consider the Redirect: A Missing Dimension of Wikipedia...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hill, Benjamin Mako; Shaw, Aaron (2023). Archival Data for Consider the Redirect: A Missing Dimension of Wikipedia Research [Dataset]. http://doi.org/10.7910/DVN/NQSHQD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Hill, Benjamin Mako; Shaw, Aaron
    Description

    This contains data and software for the following paper: Hill, Benjamin Mako and Shaw, Aaron. (2014) "Consider the Redirect: A Missing Dimension of Wikipedia Research." In Proceedings of the 10th International Symposium on Open Collaboration (OpenSym 2014). ACM Press. doi: 10.1145/2641580.2641616 This is an archival version of the data and software released with the paper. All of these data were originally (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-redirects/ In wikis, redirects are special pages in that silently take readers from the page they are visiting to another page in the wiki. In the English Wikipedia, redirects make up more than half of all article pages. Different Wikipedia data sources handle redirects differently. For example, the MediaWiki API will automatically "follow" redirects but the XML database dumps treat redirects like normal articles. In both cases, redirects are often invisible to researchers. Because redirects constitute a majority of all pages and see a large portion of all traffic, Wikipedia researchers need to take redirects into account or their findings may be incomplete or incorrect. For example, the histogram on this page shows the distribution of edits across pages in Wikipedia for every page, and for non-redirects only. Because redirects are almost never edited, the distributions are very different. Similarly, because redirects are viewed but almost never edited, any study of views over articles should also take redirects into account. Because redirects can change over time, the snapshots of redirects stored by Wikimedia and published by Wikimedia Foundation are incomplete. Taking redirects into account fully involves looking at the content of every single revision of every article to determine both when and where pages redirect. Much more detail can be found in Consider the Redirect: A Missing Dimension of Wikipedia Research — a short paper that we have written to accompany this dataset and these tools. If you use this software or these data, we would appreciate if you cite the paper. This dataset was previously hosted at this now obsolete URL: http://networkcollectiv.es/wiki-redirects/

  7. Data from: English Wikipedia - Species Pages

    • gbif.org
    Updated Aug 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
    Explore at:
    Dataset updated
    Aug 23, 2022
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Markus Döring; Markus Döring
    Description

    Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

    See https://github.com/mdoering/wikipedia-dwca for details.

  8. Wikimedia Commons photos by prominent users and their usage across the web

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Nov 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leva, Federico (2020). Wikimedia Commons photos by prominent users and their usage across the web [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3355708
    Explore at:
    Dataset updated
    Nov 16, 2020
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Leva, Federico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Extract from the Wikimedia Commons database containing a list of users selected by the community for having uploaded high quality photos; list of 310k photos of theirs and of the subset of 59k photos sent to Infringement.Report for matching; list of domains whose matches were ignored as not useful for copyleft license enforcement. Domains were then matched for their rank in the Tranco list and the number of image usages found, and ranked by a mix of the two criteria.

  9. Teahouse corpus

    • data.wu.ac.at
    .txt, csv
    Updated Apr 12, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2015). Teahouse corpus [Dataset]. https://data.wu.ac.at/schema/datahub_io/MmZiZjJmNWEtM2E2OS00NGZmLTgyMjUtMDk1MmVhNTQ0NGU1
    Explore at:
    .txt, csvAvailable download formats
    Dataset updated
    Apr 12, 2015
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    License

    http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa

    Description

    The Teahouse corpus is a set of questions asked at the Wikipedia Teahouse, a peer support forum for new Wikipedia editors. This corpus contains data from its first two years of operation.

    The Teahouse started as an editor engagement initiative and Fellowship project. It was launched in February 2012 by a small team working with the Wikimedia Foundation. Our intention was to pilot a new, scalable model for teaching Wikipedia newcomers the ropes of editing in a friendly and engaging environment.

    The ultimate goal of the pilot project was to increase the retention of new Wikipedia editors (most of whom give up and leave within their first 24 hours post-registration) through early proactive outreach. The project was particularly focused on retaining female newcomers, who are woefully underrepresented among the regular contributors to the encyclopedia.

    The Teahouse lives on as an vibrant, self-sustaining and community-driven project. All Teahouse participants are volunteers: no one is told when, how, or how much they must contribute.

    See the README files associated with each datafile for a schema of the data fields in that file.

    Read on for more info on potential applications, the provenance of these data, and links to related resources.

    Potential Applications

    or, what is it good for?

    The Teahouse corpus consists of good quality data and rich metadata around social Q&A interactions in a particular setting: new user help requests in a large, collaborative online community.

    More generally, this corpus is a valuable resource for research on conversational dynamics in online, asynchronous discussions.

    Qualitative textual analysis could yield insights into the kinds of issues faced by newcomers in established online collaborations.

    Linguisitc analysis could examine the impact of syntactic and semantic features related to politeness, sentiment, question framing, or other rhetorical strategies on discussion outcomes.

    Response patterns (questioner replies and answers) within each thread could be used to map network relationships, or to investigate correlations between participation by the initiator of a thread, or the number of participants, on thread length or interactivity (the interval of time between posts).

    The corpus is large and rich enough to provide training both training and test data for machine learning applications.

    Finally, the data provide here can be extended and compared with other publicly-available datasets of Wikipedia, allowing researchers to examine relationships between editors' participation within the Teahouse Q&A forum and their previous, concurrent, and subsequent editing activities within millions of other articles, meta-content, and discussion spaces on Wikipedia.

    Data hygiene

    or, how the research sausage was made

    Parsing wikitext presents many challenges: the mediawiki editing interface is deliberately underspecified in order to maximize flexibility for contributors. This can make it difficult to tell the difference between different types of contribution--say, fixing a typo or answering a question.

    The Teahouse Q&A board was designed to provide a more structured workflow than normal wiki talk pages, and instrumented to identify certain kinds of contributions (questions and answers) and isolate them from the 'noisy' background datastream of incidental edits to the Q&A page. The post-processing of the data presented here favored precision over recall: to provide a good quality set of questions, rather than a complete one.

    In cases where it wasn't easy to identify whether an edit contained a question or answer, these data have not been included. However, it is hard to account for all ambiguous or invalid cases: caveat quaesitor!

    Our approach to data inclusion was conservative. The number of questioner replies and answers to any given question may be under-counted, but is unlikely to be over-counted. However, our spot checks and analysis of the data suggest that the majority of responses are accounted for, and that the distribution of "missed" responses is randomly distributed.

    The Teahouse corpus only contains questions and answers by registered users of Wikipedia who were logged in when they participated. IP addresses can be linked to an individual's physical location. On Wikipedia, edits by logged out and unregistered users are identified by the user's current IP address. Although all edits to Wikipedia are legally public and free licenced, we have redacted IP edits from this dataset in deference to user privacy. Researchers interested in those data can find them in other public Wikipedia datasets.

    Possible future additions

    Additional data about these Q&A interactions has been collected, and other data are retrievable. Examples of data that could be included in future revisions of the corpus at low cost include:

    • more metadata about the people asking questions:
      • how many edits had they made before asking their (first) question?
      • when did they join Wikipedia?
      • were they explicitly invited to participate in the Teahouse, or did they locate the forum by other means?
      • did the questioner also create a guest profile on the Teahouse introductions page?
    • more metadata about the people answering the questions:
      • were they a Teahouse host at the time they answered a question?

    Examples of data that could be included in future revisions of the corpus at reasonable cost:

    • full text of answers to questions, including replies by original questioner
    • full text of profiles created by Teahouse guests and hosts (some privacy considerations here; contact corpus maintainer directly if interested in these data)

    See also

  10. Wikimedia editor activity (monthly)

    • figshare.com
    bz2
    Updated Dec 17, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Halfaker (2019). Wikimedia editor activity (monthly) [Dataset]. http://doi.org/10.6084/m9.figshare.1553296.v1
    Explore at:
    bz2Available download formats
    Dataset updated
    Dec 17, 2019
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Aaron Halfaker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a row for every (wiki, user, month) that contains a count of all 'revisions' saved and a count of those revisions that were 'archived' when the page was deleted. For more information, see https://meta.wikimedia.org/wiki/Research:Monthly_wikimedia_editor_activity_dataset Fields: · wiki -- The dbname of the wiki in question ("enwiki" == English Wikipedia, "commonswiki" == Commons) · month -- YYYYMM · user_id -- The user's identifier in the local wiki · user_name -- The user name in the local wiki (from the 'user' table) · user_registration -- The recorded registration date for the user in the 'user' table · archived -- The count of deleted revisions saved in this month by this user · revisions -- The count of all revisions saved in this month by this user (archived or not) · attached_method -- The method by which this user attached this account to their global account

  11. Data from: Wikipedia Category Granularity (WikiGrain) data

    • zenodo.org
    csv, txt
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jürgen Lerner; Jürgen Lerner (2020). Wikipedia Category Granularity (WikiGrain) data [Dataset]. http://doi.org/10.5281/zenodo.1005175
    Explore at:
    txt, csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jürgen Lerner; Jürgen Lerner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).

    The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.

    WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.

    The WikiGrain Data is analyzed in the paper

    Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.

    ===============================================================
    Individual files (tables in comma-separated-values-format):

    ---------------------------------------------------------------
    * article_info.csv contains the following variables:

    - "id"
    (integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

    - "granularity"
    (decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.

    - "is.FA"
    (boolean) True ('1') if the article is a featured article; false ('0') else.

    - "is.FA.or.GA"
    (boolean) True ('1') if the article is a featured article or a good article; false ('0') else.

    - "is.top.importance"
    (boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.

    - "number.of.revisions"
    (integer) Number of times a new version of the article has been uploaded.


    ---------------------------------------------------------------
    * article_to_tlc.csv
    is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
    The file contains the following variables:

    - "id"
    (integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

    - "id.of.tlc"
    (integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.

    - "title.of.tlc"
    (string) Title of the TLC in which the article is contained.

    ---------------------------------------------------------------
    * article_info_normalized.csv
    contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
    The file contains the following variables:

    - "id"
    Article id.

    - "is.FA"
    Boolean indicator for whether the article is featured.

    - "log1p.length"
    Length measured by the number of bytes.

    - "age"
    Age measured by the time since the first edit.

    - "log1p.number.of.edits"
    Number of times a new version of the article has been uploaded.

    - "log1p.number.of.reverts"
    Number of times a revision has been reverted to a previous one.

    - "log1p.number.of.contributors"
    Number of unique contributors to the article.

    - "number.of.characters.per.word"
    Average number of characters per word (one component of 'reading complexity').

    - "number.of.words.per.sentence"
    Average number of words per sentence (second component of 'reading complexity').

    - "number.of.level.1.sections"
    Number of first level sections in the article.

    - "number.of.level.2.sections"
    Number of second level sections in the article.

    - "number.of.categories"
    Number of categories the article is in.

    - "log1p.average.size.of.categories"
    Average size of the categories the article is in.

    - "log1p.number.of.intra.wiki.links"
    Number of links to pages in the English-language version of Wikipedia.

    - "log1p.number.of.external.references"
    Number of external references given in the article.

    - "log1p.number.of.images"
    Number of images in the article.

    - "log1p.number.of.templates"
    Number of templates that the article uses.

    - "log1p.number.of.inter.language.links"
    Number of links to articles in different language edition of Wikipedia.

    - "granularity"
    As in article_info.csv (but normalized to standard deviation one).

  12. Wikipedia Structured Contents

    • kaggle.com
    zip
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). Wikipedia Structured Contents [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents
    Explore at:
    zip(25121685657 bytes)Available download formats
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.

    This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).

    Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.

    The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

    The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.

    Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/

    Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...

  13. Data from: WikiHist.html: English Wikipedia's Full Revision History in HTML...

    • zenodo.org
    application/gzip, zip
    Updated Jun 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski (2020). WikiHist.html: English Wikipedia's Full Revision History in HTML Format [Dataset]. http://doi.org/10.5281/zenodo.3605388
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    Jun 8, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Introduction

    Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions.

    We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML.

    For more details, please refer to the description below and to the dataset paper:
    Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020.
    https://arxiv.org/abs/2001.10256

    When using the dataset, please cite the above paper.

    Dataset summary

    The dataset consists of three parts:

    1. English Wikipedia’s full revision history parsed to HTML,
    2. a table of the creation times of all Wikipedia pages (page_creation_times.json.gz),
    3. a table that allows for resolving redirects for any point in time (redirect_history.json.gz).

    Part 1 is our main contribution, while parts 2 and 3 contain complementary information that can aid researchers in their analyses.

    Getting the data

    Parts 2 and 3 are hosted in this Zenodo repository. Part 1 is 7TB large -- too large for Zenodo -- and is therefore hosted externally on the Internet Archive. For downloading part 1, you have multiple options:

    Dataset details

    Part 1: HTML revision history
    The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and p$2p$3 indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

    • id: id of this revision
    • parentid: id of revision modified by this revision
    • timestamp: time when revision was made
    • cont_username: username of contributor
    • cont_id: id of contributor
    • cont_ip: IP address of contributor
    • comment: comment made by contributor
    • model: content model (usually "wikitext")
    • format: content format (usually "text/x-wiki")
    • sha1: SHA-1 hash
    • title: page title
    • ns: namespace (always 0)
    • page_id: page id
    • redirect_title: if page is redirect, title of target page
    • html: revision content in HTML format

    Part 2: Page creation times (page_creation_times.json.gz)

    This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

    • page_id: page id
    • title: page title
    • ns: namespace (0 for articles)
    • timestamp: time when page was created

    Part 3: Redirect history (redirect_history.json.gz)

    This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

    • page_id: page id of redirect source
    • title: page title of redirect source
    • ns: namespace (0 for articles)
    • revision_id: revision id of redirect source
    • timestamp: time at which redirect became active
    • redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

    The repository also contains two additional files, metadata.zip and mysql_database.zip. These two files are not part of WikiHist.html per se, and most users will not need to download them manually. The file metadata.zip is required by the download script (and will be fetched by the script automatically), and mysql_database.zip is required by the code used to produce WikiHist.html. The code that uses these files is hosted at GitHub, but the files are too big for GitHub and are therefore hosted here.

    WikiHist.html was produced by parsing the 1 March 2019 dump of https://dumps.wikimedia.org/enwiki/20190301 from wikitext to HTML. That old dump is not available anymore on Wikimedia's servers, so we make a copy available at https://archive.org/details/enwiki-20190301-original-full-history-dump_dlab .

  14. Z

    Archived Data from the Education Program Extension

    • data-staging.niaid.nih.gov
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Varella, Flávia; Figueredo, Danielly (2025). Archived Data from the Education Program Extension [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14525245
    Explore at:
    Dataset updated
    Jan 7, 2025
    Dataset provided by
    Universidade Federal de Santa Catarina
    Authors
    Varella, Flávia; Figueredo, Danielly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Education Program Extension was a MediaWiki software developed by the Wikimedia Foundation to support the Wikipedia Education Program. This extension aimed to facilitate the integration of Wikipedia into educational environments by enabling the tracking and management of groups of editors, such as students and instructors participating in educational projects.

    Launched in 2011 and first implemented on the English Wikipedia, the extension provided tools for monitoring editor contributions, organizing course pages, and managing assignments. Despite its initial promise, the tool faced significant challenges over time, including security vulnerabilities and usability issues, which ultimately led to its official discontinuation in 2018.

    The projects registered in the extension were archived and remain accessible for consultation here. This database represents an extraction of information preserved by the Wikimedia Foundation, encompassing educational projects conducted across Wikipedia, Wikiversity, Wikinews, Wikisource, and Wiktionary. These projects span 18 languages, showcasing a broad array of collaborative educational initiatives that contributed to the Wikimedia ecosystem.

  15. Wikipedia user preferences

    • data.wu.ac.at
    tsv
    Updated Oct 11, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2013). Wikipedia user preferences [Dataset]. https://data.wu.ac.at/odso/datahub_io/ZTlhZGE0MjctODkzYy00OGQzLWE1MWUtNGQxYmQ1YzFiOTY3
    Explore at:
    tsv(1982.0), tsv(3732.0), tsv(1194.0), tsv(2843.0), tsv(5298.0), tsv(2827.0), tsv(2005.0), tsv(1296.0), tsv(3352.0), tsv(18489.0), tsv(2739.0), tsv(2423.0), tsv(3618.0), tsv(14990.0), tsv(4299.0), tsv(2682.0), tsv(3799.0), tsv(2053.0), tsv(3565.0), tsv(58166.0), tsv(2193.0), tsv(44328.0), tsv(2953.0), tsv(2278.0), tsv(1773.0), tsv(32862.0), tsv(1369.0), tsv(2765.0), tsv(4540.0), tsv(83443.0), tsv(5896.0), tsv(3388.0), tsv(17417.0), tsv(47927.0), tsv(4372.0), tsv(466106.0)Available download formats
    Dataset updated
    Oct 11, 2013
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data on user preferences set by active Wikipedia editors.

    Active editors are defined as registered users with at least 5 edits per month in a given project. The dumps were generated on 2012-10-10 and include data for the top 10 Wikipedias (de, en, es, fr, it, ja, nl, pl, pt, ru).

    For each project, 4 different data dumps are available:

    [project]_active_20121010.tsv The list of active editors whose prefs were extracted, along with their edit count in the 2012-09-10 - 2012-10-10 period. This is non-aggregate, public data. Note that bots and globally attached users are included.

    [project]_prefs_all_20121010.tsv Unique user count for preferences set to non-empty value. Note that the way in which MediaWiki and various extensions handle defaults is not always consistent, sometimes a record is removed from the table, sometimes it's set to a null value. Any preference with less than 5 occurrences is removed from the dump.

    [project]_prefs_0_20121010.tsv Unique user count for preferences set to 0 or an empty string. Same caveats apply as above. This dump includes non-boolean preferences whose value has been set to 0 or empty. Any preference with less than 5 occurrences is removed from the dump.

    [project]_prefs_1_20121010.tsv Unique user count for preferences set to 1. Same caveats apply as above. This dump includes non-boolean preferences whose value has been set to 1. Any preference with less than 5 occurrences is removed from the dump.

  16. Dataset Wikipedia

    • figshare.com
    txt
    Updated Jul 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Rizzo (2021). Dataset Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14939319.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 9, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Lucas Rizzo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Quantitative features extracted from Wikipedia dumps for the inference of computational trust. Dumps provided at:https://dumps.wikimedia.org/Files used:XML dump Portuguese: ptwiki-20200820-stub-meta-history.xmlXML dump Italian: itwiki-20200801-stub-meta-history.xml

  17. Data from: Wiki-based Communities of Interest: Demographics and Outliers

    • zenodo.org
    bin
    Updated Jan 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiba Arnaout; Simon Razniewski; Jeff Z. Pan; Hiba Arnaout; Simon Razniewski; Jeff Z. Pan (2023). Wiki-based Communities of Interest: Demographics and Outliers [Dataset]. http://doi.org/10.5281/zenodo.7537200
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hiba Arnaout; Simon Razniewski; Jeff Z. Pan; Hiba Arnaout; Simon Razniewski; Jeff Z. Pan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These datasets contains statements about demographics and outliers of Wiki-based Communities of Interest.

    Group-centric dataset (sample):

    {
      "title": "winners of Priestley Medal", 
      "recorded_members": 83, 
      "topics": ["STEM.Chemistry"], 
      "demographics": [
          "occupation-chemist",
          "gender-male", 
          "citizen-U.S."
      ], 
      "outliers": [
        {
          "reason": "NOT(chemist) unlike 82 recorded members", 
          "members": [
          "Francis Garvan (lawyer, art collector)"
          ]
        }, 
        {
          "reason": "NOT(male) unlike 80 recorded members", 
          "members": [
          "Mary L. Good (female)",
          "Darleane Hoffman (female)", 
          "Jacqueline Barton (female)"
          ]
        }
      ]
    }

    Subject-centric dataset (sample):

    {
      "subject": "Serena Williams", 
      "statements": [
        {
          "statement": "NOT(sport-basketball) but (tennis) unlike 4 recorded winners of Best Female Athlete ESPY Award.", 
          "score": 0.36
        },
      {
          "statement": "NOT(occupation-politician) but (tennis player, businessperson, autobiographer) unlike 20 recorded winners of Michigan Women's Hall of Fame.",
          "score": 0.17
        }
      ]
    }

    This data can be also browsed at: https://wikiknowledge.onrender.com/demographics/

  18. Wikipedia Article Topics for All Languages (based on article outlinks)

    • figshare.com
    bz2
    Updated Jul 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Johnson (2021). Wikipedia Article Topics for All Languages (based on article outlinks) [Dataset]. http://doi.org/10.6084/m9.figshare.12619766.v3
    Explore at:
    bz2Available download formats
    Dataset updated
    Jul 20, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Isaac Johnson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performanceAdditional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.

  19. d

    Archival Data for Page Protection: Another Missing Dimension of Wikipedia...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hill, Benjamin Mako; Shaw, Aaron (2023). Archival Data for Page Protection: Another Missing Dimension of Wikipedia Research [Dataset]. http://doi.org/10.7910/DVN/P1VECE
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Hill, Benjamin Mako; Shaw, Aaron
    Description

    This dataset contains data and software for the following paper: Hill, Benjamin Mako and Shaw, Aaron. (2015) “Page Protection: Another Missing Dimension of Wikipedia Research.” In Proceedings of the 11th International Symposium on Open Collaboration (OpenSym 2015). ACM Press. doi: 10.1145/2788993.2789846 This is an archival version of the data and software released with the paper. All of these data were (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-proetection/ Page protection is a feature of MediaWiki software that allows administrators to restrict contributions to particular pages. For example, a page can be “protected” so that only administrators or logged-in editors with a history of good editing can edit, move, or create it. Protection might involve “full protection” where a page can only be edited by administrators (i.e., “sysops”) or “semi-protection” where a page can only be edited by accounts with a history of good edits (i.e., “autoconfirmed” users). Although largely hidden, page protection profoundly shapes activity on the site. For example, page protection is an important tool used to manage access and participation in situations where vandalism or interpersonal conflict can threaten to undermine content quality. While protection affects only a small portion of pages in English Wikipedia, many of the most highly viewed pages are protected. For example, the “Main Page” in English Wikipedia has been protected since February, 2006 and all Featured Articles are protected at the time they appear on the site’s main page. Millions of viewers may never edit Wikipedia because they never see an edit button. Despite it's widespread and influential nature, very little quantitative research on Wikipedia has taken page protection into account systematically. This page contains software and data to help Wikipedia researchers do exactly this in their work. Because a page's protection status can change over time, the snapshots of page protection data stored by Wikimedia and published by Wikimedia Foundation in as dumps is incomplete. As a result, taking protection into account involves looking at several different sources of data. Much more detail can be found in our paper Page Protection: Another Missing Dimension of Wikipedia Research. If you use this software or these data, we would appreciate if you cite the paper.

  20. w

    WikiWord Thesaurus Data

    • data.wu.ac.at
    Updated Jul 29, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OWLG (2014). WikiWord Thesaurus Data [Dataset]. https://data.wu.ac.at/odso/datahub_io/NDkwYzI1NjgtMGYzMi00NWZlLTliMzAtYzAwMWMyNWE1Njkx
    Explore at:
    Dataset updated
    Jul 29, 2014
    Dataset provided by
    OWLG
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    About

    Overview:

    The WikiWord-Thesaurus is a multilingual Thesaurus derived from Wikipedia by extracting lexical and semantic information. It was originally developed for a diploma thesis at the University of Leipzig. Development is continued by Wikimedia Deutschland.

    Note: only extracts for specific topics are available for download right now. This is due mainly to the sheer size of the dump files. Full SQL dumps are available upon request. For the next release, we plan to make full RDF dumps available again.

    Updates

    The original thesaurus was created in 2008, using data from late 2007. An update thesaurus is due to be released soon. Wikimedia Deutschland plans to release new versions on a regular basis.

    Licensing

    The thesaurus as such is generated automatically and thus considered to be in the public domain. It is not created from textual content, but from the structure of Wikipedia articles, and Wikipedia as a whole. No database protection rights are claimed or enforced.

    Some data sets may however contain concept definitions taken directly from Wikipedia - these are licensed GFDL (for newer versions, this will be CC-BY-SA 3.0), the authorship can be determined by looking at the page history of the respective Wikipedia article.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Organization logo

English Wikipedia People Dataset

Biographical Data for People on English Wikipedia

Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

  • File name: wme_people_infobox.tar.gz
  • Size of compressed file: 4.12 GB
  • Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

Search
Clear search
Close search
Google apps
Main menu