100+ datasets found

English Wikipedia People Dataset
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz

Size of compressed file: 4.12 GB

Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
h
wikimedia
huggingface.co
Updated Jul 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Pile (2024). wikimedia [Dataset]. https://huggingface.co/datasets/common-pile/wikimedia
Explore at:
Dataset updated
Jul 27, 2024
Dataset authored and provided by
Common Pile
Description
Wikimedia

Description

Official Wikimedia wikis are released under a CC BY-SA license. We downloaded the official database dumps from March 2025 of the English-language wikis that are directly managed by the Wikimedia foundation. These database dumps include the wikitext—Mediawiki’s custom markup language—for each page as well as talk pages, where editors discuss changes made for a page. We only use the most recent version of each page. We converted wikitext to plain… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/wikimedia.
Wikimedia Dump enwiki-20220901
kaggle.com
zip
Updated Sep 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shy_ (2022). Wikimedia Dump enwiki-20220901 [Dataset]. https://www.kaggle.com/datasets/shyguy/wikimedia-dump-enwiki20220901
Explore at:
zip(25110528357 bytes)Available download formats
Dataset updated
Sep 7, 2022
Authors
shy_
Description
The wikimedia dump for the english wikipedia on the 9 September 2022: https://dumps.wikimedia.org/enwiki/20220901/ for more information visit (especially for the license): https://dumps.wikimedia.org/
d
Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20...
datadryad.org
search.dataone.org
zip
Updated Aug 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
R. Stuart Geiger; Aaron Halfaker (2017). Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20 April 2017 [Dataset]. http://doi.org/10.6078/D1FD3K
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6078/D1FD3K
Dataset updated
Aug 15, 2017
Dataset provided by
Dryad
Authors
R. Stuart Geiger; Aaron Halfaker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Aug 15, 2017
Description
See https://meta.wikimedia.org/wiki/Data_dumps for more detail on using these dumps.
[deprecated] Reference and map usage across Wikimedia wiki pages
figshare.com
Updated Dec 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Wight (2023). [deprecated] Reference and map usage across Wikimedia wiki pages [Dataset]. http://doi.org/10.6084/m9.figshare.24064941.v2
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.24064941.v2
Dataset updated
Dec 18, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Adam Wight
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
ErrataPlease note that this data set includes some major inaccuracies and should not be used. The data files will be unpublished from their hosting and this metadata will eventually be unpublished as well.A short list of issues discovered:Many dumps were truncated (T345176).Pages appeared multiple times, with different revision numbers.Revisions were sometimes mixed, with wikitext and HTML coming from different versions of an article.Reference similarity was overcounted when more than two refs shared content.In particular, the truncation and duplication means that the aggregate statistics are inaccurate and can't be compared to other data points.OverviewThis data was produced by Wikimedia Germany’s Technical Wishes team, and focuses on real-world usage statistics for reference footnotes (Cite extension) and maps (Kartographer extension) across all main-namespace pages (articles) on about 700 Wikimedia wikis. It was produced by processing the Wikimedia Enterprise HTML dumps which are a fully-parsed rendering of the pages, and by querying the MediaWiki query API to get more detailed information about maps. The data is also accompanied by several more general columns about each page for context.Our analysis of references was inspired by "Characterizing Wikipedia Citation Usage” and other research, but the goal in our case was to understand the potential impact of improving the ways in which references can be reused within a page. Gathering the map data was to understand the actual impact of improvements made to how external data can be integrated in maps. Both tasks are complicated by the heavy use of wikitext templates, obscuring when and how and tags are being used. For this reason, we decided to parse the rendered HTML pages rather than the original wikitext.LicenseAll files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/The source code is distributed under BSD-3-Clause.Source code and executionThe program used to create these files is our HTML dump scraper, version 0.1, written in Elixir. It can be run locally, but we used the Wikimedia Cloud VPS in order to have intra-datacenter access to the HTML dump file inputs. Our production configuration is included in the source code repository, and the commandline used to run was: “MIX_ENV=prod mix run pipeline.exs” .Execution was interrupted and restarted many times in order to make small fixes to the code. We expect that the only class of inconsistency this could have caused is that a small number of article records may potentially be repeated in the per-page summary files, and these pages’ statistics duplicated in the aggregates. Whatever the cause, we’ve found many of these duplicate errors and counts are given in the “duplicates.txt” file.The program is pluggable and configurable, it can be extended by writing new analysis modules. Our team plans to continue development and to run it again in the near future to track evolution of the collected metrics over time.FormatAll fields are documented in metrics.md as part of the code repository. Outputs are mostly split into separate ND-JSON (newline-delimited) and JSON files, and grand totals are gathered into a single CSV file.Per-page summary filesThe first phase of scraping produces a fine-grained report summarizing each page into a few statistics. Each file corresponds to a wiki (using its database name, for example "enwiki" for English Wikipedia) and each line of the file is a JSON object corresponding to a page.Example file name: enwiki-20230601-page-summary.ndjson.gzExample metrics:How many tags are created from templates vs. directly in the article.How many references contain a template transclusion to produce their content.How many references are unnamed, automatically, or manually named.How often references are reused via their name.Copy-pasted references that share the same or almost the same content, on the same page.Whether an article has more than one list.Mapdata filesExample file name: enwiki-20230601-mapdata.ndjson.gzThese files give the count of different types of map "external data" on each page. A line will either be empty "{}" or it will include the revid and number of external data references for maps on that page.External data is tallied in 9 different buckets, starting with "page" meaning that the source is .map data from the Wikimedia Commons server, or geoline / geoshape / geomask / geopoint and the data source, either an "ids" (Wikidata Q-ID) or "query" (SPARQL query) source.Mapdata summary filesEach wiki has a summary of map external data counts, which contains a sum for each type count.Example file name: enwiki-20230601-mapdata-summary.jsonWiki summary filesPer-page statistics are rolled up to the wiki level, and results are stored in a separate file for each wiki. Some statistics are summed, some are averaged, check the suffix on the column name for a hint.Example file name: enwiki-20230601-summary.jsonTop-level summary fileThere is one file which aggregates the wiki summary statistics, discarding non-numeric fields and formatting as a CSV for ease of use: all-wikis-20230601-summary.csv
d
Archival Data for Consider the Redirect: A Missing Dimension of Wikipedia...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hill, Benjamin Mako; Shaw, Aaron (2023). Archival Data for Consider the Redirect: A Missing Dimension of Wikipedia Research [Dataset]. http://doi.org/10.7910/DVN/NQSHQD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/NQSHQD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Hill, Benjamin Mako; Shaw, Aaron
Description
This contains data and software for the following paper: Hill, Benjamin Mako and Shaw, Aaron. (2014) "Consider the Redirect: A Missing Dimension of Wikipedia Research." In Proceedings of the 10th International Symposium on Open Collaboration (OpenSym 2014). ACM Press. doi: 10.1145/2641580.2641616 This is an archival version of the data and software released with the paper. All of these data were originally (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-redirects/ In wikis, redirects are special pages in that silently take readers from the page they are visiting to another page in the wiki. In the English Wikipedia, redirects make up more than half of all article pages. Different Wikipedia data sources handle redirects differently. For example, the MediaWiki API will automatically "follow" redirects but the XML database dumps treat redirects like normal articles. In both cases, redirects are often invisible to researchers. Because redirects constitute a majority of all pages and see a large portion of all traffic, Wikipedia researchers need to take redirects into account or their findings may be incomplete or incorrect. For example, the histogram on this page shows the distribution of edits across pages in Wikipedia for every page, and for non-redirects only. Because redirects are almost never edited, the distributions are very different. Similarly, because redirects are viewed but almost never edited, any study of views over articles should also take redirects into account. Because redirects can change over time, the snapshots of redirects stored by Wikimedia and published by Wikimedia Foundation are incomplete. Taking redirects into account fully involves looking at the content of every single revision of every article to determine both when and where pages redirect. Much more detail can be found in Consider the Redirect: A Missing Dimension of Wikipedia Research — a short paper that we have written to accompany this dataset and these tools. If you use this software or these data, we would appreciate if you cite the paper. This dataset was previously hosted at this now obsolete URL: http://networkcollectiv.es/wiki-redirects/
Data from: English Wikipedia - Species Pages
gbif.org
Updated Aug 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
Explore at:
Unique identifier
https://doi.org/10.15468/c3kkgh
Dataset updated
Aug 23, 2022
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Markus Döring; Markus Döring
Description
Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

See https://github.com/mdoering/wikipedia-dwca for details.
Wikimedia Commons photos by prominent users and their usage across the web
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Nov 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leva, Federico (2020). Wikimedia Commons photos by prominent users and their usage across the web [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3355708
Explore at:
Dataset updated
Nov 16, 2020
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Leva, Federico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Extract from the Wikimedia Commons database containing a list of users selected by the community for having uploaded high quality photos; list of 310k photos of theirs and of the subset of 59k photos sent to Infringement.Report for matching; list of domains whose matches were ignored as not useful for copyleft license enforcement. Domains were then matched for their rank in the Tranco list and the number of image usages found, and ranked by a mix of the two criteria.
Teahouse corpus
data.wu.ac.at
.txt, csv
Updated Apr 12, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2015). Teahouse corpus [Dataset]. https://data.wu.ac.at/schema/datahub_io/MmZiZjJmNWEtM2E2OS00NGZmLTgyMjUtMDk1MmVhNTQ0NGU1
Explore at:
.txt, csvAvailable download formats
Dataset updated
Apr 12, 2015
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
License
http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
Description
The Teahouse corpus is a set of questions asked at the Wikipedia Teahouse, a peer support forum for new Wikipedia editors. This corpus contains data from its first two years of operation.

The Teahouse started as an editor engagement initiative and Fellowship project. It was launched in February 2012 by a small team working with the Wikimedia Foundation. Our intention was to pilot a new, scalable model for teaching Wikipedia newcomers the ropes of editing in a friendly and engaging environment.

The ultimate goal of the pilot project was to increase the retention of new Wikipedia editors (most of whom give up and leave within their first 24 hours post-registration) through early proactive outreach. The project was particularly focused on retaining female newcomers, who are woefully underrepresented among the regular contributors to the encyclopedia.

The Teahouse lives on as an vibrant, self-sustaining and community-driven project. All Teahouse participants are volunteers: no one is told when, how, or how much they must contribute.

See the README files associated with each datafile for a schema of the data fields in that file.

Read on for more info on potential applications, the provenance of these data, and links to related resources.

Potential Applications

or, what is it good for?

The Teahouse corpus consists of good quality data and rich metadata around social Q&A interactions in a particular setting: new user help requests in a large, collaborative online community.

More generally, this corpus is a valuable resource for research on conversational dynamics in online, asynchronous discussions.

Qualitative textual analysis could yield insights into the kinds of issues faced by newcomers in established online collaborations.

Linguisitc analysis could examine the impact of syntactic and semantic features related to politeness, sentiment, question framing, or other rhetorical strategies on discussion outcomes.

Response patterns (questioner replies and answers) within each thread could be used to map network relationships, or to investigate correlations between participation by the initiator of a thread, or the number of participants, on thread length or interactivity (the interval of time between posts).

The corpus is large and rich enough to provide training both training and test data for machine learning applications.

Finally, the data provide here can be extended and compared with other publicly-available datasets of Wikipedia, allowing researchers to examine relationships between editors' participation within the Teahouse Q&A forum and their previous, concurrent, and subsequent editing activities within millions of other articles, meta-content, and discussion spaces on Wikipedia.

Data hygiene

or, how the research sausage was made

Parsing wikitext presents many challenges: the mediawiki editing interface is deliberately underspecified in order to maximize flexibility for contributors. This can make it difficult to tell the difference between different types of contribution--say, fixing a typo or answering a question.

The Teahouse Q&A board was designed to provide a more structured workflow than normal wiki talk pages, and instrumented to identify certain kinds of contributions (questions and answers) and isolate them from the 'noisy' background datastream of incidental edits to the Q&A page. The post-processing of the data presented here favored precision over recall: to provide a good quality set of questions, rather than a complete one.

In cases where it wasn't easy to identify whether an edit contained a question or answer, these data have not been included. However, it is hard to account for all ambiguous or invalid cases: caveat quaesitor!

Our approach to data inclusion was conservative. The number of questioner replies and answers to any given question may be under-counted, but is unlikely to be over-counted. However, our spot checks and analysis of the data suggest that the majority of responses are accounted for, and that the distribution of "missed" responses is randomly distributed.

The Teahouse corpus only contains questions and answers by registered users of Wikipedia who were logged in when they participated. IP addresses can be linked to an individual's physical location. On Wikipedia, edits by logged out and unregistered users are identified by the user's current IP address. Although all edits to Wikipedia are legally public and free licenced, we have redacted IP edits from this dataset in deference to user privacy. Researchers interested in those data can find them in other public Wikipedia datasets.

Possible future additions

Additional data about these Q&A interactions has been collected, and other data are retrievable. Examples of data that could be included in future revisions of the corpus at low cost include:

more metadata about the people asking questions:

how many edits had they made before asking their (first) question?

when did they join Wikipedia?

were they explicitly invited to participate in the Teahouse, or did they locate the forum by other means?

did the questioner also create a guest profile on the Teahouse introductions page?

more metadata about the people answering the questions:

were they a Teahouse host at the time they answered a question?

Examples of data that could be included in future revisions of the corpus at reasonable cost:

full text of answers to questions, including replies by original questioner

full text of profiles created by Teahouse guests and hosts (some privacy considerations here; contact corpus maintainer directly if interested in these data)

See also

Teahouse project documentation: project planning docs and reports from the Teahouse pilot

Wikimedia data portal: public data resources on Wikipedia and other Wikimedia projects

Mediawiki database schema: description of standard data tables and fields in MediaWiki sites

Mediawiki API documentation: data available through the MediaWiki API (depending on site configuration)

Wikitext markup information: information about the markup conventions used in the text of the Teahouse corpus
Wikimedia editor activity (monthly)
figshare.com
bz2
Updated Dec 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Halfaker (2019). Wikimedia editor activity (monthly) [Dataset]. http://doi.org/10.6084/m9.figshare.1553296.v1
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1553296.v1
Dataset updated
Dec 17, 2019
Dataset provided by
Figsharehttp://figshare.com/
Authors
Aaron Halfaker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a row for every (wiki, user, month) that contains a count of all 'revisions' saved and a count of those revisions that were 'archived' when the page was deleted. For more information, see https://meta.wikimedia.org/wiki/Research:Monthly_wikimedia_editor_activity_dataset Fields: · wiki -- The dbname of the wiki in question ("enwiki" == English Wikipedia, "commonswiki" == Commons) · month -- YYYYMM · user_id -- The user's identifier in the local wiki · user_name -- The user name in the local wiki (from the 'user' table) · user_registration -- The recorded registration date for the user in the 'user' table · archived -- The count of deleted revisions saved in this month by this user · revisions -- The count of all revisions saved in this month by this user (archived or not) · attached_method -- The method by which this user attached this account to their global account
Data from: Wikipedia Category Granularity (WikiGrain) data
zenodo.org
csv, txt
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jürgen Lerner; Jürgen Lerner (2020). Wikipedia Category Granularity (WikiGrain) data [Dataset]. http://doi.org/10.5281/zenodo.1005175
Explore at:
txt, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1005175
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jürgen Lerner; Jürgen Lerner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).

The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.

WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.

The WikiGrain Data is analyzed in the paper

Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.

===============================================================
Individual files (tables in comma-separated-values-format):

---------------------------------------------------------------
* article_info.csv contains the following variables:

- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

- "granularity"
(decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.

- "is.FA"
(boolean) True ('1') if the article is a featured article; false ('0') else.

- "is.FA.or.GA"
(boolean) True ('1') if the article is a featured article or a good article; false ('0') else.

- "is.top.importance"
(boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.

- "number.of.revisions"
(integer) Number of times a new version of the article has been uploaded.

---------------------------------------------------------------
* article_to_tlc.csv
is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
The file contains the following variables:

- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.

- "id.of.tlc"
(integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.

- "title.of.tlc"
(string) Title of the TLC in which the article is contained.

---------------------------------------------------------------
* article_info_normalized.csv
contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
The file contains the following variables:

- "id"
Article id.

- "is.FA"
Boolean indicator for whether the article is featured.

- "log1p.length"
Length measured by the number of bytes.

- "age"
Age measured by the time since the first edit.

- "log1p.number.of.edits"
Number of times a new version of the article has been uploaded.

- "log1p.number.of.reverts"
Number of times a revision has been reverted to a previous one.

- "log1p.number.of.contributors"
Number of unique contributors to the article.

- "number.of.characters.per.word"
Average number of characters per word (one component of 'reading complexity').

- "number.of.words.per.sentence"
Average number of words per sentence (second component of 'reading complexity').

- "number.of.level.1.sections"
Number of first level sections in the article.

- "number.of.level.2.sections"
Number of second level sections in the article.

- "number.of.categories"
Number of categories the article is in.

- "log1p.average.size.of.categories"
Average size of the categories the article is in.

- "log1p.number.of.intra.wiki.links"
Number of links to pages in the English-language version of Wikipedia.

- "log1p.number.of.external.references"
Number of external references given in the article.

- "log1p.number.of.images"
Number of images in the article.

- "log1p.number.of.templates"
Number of templates that the article uses.

- "log1p.number.of.inter.language.links"
Number of links to articles in different language edition of Wikipedia.

- "granularity"
As in article_info.csv (but normalized to standard deviation one).
Wikipedia Structured Contents
kaggle.com
zip
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). Wikipedia Structured Contents [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents
Explore at:
zip(25121685657 bytes)Available download formats
Dataset updated
Apr 11, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.

This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).

Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.

The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.

Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/

Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...
Data from: WikiHist.html: English Wikipedia's Full Revision History in HTML...
zenodo.org
application/gzip, zip
Updated Jun 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski (2020). WikiHist.html: English Wikipedia's Full Revision History in HTML Format [Dataset]. http://doi.org/10.5281/zenodo.3605388
Explore at:
application/gzip, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3605388
Dataset updated
Jun 8, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Introduction

Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions.

We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML.

For more details, please refer to the description below and to the dataset paper:
Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020.
https://arxiv.org/abs/2001.10256

When using the dataset, please cite the above paper.

Dataset summary

The dataset consists of three parts:

English Wikipedia’s full revision history parsed to HTML,

a table of the creation times of all Wikipedia pages (page_creation_times.json.gz),

a table that allows for resolving redirects for any point in time (redirect_history.json.gz).

Part 1 is our main contribution, while parts 2 and 3 contain complementary information that can aid researchers in their analyses.

Getting the data

Parts 2 and 3 are hosted in this Zenodo repository. Part 1 is 7TB large -- too large for Zenodo -- and is therefore hosted externally on the Internet Archive. For downloading part 1, you have multiple options:

use a Torrent-based solution as described at https://github.com/epfl-dlab/WikiHist.html - Option 1 (recommended approach for the full download)

use our download scripts by following the instructions at https://github.com/epfl-dlab/WikiHist.html - Option 2 (the download scripts allow you to bulk-download all data as well as to download revisions for specific articles only).

download it manually from the Internet Archive at https://archive.org/details/WikiHist_html

Dataset details

Part 1: HTML revision history
The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and p$2p$3 indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

id: id of this revision

parentid: id of revision modified by this revision

timestamp: time when revision was made

cont_username: username of contributor

cont_id: id of contributor

cont_ip: IP address of contributor

comment: comment made by contributor

model: content model (usually "wikitext")

format: content format (usually "text/x-wiki")

sha1: SHA-1 hash

title: page title

ns: namespace (always 0)

page_id: page id

redirect_title: if page is redirect, title of target page

html: revision content in HTML format

Part 2: Page creation times (page_creation_times.json.gz)

This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

page_id: page id

title: page title

ns: namespace (0 for articles)

timestamp: time when page was created

Part 3: Redirect history (redirect_history.json.gz)

This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

page_id: page id of redirect source

title: page title of redirect source

ns: namespace (0 for articles)

revision_id: revision id of redirect source

timestamp: time at which redirect became active

redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

The repository also contains two additional files, metadata.zip and mysql_database.zip. These two files are not part of WikiHist.html per se, and most users will not need to download them manually. The file metadata.zip is required by the download script (and will be fetched by the script automatically), and mysql_database.zip is required by the code used to produce WikiHist.html. The code that uses these files is hosted at GitHub, but the files are too big for GitHub and are therefore hosted here.

WikiHist.html was produced by parsing the 1 March 2019 dump of https://dumps.wikimedia.org/enwiki/20190301 from wikitext to HTML. That old dump is not available anymore on Wikimedia's servers, so we make a copy available at https://archive.org/details/enwiki-20190301-original-full-history-dump_dlab .
Z
Archived Data from the Education Program Extension
data-staging.niaid.nih.gov
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Varella, Flávia; Figueredo, Danielly (2025). Archived Data from the Education Program Extension [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14525245
Explore at:
Dataset updated
Jan 7, 2025
Dataset provided by
Universidade Federal de Santa Catarina
Authors
Varella, Flávia; Figueredo, Danielly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Education Program Extension was a MediaWiki software developed by the Wikimedia Foundation to support the Wikipedia Education Program. This extension aimed to facilitate the integration of Wikipedia into educational environments by enabling the tracking and management of groups of editors, such as students and instructors participating in educational projects.

Launched in 2011 and first implemented on the English Wikipedia, the extension provided tools for monitoring editor contributions, organizing course pages, and managing assignments. Despite its initial promise, the tool faced significant challenges over time, including security vulnerabilities and usability issues, which ultimately led to its official discontinuation in 2018.

The projects registered in the extension were archived and remain accessible for consultation here. This database represents an extraction of information preserved by the Wikimedia Foundation, encompassing educational projects conducted across Wikipedia, Wikiversity, Wikinews, Wikisource, and Wiktionary. These projects span 18 languages, showcasing a broad array of collaborative educational initiatives that contributed to the Wikimedia ecosystem.
Wikipedia user preferences
data.wu.ac.at
tsv
Updated Oct 11, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2013). Wikipedia user preferences [Dataset]. https://data.wu.ac.at/odso/datahub_io/ZTlhZGE0MjctODkzYy00OGQzLWE1MWUtNGQxYmQ1YzFiOTY3
Explore at:
tsv(1982.0), tsv(3732.0), tsv(1194.0), tsv(2843.0), tsv(5298.0), tsv(2827.0), tsv(2005.0), tsv(1296.0), tsv(3352.0), tsv(18489.0), tsv(2739.0), tsv(2423.0), tsv(3618.0), tsv(14990.0), tsv(4299.0), tsv(2682.0), tsv(3799.0), tsv(2053.0), tsv(3565.0), tsv(58166.0), tsv(2193.0), tsv(44328.0), tsv(2953.0), tsv(2278.0), tsv(1773.0), tsv(32862.0), tsv(1369.0), tsv(2765.0), tsv(4540.0), tsv(83443.0), tsv(5896.0), tsv(3388.0), tsv(17417.0), tsv(47927.0), tsv(4372.0), tsv(466106.0)Available download formats
Dataset updated
Oct 11, 2013
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data on user preferences set by active Wikipedia editors.

Active editors are defined as registered users with at least 5 edits per month in a given project. The dumps were generated on 2012-10-10 and include data for the top 10 Wikipedias (de, en, es, fr, it, ja, nl, pl, pt, ru).

For each project, 4 different data dumps are available:

[project]_active_20121010.tsv The list of active editors whose prefs were extracted, along with their edit count in the 2012-09-10 - 2012-10-10 period. This is non-aggregate, public data. Note that bots and globally attached users are included.

[project]_prefs_all_20121010.tsv Unique user count for preferences set to non-empty value. Note that the way in which MediaWiki and various extensions handle defaults is not always consistent, sometimes a record is removed from the table, sometimes it's set to a null value. Any preference with less than 5 occurrences is removed from the dump.

[project]_prefs_0_20121010.tsv Unique user count for preferences set to 0 or an empty string. Same caveats apply as above. This dump includes non-boolean preferences whose value has been set to 0 or empty. Any preference with less than 5 occurrences is removed from the dump.

[project]_prefs_1_20121010.tsv Unique user count for preferences set to 1. Same caveats apply as above. This dump includes non-boolean preferences whose value has been set to 1. Any preference with less than 5 occurrences is removed from the dump.
Dataset Wikipedia
figshare.com
txt
Updated Jul 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Rizzo (2021). Dataset Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14939319.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14939319.v1
Dataset updated
Jul 9, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Lucas Rizzo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Quantitative features extracted from Wikipedia dumps for the inference of computational trust. Dumps provided at:https://dumps.wikimedia.org/Files used:XML dump Portuguese: ptwiki-20200820-stub-meta-history.xmlXML dump Italian: itwiki-20200801-stub-meta-history.xml
Data from: Wiki-based Communities of Interest: Demographics and Outliers
zenodo.org
bin
Updated Jan 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hiba Arnaout; Simon Razniewski; Jeff Z. Pan; Hiba Arnaout; Simon Razniewski; Jeff Z. Pan (2023). Wiki-based Communities of Interest: Demographics and Outliers [Dataset]. http://doi.org/10.5281/zenodo.7537200
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7537200
Dataset updated
Jan 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hiba Arnaout; Simon Razniewski; Jeff Z. Pan; Hiba Arnaout; Simon Razniewski; Jeff Z. Pan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These datasets contains statements about demographics and outliers of Wiki-based Communities of Interest.

Group-centric dataset (sample):

{ "title": "winners of Priestley Medal", "recorded_members": 83, "topics": ["STEM.Chemistry"], "demographics": [ "occupation-chemist", "gender-male", "citizen-U.S." ], "outliers": [ { "reason": "NOT(chemist) unlike 82 recorded members", "members": [ "Francis Garvan (lawyer, art collector)" ] }, { "reason": "NOT(male) unlike 80 recorded members", "members": [ "Mary L. Good (female)", "Darleane Hoffman (female)", "Jacqueline Barton (female)" ] } ] }

Subject-centric dataset (sample):

{ "subject": "Serena Williams", "statements": [ { "statement": "NOT(sport-basketball) but (tennis) unlike 4 recorded winners of Best Female Athlete ESPY Award.", "score": 0.36 }, { "statement": "NOT(occupation-politician) but (tennis player, businessperson, autobiographer) unlike 20 recorded winners of Michigan Women's Hall of Fame.", "score": 0.17 } ] }

This data can be also browsed at: https://wikiknowledge.onrender.com/demographics/
Wikipedia Article Topics for All Languages (based on article outlinks)
figshare.com
bz2
Updated Jul 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Johnson (2021). Wikipedia Article Topics for All Languages (based on article outlinks) [Dataset]. http://doi.org/10.6084/m9.figshare.12619766.v3
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12619766.v3
Dataset updated
Jul 20, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Isaac Johnson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performanceAdditional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.
d
Archival Data for Page Protection: Another Missing Dimension of Wikipedia...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hill, Benjamin Mako; Shaw, Aaron (2023). Archival Data for Page Protection: Another Missing Dimension of Wikipedia Research [Dataset]. http://doi.org/10.7910/DVN/P1VECE
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/P1VECE
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Hill, Benjamin Mako; Shaw, Aaron
Description
This dataset contains data and software for the following paper: Hill, Benjamin Mako and Shaw, Aaron. (2015) “Page Protection: Another Missing Dimension of Wikipedia Research.” In Proceedings of the 11th International Symposium on Open Collaboration (OpenSym 2015). ACM Press. doi: 10.1145/2788993.2789846 This is an archival version of the data and software released with the paper. All of these data were (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-proetection/ Page protection is a feature of MediaWiki software that allows administrators to restrict contributions to particular pages. For example, a page can be “protected” so that only administrators or logged-in editors with a history of good editing can edit, move, or create it. Protection might involve “full protection” where a page can only be edited by administrators (i.e., “sysops”) or “semi-protection” where a page can only be edited by accounts with a history of good edits (i.e., “autoconfirmed” users). Although largely hidden, page protection profoundly shapes activity on the site. For example, page protection is an important tool used to manage access and participation in situations where vandalism or interpersonal conflict can threaten to undermine content quality. While protection affects only a small portion of pages in English Wikipedia, many of the most highly viewed pages are protected. For example, the “Main Page” in English Wikipedia has been protected since February, 2006 and all Featured Articles are protected at the time they appear on the site’s main page. Millions of viewers may never edit Wikipedia because they never see an edit button. Despite it's widespread and influential nature, very little quantitative research on Wikipedia has taken page protection into account systematically. This page contains software and data to help Wikipedia researchers do exactly this in their work. Because a page's protection status can change over time, the snapshots of page protection data stored by Wikimedia and published by Wikimedia Foundation in as dumps is incomplete. As a result, taking protection into account involves looking at several different sources of data. Much more detail can be found in our paper Page Protection: Another Missing Dimension of Wikipedia Research. If you use this software or these data, we would appreciate if you cite the paper.
w
WikiWord Thesaurus Data
data.wu.ac.at
Updated Jul 29, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OWLG (2014). WikiWord Thesaurus Data [Dataset]. https://data.wu.ac.at/odso/datahub_io/NDkwYzI1NjgtMGYzMi00NWZlLTliMzAtYzAwMWMyNWE1Njkx
Explore at:
Dataset updated
Jul 29, 2014
Dataset provided by
OWLG
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
About

Overview:

The WikiWord-Thesaurus is a multilingual Thesaurus derived from Wikipedia by extracting lexical and semantic information. It was originally developed for a diploma thesis at the University of Leipzig. Development is continued by Wikimedia Deutschland.

Note: only extracts for specific topics are available for download right now. This is due mainly to the sheer size of the dump files. Full SQL dumps are available upon request. For the next release, we plan to make full RDF dumps available again.

Updates

The original thesaurus was created in 2008, using data from late 2007. An update thesaurus is due to be released soon. Wikimedia Deutschland plans to release new versions on a regular basis.

Licensing

The thesaurus as such is generated automatically and thus considered to be in the public domain. It is not created from textual content, but from the structure of Wikipedia articles, and Wikipedia as a whole. No database protection rights are claimed or enforced.

Some data sets may however contain concept definitions taken directly from Wikipedia - these are licensed GFDL (for newer versions, this will be CC-BY-SA 3.0), the authorship can be determined by looking at the page history of the respective Wikipedia article.

Facebook

Twitter

Click to copy link

Link copied

Cite

Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset

English Wikipedia People Dataset

Biographical Data for People on English Wikipedia

Explore at:

zip(4293465577 bytes)Available download formats

Dataset updated

Jul 31, 2025

Dataset provided by

Wikimedia Foundationhttp://www.wikimedia.org/

Authors

Wikimedia

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz
Size of compressed file: 4.12 GB
Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

Clear search

Close search

Google apps

Main menu

English Wikipedia People Dataset

Summary

Data Structure

Stats

Maintenance and Support

Initial Data Collection and Normalization

Who are the source language producers?

Attribution

wikimedia

Wikimedia Dump enwiki-20220901

Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20...

[deprecated] Reference and map usage across Wikimedia wiki pages

Archival Data for Consider the Redirect: A Missing Dimension of Wikipedia...

Data from: English Wikipedia - Species Pages

Wikimedia Commons photos by prominent users and their usage across the web

Teahouse corpus

Potential Applications

Data hygiene

Possible future additions

See also

Wikimedia editor activity (monthly)

Data from: Wikipedia Category Granularity (WikiGrain) data

Wikipedia Structured Contents

Data from: WikiHist.html: English Wikipedia's Full Revision History in HTML...

Archived Data from the Education Program Extension

Wikipedia user preferences

Dataset Wikipedia

Data from: Wiki-based Communities of Interest: Demographics and Outliers

Wikipedia Article Topics for All Languages (based on article outlinks)

Archival Data for Page Protection: Another Missing Dimension of Wikipedia...

WikiWord Thesaurus Data

About

Updates

Licensing

English Wikipedia People Dataset

Biographical Data for People on English Wikipedia

Summary

Data Structure

Stats

Maintenance and Support

Initial Data Collection and Normalization

Who are the source language producers?

Attribution