Saved datasets
Last updated
Download format
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Topic
Free
Cost to access
Described as free to access or have a license that allows redistribution.
100+ datasets found
  1. Kensho Derived Wikimedia Dataset

    • www.kaggle.com
    zip
    Updated Jan 24, 2020
  2. Wikipedia Clickstream

    • datahub.io
    1305770, tsv.gz
    Updated Apr 7, 2016
  3. Wikidata

    • triplydb.com
    • live.european-language-grid.eu
    application/n-quads +3
    Updated Apr 26, 2022
  4. c

    Plaintext Wikipedia dump 2018

    • lindat.mff.cuni.cz
    Updated Feb 25, 2018
  5. WIKIMEDIA FOUNDATION ORG, fiscal year ending June 2019

    • projects.propublica.org
  6. f

    Wikidata item quality labels

    • figshare.com
    • search.datacite.org
    txt
    Updated Dec 17, 2019
  7. f

    COVID-19 Pandemic Wikipedia Readership

    • figshare.com
    txt
    Updated Jun 7, 2021
  8. Tamil Wikipedia Articles

    • www.kaggle.com
    zip
    Updated Dec 25, 2019
  9. Wikipedia Promotional Articles

    • www.kaggle.com
    zip
    Updated Oct 27, 2019
  10. P

    Wizard of Wikipedia Dataset

    • paperswithcode.com
  11. Data from: Wikipedia Citations: A comprehensive dataset of citations with...

    • zenodo.org
    zip
    Updated Jul 14, 2020
  12. Wikipedia Article Networks

    • www.kaggle.com
    zip
    Updated Nov 12, 2019
  13. f

    English Wikipedia pageviews by second

    • figshare.com
    • datahub.io
    • +1more
    application/gzip
    Updated Jan 19, 2016
  14. f

    Accessibility and topics of citations with identifiers in Wikipedia

    • figshare.com
    • search.datacite.org
    application/gzip
    Updated Jul 16, 2018
  15. d

    News Article and Wiki Pairings

    • data.world
    csv, zip
    Updated Jan 5, 2022
  16. P

    Wikidata-Disamb Dataset

    • paperswithcode.com
    Updated Jan 27, 2021
  17. G

    Translated Wikipedia Biographies

    • research.google
    Updated Jun 25, 2021
  18. o

    Wikipedia XML revision history data dumps (stub-meta-history.xml.gz) from 20...

    • explore.openaire.eu
    • zenodo.org
    • +2more
    Updated Jan 1, 2017
  19. H

    Supplementary Data for Wikipedia

    • dataverse.harvard.edu
    tsv, zip
    Updated Sep 30, 2020
  20. P

    Wikipedia Title Dataset

    • paperswithcode.com
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kensho R&D (2020). Kensho Derived Wikimedia Dataset [Dataset]. https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data
Organization logo

Kensho Derived Wikimedia Dataset

English Wikipedia corpus and Wikidata knowledge graph for NLP

7 scholarly articles cite this dataset (View in Google Scholar)
zip(8760044227 bytes)Available download formats
Dataset updated
Jan 24, 2020
Authors
Kensho R&D
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Kensho Derived Wikimedia Dataset

Wikipedia, the free encyclopedia, and https://www.wikidata.org/wiki/Wikidata:Main_Page">Wikidata, the free knowledge base, are crowd-sourced projects supported by the Wikimedia Foundation. Wikipedia is https://wikipedia20.pubpub.org/">nearly 20 years old and recently added its six millionth article in English. Wikidata, its https://wikipedia20.pubpub.org/pub/s2t6abfh">younger machine-readable sister project, was created in 2012 but has been growing rapidly and currently contains more than 75 million items.

These projects contribute to the https://wikimediafoundation.org/about/mission/">Wikimedia Foundation's mission of empowering people to develop and disseminate educational content under a free license. They are also heavily utilized by computer science research groups, especially those interested in natural language processing (NLP). The Wikimedia Foundation periodically releases snapshots of the raw data backing these projects, but these are in a variety of formats and were not designed for use in NLP research. In the https://www.kensho.com/">Kensho R&D group, we spend a lot of time downloading, parsing, and experimenting with this raw data. The Kensho Derived Wikimedia Dataset (KDWD) is a condensed subset of the raw Wikimedia data in a form that we find helpful for NLP work. The KDWD has a CC BY-SA 3.0 license, so feel free to use it in your work too.

This particular release consists of two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base. We version the KDWD using the raw Wikimedia snapshot dates. The version string for this dataset is kdwd_enwiki_20191201_wikidata_20191202 indicating that this KDWD was built from the English Wikipedia snapshot from 2019 December 1 and the Wikidata snapshot from 2019 December 2. Below we describe these components in more detail.

Example Notebooks

Dive right in by checking out some of our example notebooks:

Updates / Changelog

  • initial release 2020-01-31

File Summary

  • Wikipedia
    • page.csv (page metadata and Wikipedia-to-Wikidata mapping)
    • link_annotated_text.jsonl (plaintext of Wikipedia pages with link offsets)
  • Wikidata
    • item.csv (item labels and descriptions in English)
    • item_aliases.csv (item aliases in English)
    • property.csv (property labels and descriptions in English)
    • property_aliases.csv (property aliases in English)
    • statements.csv (truthy qpq statements)

Three Layers of Data

The KDWD is three connected layers of data. The base layer is a plain text English Wikipedia corpus, the middle layer annotates the corpus by indicating which text spans are links, and the top layer connects the link text spans to items in Wikidata. Below we'll describe these layers in more detail.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F19663d43bade0e92f578255f6e0d9dcd%2Fkensho_wiki_triple_layer.svg?generation=1580347573004185&alt=media" alt="">

Wikipedia Sample

The first part of the KDWD is derived from Wikipedia. In order to create a corpus of mostly natural text, we restrict our English Wikipedia page sample to those that:

From these pages we construct a corpus of link annotated text. We store this data in a single JSON Lines file with one page per line. Each page object has the following format:

page = {
  "page_id": 12,   # wikipedia page id of annotated page
  "sections": [...]  # list of section objects
}

section = {
  "name": "Introduction",             # section header
  "text": "Anarchism is an ...",          # plaintext of section
  "link_offsets": [16, 35, 49, ...],        # list of anchor text offsets
  "link_lengths": [18, 9, 17, ...],        # list of anchor text lengths
  "target_page_ids": [867979, 23040, 586276, ...] # list of link target page ids
}

The text attribute of each section object contains our parse of the section’s https://en.wikipedia.org/wiki/Help:Wikitext">wikitext markup into plaintext. Text spans that represent links are identified via the attributes link_offsets, link_lengths, and target_page_ids.

Wikidata Sample

The second part of the KDWD is derived from Wikidata. Because more people are familiar with Wikipedia than Wikidata, we provide more background here than in the previous section. Wikidata provides centralized storage of structured data for all Wikimedia projects. The core Wikidata concepts are https://www.wikidata.org/wiki/Help:Items">items, properties, and https://www.wikidata.org/wiki/Help:Statements">statements.

In Wikidata, items are used to represent all the things in human knowledge, including topics, concepts, and objects. For example, the "1988 Summer Olympics", "love", "Elvis Presley", and "gorilla" are all items in Wikidata.

-- https://www.wikidata.org/wiki/Help:Items

A property describes the data value of a statement and can be thought of as a category of data, for example "color" for the data value "blue".

-- https://www.wikidata.org/wiki/Help:Properties"> https://www.wikidata.org/wiki/Help:Properties

A statement is how the information we know about an item - the data we have about it - gets recorded in Wikidata. This happens by pairing a property with at least one data value

-- https://www.wikidata.org/wiki/Help:Statements

The image above shows several statements from the Wikidata item for Grace Hopper. We can think about these statements as triples with the form (item, property, data value).

In the first statement (https://www.wikidata.org/wiki/Q11641">Grace Hopper, date of birth, 9 December 1906) the data value represents a time. However, data values can have https://www.wikidata.org/wiki/Special:ListDatatypes">several different types (e.g., time, string, globecoordinate, item, …). If the data value in a statement triple is a Wikidata item, we call it a qpq-statement (note that each item has a unique i.d. beginning with Q and each property has a unique i.d. beginning with P). We can think of qpq-statements as triples of the form (source item, property, target_item). The qpq-statements in the image above are:

In order to construct a compact Wikidata sample that is relevant to our Wikipedia sample, we start with all statements in Wikidata and filter down to those that:

  • have a data value that is a Wikidata item (i.e., qpq-statements)
  • have a source item associated with a Wikipedia page from our Wikipedia sample
  • are
Search
Clear search
Close search
Google apps
Main menu