Wikidata is a collaboratively edited knowledge base operated by the Wikimedia Foundation. It is intended to provide a common source of certain types of data which can be used by Wikimedia projects such as Wikipedia. Wikidata functions as a document-oriented database, centred on individual items. Items represent topics, for which basic information is stored that identifies each topic.
The Wikidata-Disamb dataset is intended to allow a clean and scalable evaluation of NED with Wikidata entries, and to be used as a reference in future research.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains corrections for Wikidata constraint violations extracted from the July 1st 2018 Wikidata full history dump.The following constraints are considered:* conflicts with: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Conflicts_with* distinct values: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Unique_value* inverse and symmetric: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Inverse https://www.wikidata.org/wiki/Help:Property_constraints_portal/Symmetric* item requires statement: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Item* one of: https://www.wikidata.org/wiki/Help:Property_constraints_portal/One_of* single value: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Single_value* type: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Type* value requires statement: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Target_required_claim* value type: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Value_typeThe constraints.tsv file contains the list of most of the Wikidata constraints considered in this dataset (beware, there could be some discrepancies for type, valueType, itemRequiresClaim and valueRequiresClaim constraints).It is a tabbed-separated file with the following columns:* constrain id: the URI of the Wikidata statement describing the constraint* property id: the URI of the property that is constrained* type id: the URI of the constraint type (type, value type...). It is a Wikidata item.* 15 columns for the possible attributes of the constraint. If an attribute has multiple values, they are in the same cell but separated by a space. The columns are:** regex: https://www.wikidata.org/wiki/Property:P1793** exceptions: https://www.wikidata.org/wiki/Property:P2303** group by: https://www.wikidata.org/wiki/Property:P2304** items: https://www.wikidata.org/wiki/Property:P2305** property: https://www.wikidata.org/wiki/Property:P2306** namespace: https://www.wikidata.org/wiki/Property:P2307** class: https://www.wikidata.org/wiki/Property:P2308** relation: https://www.wikidata.org/wiki/Property:P2309** minimal date: https://www.wikidata.org/wiki/Property:P2310** maximum date: https://www.wikidata.org/wiki/Property:P2311** maximum value: https://www.wikidata.org/wiki/Property:P2312** minimal value: https://www.wikidata.org/wiki/Property:P2313** status: https://www.wikidata.org/wiki/Property:P2316** separator: https://www.wikidata.org/wiki/Property:P4155** scope: https://www.wikidata.org/wiki/Property:P5314The other files provide for each constraint type the list of all corrections extracted from the edit history. The format of the file is one line per correction with the following tabbed-separated values:* URI for the statement describing the constraint in Wikidata* URI of the revision that has solved the constraint violation* subject, predicate and object of the triple that was violating the constraint (separated by a tab)* the string "->"* subject, predicate and object of the triple(s) of the correction, each followed by "http://wikiba.se/history/ontology#deletion" if the triple has been removed or "http://wikiba.se/history/ontology#addition" if the triple has been added. Each component of these values is separated by a tab.More detailed explanations are provided in a soon to be published paper
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information about commercial organizations (companies) and their relations with other commercial organizations, persons, products, locations, groups and industries. The dataset has the form of a graph. It has been produced by the SmartDataLake project (https://smartdatalake.eu), using data collected from Wikidata (https://www.wikidata.org).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Profiles of politically exposed persons from Wikidata, the structured data version of Wikipedia.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For an updated list , see
Matching OpenAlex venues to Wikidata identifiers
Motivation : the selective/Inclusive approach in bibliometric databases
An important difference between bibliometric databases is their “inclusion policy”.
Some databases like Web Of Science and Scopus select the sources they index, while others like Dimensions and OpenAlex are more inclusive (they index for example all data from a given source such as Crossref).
“selectivity remained a hallmark of coverage because Garfield had decided early on to focus on internationally influential journals.” (...).”
“Serial content (i.e., journals, conference proceedings, and book series) submitted for possible inclusion in Scopus by editors and publishers is reviewed and selected, based on criteria of scientific quality and rigor. This selection process is carried out by an external Content Selection and Advisory Board (CSAB) of editorially independent scientists, each of which are subject matter experts in their respective fields. This ensures that only high-quality curated content is indexed in the database and affirms the trustworthiness of Scopus”
We have decided to take an “inclusive” approach to the publications we index in Dimensions. We believe that Dimensions should be a comprehensive data source, not a judgment call, and so we index as broad a swath of content as possible and have developed a number of features (e.g., the Dimensions API, journal list filters that limit search results to journals that appear in sources such as Pubmed or the 2015 Australian ERA6 journal list) that allow users to filter and select the data that is most relevant to their specific needs.
Using wikidata to enable the filtering of “ venues subsets” in OpenAlex
We are interested in creating subsets of venues in OpenAlex (for example for comparative analysis with inclusive databases or other use cases). This would require matching identifiers of OpenAlex venues to other identifiers.
Thanks to WikiCite, a project to record and link scholarly data, Wikidata has a large collection of metadata related to Scholarly journals. This repository provides a subset of the scholarly journals in Wikidata, focusing mainly on external identifiers.
The dataset will be used to explore the extent to which wikidata journal external identifiers can be used to select the content in OpenAlex.
(see here an list of openly available lists of journals )
Dataset creation & Documentation
Wikidata dump from 2022-02-21
Extract entities with following properties:
https://www.wikidata.org/wiki/Q5633421 # scientific journal (Q5633421)
https://www.wikidata.org/wiki/Q737498 # academic journal (Q737498)
Extract the properties related to (selected) external identifiers
Some numbers :
Number of journals in wikidata : 113,797 ; With issn_l 95,888 , With OpenAlex_venue id : 29,150
external identifiers
https://www.wikidata.org/wiki/Property:P236 # ext_id_issn
https://www.wikidata.org/wiki/Property:P7363 # ext_id_issn_l
https://www.wikidata.org/wiki/Property:P8375 # ext_id_crossref_journal_id
https://www.wikidata.org/wiki/Property:P1055 # ext_id_nlm_unique_id
https://www.wikidata.org/wiki/Property:P1058 # ext_id_era_journal_id
https://www.wikidata.org/wiki/Property:P1250 # ext_id_danish_bif_id
https://www.wikidata.org/wiki/Property:P10283 #ext_id_openalex_id
https://www.wikidata.org/wiki/Property:P1156 # ext_id_scopus_source_id
Indexing services
https://www.wikidata.org/wiki/Property:P8875
https://www.wikidata.org/wiki/Q371467 # Scopus
https://www.wikidata.org/wiki/Q104047209 # Science Citation Index Expanded
https://www.wikidata.org/wiki/Q22908122 # Emerging Sources Citation Index
https://www.wikidata.org/wiki/Q1090953 # Social Sciences Citation Index
https://www.wikidata.org/wiki/Q713927 # Arts and Humanities Citation index
Wikidata-14M is a recommender system dataset for recommending items to Wikidata editors. It consists of 220,000 editors responsible for 14 million interactions with 4 million items.
Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. It acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is composed of 300 instances from the 100 most important classes in Wikidata, for a total of around 30000 entities and 390000 triples. The dataset is geared towards knowledge graph refinement models that leverage edit history information from the graph. There are two versions of the dataset:
Each version is split into three subsets: train, validation (val), and test. Each split contains every entity from the dataset. The train split contains the first 70% of revisions made to each entity, the validation split contains the 70% to 85% revisions, and the test set contains the last 15% revisions.
This is a sample from the static datasets:
wd:Q217432 a uo:entity ;
wdt:P1082 1.005904e+06 ;
wdt:P1296 "0052280" ;
wdt:P1791 wd:Q18704103 ;
wdt:P18 "Pitakwa.jpg" ;
wdt:P244 "n80066826" ;
wdt:P571 "+1912-00-00T00:00:00Z" ;
wdt:P6766 "421180027" .
Each entity has the type uo:entity, and contains the statements added during that time period following Wikidata's data model.
In the following code snippet we show an example from the dynamic dataset:
uo:rev703872813 a uo:revision ;
uo:timestamp "2018-06-28T22:31:32Z" .
uo:op703872813_0 a uo:operation ;
uo:fromRevision uo:rev703872813 ;
uo:newObject wd:Q82955 ;
uo:opType uo:add ;
uo:revProp wdt:P106 ;
uo:revSubject wd:Q6097419 .
uo:op703878666_0 a uo:operation ;
uo:fromRevision uo:rev703878666 ;
uo:opType uo:remove ;
uo:prevObject wd:Q1108445 ;
uo:revProp wdt:P460 ;
uo:revSubject wd:Q1147883 .
This dataset is composed of revisions, which have a timestamp. Each revision is composed of 1 to n operations, in which there is a change to a statement from the entity. There are two types of operations: uo:add and uo:remove. In both cases, the property and the subject being modified are shown with the uo:revProp and uo:revSubject properties. In the case of additions, uo:newObject and uo:prevObject properties are added to show the previous and new objects after the addition. In the case of removals, there is a uo:prevObject property to record the object that was removed.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
RDF dump of wikidata produced with wdumper.
entity count: 425468, statement count: 11624839, triple count: 25332332
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists the complete revision history of every instance of the 100 most important classes in Wikidata. It contains 9.3 million classes and around 450 million revisions made to those classes. This dataset was exported from a MongoDB database. After decompressing the files, the resulting JSON files can be imported into MongoDB using the following commands:
mongoimport --db=db_name --collection=wd_entities --file=wd_entities.json
mongoimport --db=db_name --collection=wd_revisions --file=wd_revisions.json
Make sure that db_name is replaced by the database where this data will be imported.
Documents within the wd_entities collection have the following schema:
Documents within the wd_revisions collection have the following schema:
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia, the free encyclopedia, and Wikidata, the free knowledge base, are crowd-sourced projects supported by the Wikimedia Foundation. Wikipedia is nearly 20 years old and recently added its six millionth article in English. Wikidata, its younger machine-readable sister project, was created in 2012 but has been growing rapidly and currently contains more than 75 million items.
These projects contribute to the Wikimedia Foundation's mission of empowering people to develop and disseminate educational content under a free license. They are also heavily utilized by computer science research groups, especially those interested in natural language processing (NLP). The Wikimedia Foundation periodically releases snapshots of the raw data backing these projects, but these are in a variety of formats and were not designed for use in NLP research. In the Kensho R&D group, we spend a lot of time downloading, parsing, and experimenting with this raw data. The Kensho Derived Wikimedia Dataset (KDWD) is a condensed subset of the raw Wikimedia data in a form that we find helpful for NLP work. The KDWD has a CC BY-SA 3.0 license, so feel free to use it in your work too.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F972e4157b97efe8c2c5ea17c983b1504%2Fkdwd_header_logos_2.jpg?generation=1580510520532141&alt=media" alt="">
This particular release consists of two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base. We version the KDWD using the raw Wikimedia snapshot dates. The version string for this dataset is kdwd_enwiki_20191201_wikidata_20191202
indicating that this KDWD was built from the English Wikipedia snapshot from 2019 December 1 and the Wikidata snapshot from 2019 December 2. Below we describe these components in more detail.
Dive right in by checking out some of our example notebooks:
page.csv
(page metadata and Wikipedia-to-Wikidata mapping)link_annotated_text.jsonl
(plaintext of Wikipedia pages with link offsets)item.csv
(item labels and descriptions in English)item_aliases.csv
(item aliases in English)property.csv
(property labels and descriptions in English)property_aliases.csv
(property aliases in English)statements.csv
(truthy qpq statements)The KDWD is three connected layers of data. The base layer is a plain text English Wikipedia corpus, the middle layer annotates the corpus by indicating which text spans are links, and the top layer connects the link text spans to items in Wikidata. Below we'll describe these layers in more detail.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F19663d43bade0e92f578255f6e0d9dcd%2Fkensho_wiki_triple_layer.svg?generation=1580347573004185&alt=media" alt="">
The first part of the KDWD is derived from Wikipedia. In order to create a corpus of mostly natural text, we restrict our English Wikipedia page sample to those that:
From these pages we construct a corpus of link annotated text. We store this data in a single JSON Lines file with one page per line. Each page object has the following format:
page = {
"page_id": 12, # wikipedia page id of annotated page
"sections": [...] # list of section objects
}
section = {
"name": "Introduction", # section header
"text": "Anarchism is an ...", # plaintext of section
"link_offsets": [16, 35, 49, ...], # list of anchor text offsets
"link_lengths": [18, 9, 17, ...], # list of anchor text lengths
"target_page_ids": [867979, 23040, 586276, ...] # list of link target page ids
}
The text
attribute of each section object contains our parse of the section’s wikitext markup into plaintext. Text spans that represent links are identified via the attributes link_offsets
, link_lengths
, and target_page_ids
.
The second part of the KDWD is derived from Wikidata. Because more people are familiar with Wikipedia than Wikidata, we provide more background here than in the previous section. Wikidata provides centralized storage of structured data for all Wikimedia projects. The core Wikidata concepts are items, properties, and statements.
In Wikidata, items are used to represent all the things in human knowledge, including topics, concepts, and objects. For example, the "1988 Summer Olympics", "love", "Elvis Presley", and "gorilla" are all items in Wikidata.
A property describes the data value of a statement and can be thought of as a category of data, for example "color" for the data value "blue".
A statement is how the information we know about an item - the data we have about it - gets recorded in Wikidata. This happens by pairing a property with at least one data value
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F1c39f09bcbce766b7cef415a102be567%2Fkdwd_wikidata_image_nologo.jpg?generation=1579814918394405&alt=media" alt="">
The image above shows several statements from the Wikidata item for Grace Hopper. We can think about these statements as triples with the form (item, property, data value).
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F0a9c3aa58860f298db8b0e1f5952030e%2Fkdwd_statement_table.png?generation=1579214465126860&alt=media" alt="">
In the first statement (Grace Hopper, date of birth, 9 December 1906) the data value represents a time. However, data values can have several different types (e.g., time, string, globecoordinate, item, …). If the data value in a statement triple is a Wikidata item, we call it a qpq-statement (note that each item has a unique i.d. beginning with Q and each property has a unique i.d. beginning with P). We can think of qpq-statements as triples of the form (source item, property, target_item). The qpq-statements in the image above are:
In order to construct a compact Wikidata sample that is relevant to our Wikipedia sample, we start with all statements in Wikidata and filter down to those that:
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
derenrich/wikidata-en-descriptions dataset hosted on Hugging Face and contributed by the HF Datasets community
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wikidata offers a wide range of general data about our universe as well as links to other databases. The data is published under the CC0 "Public domain dedication" license. It can be edited by anyone and is maintained by Wikidata's editor community.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A BitTorrent file to download data with the title 'wikidata-20220103-all.json.gz'
Wikidata is a free and open knowledge base which can be read and edited by both humans and machines. It acts as a central storage for the structured data of several Wikimedia projects. To improve the process of manually inserting new facts, the Wikidata platform features an association rule-based tool to recommend additional suitable properties. In this work, we introduce a novel approach to provide such recommendations based on frequentist inference. We introduce a trie-based method that can efficiently learn and represent property set probabilities in RDF graphs. We extend the method by adding type information to improve recommendation precision and introduce backoff strategies which further increase the performance of the initial approach for entities with rare property combinations. We investigate how the captured structure can be employed for property recommendation, analogously to the Wikidata PropertySuggester. We evaluate our approach on the full Wikidata dataset and compare its performance to the state-of-the-art Wikidata PropertySuggester, outperforming it in all evaluated metrics. Notably we could reduce the average rank of the first relevant recommendation by 71%.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wikidata dump retrieved from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 on 27 Dec 2017
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
RDF dump of wikidata produced with wdumper.
basic filter
View on wdumper
entity count: 0, statement count: 0, triple count: 38
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
RDF dump of wikidata produced with wdumps.
<p>
<br>
<a href="https://tools.wmflabs.org/wdumps/dump/22">View on wdumper</a>
</p>
<p>
<b>entity count</b>: 0, <b>statement count</b>: 0, <b>triple count</b>: 0
</p>
RDF dump of wikidata produced with wdumps. companies, simple statement off, full statement mode complete, KR, EN View on wdumper entity count: 0, statement count: 0, triple count: 0
Wikidata is a collaboratively edited knowledge base operated by the Wikimedia Foundation. It is intended to provide a common source of certain types of data which can be used by Wikimedia projects such as Wikipedia. Wikidata functions as a document-oriented database, centred on individual items. Items represent topics, for which basic information is stored that identifies each topic.