Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.
The dataset is distributed as a knowledge graph, a corpus, and aliases. We provide both transductive and inductive data splits used in the original paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
KeySearchWiki is a dataset for evaluating keyword search systems over Wikidata.
The dataset was automatically generated by leveraging Wikidata and Wikipedia set categories (e.g., Category:American television directors) as data sources for both relevant entities and queries.
Relevant entities are gathered by carefully navigating the Wikipedia set categories hierarchy in all available languages. Furthermore, those categories are refined and combined to derive more complex queries.
Detailed information about KeySearchWiki and its generation can be found on the Github page.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Node2Vec embedding model trained on Czech wikidata (from October 2020) labels using gensim implementation of Word2Vec with the following parameters for random walks:
length of walk = 160
number of random walks = 40
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
A Wikidata sample containing all the entities with Russian labels as of March 2020. It consists of about 212M triples with 8.1M unique entities.
This snapshot intended to be used with RuBQ dataset. It mitigates the problem of Wikidata’s dynamics – a reference answer may change with time as the knowledge base evolves. The sample guarantees the correctness of the queries and answers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains two public knowledge graph datasets used in our paper Improving the Utility of Knowledge Graph Embeddings with Calibration. Each dataset is described below.
Note that for our experiments we split each dataset randomly 5 times into 80/10/10 train/validation/test splits. We recommend that users of our data do the same to avoid (potentially) overfitting models to a single dataset split.
wikidata-authors
This dataset was extracted by querying the Wikidata API for facts about people categorized as "authors" or "writers" on Wikidata. Note that all head entities of triples are people (authors or writers), and all triples describe something about that person (e.g., their place of birth, their place of death, or their spouse). The knowledge graph has 23,887 entities, 13 relations, and 86,376 triples.
The files are as follows:
entities.tsv
: A tab-separated file of all unique entities in the dataset. The fields are as follows:
eid
: The unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/
.label
: A human-readable label of this entity (extracted from Wikidata).relations.tsv
: A tab-separated file of all unique relations in the dataset. The fields are as follows:
rid
: The unique Wikidata identifier of this relation. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/Property:
.label
: A human-readable label of this relation (extracted from Wikidata).triples.tsv
: A tab-separated file of all triples in the dataset, in the form of ,
,
.
fb15krr-linked
This dataset is an extended version of the FB15k+ dataset provided by [Xie et al IJCAI16]. It has been linked to Wikidata using Freebase MIDs (machine IDs) as keys; we discarded triples from the original dataset that contained entities that could not be linked to Wikidata. We also removed reverse relations following the procedure described by [Toutanova and Chen CVSC2015]. Finally, we removed existing triples labeled as False and added predicted triples labeled as True based on the crowdsourced annotations we obtained in our True or False Facts experiment (see our paper for details). The knowledge graph consists of 14,289 entities, 770 relations, and 272,385 triples.
The files are as follows:
entities.tsv
: A tab-separated file of all unique entities in the dataset. The fields are as follows:
mid
: The Freebase machine ID (MID) of this entity.wiki
: The corresponding unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/
.label
: A human-readable label of this entity (extracted from Wikidata).types
: All hierarchical types of this entity, as provided by [Xie et al IJCAI16].relations.tsv
: A tab-separated file of all unique relations in the dataset. The fields are as follows:
label
: The hierarchical Freebase label of this relation.triples.tsv
: A tab-separated file of all triples in the dataset, in the form of ,
,
.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset SummaryThe Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.Data FieldsEach row in the dataset consists of the following fields:subject (str): The subject entity of the knowledge graph triple.rel (str): The relation that connects the subject and object.object (str): The object entity of the knowledge graph triple.text (str): A natural language sentence that entails the given triple.validation (str): LLM-based validation results, including:Fluent Sentence(s): TRUE/FALSESubject mentioned in Text: TRUE/FALSERelation mentioned in Text: TRUE/FALSEObject mentioned in Text: TRUE/FALSEFact Entailed By Text: TRUE/FALSEFinal Answer: TRUE/FALSEreference_url (str): URL of the web source from which the text was extracted.subj_qid (str): Wikidata QID for the subject entity.rel_id (str): Wikidata Property ID for the relation.obj_qid (str): Wikidata QID for the object entity.Dataset CreationThe dataset was created through the following process:1. Triple-Reference Sampling and ExtractionAll relations from Wikidata were extracted using SPARQL queries.A sample of KG triples with associated reference URLs was collected for each relation.2. Domain Analysis and Web ScrapingURLs were grouped by domain, and sampled pages were analyzed to determine their primary language.English-language web pages were scraped and processed to extract plaintext content.3. LLM-Based Text Span Selection and ValidationLLMs were used to identify text spans from web content that correspond to KG triples.A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.4. Final Dataset Statistics12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.After filtering for English content, 458K triple-web content pairs were processed with LLMs.80.5K validated triple-text alignments were included in the final dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Sets from the ISWC 2024 Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, Round 1, Wikidata Tables. Links to other datasets can be found on the challenge website: https://sem-tab-challenge.github.io/2024/ as well as the proceedings of the challenge published on CEUR.
For details about the challenge, see: http://www.cs.ox.ac.uk/isg/challenges/sem-tab/
For 2024 edition, see: https://sem-tab-challenge.github.io/2024/
Note on License: This data includes data from the following sources. Refer to each source for license details:
- Wikidata https://www.wikidata.org/
THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains triples curated from Wikidata surrounding news events with causal relations, and is released as part of our WWW'23 paper, "Event Prediction using Case-Based Reasoning over Knowledge Graphs".
Starting from a set of classes that we consider to be types of "events", we queried Wikidata to collect entities that were an instanceOf an event class and that were connected to another such event entity by a causal triple (https://www.wikidata.org/wiki/Wikidata:List_of_properties/causality). For all such cause-effect event pairs, we then collected a 3-hop neighborhood of outgoing triples.
CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Relation extraction dataset with its knowledge graph.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experiment results together with queries, runs, and relevance judgments produced in the context of evaluating different retrieval methods using the KeySearchWiki dataset.
Detailed information about KeySearchWiki and its generation can be found on the Github page.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistics about the Wikidata SPARQL logs used.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
WikiWiki is a dataset for understanding entities and their place in a taxonomy of knowledge—their types. It consists of entities and passages from 10M Wikipedia articles linked to the Wikidata knowledge graph with 41K types.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ConvQuestions is the first realistic benchmark for conversational question answering over knowledge graphs. It contains 11,200 conversations which can be evaluated over Wikidata. They are compiled from the inputs of 70 Master crowdworkers on Amazon Mechanical Turk, with conversations from five domains: Books, Movies, Soccer, Music, and TV Series. The questions feature a variety of complex question phenomena like comparisons, aggregations, compositionality, and temporal reasoning. Answers are grounded in Wikidata entities to enable fair comparison across diverse methods. The data gathering setup was kept as natural as possible, with the annotators selecting entities of their choice from each of the five domains, and formulating the entire conversation in one session. All questions in a conversation are from the same Turker, who also provided gold answers to the questions. For suitability to knowledge graphs, questions were constrained to be objective or factoid in nature, but no other restrictive guidelines were set. A notable property of ConvQuestions is that several questions are not answerable by Wikidata alone (as of September 2019), but the required facts can, for example, be found in the open Web or in Wikipedia. For details, please refer to our CIKM 2019 full paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
QALD-9-Plus is the dataset for Knowledge Graph Question Answering (KGQA) based on well-known QALD-9.QALD-9-Plus enables to train and test KGQA systems over DBpedia and Wikidata using questions in 8 different languages.Some of the questions have several alternative writings in particular languages which enables to evaluate the robustness of KGQA systems and train paraphrasing models.As the questions' translations were provided by native speakers, they are considered as "gold standard", therefore, machine translation tools can be trained and evaluated on the dataset.Please, see also the GitHub repository: https://github.com/Perevalov/qald_9_plus
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sqlite cache used to store the KGTK Kypher (https://kgtk.readthedocs.io/en/dev/transform/query/) queries for paper "Creating and Querying Personalized Versions of Wikidata on a Laptop"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of SQLite database files containing all the data retrieved from Wikidata JSON Dump and Wikipedia SQL Dumps of 2021-09-20 in the context of KeySearchWiki dataset generation.
Detailed information about KeySearchWiki can be found on the Github page.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the widespread use of knowledge graphs (KG) in various automated AI systems and applications, it is very important to ensure that information retrieval algorithms leveraging them are free from societal biases. Previous works have depicted biases that persist in KGs, as well as employed several metrics for measuring the biases. However, such studies lack the systematic exploration of the sensitivity of the bias measurements, through varying sources of data, or the embedding algorithms used. To address this research gap, in this work, we present a holistic analysis of bias measurement on the knowledge graph. First, we attempt to reveal data biases that surface in Wikidata for thirteen different demographics selected from seven continents. Next, we attempt to unfold the variance in the detection of biases by two different knowledge graph embedding algorithms - TransE and ComplEx. We conduct our extensive experiments on a large number of occupations sampled from the thirteen demographics with respect to the sensitive attribute, i.e., gender. Our results show that the inherent data bias that persists in KG can be altered by specific algorithm bias as incorporated by KG embedding learning algorithms. Further, we show that the choice of the state-of-the-art KG embedding algorithm has a strong impact on the ranking of biased occupations irrespective of gender. We observe that the similarity of the biased occupations across demographics is minimal which reflects the socio-cultural differences around the globe. We believe that this full-scale audit of the bias measurement pipeline will raise awareness among the community while deriving insights related to design choices of data and algorithms both and refrain from the popular dogma of ``one-size-fits-all''.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Triple-to-Text Alignment Dataset
Dataset Summary
The Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific… See the full description on the dataset page: https://huggingface.co/datasets/sven-h/wikidata_reference.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Wikidata Thematic Subgraph Selection
These datasets have been designed to train and evaluate algorithms to select thematic subgraphs of interest in a large knowledge graph from seed entities of interest. Specifically, we consider Wikidata. Given a set of seed QIDs of interest, a graph expansion is performed following P31, P279, and (-)P279 edges. Traversed classes that thematically deviates from seed QIDs of interest should be pruned. Datasets thus consist of classes reached from seed QIDs that are labeled as "to prune" or "to keep".
Available datasets
Dataset | # Seed QIDs | # Labeled decisions | # Prune decisions | Min prune depth | Max prune depth | # Keep decisions | Min keep depth | Max keep depth | # Reached nodes up | # Reached nodes down |
---|---|---|---|---|---|---|---|---|---|---|
dataset1 | 455 | 5233 | 3464 | 1 | 4 | 1769 | 1 | 4 | 1507 | 2593609 |
dataset2 | 105 | 982 | 388 | 1 | 2 | 594 | 1 | 3 | 1159 | 1247385 |
Each dataset folder contains
datasetX.csv
: a CSV file containing one seed QID per line (not the complete URL, just the QID). This CSV file has no header.datasetX_labels.csv
: a CSV file containing one seed QID per line and its label (not the complete URL, just the QID)datasetX_gold_decisions.csv
: a CSV file with seed QIDs, reached QIDs, and the labeled decision (1: keep, 0: prune)datasetX_Y_folds.pkl
: folds to train and test models based on the labeled decisionsdataset1-2
consists of using dataset1
for training and dataset2
for testing.
License
Datasets are available under the CC BY-NC license.
Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.
The dataset is distributed as a knowledge graph, a corpus, and aliases. We provide both transductive and inductive data splits used in the original paper.