44 datasets found

P
Wikidata5M Dataset
paperswithcode.com
Updated Jun 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang (2023). Wikidata5M Dataset [Dataset]. https://paperswithcode.com/dataset/wikidata5m
Explore at:
Dataset updated
Jun 14, 2023
Authors
Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang
Description
Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.

The dataset is distributed as a knowledge graph, a corpus, and aliases. We provide both transductive and inductive data splits used in the original paper.
KeySearchWiki
zenodo.org
explore.openaire.eu
+1more
zip
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler (2022). KeySearchWiki [Dataset]. http://doi.org/10.5281/zenodo.4955200
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4955200
Dataset updated
Feb 14, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
KeySearchWiki is a dataset for evaluating keyword search systems over Wikidata.

The dataset was automatically generated by leveraging Wikidata and Wikipedia set categories (e.g., Category:American television directors) as data sources for both relevant entities and queries.
Relevant entities are gathered by carefully navigating the Wikipedia set categories hierarchy in all available languages. Furthermore, those categories are refined and combined to derive more complex queries.

Detailed information about KeySearchWiki and its generation can be found on the Github page.
Z
Node2Vec model - Czech Wikidata (knowledge graph / labels / l160 / rw40)
data.niaid.nih.gov
zenodo.org
Updated Jan 13, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skopal, Tomáš (2021). Node2Vec model - Czech Wikidata (knowledge graph / labels / l160 / rw40) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4433736
Explore at:
Dataset updated
Jan 13, 2021
Dataset provided by
Bernhauer, David
Skopal, Tomáš
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Node2Vec embedding model trained on Czech wikidata (from October 2020) labels using gensim implementation of Word2Vec with the following parameters for random walks:

length of walk = 160

number of random walks = 40
RuWikidata8M
zenodo.org
zip
Updated Apr 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vladislav Korablinov; Vladislav Korablinov (2020). RuWikidata8M [Dataset]. http://doi.org/10.5281/zenodo.3751761
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3751761
Dataset updated
Apr 18, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Vladislav Korablinov; Vladislav Korablinov
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
A Wikidata sample containing all the entities with Russian labels as of March 2020. It consists of about 212M triples with 8.1M unique entities.

This snapshot intended to be used with RuBQ dataset. It mitigates the problem of Wikidata’s dynamics – a reference answer may change with time as the knowledge base evolves. The sample guarantees the correctness of the queries and answers.
Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with...
zenodo.org
zip
Updated Apr 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tara Safavi; Danai Koutra; Edgar Meij; Tara Safavi; Danai Koutra; Edgar Meij (2020). Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with Calibration [Dataset]. http://doi.org/10.5281/zenodo.3738264
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3738264
Dataset updated
Apr 2, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tara Safavi; Danai Koutra; Edgar Meij; Tara Safavi; Danai Koutra; Edgar Meij
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains two public knowledge graph datasets used in our paper Improving the Utility of Knowledge Graph Embeddings with Calibration. Each dataset is described below.

Note that for our experiments we split each dataset randomly 5 times into 80/10/10 train/validation/test splits. We recommend that users of our data do the same to avoid (potentially) overfitting models to a single dataset split.

wikidata-authors

This dataset was extracted by querying the Wikidata API for facts about people categorized as "authors" or "writers" on Wikidata. Note that all head entities of triples are people (authors or writers), and all triples describe something about that person (e.g., their place of birth, their place of death, or their spouse). The knowledge graph has 23,887 entities, 13 relations, and 86,376 triples.

The files are as follows:

entities.tsv: A tab-separated file of all unique entities in the dataset. The fields are as follows:

eid: The unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/.

label: A human-readable label of this entity (extracted from Wikidata).

relations.tsv: A tab-separated file of all unique relations in the dataset. The fields are as follows:

rid: The unique Wikidata identifier of this relation. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/Property:.

label: A human-readable label of this relation (extracted from Wikidata).

triples.tsv: A tab-separated file of all triples in the dataset, in the form of , , .

fb15krr-linked

This dataset is an extended version of the FB15k+ dataset provided by [Xie et al IJCAI16]. It has been linked to Wikidata using Freebase MIDs (machine IDs) as keys; we discarded triples from the original dataset that contained entities that could not be linked to Wikidata. We also removed reverse relations following the procedure described by [Toutanova and Chen CVSC2015]. Finally, we removed existing triples labeled as False and added predicted triples labeled as True based on the crowdsourced annotations we obtained in our True or False Facts experiment (see our paper for details). The knowledge graph consists of 14,289 entities, 770 relations, and 272,385 triples.

The files are as follows:

entities.tsv: A tab-separated file of all unique entities in the dataset. The fields are as follows:

mid: The Freebase machine ID (MID) of this entity.

wiki: The corresponding unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/.

label: A human-readable label of this entity (extracted from Wikidata).

types: All hierarchical types of this entity, as provided by [Xie et al IJCAI16].

relations.tsv: A tab-separated file of all unique relations in the dataset. The fields are as follows:

label: The hierarchical Freebase label of this relation.

triples.tsv: A tab-separated file of all triples in the dataset, in the form of , , .
Wikidata Reference
figshare.com
application/gzip
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven Hertling; Nandana Mihindukulasooriya (2025). Wikidata Reference [Dataset]. http://doi.org/10.6084/m9.figshare.28602170.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28602170.v2
Dataset updated
Mar 17, 2025
Dataset provided by
figshare
Authors
Sven Hertling; Nandana Mihindukulasooriya
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset SummaryThe Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.Data FieldsEach row in the dataset consists of the following fields:subject (str): The subject entity of the knowledge graph triple.rel (str): The relation that connects the subject and object.object (str): The object entity of the knowledge graph triple.text (str): A natural language sentence that entails the given triple.validation (str): LLM-based validation results, including:Fluent Sentence(s): TRUE/FALSESubject mentioned in Text: TRUE/FALSERelation mentioned in Text: TRUE/FALSEObject mentioned in Text: TRUE/FALSEFact Entailed By Text: TRUE/FALSEFinal Answer: TRUE/FALSEreference_url (str): URL of the web source from which the text was extracted.subj_qid (str): Wikidata QID for the subject entity.rel_id (str): Wikidata Property ID for the relation.obj_qid (str): Wikidata QID for the object entity.Dataset CreationThe dataset was created through the following process:1. Triple-Reference Sampling and ExtractionAll relations from Wikidata were extracted using SPARQL queries.A sample of KG triples with associated reference URLs was collected for each relation.2. Domain Analysis and Web ScrapingURLs were grouped by domain, and sampled pages were analyzed to determine their primary language.English-language web pages were scraped and processed to extract plaintext content.3. LLM-Based Text Span Selection and ValidationLLMs were used to identify text spans from web content that correspond to KG triples.A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.4. Final Dataset Statistics12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.After filtering for English content, 458K triple-web content pairs were processed with LLMs.80.5K validated triple-text alignments were included in the final dataset.
SemTab 2024: Semantic Web Challenge on Tabular Data to Knowledge Graph...
zenodo.org
explore.openaire.eu
application/gzip
Updated Nov 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oktie Hassanzadeh; Oktie Hassanzadeh; Vasilis Efthymiou; Jiaoyan Chen; Vasilis Efthymiou; Jiaoyan Chen (2024). SemTab 2024: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Data Sets - WikidataTables2024R1 and WikidataTables2024R2 [Dataset]. http://doi.org/10.5281/zenodo.14207232
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14207232
Dataset updated
Nov 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Oktie Hassanzadeh; Oktie Hassanzadeh; Vasilis Efthymiou; Jiaoyan Chen; Vasilis Efthymiou; Jiaoyan Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Sets from the ISWC 2024 Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, Round 1, Wikidata Tables. Links to other datasets can be found on the challenge website: https://sem-tab-challenge.github.io/2024/ as well as the proceedings of the challenge published on CEUR.

For details about the challenge, see: http://www.cs.ox.ac.uk/isg/challenges/sem-tab/

For 2024 edition, see: https://sem-tab-challenge.github.io/2024/

Note on License: This data includes data from the following sources. Refer to each source for license details:
- Wikidata https://www.wikidata.org/

THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Wikidata Causal Event Triple Data
zenodo.org
data.niaid.nih.gov
bin
Updated Feb 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sola; Sola; Debarun; Debarun; Oktie; Oktie (2023). Wikidata Causal Event Triple Data [Dataset]. http://doi.org/10.5281/zenodo.7196049
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7196049
Dataset updated
Feb 7, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sola; Sola; Debarun; Debarun; Oktie; Oktie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains triples curated from Wikidata surrounding news events with causal relations, and is released as part of our WWW'23 paper, "Event Prediction using Case-Based Reasoning over Knowledge Graphs".

Starting from a set of classes that we consider to be types of "events", we queried Wikidata to collect entities that were an instanceOf an event class and that were connected to another such event entity by a causal triple (https://www.wikidata.org/wiki/Wikidata:List_of_properties/causality). For all such cause-effect event pairs, we then collected a 3-hop neighborhood of outgoing triples.
P
CoDEx Medium Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tara Safavi; Danai Koutra, CoDEx Medium Dataset [Dataset]. https://paperswithcode.com/dataset/codex-medium
Explore at:
Authors
Tara Safavi; Danai Koutra
Description
CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false.
Wiki80
figshare.com
json
Updated Oct 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongmin Xiao (2022). Wiki80 [Dataset]. http://doi.org/10.6084/m9.figshare.19323371.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19323371.v1
Dataset updated
Oct 1, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Hongmin Xiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Relation extraction dataset with its knowledge graph.
KeySearchWiki-experiments
zenodo.org
data.niaid.nih.gov
zip
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler (2022). KeySearchWiki-experiments [Dataset]. http://doi.org/10.5281/zenodo.5761138
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5761138
Dataset updated
Feb 14, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Experiment results together with queries, runs, and relevance judgments produced in the context of evaluating different retrieval methods using the KeySearchWiki dataset.

Detailed information about KeySearchWiki and its generation can be found on the Github page.
f
Statistics about the Wikidata SPARQL logs used.
plos.figshare.com
figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Fernández-Álvarez; Johannes Frey; Jose Emilio Labra Gayo; Daniel Gayo-Avello; Sebastian Hellmann (2023). Statistics about the Wikidata SPARQL logs used. [Dataset]. http://doi.org/10.1371/journal.pone.0252862.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0252862.t003
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Daniel Fernández-Álvarez; Johannes Frey; Jose Emilio Labra Gayo; Daniel Gayo-Avello; Sebastian Hellmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Statistics about the Wikidata SPARQL logs used.
O
Data from: WikiWiki
opendatalab.com
zip
Updated Apr 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Research (2023). WikiWiki [Dataset]. https://opendatalab.com/OpenDataLab/WikiWiki
Explore at:
zip(25410 bytes)Available download formats
Dataset updated
Apr 1, 2023
Dataset provided by
Amazon Research
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
WikiWiki is a dataset for understanding entities and their place in a taxonomy of knowledge—their types. It consists of entities and passages from 10M Wikipedia articles linked to the Wikidata knowledge graph with 41K types.
ConvQuestions
opendatalab.com
paperswithcode.com
+1more
zip
Updated Oct 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Planck Institute for Informatics (2019). ConvQuestions [Dataset]. https://opendatalab.com/OpenDataLab/ConvQuestions
Explore at:
zip(35322132 bytes)Available download formats
Dataset updated
Oct 9, 2019
Dataset provided by
亚马逊http://amazon.com/
Max Planck Institute for Informatics
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ConvQuestions is the first realistic benchmark for conversational question answering over knowledge graphs. It contains 11,200 conversations which can be evaluated over Wikidata. They are compiled from the inputs of 70 Master crowdworkers on Amazon Mechanical Turk, with conversations from five domains: Books, Movies, Soccer, Music, and TV Series. The questions feature a variety of complex question phenomena like comparisons, aggregations, compositionality, and temporal reasoning. Answers are grounded in Wikidata entities to enable fair comparison across diverse methods. The data gathering setup was kept as natural as possible, with the annotators selecting entities of their choice from each of the five domains, and formulating the entire conversation in one session. All questions in a conversation are from the same Turker, who also provided gold answers to the questions. For suitability to knowledge graphs, questions were constrained to be objective or factoid in nature, but no other restrictive guidelines were set. A notable property of ConvQuestions is that several questions are not answerable by Wikidata alone (as of September 2019), but the required facts can, for example, be found in the open Web or in Wikipedia. For details, please refer to our CIKM 2019 full paper.
QALD-9-Plus
figshare.com
paperswithcode.com
+1more
txt
Updated Dec 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleksandr Perevalov; Andreas Both; Dennis Diefenbach; Ricardo Usbeck (2021). QALD-9-Plus [Dataset]. http://doi.org/10.6084/m9.figshare.16864273.v7
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16864273.v7
Dataset updated
Dec 21, 2021
Dataset provided by
figshare
Authors
Aleksandr Perevalov; Andreas Both; Dennis Diefenbach; Ricardo Usbeck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
QALD-9-Plus is the dataset for Knowledge Graph Question Answering (KGQA) based on well-known QALD-9.QALD-9-Plus enables to train and test KGQA systems over DBpedia and Wikidata using questions in 8 different languages.Some of the questions have several alternative writings in particular languages which enables to evaluate the robustness of KGQA systems and train paraphrasing models.As the questions' translations were provided by native speakers, they are considered as "gold standard", therefore, machine translation tools can be trained and evaluated on the dataset.Please, see also the GitHub repository: https://github.com/Perevalov/qald_9_plus
Cache for KGTK queries in Creating and Querying Personalized Versions of...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jul 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pedro Szekely; Pedro Szekely (2021). Cache for KGTK queries in Creating and Querying Personalized Versions of Wikidata on a Laptop [Dataset]. http://doi.org/10.5281/zenodo.5146407
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5146407
Dataset updated
Jul 30, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pedro Szekely; Pedro Szekely
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sqlite cache used to store the KGTK Kypher (https://kgtk.readthedocs.io/en/dev/transform/query/) queries for paper "Creating and Querying Personalized Versions of Wikidata on a Laptop"
KeySearchWiki-cache
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler (2023). KeySearchWiki-cache [Dataset]. http://doi.org/10.5281/zenodo.5752018
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5752018
Dataset updated
Jun 9, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of SQLite database files containing all the data retrieved from Wikidata JSON Dump and Wikipedia SQL Dumps of 2021-09-20 in the context of KeySearchWiki dataset generation.

Detailed information about KeySearchWiki can be found on the Github page.
Data from: Diversity matters: Robustness of bias measurements in Wikidata
zenodo.org
data.niaid.nih.gov
tsv
Updated May 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paramita das; Sai Keerthana Karnam; Anirban Panda; Bhanu Prakash Reddy Guda; Soumya Sarkar; Animesh Mukherjee; Paramita das; Sai Keerthana Karnam; Anirban Panda; Bhanu Prakash Reddy Guda; Soumya Sarkar; Animesh Mukherjee (2023). Diversity matters: Robustness of bias measurements in Wikidata [Dataset]. http://doi.org/10.48550/arxiv.2302.14027
Explore at:
tsvAvailable download formats
Unique identifier
https://doi.org/10.48550/arxiv.2302.14027
Dataset updated
May 1, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Paramita das; Sai Keerthana Karnam; Anirban Panda; Bhanu Prakash Reddy Guda; Soumya Sarkar; Animesh Mukherjee; Paramita das; Sai Keerthana Karnam; Anirban Panda; Bhanu Prakash Reddy Guda; Soumya Sarkar; Animesh Mukherjee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the widespread use of knowledge graphs (KG) in various automated AI systems and applications, it is very important to ensure that information retrieval algorithms leveraging them are free from societal biases. Previous works have depicted biases that persist in KGs, as well as employed several metrics for measuring the biases. However, such studies lack the systematic exploration of the sensitivity of the bias measurements, through varying sources of data, or the embedding algorithms used. To address this research gap, in this work, we present a holistic analysis of bias measurement on the knowledge graph. First, we attempt to reveal data biases that surface in Wikidata for thirteen different demographics selected from seven continents. Next, we attempt to unfold the variance in the detection of biases by two different knowledge graph embedding algorithms - TransE and ComplEx. We conduct our extensive experiments on a large number of occupations sampled from the thirteen demographics with respect to the sensitive attribute, i.e., gender. Our results show that the inherent data bias that persists in KG can be altered by specific algorithm bias as incorporated by KG embedding learning algorithms. Further, we show that the choice of the state-of-the-art KG embedding algorithm has a strong impact on the ranking of biased occupations irrespective of gender. We observe that the similarity of the biased occupations across demographics is minimal which reflects the socio-cultural differences around the globe. We believe that this full-scale audit of the bias measurement pipeline will raise awareness among the community while deriving insights related to design choices of data and algorithms both and refrain from the popular dogma of ``one-size-fits-all''.
h
wikidata_reference
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven Hertling, wikidata_reference [Dataset]. https://huggingface.co/datasets/sven-h/wikidata_reference
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Sven Hertling
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Triple-to-Text Alignment Dataset

Dataset Summary

The Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific… See the full description on the dataset page: https://huggingface.co/datasets/sven-h/wikidata_reference.

Wikidata Thematic Subgraph Selection

zenodo.org
data.niaid.nih.gov

zip

Updated May 24, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Lucas Jarnac; Lucas Jarnac; Miguel Couceiro; Miguel Couceiro; Pierre Monnin; Pierre Monnin (2024). Wikidata Thematic Subgraph Selection [Dataset]. http://doi.org/10.5281/zenodo.8091584

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8091584

Dataset updated

May 24, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Lucas Jarnac; Lucas Jarnac; Miguel Couceiro; Miguel Couceiro; Pierre Monnin; Pierre Monnin

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Wikidata Thematic Subgraph Selection

These datasets have been designed to train and evaluate algorithms to select thematic subgraphs of interest in a large knowledge graph from seed entities of interest. Specifically, we consider Wikidata. Given a set of seed QIDs of interest, a graph expansion is performed following P31, P279, and (-)P279 edges. Traversed classes that thematically deviates from seed QIDs of interest should be pruned. Datasets thus consist of classes reached from seed QIDs that are labeled as "to prune" or "to keep".

Available datasets

Dataset	# Seed QIDs	# Labeled decisions	# Prune decisions	Min prune depth	Max prune depth	# Keep decisions	Min keep depth	Max keep depth	# Reached nodes up	# Reached nodes down
dataset1	455	5233	3464	1	4	1769	1	4	1507	2593609
dataset2	105	982	388	1	2	594	1	3	1159	1247385

Each dataset folder contains

datasetX.csv: a CSV file containing one seed QID per line (not the complete URL, just the QID). This CSV file has no header.
datasetX_labels.csv: a CSV file containing one seed QID per line and its label (not the complete URL, just the QID)
datasetX_gold_decisions.csv: a CSV file with seed QIDs, reached QIDs, and the labeled decision (1: keep, 0: prune)
datasetX_Y_folds.pkl: folds to train and test models based on the labeled decisions

dataset1-2 consists of using dataset1 for training and dataset2 for testing.

License

Datasets are available under the CC BY-NC license.

Facebook

Twitter

Click to copy link

Link copied

Cite

Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang (2023). Wikidata5M Dataset [Dataset]. https://paperswithcode.com/dataset/wikidata5m

Wikidata5M Dataset

Explore at:

203 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jun 14, 2023

Authors

Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang

Description

Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.

The dataset is distributed as a knowledge graph, a corpus, and aliases. We provide both transductive and inductive data splits used in the original paper.

Clear search

Close search

Google apps

Main menu

Wikidata5M Dataset

KeySearchWiki

Node2Vec model - Czech Wikidata (knowledge graph / labels / l160 / rw40)

RuWikidata8M

Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with...

Wikidata Reference

SemTab 2024: Semantic Web Challenge on Tabular Data to Knowledge Graph...

Wikidata Causal Event Triple Data

CoDEx Medium Dataset

Wiki80

KeySearchWiki-experiments

Statistics about the Wikidata SPARQL logs used.

Data from: WikiWiki

ConvQuestions

QALD-9-Plus

Cache for KGTK queries in Creating and Querying Personalized Versions of...

KeySearchWiki-cache

Data from: Diversity matters: Robustness of bias measurements in Wikidata

wikidata_reference

Wikidata Thematic Subgraph Selection

Wikidata5M DatasetSee More Versions

Wikidata5M Dataset