44 datasets found
  1. P

    Wikidata5M Dataset

    • paperswithcode.com
    Updated Jun 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang (2023). Wikidata5M Dataset [Dataset]. https://paperswithcode.com/dataset/wikidata5m
    Explore at:
    Dataset updated
    Jun 14, 2023
    Authors
    Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang
    Description

    Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.

    The dataset is distributed as a knowledge graph, a corpus, and aliases. We provide both transductive and inductive data splits used in the original paper.

  2. KeySearchWiki

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Feb 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler (2022). KeySearchWiki [Dataset]. http://doi.org/10.5281/zenodo.4955200
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 14, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    KeySearchWiki is a dataset for evaluating keyword search systems over Wikidata.

    The dataset was automatically generated by leveraging Wikidata and Wikipedia set categories (e.g., Category:American television directors) as data sources for both relevant entities and queries.
    Relevant entities are gathered by carefully navigating the Wikipedia set categories hierarchy in all available languages. Furthermore, those categories are refined and combined to derive more complex queries.

    Detailed information about KeySearchWiki and its generation can be found on the Github page.

  3. Z

    Node2Vec model - Czech Wikidata (knowledge graph / labels / l160 / rw40)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 13, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skopal, Tomáš (2021). Node2Vec model - Czech Wikidata (knowledge graph / labels / l160 / rw40) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4433736
    Explore at:
    Dataset updated
    Jan 13, 2021
    Dataset provided by
    Bernhauer, David
    Skopal, Tomáš
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Node2Vec embedding model trained on Czech wikidata (from October 2020) labels using gensim implementation of Word2Vec with the following parameters for random walks:

    length of walk = 160

    number of random walks = 40

  4. RuWikidata8M

    • zenodo.org
    zip
    Updated Apr 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vladislav Korablinov; Vladislav Korablinov (2020). RuWikidata8M [Dataset]. http://doi.org/10.5281/zenodo.3751761
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 18, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Vladislav Korablinov; Vladislav Korablinov
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    A Wikidata sample containing all the entities with Russian labels as of March 2020. It consists of about 212M triples with 8.1M unique entities.

    This snapshot intended to be used with RuBQ dataset. It mitigates the problem of Wikidata’s dynamics – a reference answer may change with time as the knowledge base evolves. The sample guarantees the correctness of the queries and answers.

  5. Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with...

    • zenodo.org
    zip
    Updated Apr 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tara Safavi; Danai Koutra; Edgar Meij; Tara Safavi; Danai Koutra; Edgar Meij (2020). Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with Calibration [Dataset]. http://doi.org/10.5281/zenodo.3738264
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 2, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tara Safavi; Danai Koutra; Edgar Meij; Tara Safavi; Danai Koutra; Edgar Meij
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains two public knowledge graph datasets used in our paper Improving the Utility of Knowledge Graph Embeddings with Calibration. Each dataset is described below.

    Note that for our experiments we split each dataset randomly 5 times into 80/10/10 train/validation/test splits. We recommend that users of our data do the same to avoid (potentially) overfitting models to a single dataset split.

    wikidata-authors

    This dataset was extracted by querying the Wikidata API for facts about people categorized as "authors" or "writers" on Wikidata. Note that all head entities of triples are people (authors or writers), and all triples describe something about that person (e.g., their place of birth, their place of death, or their spouse). The knowledge graph has 23,887 entities, 13 relations, and 86,376 triples.

    The files are as follows:

    entities.tsv: A tab-separated file of all unique entities in the dataset. The fields are as follows:

    • eid: The unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/.
    • label: A human-readable label of this entity (extracted from Wikidata).

    relations.tsv: A tab-separated file of all unique relations in the dataset. The fields are as follows:

    • rid: The unique Wikidata identifier of this relation. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/Property:.
    • label: A human-readable label of this relation (extracted from Wikidata).

    triples.tsv: A tab-separated file of all triples in the dataset, in the form of , , .

    fb15krr-linked

    This dataset is an extended version of the FB15k+ dataset provided by [Xie et al IJCAI16]. It has been linked to Wikidata using Freebase MIDs (machine IDs) as keys; we discarded triples from the original dataset that contained entities that could not be linked to Wikidata. We also removed reverse relations following the procedure described by [Toutanova and Chen CVSC2015]. Finally, we removed existing triples labeled as False and added predicted triples labeled as True based on the crowdsourced annotations we obtained in our True or False Facts experiment (see our paper for details). The knowledge graph consists of 14,289 entities, 770 relations, and 272,385 triples.

    The files are as follows:

    entities.tsv: A tab-separated file of all unique entities in the dataset. The fields are as follows:

    • mid: The Freebase machine ID (MID) of this entity.
    • wiki: The corresponding unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/.
    • label: A human-readable label of this entity (extracted from Wikidata).
    • types: All hierarchical types of this entity, as provided by [Xie et al IJCAI16].

    relations.tsv: A tab-separated file of all unique relations in the dataset. The fields are as follows:

    • label: The hierarchical Freebase label of this relation.

    triples.tsv: A tab-separated file of all triples in the dataset, in the form of , , .

  6. Wikidata Reference

    • figshare.com
    application/gzip
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Hertling; Nandana Mihindukulasooriya (2025). Wikidata Reference [Dataset]. http://doi.org/10.6084/m9.figshare.28602170.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    figshare
    Authors
    Sven Hertling; Nandana Mihindukulasooriya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset SummaryThe Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.Data FieldsEach row in the dataset consists of the following fields:subject (str): The subject entity of the knowledge graph triple.rel (str): The relation that connects the subject and object.object (str): The object entity of the knowledge graph triple.text (str): A natural language sentence that entails the given triple.validation (str): LLM-based validation results, including:Fluent Sentence(s): TRUE/FALSESubject mentioned in Text: TRUE/FALSERelation mentioned in Text: TRUE/FALSEObject mentioned in Text: TRUE/FALSEFact Entailed By Text: TRUE/FALSEFinal Answer: TRUE/FALSEreference_url (str): URL of the web source from which the text was extracted.subj_qid (str): Wikidata QID for the subject entity.rel_id (str): Wikidata Property ID for the relation.obj_qid (str): Wikidata QID for the object entity.Dataset CreationThe dataset was created through the following process:1. Triple-Reference Sampling and ExtractionAll relations from Wikidata were extracted using SPARQL queries.A sample of KG triples with associated reference URLs was collected for each relation.2. Domain Analysis and Web ScrapingURLs were grouped by domain, and sampled pages were analyzed to determine their primary language.English-language web pages were scraped and processed to extract plaintext content.3. LLM-Based Text Span Selection and ValidationLLMs were used to identify text spans from web content that correspond to KG triples.A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.4. Final Dataset Statistics12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.After filtering for English content, 458K triple-web content pairs were processed with LLMs.80.5K validated triple-text alignments were included in the final dataset.

  7. SemTab 2024: Semantic Web Challenge on Tabular Data to Knowledge Graph...

    • zenodo.org
    • explore.openaire.eu
    application/gzip
    Updated Nov 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oktie Hassanzadeh; Oktie Hassanzadeh; Vasilis Efthymiou; Jiaoyan Chen; Vasilis Efthymiou; Jiaoyan Chen (2024). SemTab 2024: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Data Sets - WikidataTables2024R1 and WikidataTables2024R2 [Dataset]. http://doi.org/10.5281/zenodo.14207232
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Nov 22, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Oktie Hassanzadeh; Oktie Hassanzadeh; Vasilis Efthymiou; Jiaoyan Chen; Vasilis Efthymiou; Jiaoyan Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Sets from the ISWC 2024 Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, Round 1, Wikidata Tables. Links to other datasets can be found on the challenge website: https://sem-tab-challenge.github.io/2024/ as well as the proceedings of the challenge published on CEUR.

    For details about the challenge, see: http://www.cs.ox.ac.uk/isg/challenges/sem-tab/

    For 2024 edition, see: https://sem-tab-challenge.github.io/2024/

    Note on License: This data includes data from the following sources. Refer to each source for license details:
    - Wikidata https://www.wikidata.org/

    THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

  8. Wikidata Causal Event Triple Data

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Feb 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sola; Sola; Debarun; Debarun; Oktie; Oktie (2023). Wikidata Causal Event Triple Data [Dataset]. http://doi.org/10.5281/zenodo.7196049
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 7, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sola; Sola; Debarun; Debarun; Oktie; Oktie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains triples curated from Wikidata surrounding news events with causal relations, and is released as part of our WWW'23 paper, "Event Prediction using Case-Based Reasoning over Knowledge Graphs".

    Starting from a set of classes that we consider to be types of "events", we queried Wikidata to collect entities that were an instanceOf an event class and that were connected to another such event entity by a causal triple (https://www.wikidata.org/wiki/Wikidata:List_of_properties/causality). For all such cause-effect event pairs, we then collected a 3-hop neighborhood of outgoing triples.

  9. P

    CoDEx Medium Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tara Safavi; Danai Koutra, CoDEx Medium Dataset [Dataset]. https://paperswithcode.com/dataset/codex-medium
    Explore at:
    Authors
    Tara Safavi; Danai Koutra
    Description

    CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false.

  10. Wiki80

    • figshare.com
    json
    Updated Oct 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongmin Xiao (2022). Wiki80 [Dataset]. http://doi.org/10.6084/m9.figshare.19323371.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 1, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Hongmin Xiao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Relation extraction dataset with its knowledge graph.

  11. KeySearchWiki-experiments

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Feb 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler (2022). KeySearchWiki-experiments [Dataset]. http://doi.org/10.5281/zenodo.5761138
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 14, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Experiment results together with queries, runs, and relevance judgments produced in the context of evaluating different retrieval methods using the KeySearchWiki dataset.

    Detailed information about KeySearchWiki and its generation can be found on the Github page.

  12. f

    Statistics about the Wikidata SPARQL logs used.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Fernández-Álvarez; Johannes Frey; Jose Emilio Labra Gayo; Daniel Gayo-Avello; Sebastian Hellmann (2023). Statistics about the Wikidata SPARQL logs used. [Dataset]. http://doi.org/10.1371/journal.pone.0252862.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Daniel Fernández-Álvarez; Johannes Frey; Jose Emilio Labra Gayo; Daniel Gayo-Avello; Sebastian Hellmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistics about the Wikidata SPARQL logs used.

  13. O

    Data from: WikiWiki

    • opendatalab.com
    zip
    Updated Apr 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon Research (2023). WikiWiki [Dataset]. https://opendatalab.com/OpenDataLab/WikiWiki
    Explore at:
    zip(25410 bytes)Available download formats
    Dataset updated
    Apr 1, 2023
    Dataset provided by
    Amazon Research
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    WikiWiki is a dataset for understanding entities and their place in a taxonomy of knowledge—their types. It consists of entities and passages from 10M Wikipedia articles linked to the Wikidata knowledge graph with 41K types.

  14. ConvQuestions

    • opendatalab.com
    • paperswithcode.com
    • +1more
    zip
    Updated Oct 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Planck Institute for Informatics (2019). ConvQuestions [Dataset]. https://opendatalab.com/OpenDataLab/ConvQuestions
    Explore at:
    zip(35322132 bytes)Available download formats
    Dataset updated
    Oct 9, 2019
    Dataset provided by
    亚马逊http://amazon.com/
    Max Planck Institute for Informatics
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ConvQuestions is the first realistic benchmark for conversational question answering over knowledge graphs. It contains 11,200 conversations which can be evaluated over Wikidata. They are compiled from the inputs of 70 Master crowdworkers on Amazon Mechanical Turk, with conversations from five domains: Books, Movies, Soccer, Music, and TV Series. The questions feature a variety of complex question phenomena like comparisons, aggregations, compositionality, and temporal reasoning. Answers are grounded in Wikidata entities to enable fair comparison across diverse methods. The data gathering setup was kept as natural as possible, with the annotators selecting entities of their choice from each of the five domains, and formulating the entire conversation in one session. All questions in a conversation are from the same Turker, who also provided gold answers to the questions. For suitability to knowledge graphs, questions were constrained to be objective or factoid in nature, but no other restrictive guidelines were set. A notable property of ConvQuestions is that several questions are not answerable by Wikidata alone (as of September 2019), but the required facts can, for example, be found in the open Web or in Wikipedia. For details, please refer to our CIKM 2019 full paper.

  15. QALD-9-Plus

    • figshare.com
    • paperswithcode.com
    • +1more
    txt
    Updated Dec 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksandr Perevalov; Andreas Both; Dennis Diefenbach; Ricardo Usbeck (2021). QALD-9-Plus [Dataset]. http://doi.org/10.6084/m9.figshare.16864273.v7
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 21, 2021
    Dataset provided by
    figshare
    Authors
    Aleksandr Perevalov; Andreas Both; Dennis Diefenbach; Ricardo Usbeck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    QALD-9-Plus is the dataset for Knowledge Graph Question Answering (KGQA) based on well-known QALD-9.QALD-9-Plus enables to train and test KGQA systems over DBpedia and Wikidata using questions in 8 different languages.Some of the questions have several alternative writings in particular languages which enables to evaluate the robustness of KGQA systems and train paraphrasing models.As the questions' translations were provided by native speakers, they are considered as "gold standard", therefore, machine translation tools can be trained and evaluated on the dataset.Please, see also the GitHub repository: https://github.com/Perevalov/qald_9_plus

  16. Cache for KGTK queries in Creating and Querying Personalized Versions of...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jul 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro Szekely; Pedro Szekely (2021). Cache for KGTK queries in Creating and Querying Personalized Versions of Wikidata on a Laptop [Dataset]. http://doi.org/10.5281/zenodo.5146407
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jul 30, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pedro Szekely; Pedro Szekely
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sqlite cache used to store the KGTK Kypher (https://kgtk.readthedocs.io/en/dev/transform/query/) queries for paper "Creating and Querying Personalized Versions of Wikidata on a Laptop"

  17. KeySearchWiki-cache

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler (2023). KeySearchWiki-cache [Dataset]. http://doi.org/10.5281/zenodo.5752018
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of SQLite database files containing all the data retrieved from Wikidata JSON Dump and Wikipedia SQL Dumps of 2021-09-20 in the context of KeySearchWiki dataset generation.

    Detailed information about KeySearchWiki can be found on the Github page.

  18. Data from: Diversity matters: Robustness of bias measurements in Wikidata

    • zenodo.org
    • data.niaid.nih.gov
    tsv
    Updated May 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paramita das; Sai Keerthana Karnam; Anirban Panda; Bhanu Prakash Reddy Guda; Soumya Sarkar; Animesh Mukherjee; Paramita das; Sai Keerthana Karnam; Anirban Panda; Bhanu Prakash Reddy Guda; Soumya Sarkar; Animesh Mukherjee (2023). Diversity matters: Robustness of bias measurements in Wikidata [Dataset]. http://doi.org/10.48550/arxiv.2302.14027
    Explore at:
    tsvAvailable download formats
    Dataset updated
    May 1, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Paramita das; Sai Keerthana Karnam; Anirban Panda; Bhanu Prakash Reddy Guda; Soumya Sarkar; Animesh Mukherjee; Paramita das; Sai Keerthana Karnam; Anirban Panda; Bhanu Prakash Reddy Guda; Soumya Sarkar; Animesh Mukherjee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the widespread use of knowledge graphs (KG) in various automated AI systems and applications, it is very important to ensure that information retrieval algorithms leveraging them are free from societal biases. Previous works have depicted biases that persist in KGs, as well as employed several metrics for measuring the biases. However, such studies lack the systematic exploration of the sensitivity of the bias measurements, through varying sources of data, or the embedding algorithms used. To address this research gap, in this work, we present a holistic analysis of bias measurement on the knowledge graph. First, we attempt to reveal data biases that surface in Wikidata for thirteen different demographics selected from seven continents. Next, we attempt to unfold the variance in the detection of biases by two different knowledge graph embedding algorithms - TransE and ComplEx. We conduct our extensive experiments on a large number of occupations sampled from the thirteen demographics with respect to the sensitive attribute, i.e., gender. Our results show that the inherent data bias that persists in KG can be altered by specific algorithm bias as incorporated by KG embedding learning algorithms. Further, we show that the choice of the state-of-the-art KG embedding algorithm has a strong impact on the ranking of biased occupations irrespective of gender. We observe that the similarity of the biased occupations across demographics is minimal which reflects the socio-cultural differences around the globe. We believe that this full-scale audit of the bias measurement pipeline will raise awareness among the community while deriving insights related to design choices of data and algorithms both and refrain from the popular dogma of ``one-size-fits-all''.

  19. h

    wikidata_reference

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Hertling, wikidata_reference [Dataset]. https://huggingface.co/datasets/sven-h/wikidata_reference
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Sven Hertling
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Triple-to-Text Alignment Dataset

      Dataset Summary
    

    The Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific… See the full description on the dataset page: https://huggingface.co/datasets/sven-h/wikidata_reference.

  20. Wikidata Thematic Subgraph Selection

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Jarnac; Lucas Jarnac; Miguel Couceiro; Miguel Couceiro; Pierre Monnin; Pierre Monnin (2024). Wikidata Thematic Subgraph Selection [Dataset]. http://doi.org/10.5281/zenodo.8091584
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 24, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lucas Jarnac; Lucas Jarnac; Miguel Couceiro; Miguel Couceiro; Pierre Monnin; Pierre Monnin
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Wikidata Thematic Subgraph Selection

    These datasets have been designed to train and evaluate algorithms to select thematic subgraphs of interest in a large knowledge graph from seed entities of interest. Specifically, we consider Wikidata. Given a set of seed QIDs of interest, a graph expansion is performed following P31, P279, and (-)P279 edges. Traversed classes that thematically deviates from seed QIDs of interest should be pruned. Datasets thus consist of classes reached from seed QIDs that are labeled as "to prune" or "to keep".

    Available datasets

    Dataset# Seed QIDs# Labeled decisions# Prune decisionsMin prune depthMax prune depth# Keep decisionsMin keep depthMax keep depth# Reached nodes up# Reached nodes down
    dataset1455523334641417691415072593609
    dataset2105982388125941311591247385

    Each dataset folder contains

    • datasetX.csv: a CSV file containing one seed QID per line (not the complete URL, just the QID). This CSV file has no header.
    • datasetX_labels.csv: a CSV file containing one seed QID per line and its label (not the complete URL, just the QID)
    • datasetX_gold_decisions.csv: a CSV file with seed QIDs, reached QIDs, and the labeled decision (1: keep, 0: prune)
    • datasetX_Y_folds.pkl: folds to train and test models based on the labeled decisions

    dataset1-2 consists of using dataset1 for training and dataset2 for testing.

    License

    Datasets are available under the CC BY-NC license.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang (2023). Wikidata5M Dataset [Dataset]. https://paperswithcode.com/dataset/wikidata5m

Wikidata5M Dataset

Explore at:
203 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 14, 2023
Authors
Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang
Description

Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.

The dataset is distributed as a knowledge graph, a corpus, and aliases. We provide both transductive and inductive data splits used in the original paper.

Search
Clear search
Close search
Google apps
Main menu