65 datasets found
  1. Wikipedia Knowledge Graph dataset

    • zenodo.org
    • produccioncientifica.ugr.es
    • +2more
    pdf, tsv
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
    Explore at:
    tsv, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

    There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

    The document Dataset_summary includes a detailed description of the dataset.

    Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

  2. Wikipedia Knowledge Graph

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vivian Silva (2023). Wikipedia Knowledge Graph [Dataset]. http://doi.org/10.6084/m9.figshare.9896399.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Vivian Silva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Knowledge graph generated from definitions extracted from Wikipedia articles.

  3. h

    wiki-kg-dataset

    • huggingface.co
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Automated Scientist (2025). wiki-kg-dataset [Dataset]. https://huggingface.co/datasets/AutomatedScientist/wiki-kg-dataset
    Explore at:
    Dataset updated
    Nov 21, 2025
    Authors
    Automated Scientist
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Wikipedia Knowledge Graph Dataset

    This dataset contains 49897 Wikipedia articles processed with two different models (sherlock_think and polaris_alpha) to extract structured knowledge.

      Data Format
    

    The knowledge graphs are stored in Wolfram Language format, containing structured entities, relations, properties, and timeline events extracted from Wikipedia articles.

      Usage
    

    from datasets import load_dataset

    dataset =… See the full description on the dataset page: https://huggingface.co/datasets/AutomatedScientist/wiki-kg-dataset.

  4. E

    Wikidata

    • live.european-language-grid.eu
    json
    Updated Oct 28, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). Wikidata [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/7268
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 28, 2012
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.

  5. KeySearchWiki

    • zenodo.org
    zip
    Updated Feb 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler (2022). KeySearchWiki [Dataset]. http://doi.org/10.5281/zenodo.5751978
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 14, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    KeySearchWiki is a dataset for evaluating keyword search systems over Wikidata.

    The dataset was automatically generated by leveraging Wikidata and Wikipedia set categories (e.g., Category:American television directors) as data sources for both relevant entities and queries.
    Relevant entities are gathered by carefully navigating the Wikipedia set categories hierarchy in all available languages. Furthermore, those categories are refined and combined to derive more complex queries.

    Detailed information about KeySearchWiki and its generation can be found on the Github page.

  6. I

    WikiCSSH - Computer Science Subject Headings from Wikipedia

    • databank.illinois.edu
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanyao Han; Pingjing Yang; Shubhanshu Mishra; Jana Diesner (2024). WikiCSSH - Computer Science Subject Headings from Wikipedia [Dataset]. http://doi.org/10.13012/B2IDB-0424970_V1
    Explore at:
    Dataset updated
    Apr 18, 2024
    Authors
    Kanyao Han; Pingjing Yang; Shubhanshu Mishra; Jana Diesner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    WikiCSSH If you are using WikiCSSH please cite the following: > Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana. 2020. “WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia.” In Workshop on Scientific Knowledge Graphs (SKG 2020). https://skg.kmi.open.ac.uk/SKG2020/papers/HAN_et_al_SKG_2020.pdf > Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana. 2020. "WikiCSSH - Computer Science Subject Headings from Wikipedia". University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0424970_V1 Download the WikiCSSH files from: https://doi.org/10.13012/B2IDB-0424970_V1 More details about the WikiCSSH project can be found at: https://github.com/uiuc-ischool-scanr/WikiCSSH This folder contains the following files: WikiCSSH_categories.csv - Categories in WikiCSSH WikiCSSH_category_links.csv - Links between categories in WikiCSSH Wikicssh_core_categories.csv - Core categories as mentioned in the paper WikiCSSH_category_links_all.csv - Links between categories in WikiCSSH (includes a dummy category called

  7. Classes Knowledge Graph

    • kaggle.com
    zip
    Updated Aug 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Afroz (2024). Classes Knowledge Graph [Dataset]. https://www.kaggle.com/datasets/pythonafroz/dbpedia-classes-knowledge-graph
    Explore at:
    zip(174050111 bytes)Available download formats
    Dataset updated
    Aug 31, 2024
    Authors
    Afroz
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    DBPedia Classes

    DBpedia is a knowledge graph extracted from Wikipedia, providing structured data about real-world entities and their relationships. DBpedia Classes are the core building blocks of this knowledge graph, representing different categories or types of entities.

    Key Concepts:

    Entity: A real-world object, such as a person, place, thing, or concept. Class: A group of entities that share common properties or characteristics. Instance: A specific member of a class.

    Examples of DBPedia Classes:

    Person: Represents individuals, e.g., "Barack Obama," "Albert Einstein." Place: Represents locations, e.g., "Paris," "Mount Everest." Organization: Represents groups, institutions, or companies, e.g., "Google," "United Nations." Event: Represents occurrences, e.g., "World Cup," "French Revolution." Artwork: Represents creative works, e.g., "Mona Lisa," "Star Wars."

    Hierarchy and Relationships:

    DBpedia classes often have a hierarchical structure, where subclasses inherit properties from their parent classes. For example, the class "Person" might have subclasses like "Politician," "Scientist," and "Artist."

    Relationships between classes are also important. For instance, a "Person" might have a "birthPlace" relationship with a "Place," or an "Artist" might have a "hasArtwork" relationship with an "Artwork."

    Applications of DBPedia Classes:

    Semantic Search: DBPedia classes can be used to enhance search results by understanding the context and meaning of queries.

    Knowledge Graph Construction: DBPedia classes form the foundation of knowledge graphs, which can be used for various applications like question answering, recommendation systems, and data integration.

    Data Analysis: DBPedia classes can be used to analyze and extract insights from large datasets.

  8. Z

    Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Apr 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Safavi, Tara; Koutra, Danai; Meij, Edgar (2020). Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with Calibration [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3738263
    Explore at:
    Dataset updated
    Apr 2, 2020
    Dataset provided by
    Bloomberg
    University of Michigan
    Authors
    Safavi, Tara; Koutra, Danai; Meij, Edgar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains two public knowledge graph datasets used in our paper Improving the Utility of Knowledge Graph Embeddings with Calibration. Each dataset is described below.

    Note that for our experiments we split each dataset randomly 5 times into 80/10/10 train/validation/test splits. We recommend that users of our data do the same to avoid (potentially) overfitting models to a single dataset split.

    wikidata-authors

    This dataset was extracted by querying the Wikidata API for facts about people categorized as "authors" or "writers" on Wikidata. Note that all head entities of triples are people (authors or writers), and all triples describe something about that person (e.g., their place of birth, their place of death, or their spouse). The knowledge graph has 23,887 entities, 13 relations, and 86,376 triples.

    The files are as follows:

    entities.tsv: A tab-separated file of all unique entities in the dataset. The fields are as follows:

    eid: The unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/.

    label: A human-readable label of this entity (extracted from Wikidata).

    relations.tsv: A tab-separated file of all unique relations in the dataset. The fields are as follows:

    rid: The unique Wikidata identifier of this relation. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/Property:.

    label: A human-readable label of this relation (extracted from Wikidata).

    triples.tsv: A tab-separated file of all triples in the dataset, in the form of , , .

    fb15krr-linked

    This dataset is an extended version of the FB15k+ dataset provided by [Xie et al IJCAI16]. It has been linked to Wikidata using Freebase MIDs (machine IDs) as keys; we discarded triples from the original dataset that contained entities that could not be linked to Wikidata. We also removed reverse relations following the procedure described by [Toutanova and Chen CVSC2015]. Finally, we removed existing triples labeled as False and added predicted triples labeled as True based on the crowdsourced annotations we obtained in our True or False Facts experiment (see our paper for details). The knowledge graph consists of 14,289 entities, 770 relations, and 272,385 triples.

    The files are as follows:

    entities.tsv: A tab-separated file of all unique entities in the dataset. The fields are as follows:

    mid: The Freebase machine ID (MID) of this entity.

    wiki: The corresponding unique Wikidata identifier of this entity. You can find the corresponding Wikidata page at https://www.wikidata.org/wiki/.

    label: A human-readable label of this entity (extracted from Wikidata).

    types: All hierarchical types of this entity, as provided by [Xie et al IJCAI16].

    relations.tsv: A tab-separated file of all unique relations in the dataset. The fields are as follows:

    label: The hierarchical Freebase label of this relation.

    triples.tsv: A tab-separated file of all triples in the dataset, in the form of , , .

  9. Z

    CaLiGraph - A Large-Scale Semantic Knowledge Graph compiled from Wikipedia...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jun 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heist, Nicolas; Paulheim, Heiko (2023). CaLiGraph - A Large-Scale Semantic Knowledge Graph compiled from Wikipedia Categories and List Pages [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3484511
    Explore at:
    Dataset updated
    Jun 25, 2023
    Dataset provided by
    University of Mannheim
    Authors
    Heist, Nicolas; Paulheim, Heiko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CaLiGraph is a large-scale semantic knowledge graph with a rich ontology which is compiled from the DBpedia ontology, and Wikipedia categories & list pages. For more information, visit http://caligraph.org

    Information about uploaded files: (all files are b-zipped and in the n-triple format)

    caligraph-metadata.nt.bz2 Metadata about the dataset which is described using void vocabulary.

    caligraph-ontology.nt.bz2 Class definitions, property definitions, restrictions, and labels of the CaLiGraph ontology.

    caligraph-ontology_dbpedia-mapping.nt.bz2 Mapping of classes and properties to the DBpedia ontology.

    caligraph-ontology_provenance.nt.bz2 Provenance information about classes (i.e. which Wikipedia category or list page has been used to create this class).

    caligraph-instances_types.nt.bz2 Definition of instances and (non-transitive) types.

    caligraph-instances_transitive-types.nt.bz2 Transitive types for instances (can also be induced by a reasoner).

    caligraph-instances_labels.nt.bz2 Labels for instances.

    caligraph-instances_relations.nt.bz2 Relations between instances derived from the class restrictions of the ontology (can also be induced by a reasoner).

    caligraph-instances_dbpedia-mapping.nt.bz2 Mapping of instances to respective DBpedia instances.

    caligraph-instances_provenance.nt.bz2 Provenance information about instances (e.g. if the instance has been extracted from a Wikipedia list page).

    dbpedia_caligraph-instances.nt.bz2 Additional instances of CaLiGraph that are not in DBpedia. ! This file is no part of CaLiGraph but should rather be used as an extension to DBpedia. The triples use the DBpedia namespace and can thus be used to directly extend DBpedia. !

    dbpedia_caligraph-types.nt.bz2 Additional types of CaLiGraph that are not in DBpedia. ! This file is no part of CaLiGraph but should rather be used as an extension to DBpedia. The triples use the DBpedia namespace and can thus be used to directly extend DBpedia. !

    dbpedia_caligraph-relations.nt.bz2 Additional relations of CaLiGraph that are not in DBpedia. ! This file is no part of CaLiGraph but should rather be used as an extension to DBpedia. The triples use the DBpedia namespace and can thus be used to directly extend DBpedia. !

    Changelog

    v3.1.1

    Fixed an encoding issue in caligraph-ontology.nt.bz2

    v3.1.0

    Fixed several issues related to ontology consistency and structure

    v3.0.0

    Added functionality to group mentions of unknown entities into distinct entities

    v2.1.0

    Fixed error that lead to a class inheriting from a disjoint class

    Introduced owl:ObjectProperty and owl:DataProperty instead of rdf:Property

    Several cosmetic fixes

    v2.0.2

    Fixed incorrect formatting of some properties

    v2.0.1

    Better entity extraction and representation

    Small cosmetic fixes

    v2.0.0

    Entity extraction from arbitrary tables and enumerations in Wikipedia pages

    v1.4.0

    BERT-based recognition of subject entities and improved language models from spaCy 3.0

    v1.3.1

    Fixed minor encoding errors and improved formatting

    v1.3.0

    CaLiGraph is now based on a recent version of Wikipedia and DBpedia from November 2020

    v1.1.0

    Improved the CaLiGraph type hierarchy

    Many small bugfixes and improvements

    v1.0.9

    Additional alternative labels for CaLiGraph instances

    v1.0.8

    Small cosmetic changes to URIs to be closer to DBpedia URIs

    v1.0.7

    Mappings from CaLiGraph classes to DBpedia classes are now realised via rdfs:subClassOf instead of owl:equivalentClass

    Entities are now URL-encoded to improve accessibility

    v1.0.6

    Fixed a bug in the ontology creation step that led to a substantially lower amount of sub-type relationships than actually exist. The new version provides a richer type hierarchy that also leads to an increased amount of types for resources.

    v1.0.5

    Fixed a bug that has declared CaLiGraph predicates as subclasses of owl:Predicate instead of being of the type owl:Predicate.

  10. Wiki Sentences

    • kaggle.com
    zip
    Updated Nov 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ved Prakash (2019). Wiki Sentences [Dataset]. https://www.kaggle.com/datasets/ved1104/wiki-sentences
    Explore at:
    zip(94294 bytes)Available download formats
    Dataset updated
    Nov 15, 2019
    Authors
    Ved Prakash
    Description

    Dataset

    This dataset was created by Ved Prakash

    Contents

  11. KeySearchWiki-cache

    • zenodo.org
    zip
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler (2023). KeySearchWiki-cache [Dataset]. http://doi.org/10.5281/zenodo.4965398
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of SQLite database files containing all the data retrieved either from Wikidata/Wikipedia endpoints via SPARQL/MediaWiki API in the context of KeySearchWiki dataset generation.

    Detailed information about KeySearchWiki can be found on the Github page.

  12. R

    RKD-Knowledge-Graph

    • rkd.triply.cc
    application/n-quads +5
    Updated Nov 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RKD (2025). RKD-Knowledge-Graph [Dataset]. https://rkd.triply.cc/rkd/RKD-Knowledge-Graph
    Explore at:
    application/sparql-results+json, ttl, application/n-quads, application/n-triples, jsonld, application/trigAvailable download formats
    Dataset updated
    Nov 16, 2025
    Dataset authored and provided by
    RKD
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    We manage unique archives, documentation and photographic material and the largest art historical library on Western art from the Late Middle Ages to the present, with the focus on Netherlandish art. Our collections cover not only paintings, drawings and sculptures, but also monumental art, modern media and design. The collections are present in both digital and analogue form (the latter in our study rooms).

    This knowledge graph represents our collection as Linked Data, primarily using the CIDOC-CRM and LinkedArt vocabularies.

  13. Wikidata Causal Event Triple Data

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Feb 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sola; Sola; Debarun; Debarun; Oktie; Oktie (2023). Wikidata Causal Event Triple Data [Dataset]. http://doi.org/10.5281/zenodo.7196049
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 7, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sola; Sola; Debarun; Debarun; Oktie; Oktie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains triples curated from Wikidata surrounding news events with causal relations, and is released as part of our WWW'23 paper, "Event Prediction using Case-Based Reasoning over Knowledge Graphs".

    Starting from a set of classes that we consider to be types of "events", we queried Wikidata to collect entities that were an instanceOf an event class and that were connected to another such event entity by a causal triple (https://www.wikidata.org/wiki/Wikidata:List_of_properties/causality). For all such cause-effect event pairs, we then collected a 3-hop neighborhood of outgoing triples.

  14. Kensho Derived Wikimedia Dataset

    • kaggle.com
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kensho R&D (2020). Kensho Derived Wikimedia Dataset [Dataset]. https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data
    Explore at:
    zip(8760044227 bytes)Available download formats
    Dataset updated
    Jan 24, 2020
    Authors
    Kensho R&D
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Kensho Derived Wikimedia Dataset

    Wikipedia, the free encyclopedia, and Wikidata, the free knowledge base, are crowd-sourced projects supported by the Wikimedia Foundation. Wikipedia is nearly 20 years old and recently added its six millionth article in English. Wikidata, its younger machine-readable sister project, was created in 2012 but has been growing rapidly and currently contains more than 75 million items.

    These projects contribute to the Wikimedia Foundation's mission of empowering people to develop and disseminate educational content under a free license. They are also heavily utilized by computer science research groups, especially those interested in natural language processing (NLP). The Wikimedia Foundation periodically releases snapshots of the raw data backing these projects, but these are in a variety of formats and were not designed for use in NLP research. In the Kensho R&D group, we spend a lot of time downloading, parsing, and experimenting with this raw data. The Kensho Derived Wikimedia Dataset (KDWD) is a condensed subset of the raw Wikimedia data in a form that we find helpful for NLP work. The KDWD has a CC BY-SA 3.0 license, so feel free to use it in your work too.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F972e4157b97efe8c2c5ea17c983b1504%2Fkdwd_header_logos_2.jpg?generation=1580510520532141&alt=media" alt="">

    This particular release consists of two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base. We version the KDWD using the raw Wikimedia snapshot dates. The version string for this dataset is kdwd_enwiki_20191201_wikidata_20191202 indicating that this KDWD was built from the English Wikipedia snapshot from 2019 December 1 and the Wikidata snapshot from 2019 December 2. Below we describe these components in more detail.

    Example Notebooks

    Dive right in by checking out some of our example notebooks:

    Updates / Changelog

    • initial release 2020-01-31

    File Summary

    • Wikipedia
      • page.csv (page metadata and Wikipedia-to-Wikidata mapping)
      • link_annotated_text.jsonl (plaintext of Wikipedia pages with link offsets)
    • Wikidata
      • item.csv (item labels and descriptions in English)
      • item_aliases.csv (item aliases in English)
      • property.csv (property labels and descriptions in English)
      • property_aliases.csv (property aliases in English)
      • statements.csv (truthy qpq statements)

    Three Layers of Data

    The KDWD is three connected layers of data. The base layer is a plain text English Wikipedia corpus, the middle layer annotates the corpus by indicating which text spans are links, and the top layer connects the link text spans to items in Wikidata. Below we'll describe these layers in more detail.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F19663d43bade0e92f578255f6e0d9dcd%2Fkensho_wiki_triple_layer.svg?generation=1580347573004185&alt=media" alt="">

    Wikipedia Sample

    The first part of the KDWD is derived from Wikipedia. In order to create a corpus of mostly natural text, we restrict our English Wikipedia page sample to those that:

  15. DBkWik Plus Plus

    • figshare.com
    bin
    Updated Sep 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Hertling; Heiko Paulheim (2022). DBkWik Plus Plus [Dataset]. http://doi.org/10.6084/m9.figshare.20407864.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 29, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Sven Hertling; Heiko Paulheim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large knowledge graphs like DBpedia and YAGO are always based on the same source - namely Wikipedia. But there are more wikis that contain information about long-tail entities such as wiki hosting platforms like Fandom. In this paper, we present the approach and analysis of DBkWik++, a fused Knowledge Graph from thousands of wikis. A modified version of the DBpedia framework is applied to each wiki which results in many isolated Knowledge Graphs. With an incremental merge based approach, we reuse one-to-one matching systems to solve the multi source KG matching task. Based on this alignment we create a consolidated knowledge graph with more than 15 million instances.

  16. wiki sentences and got dataset for knowledge graph

    • kaggle.com
    zip
    Updated Apr 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiten (2023). wiki sentences and got dataset for knowledge graph [Dataset]. https://www.kaggle.com/datasets/hitens/kgdataset
    Explore at:
    zip(30939 bytes)Available download formats
    Dataset updated
    Apr 24, 2023
    Authors
    Hiten
    Description

    Dataset

    This dataset was created by Hiten

    Contents

  17. Z

    Wikidata5m - knowledge graph (inductive)

    • data.niaid.nih.gov
    Updated Oct 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang (2021). Wikidata5m - knowledge graph (inductive) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5546386
    Explore at:
    Dataset updated
    Oct 3, 2021
    Dataset provided by
    Tsinghua University
    Mila - Quebec AI Institute
    Authors
    Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wikidata5m is a million-scale knowledge graph dataset with aligned corpus.This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.

    This file contains the inductive split of Wikidata5m knowledge graph.

  18. u

    WikiEvents Dataset from January 2020 to December 2022

    • fdr.uni-hamburg.de
    zip
    Updated Feb 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michaelis, Lars; Michaelis, Lars (2023). WikiEvents Dataset from January 2020 to December 2022 [Dataset]. http://doi.org/10.25592/uhhfdm.11447
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 7, 2023
    Authors
    Michaelis, Lars; Michaelis, Lars
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    WikiEvents is a knowledge graph based dataset for NLP and event-related machine learning tasks.

    This dataset includes RDF data in JSON-LD about events between January 2020 and December 2022. It was extracted from the Wikipedia Current events portal, Wikidata, OpenStreetMaps Nominatim and Falcon 2.0. The extractor is available on GitHub under semantic-systems/current-events-to-kg.

    The RDF data for each month is split onto four graph modules each:

    • The base graph module contains events, event summaries with references from named entities to Wikipedia articles.
    • The ohg graph module with all one-hop graphs (ohg) around the referencend Wikidata entities.
    • The osm graph module which contains spartial data from OpenStreetMap (OSM).
    • The raw graph module containing the raw HTML objects of events and article infoboxes.

    This repository additionally includes two JSON files with training samples used for entity linking and event-related location extraction. They were created using queries to the WikiEvents dataset uploaded into this repository.

  19. PheKnowLator Human Disease KG Benchmarks: Class-Inverse Relations-OWLNETS

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tiffany J Callahan; Tiffany J Callahan (2023). PheKnowLator Human Disease KG Benchmarks: Class-Inverse Relations-OWLNETS [Dataset]. http://doi.org/10.5281/zenodo.7029922
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tiffany J Callahan; Tiffany J Callahan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PKT Human Disease Knowledge Graph Benchmark Builds (v2.0.0)

    Build Date: May 10, 2020

    Please note that all resources linked below redirect to a publicly Google Cloud Storage bucket where all data are publicly accessible. Routing users from this wiki page is perfectly safe and allows us to avoid requiring users to have a Google account and login to download data. If you have any questions or concerns, please email the project maintainer at callahantiff@gmail.com.

    If you have a Google account you can access the data directly via 👉 here

    📚 For additional information on the builds please see the following README
    🗂 For additional information on the KG file types please see the following Wiki page

    🚨 AVAILABLE FILES 🚨Available KG benchmark files are zipped and listed below. For additional details on what each file contains, please see the associated Wiki page 👉 here.

  20. QBLink-KG: QBLink Adapted to DBpedia Knowledge Graph

    • figshare.com
    json
    Updated Feb 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mona Zamiri; Yao Qiang; Fedor Nikolaev; Dongxiao Zhu; Alexander Kotov (2024). QBLink-KG: QBLink Adapted to DBpedia Knowledge Graph [Dataset]. http://doi.org/10.6084/m9.figshare.25256290.v3
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Feb 21, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Mona Zamiri; Yao Qiang; Fedor Nikolaev; Dongxiao Zhu; Alexander Kotov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    QBLink-KG is a modified version of QBLink, which is a high-quality benchmark for evaluating conversational understanding of Wikipedia content.QBLink consists of sequences of up to three hand-crafted queries, with responses being single-named entities that match the titles of Wikipedia articles.For the QBLink-KG, the English subset of the DBpedia snapshot from September 2021 was used as the target Knowledge Graph. QBLink answers provided as the titles of Wikipedia infoboxes can be easily mapped to DBpedia entity URIs - if the corresponding entities are present in DBpedia - since DBpedia is constructed through the extraction of information from Wikipedia infoboxes.QBLink, in its original format, is not directly applicable for Conversational Entity Retrieval from a Knowledge Graph (CER-KG) because knowledge graphs contain considerably less information than Wikipedia. A named entity serving as an answer to a QBLink query may not be present as an entity in DBpedia. To modify QBLink for CER over DBpedia, we implemented two filtering steps: 1) we removed all queries for which the wiki_page field is empty, or the answer cannot be mapped to a DBpedia entity or does not match to a Wikipedia page. 2) For the evaluation of a model with specific techniques for entity linking and candidate selection, we excluded queries with answers that do not belong to the set of candidate entities derived using that model.The original QBLink dataset files before filtering are:QBLink-train.jsonQBLink-dev.jsonQBLink-test.jsonAnd the final QBLink-KG files after filtering are:QBLink-Filtered-train.jsonQBLink-Filtered-dev.jsonQBLink-Filtered-test.jsonWe used below references to construct QBLink-KG:Ahmed Elgohary, Chen Zhao, and Jordan Boyd-Graber. 2018. A dataset and baselines for sequential open-domain question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1077–1083, Brussels, Belgium. Association for Computational Linguistics.https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2021-09Lehmann, Jens et al. ‘DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia’. 1 Jan. 2015 : 167 – 195.To give more details about QBLink-KG, please read our research paper:Zamiri, Mona, et al. "Benchmark and Neural Architecture for Conversational Entity Retrieval from a Knowledge Graph", The Web Conference 2024.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
Organization logo

Wikipedia Knowledge Graph dataset

Explore at:
240 scholarly articles cite this dataset (View in Google Scholar)
tsv, pdfAvailable download formats
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

The document Dataset_summary includes a detailed description of the dataset.

Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

Search
Clear search
Close search
Google apps
Main menu