47 datasets found
  1. Wikipedia Knowledge Graph dataset

    • zenodo.org
    • produccioncientifica.ugr.es
    • +1more
    pdf, tsv
    Updated Jul 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
    Explore at:
    tsv, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

    There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

    The document Dataset_summary includes a detailed description of the dataset.

    Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

  2. f

    Wikipedia Knowledge Graph

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vivian Silva (2023). Wikipedia Knowledge Graph [Dataset]. http://doi.org/10.6084/m9.figshare.9896399.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Vivian Silva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Knowledge graph generated from definitions extracted from Wikipedia articles.

  3. P

    Wikidata5M Dataset

    • paperswithcode.com
    Updated Jun 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang (2023). Wikidata5M Dataset [Dataset]. https://paperswithcode.com/dataset/wikidata5m
    Explore at:
    Dataset updated
    Jun 14, 2023
    Authors
    Xiaozhi Wang; Tianyu Gao; Zhaocheng Zhu; Zhengyan Zhang; Zhiyuan Liu; Juanzi Li; Jian Tang
    Description

    Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. Each entity in Wikidata5m is described by a corresponding Wikipedia page, which enables the evaluation of link prediction over unseen entities.

    The dataset is distributed as a knowledge graph, a corpus, and aliases. We provide both transductive and inductive data splits used in the original paper.

  4. Z

    CaLiGraph - A Large-Scale Semantic Knowledge Graph compiled from Wikipedia...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paulheim, Heiko (2023). CaLiGraph - A Large-Scale Semantic Knowledge Graph compiled from Wikipedia Categories and List Pages [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3484511
    Explore at:
    Dataset updated
    Jun 25, 2023
    Dataset provided by
    Paulheim, Heiko
    Heist, Nicolas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CaLiGraph is a large-scale semantic knowledge graph with a rich ontology which is compiled from the DBpedia ontology, and Wikipedia categories & list pages. For more information, visit http://caligraph.org

    Information about uploaded files: (all files are b-zipped and in the n-triple format)

    caligraph-metadata.nt.bz2 Metadata about the dataset which is described using void vocabulary.

    caligraph-ontology.nt.bz2 Class definitions, property definitions, restrictions, and labels of the CaLiGraph ontology.

    caligraph-ontology_dbpedia-mapping.nt.bz2 Mapping of classes and properties to the DBpedia ontology.

    caligraph-ontology_provenance.nt.bz2 Provenance information about classes (i.e. which Wikipedia category or list page has been used to create this class).

    caligraph-instances_types.nt.bz2 Definition of instances and (non-transitive) types.

    caligraph-instances_transitive-types.nt.bz2 Transitive types for instances (can also be induced by a reasoner).

    caligraph-instances_labels.nt.bz2 Labels for instances.

    caligraph-instances_relations.nt.bz2 Relations between instances derived from the class restrictions of the ontology (can also be induced by a reasoner).

    caligraph-instances_dbpedia-mapping.nt.bz2 Mapping of instances to respective DBpedia instances.

    caligraph-instances_provenance.nt.bz2 Provenance information about instances (e.g. if the instance has been extracted from a Wikipedia list page).

    dbpedia_caligraph-instances.nt.bz2 Additional instances of CaLiGraph that are not in DBpedia. ! This file is no part of CaLiGraph but should rather be used as an extension to DBpedia. The triples use the DBpedia namespace and can thus be used to directly extend DBpedia. !

    dbpedia_caligraph-types.nt.bz2 Additional types of CaLiGraph that are not in DBpedia. ! This file is no part of CaLiGraph but should rather be used as an extension to DBpedia. The triples use the DBpedia namespace and can thus be used to directly extend DBpedia. !

    dbpedia_caligraph-relations.nt.bz2 Additional relations of CaLiGraph that are not in DBpedia. ! This file is no part of CaLiGraph but should rather be used as an extension to DBpedia. The triples use the DBpedia namespace and can thus be used to directly extend DBpedia. !

    Changelog

    v3.1.1

    Fixed an encoding issue in caligraph-ontology.nt.bz2

    v3.1.0

    Fixed several issues related to ontology consistency and structure

    v3.0.0

    Added functionality to group mentions of unknown entities into distinct entities

    v2.1.0

    Fixed error that lead to a class inheriting from a disjoint class

    Introduced owl:ObjectProperty and owl:DataProperty instead of rdf:Property

    Several cosmetic fixes

    v2.0.2

    Fixed incorrect formatting of some properties

    v2.0.1

    Better entity extraction and representation

    Small cosmetic fixes

    v2.0.0

    Entity extraction from arbitrary tables and enumerations in Wikipedia pages

    v1.4.0

    BERT-based recognition of subject entities and improved language models from spaCy 3.0

    v1.3.1

    Fixed minor encoding errors and improved formatting

    v1.3.0

    CaLiGraph is now based on a recent version of Wikipedia and DBpedia from November 2020

    v1.1.0

    Improved the CaLiGraph type hierarchy

    Many small bugfixes and improvements

    v1.0.9

    Additional alternative labels for CaLiGraph instances

    v1.0.8

    Small cosmetic changes to URIs to be closer to DBpedia URIs

    v1.0.7

    Mappings from CaLiGraph classes to DBpedia classes are now realised via rdfs:subClassOf instead of owl:equivalentClass

    Entities are now URL-encoded to improve accessibility

    v1.0.6

    Fixed a bug in the ontology creation step that led to a substantially lower amount of sub-type relationships than actually exist. The new version provides a richer type hierarchy that also leads to an increased amount of types for resources.

    v1.0.5

    Fixed a bug that has declared CaLiGraph predicates as subclasses of owl:Predicate instead of being of the type owl:Predicate.

  5. o

    Wikipedia Articles Dataset

    • opendatabay.com
    .undefined
    Updated May 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2025). Wikipedia Articles Dataset [Dataset]. https://www.opendatabay.com/data/premium/b6292674-e94d-4a7e-93c0-00cf1474ffdd
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    May 25, 2025
    Dataset authored and provided by
    Bright Data
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    Access a wealth of information, including article titles, raw text, images, and structured references. Popular use cases include knowledge extraction, trend analysis, and content development.

    Use our Wikipedia Articles dataset to access a vast collection of articles across a wide range of topics, from history and science to culture and current events. This dataset offers structured data on articles, categories, and revision histories, enabling deep analysis into trends, knowledge gaps, and content development.

    Tailored for researchers, data scientists, and content strategists, this dataset allows for in-depth exploration of article evolution, topic popularity, and interlinking patterns. Whether you are studying public knowledge trends, performing sentiment analysis, or developing content strategies, the Wikipedia Articles dataset provides a rich resource to understand how information is shared and consumed globally.

    Dataset Features - url: Direct URL to the original Wikipedia article.
    - title: The title or name of the Wikipedia article.
    - table_of_contents: A list or structure outlining the article's sections and hierarchy.
    - raw_text: Unprocessed full text content of the article.
    - cataloged_text: Cleaned and structured version of the article’s content, optimized for analysis.
    - images: Links or data on images embedded in the article.
    - see_also: Related articles linked under the “See Also” section.
    - references: Sources cited in the article for credibility.
    - external_links: Links to external websites or resources mentioned in the article.
    - categories: Tags or groupings classifying the article by topic or domain.
    - timestamp: Last edit date or revision time of the article snapshot.

    Distribution - Data Volume: 11 Columns and 2.19 M Rows
    - Format: CSV

    Usage This dataset supports a wide range of applications: - Knowledge Extraction: Identify key entities, relationships, or events from Wikipedia content.
    - Content Strategy & SEO: Discover trending topics and content gaps.
    - Machine Learning: Train NLP models (e.g., summarisation, classification, QA systems).
    - Historical Trend Analysis: Study how public interest in topics changes over time.
    - Link Graph Modeling: Understand how information is interconnected.

    Coverage - Geographic Coverage: Global (multi-language Wikipedia versions also available)
    - Time Range: Continuous updates; snapshots available from early 2000s to present.

    License

    CUSTOM

    Please review the respective licenses below:

    1. Data Provider's License

    Who Can Use It - Data Scientists: For training or testing NLP and information retrieval systems.
    - Researchers: For computational linguistics, social science, or digital humanities.
    - Businesses: To enhance AI-powered content tools or customer insight platforms.
    - Educators/Students: For building projects, conducting research, or studying knowledge systems.

    Suggested Dataset Names 1. Wikipedia Corpus+
    2. Wikipedia Stream Dataset
    3. Wikipedia Knowledge Bank
    4. Open Wikipedia Dataset

    Pricing

    Based on Delivery frequency

    ~Up to $0.0025 per record. Min order $250

    Approximately 283 new records are added each month. Approximately 1.12M records are updated each month. Get the complete dataset each delivery, including all records. Retrieve only the data you need with the flexibility to set Smart Updates.

    • Monthly

    New snapshot each month, 12 snapshots/year Paid monthly

    • Quarterly

    New snapshot each quarter, 4 snapshots/year Paid quarterly

    • Bi-annual

    New snapshot every 6 months, 2 snapshots/year Paid twice-a-year

    • One-time purchase

    New snapshot one-time delivery Paid once

  6. KeySearchWiki

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Feb 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler (2022). KeySearchWiki [Dataset]. http://doi.org/10.5281/zenodo.4955200
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 14, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    KeySearchWiki is a dataset for evaluating keyword search systems over Wikidata.

    The dataset was automatically generated by leveraging Wikidata and Wikipedia set categories (e.g., Category:American television directors) as data sources for both relevant entities and queries.
    Relevant entities are gathered by carefully navigating the Wikipedia set categories hierarchy in all available languages. Furthermore, those categories are refined and combined to derive more complex queries.

    Detailed information about KeySearchWiki and its generation can be found on the Github page.

  7. R

    RKD-Knowledge-Graph

    • rkd.triply.cc
    application/n-quads +5
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RKD (2025). RKD-Knowledge-Graph [Dataset]. https://rkd.triply.cc/rkd/RKD-Knowledge-Graph
    Explore at:
    application/sparql-results+json, ttl, application/n-quads, application/n-triples, jsonld, application/trigAvailable download formats
    Dataset updated
    Jul 2, 2025
    Dataset authored and provided by
    RKD
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    We manage unique archives, documentation and photographic material and the largest art historical library on Western art from the Late Middle Ages to the present, with the focus on Netherlandish art. Our collections cover not only paintings, drawings and sculptures, but also monumental art, modern media and design. The collections are present in both digital and analogue form (the latter in our study rooms).

    This knowledge graph represents our collection as Linked Data, primarily using the CIDOC-CRM and LinkedArt vocabularies.

  8. KeySearchWiki-experiments

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Feb 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler (2022). KeySearchWiki-experiments [Dataset]. http://doi.org/10.5281/zenodo.5761138
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 14, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Experiment results together with queries, runs, and relevance judgments produced in the context of evaluating different retrieval methods using the KeySearchWiki dataset.

    Detailed information about KeySearchWiki and its generation can be found on the Github page.

  9. P

    WikiOFGraph Dataset

    • paperswithcode.com
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daehee Kim; Deokhyung Kang; Sangwon Ryu; Gary Geunbae Lee (2024). WikiOFGraph Dataset [Dataset]. https://paperswithcode.com/dataset/wikiofgraph
    Explore at:
    Dataset updated
    Sep 10, 2024
    Authors
    Daehee Kim; Deokhyung Kang; Sangwon Ryu; Gary Geunbae Lee
    Description

    a high-level explanation of the dataset characteristics We introduce WikiOFGraph, a novel large-scale, domain-diverse dataset synthesized by LLMs, ensuring superior graph-text consistency to advance general-domain graph-to-text generation.

    explain motivations and summary of its content The scarcity of high-quality, general-domain G2T generation datasets restricts progress in the generaldomain G2T generation research. To address this issue, we introduce Wikipedia OntologyFree Graph-text dataset (WikiOFGraph), a new large-scale G2T dataset generated using a novel method that leverages Large Language Model (LLM) and Data-QuestEval.

    potential use cases of the dataset General Domain Knowledge Graph-to-Text generation

  10. KeySearchWiki-cache

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler (2023). KeySearchWiki-cache [Dataset]. http://doi.org/10.5281/zenodo.5752018
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Leila Feddoul; Leila Feddoul; Frank Löffler; Frank Löffler; Sirko Schindler; Sirko Schindler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of SQLite database files containing all the data retrieved from Wikidata JSON Dump and Wikipedia SQL Dumps of 2021-09-20 in the context of KeySearchWiki dataset generation.

    Detailed information about KeySearchWiki can be found on the Github page.

  11. D

    Knowledge Graph

    • phys-techsciences.datastations.nl
    pdf, zip
    Updated Jul 31, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikipedia; Wikipedia (2018). Knowledge Graph [Dataset]. http://doi.org/10.17026/DANS-ZQA-URC7
    Explore at:
    zip(11279), pdf(246946)Available download formats
    Dataset updated
    Jul 31, 2018
    Dataset provided by
    DANS Data Station Physical and Technical Sciences
    Authors
    Wikipedia; Wikipedia
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This Wikipedia entry describes the Knowledge Graph as a knowledge base by Google. It enhances the search engine's results by gathering information from a variety of sources. The entry consists of the history, description, a criticism, references and links to external sources. Modified: 2018-06-30 This Wikipedia article is deposited in EASY in order to assign a persistent identifier to it.

  12. Z

    WikiCausal Corpus for Evaluation of Causal Knowledge Graph Construction

    • data.niaid.nih.gov
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hassanzadeh, Oktie (2023). WikiCausal Corpus for Evaluation of Causal Knowledge Graph Construction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7897995
    Explore at:
    Dataset updated
    Jun 14, 2023
    Dataset authored and provided by
    Hassanzadeh, Oktie
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Documentation on the data format and how it can be used can be found on: https://github.com/IBM/wikicausal as well as our paper:

    @unpublished{, author = {Oktie Hassanzadeh and Mark Feblowitz}, title = {{WikiCausal}: Corpus and Evaluation Framework for Causal Knowledge Graph Construction}, year = {2023}, doi = {10.5281/zenodo.7897996} }

    Corpus derived from Wikipedia and Wikidata.

    Refer to Wikipedia and Wikidata license and terms of use for more details:

    Permission is granted to copy, distribute and/or modify Wikipedia's text under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License and, unless otherwise noted, the GNU Free Documentation License, unversioned, with no invariant sections, front-cover texts, or back-cover texts.

    A copy of the Creative Commons Attribution-ShareAlike 3.0 Unported License is included in the section entitled "Wikipedia:Text of Creative Commons Attribution-ShareAlike 3.0 Unported License"

    A copy of the GNU Free Documentation License is included in the section entitled "GNU Free Documentation License".

    Content on Wikipedia is covered by disclaimers.

    THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

  13. Uncovering the Semantics of Wikipedia Categories - Axioms and Assertions

    • zenodo.org
    • data.niaid.nih.gov
    bz2, csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Heist; Nicolas Heist; Heiko Paulheim; Heiko Paulheim (2020). Uncovering the Semantics of Wikipedia Categories - Axioms and Assertions [Dataset]. http://doi.org/10.5281/zenodo.3482775
    Explore at:
    bz2, csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nicolas Heist; Nicolas Heist; Heiko Paulheim; Heiko Paulheim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Resulting axioms and assertions from applying the Cat2Ax approach to the DBpedia knowledge graph.
    The methodology is described in the conference publication "N. Heist, H. Paulheim: Uncovering the Semantics of Wikipedia Categories, International Semantic Web Conference, 2019".

  14. O

    CoDEx Small

    • opendatalab.com
    • paperswithcode.com
    zip
    Updated Sep 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Michigan (2022). CoDEx Small [Dataset]. https://opendatalab.com/OpenDataLab/CoDEx_Small
    Explore at:
    zip(299981270 bytes)Available download formats
    Dataset updated
    Sep 22, 2022
    Dataset provided by
    University of Michigan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false.

  15. DBkWik Plus Plus

    • figshare.com
    bin
    Updated Sep 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Hertling; Heiko Paulheim (2022). DBkWik Plus Plus [Dataset]. http://doi.org/10.6084/m9.figshare.20407864.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 29, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Sven Hertling; Heiko Paulheim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large knowledge graphs like DBpedia and YAGO are always based on the same source - namely Wikipedia. But there are more wikis that contain information about long-tail entities such as wiki hosting platforms like Fandom. In this paper, we present the approach and analysis of DBkWik++, a fused Knowledge Graph from thousands of wikis. A modified version of the DBpedia framework is applied to each wiki which results in many isolated Knowledge Graphs. With an incremental merge based approach, we reuse one-to-one matching systems to solve the multi source KG matching task. Based on this alignment we create a consolidated knowledge graph with more than 15 million instances.

  16. QBLink-KG: QBLink Adapted to DBpedia Knowledge Graph

    • figshare.com
    json
    Updated Feb 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mona Zamiri; Yao Qiang; Fedor Nikolaev; Dongxiao Zhu; Alexander Kotov (2024). QBLink-KG: QBLink Adapted to DBpedia Knowledge Graph [Dataset]. http://doi.org/10.6084/m9.figshare.25256290.v3
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Feb 21, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Mona Zamiri; Yao Qiang; Fedor Nikolaev; Dongxiao Zhu; Alexander Kotov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    QBLink-KG is a modified version of QBLink, which is a high-quality benchmark for evaluating conversational understanding of Wikipedia content.QBLink consists of sequences of up to three hand-crafted queries, with responses being single-named entities that match the titles of Wikipedia articles.For the QBLink-KG, the English subset of the DBpedia snapshot from September 2021 was used as the target Knowledge Graph. QBLink answers provided as the titles of Wikipedia infoboxes can be easily mapped to DBpedia entity URIs - if the corresponding entities are present in DBpedia - since DBpedia is constructed through the extraction of information from Wikipedia infoboxes.QBLink, in its original format, is not directly applicable for Conversational Entity Retrieval from a Knowledge Graph (CER-KG) because knowledge graphs contain considerably less information than Wikipedia. A named entity serving as an answer to a QBLink query may not be present as an entity in DBpedia. To modify QBLink for CER over DBpedia, we implemented two filtering steps: 1) we removed all queries for which the wiki_page field is empty, or the answer cannot be mapped to a DBpedia entity or does not match to a Wikipedia page. 2) For the evaluation of a model with specific techniques for entity linking and candidate selection, we excluded queries with answers that do not belong to the set of candidate entities derived using that model.The original QBLink dataset files before filtering are:QBLink-train.jsonQBLink-dev.jsonQBLink-test.jsonAnd the final QBLink-KG files after filtering are:QBLink-Filtered-train.jsonQBLink-Filtered-dev.jsonQBLink-Filtered-test.jsonWe used below references to construct QBLink-KG:Ahmed Elgohary, Chen Zhao, and Jordan Boyd-Graber. 2018. A dataset and baselines for sequential open-domain question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1077–1083, Brussels, Belgium. Association for Computational Linguistics.https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2021-09Lehmann, Jens et al. ‘DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia’. 1 Jan. 2015 : 167 – 195.To give more details about QBLink-KG, please read our research paper:Zamiri, Mona, et al. "Benchmark and Neural Architecture for Conversational Entity Retrieval from a Knowledge Graph", The Web Conference 2024.

  17. h

    Wikidata5M-KG

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dingjun Wu, Wikidata5M-KG [Dataset]. https://huggingface.co/datasets/Alphonse7/Wikidata5M-KG
    Explore at:
    Authors
    Dingjun Wu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Wikidata5M-KG

    Wikidata5M-KG is an open-domain knowledge graph constructed from Wikipedia and Wikidata. It contains approximately 4.6 million entities and 21 million triples. Wikidata5M-KG is built based on the Wikidata5M dataset.

      📦 Contents
    
    
    
    
    
      wikidata5m_kg.tar.gz
    

    This is the processed knowledge graph used in our experiments. It contains:

    4,665,331 entities 810 relations 20,987,217 triples

    After extraction, it yields a single file: wikidata5m_kg.jsonl, each… See the full description on the dataset page: https://huggingface.co/datasets/Alphonse7/Wikidata5M-KG.

  18. O

    Data from: WikiWiki

    • opendatalab.com
    zip
    Updated Apr 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon Research (2023). WikiWiki [Dataset]. https://opendatalab.com/OpenDataLab/WikiWiki
    Explore at:
    zip(25410 bytes)Available download formats
    Dataset updated
    Apr 1, 2023
    Dataset provided by
    Amazon Research
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    WikiWiki is a dataset for understanding entities and their place in a taxonomy of knowledge—their types. It consists of entities and passages from 10M Wikipedia articles linked to the Wikidata knowledge graph with 41K types.

  19. E

    Wikidata

    • live.european-language-grid.eu
    json
    Updated Oct 28, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). Wikidata [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/7268
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 28, 2012
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.

  20. o

    Data from: Entity Typing Datasets

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Russa Biswas (2023). Entity Typing Datasets [Dataset]. http://doi.org/10.5281/zenodo.7688589
    Explore at:
    Dataset updated
    Mar 1, 2023
    Authors
    Russa Biswas
    Description

    These are the datasets used in the Entity Type Prediction task for Knowledge Graph Completion. DB630k_Fine-grained_Hierarchical.zip dataset has been used in the papers [1] and [2]. It is an extended version of DBpedia630k dataset originally created for Text classification and is available here. FIGER.zip dataset has also been used in the papers [1] and [2]. MultilingualETdata.zip dataset has been used in the paper [3] NamesETdata.zip dataset has been used in the paper [4]. The CaLiGraph test dataset can also be downloaded here. [1] Biswas R, Sofronova R, Sack H, Alam M. Cat2type: Wikipedia category embeddings for entity typing in knowledge graphs. InProceedings of the 11th on Knowledge Capture Conference 2021 Dec 2 (pp. 81-88). [2] Biswas R, Portisch J, Paulheim H, Sack H, Alam M. Entity type prediction leveraging graph walks and entity descriptions. In The Semantic Web–ISWC 2022: 21st International Semantic Web Conference, Virtual Event, October 23–27, 2022, Proceedings 2022 Oct 16 (pp. 392-410). Cham: Springer International Publishing. [3] Biswas R, Chen Y, Paulheim H, Sack H, Alam M. It’s All in the Name: Entity Typing Using Multilingual Language Models. In The Semantic Web: ESWC 2022 Satellite Events: Hersonissos, Crete, Greece, May 29–June 2, 2022, Proceedings 2022 Jul 20 (pp. 36-41). Cham: Springer International Publishing. [4] Biswas R, Sofronova R, Alam M, Heist N, Paulheim H, Sack H. Do judge an entity by its name! entity typing using language models. In The Semantic Web: ESWC 2021 Satellite Events: Virtual Event, June 6–10, 2021, Revised Selected Papers 18 2021 (pp. 65-70). Springer International Publishing.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
Organization logo

Wikipedia Knowledge Graph dataset

Explore at:
220 scholarly articles cite this dataset (View in Google Scholar)
tsv, pdfAvailable download formats
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

The document Dataset_summary includes a detailed description of the dataset.

Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

Search
Clear search
Close search
Google apps
Main menu